# Input Data Download And Generation

This Jupyter Notebook provides a step-by-step guide on how to generate the necessary input files (`cell_map.csv`, `celltype_map.csv`, `gene_mapping.csv`, etc.) from `.h5ad` files. These generated files are required to run the analyses in the `bm.ipynb`, `cell_development.ipynb`, `grn.ipynb`, and `spatial.ipynb` notebooks.

The notebook is divided into four main sections, each dedicated to preparing the specific input data needed for one of the target notebooks. All file paths have been pre-configured based on the provided notebooks.

The download address of the .h5ad file used is as follows:
* The four **embryonic stem cell (ESC)** datasets for benchmarking can be accessed at <https://zenodo.org/records/6720690#.YrXQjHZBz4Y>.
* The **human kidney dataset** can be accessed at <https://cellxgene.cziscience.com/e/dea717d4-7bc0-4e46-950f-fd7e1cc8df7d.cxg/>.
* The **human hematopoietic dataset** can be accessed at <https://cellxgene.cziscience.com/e/cd2f23c1-aef1-48ae-8eb4-0bcf124e567d.cxg/>.
* The **mouse heart spatial transcriptomics dataset** can be accessed at <https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE178636>.

## Initial Setup

First, let's import the necessary Python libraries. Make sure you have `anndata` and `pandas` installed in your environment.

```bash
pip install anndata pandas
```

In [2]:
import anndata as ad
import pandas as pd
import os

print("Libraries imported successfully.")

Libraries imported successfully.


## Task 1: Generate Input for bm.ipynb

This task requires a gene mapping file (`gene_mapping.csv`). We will generate it from the corresponding `.h5ad` file.

In [None]:
# --- Configuration for bm.ipynb ---
# Path to the .h5ad file for the BM task, extracted from the notebook.
bm_h5ad_path = f"./BM/data/mESC1/mESC1.h5ad"
# Directory where the output files will be saved.
bm_output_dir = './BM/data/mESC1/'

# --- Data Generation ---
print(f"Processing for bm.ipynb...")
os.makedirs(bm_output_dir, exist_ok=True)

try:
    # Load the AnnData object
    bm_adata = ad.read_h5ad(bm_h5ad_path)
    
    # Create the gene mapping DataFrame
    bm_gene_mapping_df = pd.DataFrame({
        'ID': range(len(bm_adata.var.index)),
        'Gene': bm_adata.var.index
    })

    # Save to CSV
    bm_gene_mapping_csv_path = os.path.join(bm_output_dir, 'gene_mapping.csv')
    bm_gene_mapping_df.to_csv(bm_gene_mapping_csv_path, index=False)

    print(f"✅ Successfully generated 'gene_mapping.csv'.")
    print("\nFile Preview:")
    display(bm_gene_mapping_df.head())

except FileNotFoundError:
    print(f"❌ ERROR: The file was not found. Please ensure the path is correct relative to this notebook's location.")

Processing for bm.ipynb...
✅ Successfully generated 'gene_mapping.csv'.

File Preview:


Unnamed: 0,ID,Gene
0,0,0610005C13Rik
1,1,0610007N19Rik
2,2,0610007P14Rik
3,3,0610008F07Rik
4,4,0610009B14Rik


## Task 2: Generate Input for cell_development.ipynb

This task requires a cell map (`cell_map.csv`) and a cell type map (`celltype_map.csv`). The cell type column has been identified as `'cell_type'` from the notebook.

In [5]:
# --- Configuration for cell_development.ipynb ---
# Path to the .h5ad file for the cell development task.
cell_dev_h5ad_path = './Cell_Development/data/human_bone/HumanBone.h5ad'
# Column name in adata.obs that contains cell type labels.
cell_dev_cell_type_column = 'cell_type' 
# Output directory.
cell_dev_output_dir = './Cell_Development/data/human_bone/'

# --- Data Generation ---
print(f"\nProcessing for cell_development.ipynb...")
os.makedirs(cell_dev_output_dir, exist_ok=True)

try:
    # Load the AnnData object
    cell_dev_adata = ad.read_h5ad(cell_dev_h5ad_path)
    
    # 1. Generate cell_map.csv
    cell_map_df = pd.DataFrame({
        'cellName': cell_dev_adata.obs.index,
        'cellId': range(len(cell_dev_adata.obs))
    })
    cell_map_csv_path = os.path.join(cell_dev_output_dir, 'cell_map.csv')
    cell_map_df.to_csv(cell_map_csv_path, index=False)
    print(f"✅ Successfully generated 'cell_map.csv'.")
    print("\nFile Preview ('cell_map.csv'):")
    display(cell_map_df.head())

    # 2. Generate celltype_map.csv
    if cell_dev_cell_type_column in cell_dev_adata.obs.columns:
        unique_cell_types = cell_dev_adata.obs[cell_dev_cell_type_column].unique().tolist()
        celltype_map_df = pd.DataFrame({
            'celltype': unique_cell_types,
            'celltypeId': range(len(unique_cell_types))
        })
        celltype_map_csv_path = os.path.join(cell_dev_output_dir, 'celltype_map.csv')
        celltype_map_df.to_csv(celltype_map_csv_path, index=False)
        print(f"\n✅ Successfully generated 'celltype_map.csv'.")
        print("\nFile Preview ('celltype_map.csv'):")
        display(celltype_map_df.head())
    else:
        print(f"\n❌ ERROR: Column '{cell_dev_cell_type_column}' not found in the AnnData object. Please check the .h5ad file.")

except FileNotFoundError:
    print(f"❌ ERROR: The file was not found. Please ensure the path is correct relative to this notebook's location.")


Processing for cell_development.ipynb...
✅ Successfully generated 'cell_map.csv'.

File Preview ('cell_map.csv'):


Unnamed: 0,cellName,cellId
0,MantonBM6_HiSeq_5-CTGAAACAGACTAAGT-1,0
1,MantonBM4_HiSeq_6-TAGAGCTTCACAGTAC-1,1
2,MantonBM3_HiSeq_6-TGACTAGTCGATGAGG-1,2
3,MantonBM7_HiSeq_4-TCAATCTGTCTCTTAT-1,3
4,MantonBM3_HiSeq_5-GCGCAGTCATTCCTCG-1,4



✅ Successfully generated 'celltype_map.csv'.

File Preview ('celltype_map.csv'):


Unnamed: 0,celltype,celltypeId
0,hematopoietic multipotent progenitor cell,0
1,basophilic erythroblast,1
2,common myeloid progenitor,2
3,megakaryocyte progenitor cell,3
4,granulocyte monocyte progenitor cell,4


## Task 3: Generate Input for grn.ipynb

This task requires a gene mapping file named `gene_mapping_final.csv`.

In [None]:
# --- Configuration for grn.ipynb ---
# Path to the .h5ad file for the GRN task.
grn_h5ad_path = "./GRN/data/kidney/kidney1_final_annotated.h5ad"
# Output directory.
grn_output_dir = "./GRN/data/kidney/"

# --- Data Generation ---
print(f"\nProcessing for grn.ipynb...")
os.makedirs(grn_output_dir, exist_ok=True)

try:
    # Load the AnnData object
    grn_adata = ad.read_h5ad(grn_h5ad_path)
    
    # Create the gene mapping DataFrame
    grn_gene_mapping_df = pd.DataFrame({
        'ID': range(len(grn_adata.var.index)),
        'Gene': grn_adata.var.index
    })

    # Save to CSV
    grn_gene_mapping_csv_path = os.path.join(grn_output_dir, 'gene_mapping_final.csv')
    grn_gene_mapping_df.to_csv(grn_gene_mapping_csv_path, index=False)

    print(f"✅ Successfully generated 'gene_mapping_final.csv'.")
    print("\nFile Preview:")
    display(grn_gene_mapping_df.head())

except FileNotFoundError:
    print(f"❌ ERROR: The file was not found. Please ensure the path is correct relative to this notebook's location.")


Processing for grn.ipynb...
✅ Successfully generated 'gene_mapping_final.csv'.

File Preview:


Unnamed: 0,ID,Gene
0,0,MIR1302-2HG
1,1,OR4F5
2,2,LINC01409
3,3,FAM87B
4,4,LINC01128


## Task 4: Generate Input for spatial.ipynb

This task requires `cell_map_spatial.csv` and `celltype_map_spatial.csv`. The cell type column has been identified as `'cell_type'` from the notebook.

In [None]:
# --- Configuration for spatial.ipynb ---
# Path to the .h5ad file for the spatial task.
spatial_h5ad_path = './Spatial/data/spatiotemporal_mouse/mouse_spatial.h5ad'
# Column name in adata.obs that contains cell type labels.
spatial_cell_type_column = 'cell_annotion'
# Output directory.
spatial_output_dir = './Spatial/data/spatiotemporal_mouse/'

# --- Data Generation ---
print(f"\nProcessing for spatial.ipynb...")
os.makedirs(spatial_output_dir, exist_ok=True)

try:
    # Load the AnnData object
    spatial_adata = ad.read_h5ad(spatial_h5ad_path)
    
    # 1. Generate cell_map_spatial.csv
    cell_map_spatial_df = pd.DataFrame({
        'cellName': spatial_adata.obs.index,
        'cellId': range(len(spatial_adata.obs))
    })
    cell_map_spatial_csv_path = os.path.join(spatial_output_dir, 'cell_map_spatial.csv')
    cell_map_spatial_df.to_csv(cell_map_spatial_csv_path, index=False)
    print(f"✅ Successfully generated 'cell_map_spatial.csv'.")
    print("\nFile Preview ('cell_map_spatial.csv'):")
    display(cell_map_spatial_df.head())

    # 2. Generate celltype_map_spatial.csv
    if spatial_cell_type_column in spatial_adata.obs.columns:
        unique_cell_types_spatial = spatial_adata.obs[spatial_cell_type_column].unique().tolist()
        celltype_map_spatial_df = pd.DataFrame({
            'celltype': unique_cell_types_spatial,
            'celltypeId': range(len(unique_cell_types_spatial))
        })
        celltype_map_spatial_csv_path = os.path.join(spatial_output_dir, 'celltype_map_spatial.csv')
        celltype_map_spatial_df.to_csv(celltype_map_spatial_csv_path, index=False)
        print(f"\n✅ Successfully generated 'celltype_map_spatial.csv'.")
        print("\nFile Preview ('celltype_map_spatial.csv'):")
        display(celltype_map_spatial_df.head())
    else:
        print(f"\n❌ ERROR: Column '{spatial_cell_type_column}' not found in the AnnData object. Please check the .h5ad file.")

except FileNotFoundError:
    print(f"❌ ERROR: The file was not found. Please ensure the path is correct relative to this notebook's location.")


Processing for spatial.ipynb...
✅ Successfully generated 'cell_map_spatial.csv'.

File Preview ('cell_map_spatial.csv'):


Unnamed: 0,cellName,cellId
0,e20-1.rds:BIN.13557_1,0
1,e20-1.rds:BIN.10045_1,1
2,e20-1.rds:BIN.7464_1,2
3,e20-1.rds:BIN.8814_1,3
4,e20-1.rds:BIN.14607_1,4



✅ Successfully generated 'celltype_map_spatial.csv'.

File Preview ('celltype_map_spatial.csv'):


Unnamed: 0,celltype,celltypeId
0,4,0
1,1,1
2,0,2
