# Data Preprocessing and Tokenization

This notebook demonstrates how to preprocess mouse spatial transcriptomics data and tokenize it for Geneformer analysis.

## Pipeline Overview:
1. Load raw data
2. Add QC metrics
3. Convert mouse gene symbols to human Ensembl IDs
4. Tokenization
5. Verify results


In [1]:
import scanpy as sc
from geneformer_utils import DataPreprocessor, tokenize_data
from datasets import load_from_disk


  from .autonotebook import tqdm as notebook_tqdm


## Configuration


In [None]:
# Input/Output paths
INPUT_H5AD = "/home/wsg/SSW/data/mouse_E9.5_heart/mouse_E9.5_heart.h5ad"
MGI_FILE = "/home/wsg/SSW/data/HOM_MouseHumanSequence.rpt"
OUTPUT_DIR = "/home/wsg/SSW/data/mouse_E9.5_heart/token"
OUTPUT_PREFIX = "mouse_E9p5_heart"

# Specify the layer containing raw counts (None if using adata.X)
COUNT_LAYER = "total"

# Custom attribute mapping for tokenized data
CUSTOM_ATTRS = {
    "heart_regions": "cell_type",
    "heart_anno": "organ",
    "n_counts": "n_counts"
}


In [None]:
# Alternative configuration for E11.5 dataset
INPUT_H5AD = "/home/wsg/SSW/data/mouse_E11.5_heart/mouse_E11.5_heart.h5ad"
MGI_FILE = "/home/wsg/SSW/data/HOM_MouseHumanSequence.rpt"
OUTPUT_DIR = "/home/wsg/SSW/data/mouse_E11.5_heart/token"
OUTPUT_PREFIX = "mouse_E11p5_heart"

COUNT_LAYER = "total"

CUSTOM_ATTRS = {
    "heart_regions": "cell_type",
    "heart_anno": "organ",
    "n_counts": "n_counts"
}


## 1. Load Data


In [3]:
print("ðŸ“‚ Loading data...")
adata = sc.read_h5ad(INPUT_H5AD)
print(f"âœ… Data loaded: {adata.n_obs} cells Ã— {adata.n_vars} genes")
print(f"\nObservations (first 5):")
print(adata.obs.head())
print(f"\nVariable names (first 5): {list(adata.var_names[:5])}")


ðŸ“‚ Loading data...
âœ… Data loaded: 98966 cells Ã— 19746 genes

Observations (first 5):
                             ctype_user    cml     slices heart_anno  \
slice_40_80591  Cardiac muscle lineages    cml  slices_40      Heart   
slice_40_80762  Cardiac muscle lineages    cml  slices_40      Heart   
slice_40_80888  Cardiac muscle lineages    cml  slices_40      Heart   
slice_40_80908  Cardiac muscle lineages    cml  slices_40      Heart   
slice_40_80909                 Myocytes  other  slices_40      Heart   

                 heart_regions  stage  3d_spatial_density_heart_regions  
slice_40_80591  Left ventricle  E11.5                          0.627947  
slice_40_80762  Left ventricle  E11.5                          0.632751  
slice_40_80888  Left ventricle  E11.5                          0.558212  
slice_40_80908  Left ventricle  E11.5                          0.554837  
slice_40_80909     Left atrium  E11.5                          0.246904  

Variable names (first 5): ['0610

## 2. Data Preprocessing

### 2.1 Add QC Metrics


In [None]:
preprocessor = DataPreprocessor(mgi_file_path=MGI_FILE)

# Add QC metrics (n_counts, filter_pass, etc.)
adata = preprocessor.add_qc_metrics(adata, count_layer=COUNT_LAYER)


Adding QC metrics...
Moving counts from layer 'total' to adata.X...
Calculating n_counts and n_genes...
âœ… QC metrics added. n_counts range: [108, 4124]


### 2.2 Gene ID Conversion: Mouse Symbol â†’ Human Ensembl ID


In [None]:
# Convert gene IDs
adata = preprocessor.convert_mouse_to_human_ensembl(adata)

print(f"\nâœ… Conversion complete!")
print(f"Final: {adata.n_obs} cells Ã— {adata.n_vars} genes")
print(f"\nGene IDs (first 5): {list(adata.var_names[:5])}")
print(f"Mouse symbols preserved in adata.var['mouse_symbol']")


ðŸš€ Converting Mouse Symbol -> Human Ensembl ID...
Input: 19746 mouse genes
Step 1: Parsing MGI ortholog file...


Input sequence provided is already in string format. No operation performed
Input sequence provided is already in string format. No operation performed


âœ… Found 20181 ortholog pairs
Step 2: Converting Human Symbol -> Ensembl ID...


409 input query terms found dup hits:	[('RAN', 3), ('SYCP3', 5), ('ZNF670', 6), ('C19orf48P', 2), ('ATXN1-AS1', 2), ('SCYGR9', 2), ('ABCA1
15 input query terms found no hit:	['FAM210A', 'FAM210B', 'ATP6', 'ATP8', 'COX1', 'COX2', 'COX3', 'CYTB', 'ND1', 'ND2', 'ND3', 'ND4', '


Step 3: Applying mapping...
ðŸŽ‰ Result: 16490 / 19746 genes successfully mapped
âœ… Ready! Gene IDs example: ['ENSG00000168887', 'ENSG00000248713', 'ENSG00000110696', 'ENSG00000180044', 'ENSG00000291362']

âœ… Conversion complete!
Final: 98966 cells Ã— 16490 genes

Gene IDs (first 5): ['ENSG00000168887', 'ENSG00000248713', 'ENSG00000110696', 'ENSG00000180044', 'ENSG00000291362']
Mouse symbols preserved in adata.var['mouse_symbol']


## 3. Save Preprocessed Data


In [6]:
import os
os.makedirs(OUTPUT_DIR, exist_ok=True)

preprocessed_path = f"{OUTPUT_DIR}/{OUTPUT_PREFIX}_ensembl_id.h5ad"
adata.write_h5ad(preprocessed_path)
print(f"âœ… Preprocessed data saved to: {preprocessed_path}")


âœ… Preprocessed data saved to: /home/wsg/SSW/data/mouse_E11.5_heart/token/mouse_E11p5_heart_ensembl_id.h5ad


## 4. Tokenization

Convert preprocessed data into token sequences for Geneformer.


In [7]:
tokenize_data(
    input_dir=OUTPUT_DIR,
    output_dir=OUTPUT_DIR,
    output_prefix=OUTPUT_PREFIX,
    custom_attr_dict=CUSTOM_ATTRS,
    file_format="h5ad",
    nproc=16
)


ðŸš€ Tokenizing data from /home/wsg/SSW/data/mouse_E11.5_heart/token...
Tokenizing /home/wsg/SSW/data/mouse_E11.5_heart/token/mouse_E11p5_heart_ensembl_id.h5ad


  for i in adata.var["ensembl_id_collapsed"][coding_miRNA_loc]
  coding_miRNA_ids = adata.var["ensembl_id_collapsed"][coding_miRNA_loc]


Creating dataset.
âœ… Tokenization complete! Output: /home/wsg/SSW/data/mouse_E11.5_heart/token/mouse_E11p5_heart.dataset


## 5. Verify Tokenization Results


In [8]:
dataset_path = f"{OUTPUT_DIR}/{OUTPUT_PREFIX}.dataset"

print(f"ðŸ“‚ Loading tokenized dataset from {dataset_path}...")
dataset = load_from_disk(dataset_path)

print("\n=== Dataset Overview ===")
print(dataset)

print("\n=== First Cell Example ===")
first_cell = dataset[0]
print(f"Columns: {list(first_cell.keys())}")
print(f"\nToken sequence length: {len(first_cell['input_ids'])}")
print(f"First 10 tokens: {first_cell['input_ids'][:10]}")
print(f"\nMetadata:")
for key, value in first_cell.items():
    if key != 'input_ids':
        print(f"  {key}: {value}")

print("\nâœ… Tokenization verified successfully!")


ðŸ“‚ Loading tokenized dataset from /home/wsg/SSW/data/mouse_E11.5_heart/token/mouse_E11p5_heart.dataset...

=== Dataset Overview ===
Dataset({
    features: ['input_ids', 'cell_type', 'organ', 'n_counts', 'length'],
    num_rows: 98966
})

=== First Cell Example ===
Columns: ['input_ids', 'cell_type', 'organ', 'n_counts', 'length']

Token sequence length: 971
First 10 tokens: [2, 297, 6594, 12127, 12693, 7749, 18836, 10170, 4819, 1864]

Metadata:
  cell_type: Left ventricle
  organ: Heart
  n_counts: 2060.0
  length: 971

âœ… Tokenization verified successfully!
