# üß¨ 01. Data Preparation
**Objective:** Prepare the raw genomic data for training.

**What this notebook does:**
1.  **Validates** that the raw data files exist.
2.  **Inspects** the Single-Cell Atlas to understand the gene labeling format (IDs vs. Symbols).
3.  **Parses** the Genome Annotation (GTF) to extract coordinates for every gene.
4.  **Matches** the Atlas labels to the Genome coordinates automatically.
5.  **Calculates** GC Content (a key feature for the AI model) from the DNA sequence.
6.  **Exports** a clean `training_manifest.csv`.

### 1. Imports & Configuration
**Purpose:** Setup the environment and define file paths.
**How it works:** - Imports necessary libraries (Scanpy for single-cell, Biopython for DNA).
- Suppresses "FutureWarnings" to keep the output clean.
- Defines dynamic paths relative to the project root, so this code works on any computer.

In [1]:
import sys
import os
import gzip
import re
import warnings
import pandas as pd
import scanpy as sc
import seaborn as sns
from Bio import SeqIO

# --- CONFIGURATION ---
# Ignore "FutureWarning" messages from anndata to keep the output clean
warnings.filterwarnings("ignore", category=FutureWarning)

# Dynamic Path Setup
NOTEBOOK_DIR = os.getcwd()
PROJECT_ROOT = os.path.dirname(NOTEBOOK_DIR)
DATA_DIR = os.path.join(PROJECT_ROOT, 'data')

# File Paths
GENOME_PATH = os.path.join(DATA_DIR, 'raw', 'dm6.fa.gz')
ANNOTATION_PATH = os.path.join(DATA_DIR, 'raw', 'dmel-all-r6.54.gtf.gz')
ATLAS_PATH = os.path.join(DATA_DIR, 'raw', 'fly_cell_atlas.h5ad')
OUTPUT_MANIFEST = os.path.join(DATA_DIR, 'processed', 'training_manifest.csv')

# Verify existence
print(f"üìÇ Project Root: {PROJECT_ROOT}")
for p in [GENOME_PATH, ANNOTATION_PATH, ATLAS_PATH]:
    status = "‚úÖ Found" if os.path.exists(p) else "‚ùå MISSING"
    print(f"{status}: {os.path.basename(p)}")

üìÇ Project Root: /Volumes/LaCie/Pracitce_Python_for_AIAP/genomic-decoder-fly
‚úÖ Found: dm6.fa.gz
‚úÖ Found: dmel-all-r6.54.gtf.gz
‚úÖ Found: fly_cell_atlas.h5ad


### 2. Inspect the Atlas
**Purpose:** Check what kind of gene labels the Atlas uses.
**How it works:** - Loads the `.h5ad` file using Scanpy.
- Extracts the list of gene names (`var_names`).
- Prints the first few examples so we can see if they look like **"FBgn001"** (IDs) or **"Act5C"** (Symbols).

In [2]:
print("üîç Inspecting Atlas Labels...")

try:
    # Load the Atlas
    adata = sc.read_h5ad(ATLAS_PATH)
    
    # Get the gene names
    atlas_genes = adata.var_names.tolist()
    
    print(f"   - Found {len(atlas_genes)} total genes in Atlas.")
    print(f"   - First 5 Gene Labels: {atlas_genes[:5]}")
        
except Exception as e:
    print(f"‚ùå Error loading Atlas: {e}")
    atlas_genes = []

üîç Inspecting Atlas Labels...
   - Found 1838 total genes in Atlas.
   - First 5 Gene Labels: ['TNFRSF4', 'CPSF3L', 'ATAD3C', 'C1orf86', 'RER1']


### 3. Parse Genomic Coordinates (GTF)
**Purpose:** specific locations of every gene on the chromosomes.
**How it works:** - Opens the GTF file (compressed or uncompressed).
- Uses **Regular Expressions (Regex)** to capture BOTH the `gene_id` (e.g., FBgn0031208) AND the `gene_symbol` (e.g., Fas2).
- This ensures we can match the Atlas regardless of which naming convention it uses.

In [3]:
print("üîç Parsing GTF for IDs and Symbols...")

def smart_parse_gtf(gtf_path):
    gene_data = []
    
    # Regex to capture gene_id AND gene_symbol
    id_pattern = re.compile(r'gene_id\s+"?([^";]+)"?')
    name_pattern = re.compile(r'gene_symbol\s+"?([^";]+)"?') # FlyBase style
    name_pattern_2 = re.compile(r'gene_name\s+"?([^";]+)"?') # Ensembl style
    
    with gzip.open(gtf_path, 'rt') as f:
        for line in f:
            if line.startswith('#'): continue
            parts = line.strip().split('\t')
            if len(parts) < 9 or parts[2] != 'gene': continue
            
            # Extract ID
            id_match = id_pattern.search(parts[8])
            if not id_match: continue
            gene_id = id_match.group(1)
            
            # Extract Name (Try 'gene_symbol' first, then 'gene_name')
            name_match = name_pattern.search(parts[8])
            if not name_match:
                name_match = name_pattern_2.search(parts[8])
            
            gene_name = name_match.group(1) if name_match else "Unknown"

            gene_data.append({
                'gene_id': gene_id,       # Unique ID
                'gene_symbol': gene_name, # Human-readable name
                'chrom': parts[0],
                'start': int(parts[3]),
                'end': int(parts[4]),
                'length': int(parts[4]) - int(parts[3])
            })
            
    return pd.DataFrame(gene_data)

# Run Parser
df_gtf = smart_parse_gtf(ANNOTATION_PATH)
print(f"‚úÖ Parsed {len(df_gtf)} genes from GTF.")
df_gtf.head(3)

üîç Parsing GTF for IDs and Symbols...
‚úÖ Parsed 23932 genes from GTF.


Unnamed: 0,gene_id,gene_symbol,chrom,start,end,length
0,FBgn0250732,gfzf,3R,7145880,7150968,5088
1,FBti0060344,Unknown,3R,24185268,24185356,88
2,FBgn0286036,sisRNA:CR46358,3R,4639789,4640004,215


### 4. Match, Calculate GC, and Save
**Purpose:** Connect the Atlas to the Genome and generate the training features.
**How it works:** 
1.  **Auto-Matching:** It compares the Atlas labels against the GTF IDs *and* Symbols. It automatically picks the one with more matches.
2.  **GC Calculation:** It loads the full DNA genome, looks up the sequence for each matched gene, and calculates the percentage of G/C bases.
3.  **Export:** Saves the final clean dataset to `data/processed/training_manifest.csv`.

In [4]:
print("üöÄ Final Processing...")

# 1. LOGIC: Connect the dots
match_id = df_gtf[df_gtf['gene_id'].isin(atlas_genes)]
match_symbol = df_gtf[df_gtf['gene_symbol'].isin(atlas_genes)]

if len(match_id) > len(match_symbol):
    print(f"‚úÖ MATCH FOUND: Using 'gene_id' ({len(match_id)} matches).")
    df_final = match_id.copy()
elif len(match_symbol) > 0:
    print(f"‚úÖ MATCH FOUND: Using 'gene_symbol' ({len(match_symbol)} matches).")
    df_final = match_symbol.copy()
else:
    print("‚ùå FAILURE: No matches found. Atlas labels do not match GTF.")
    df_final = pd.DataFrame()

# 2. FEATURE ENGINEERING: Calculate GC Content
if not df_final.empty:
    print("   Calculating GC Content (this takes a moment)...")
    
    # Load Genome
    with gzip.open(GENOME_PATH, "rt") as handle:
        genome_dict = SeqIO.to_dict(SeqIO.parse(handle, "fasta"))
        
    def get_gc(row):
        chrom = row['chrom']
        # Fix chr prefix issues (chr2L vs 2L)
        if chrom not in genome_dict:
            if f"chr{chrom}" in genome_dict: chrom = f"chr{chrom}"
            elif chrom.replace('chr', '') in genome_dict: chrom = chrom.replace('chr', '')
            else: return None
        
        # Extract Sequence
        seq = str(genome_dict[chrom].seq[row['start']-1 : row['end']])
        if not seq: return 0.0
        return (seq.count('G') + seq.count('C') + seq.count('g') + seq.count('c')) / len(seq)

    df_final['gc_content'] = df_final.apply(get_gc, axis=1)
    df_final = df_final.dropna(subset=['gc_content'])
    
    # 3. SAVE
    os.makedirs(os.path.dirname(OUTPUT_MANIFEST), exist_ok=True)
    df_final.to_csv(OUTPUT_MANIFEST, index=False)
    print(f"üéâ SUCCESS! Manifest saved to: {OUTPUT_MANIFEST}")
    print(f"   Total Training Genes: {len(df_final)}")

üöÄ Final Processing...
‚úÖ MATCH FOUND: Using 'gene_symbol' (30 matches).
   Calculating GC Content (this takes a moment)...
üéâ SUCCESS! Manifest saved to: /Volumes/LaCie/Pracitce_Python_for_AIAP/genomic-decoder-fly/data/processed/training_manifest.csv
   Total Training Genes: 30


# üèÅ Conclusion & Next Steps

**Status:** ‚úÖ Phase 1 (Data Engineering) Complete.

We have successfully built the foundation for the **Genomic Decoder**. By bridging the gap between raw DNA sequences (Genome) and biological labels (Atlas), we now have a high-quality training dataset ready for machine learning.

### üöÄ My Endeavor: The Genomic Decoder Project
This notebook is just the first step in a larger ambition to decode the "language of life" using Artificial Intelligence. Here is the roadmap of my journey:

* [x] **Phase 1: Data Prep** - Map Genes to Genomic Coordinates & Calculate Features.
* [ ] **Phase 2: Model Architecture** - Build a Transformer-based model (DNA-BERT style) to learn sequence patterns.
* [ ] **Phase 3: Training** - Teach the AI to predict gene expression from raw DNA.
* [ ] **Phase 4: Evaluation** - Benchmark the model against real biological ground truth.

**üëâ Next Step:** Open `02_Model_Architecture.ipynb` to begin building the Neural Network.