# ðŸ§  02. Model Architecture
**Objective:** Build the Neural Network structure (DNA-BERT).

**What this notebook does:**
1.  **Tokenization:** Creates a custom tokenizer to convert DNA ("ATCG") into numbers.
2.  **Dataset Class:** Wraps our `training_manifest.csv` into a PyTorch-ready data loader.
3.  **Model Design:** Defines the Transformer architecture (Encoder-only, like BERT).
4.  **Verification:** Runs a "Forward Pass" on a single batch of data to prove the shapes align.

### 1. Environment & Device Setup
**Purpose:** Initialize the deep learning environment.
**How it works:**
* **Device Detection:** Automatically detects if your Mac has an M-series chip (`mps`) or if you are on a standard CPU (`cpu`). This ensures the model runs as fast as possible.
* **Path Configuration:** Dynamically finds your `data/processed` folder so we can load the manifest we created in the previous notebook.

In [6]:
import os
import sys
import torch
import torch.nn as nn
import pandas as pd
import numpy as np

# --- CONFIGURATION ---
# Check if GPU is available (mps for Mac, cuda for Nvidia, cpu otherwise)
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
print(f"ðŸš€ Using Device: {device}")

# Paths
NOTEBOOK_DIR = os.getcwd()
PROJECT_ROOT = os.path.dirname(NOTEBOOK_DIR)
DATA_PATH = os.path.join(PROJECT_ROOT, 'data', 'processed', 'training_manifest.csv')
SRC_DIR = os.path.join(PROJECT_ROOT, 'src')

# Add src to path
if SRC_DIR not in sys.path:
    sys.path.insert(0, SRC_DIR)

ðŸš€ Using Device: cpu


### 2. The DNA Tokenizer
**Purpose:** Translate biological language (DNA) into machine language (Numbers).
**How it works:**
* **Vocabulary:** We define a simple map: `A=2`, `T=3`, `C=4`, `G=5`.
* **Encoding:** The `encode` function takes a string like `"ATCG"`, looks up the numbers, and returns a Tensor `[2, 3, 4, 5]`.
* **Padding:** Since neural networks need fixed-size inputs (e.g., exactly 2000 bp), we fill any empty space with `0` (PAD).

In [7]:
class DNATokenizer:
    def __init__(self):
        # K-mer tokenization (using k=1 for simplicity, can upgrade to k=3 later)
        self.vocab = {'PAD': 0, 'UNK': 1, 'A': 2, 'T': 3, 'C': 4, 'G': 5, 'N': 6}
        self.pad_token_id = 0
    
    def encode(self, seq, max_length=1000):
        """Converts string sequence to list of integers."""
        # Truncate if too long
        seq = seq[:max_length]
        
        # Map chars to ints
        ids = [self.vocab.get(char, self.vocab['UNK']) for char in seq.upper()]
        
        # Padding
        if len(ids) < max_length:
            ids += [self.pad_token_id] * (max_length - len(ids))
            
        return torch.tensor(ids, dtype=torch.long)

    def decode(self, ids):
        """Converts integers back to string (for checking)."""
        rev_vocab = {v: k for k, v in self.vocab.items()}
        chars = [rev_vocab.get(i.item(), '?') for i in ids if i.item() != 0]
        return "".join(chars)

# Test it
tokenizer = DNATokenizer()
test_seq = "ATCGGGCTA"
encoded = tokenizer.encode(test_seq, max_length=10)
decoded = tokenizer.decode(encoded)

print(f"Original: {test_seq}")
print(f"Encoded:  {encoded}")
print(f"Decoded:  {decoded}")

Original: ATCGGGCTA
Encoded:  tensor([2, 3, 4, 5, 5, 5, 4, 3, 2, 0])
Decoded:  ATCGGGCTA


### 3. The Genomic Dataset (The Pipeline)
**Purpose:** Stream data from disk to the model during training.
**How it works:**
* **Lazy Loading:** Instead of loading 100,000 sequences into RAM at once, we load the *Genome Index* once.
* **On-the-Fly Extraction:** When the model asks for "Item #42", this class looks up the row in the CSV, jumps to the exact coordinate in the Genome file, extracts the sequence, tokenizes it, and hands it to the model. This is extremely memory efficient.

In [8]:
from torch.utils.data import Dataset, DataLoader
from Bio import SeqIO
import gzip

class GenomicDataset(Dataset):
    def __init__(self, manifest_path, genome_path, tokenizer, max_length=2000):
        self.manifest = pd.read_csv(manifest_path)
        self.tokenizer = tokenizer
        self.max_length = max_length
        
        # Load Genome into Memory (Fast lookup)
        print("Loading Genome for fast lookup...")
        with gzip.open(genome_path, "rt") as handle:
            self.genome = SeqIO.to_dict(SeqIO.parse(handle, "fasta"))
            
    def __len__(self):
        return len(self.manifest)
    
    def __getitem__(self, idx):
        row = self.manifest.iloc[idx]
        
        # Extract Sequence
        chrom = row['chrom']
        # Handle chromosome naming mismatches
        if chrom not in self.genome:
            if f"chr{chrom}" in self.genome: chrom = f"chr{chrom}"
            elif chrom.replace('chr', '') in self.genome: chrom = chrom.replace('chr', '')
        
        # Get raw string
        raw_seq = str(self.genome[chrom].seq[row['start']-1 : row['end']])
        
        # Tokenize
        input_ids = self.tokenizer.encode(raw_seq, self.max_length)
        
        # Target (Label): For now, we predict GC content as a dummy task
        # Later, this will be expression levels.
        label = torch.tensor(row['gc_content'], dtype=torch.float32)
        
        return input_ids, label

# Initialize Dataset
GENOME_PATH = os.path.join(PROJECT_ROOT, 'data', 'raw', 'dm6.fa.gz')
dataset = GenomicDataset(DATA_PATH, GENOME_PATH, tokenizer)
dataloader = DataLoader(dataset, batch_size=4, shuffle=True)

# Test a batch
sample_inputs, sample_labels = next(iter(dataloader))
print(f"Batch Shape: {sample_inputs.shape}") # Should be [4, 2000]

Loading Genome for fast lookup...
Batch Shape: torch.Size([4, 2000])


### 4. The Model: DNA-BERT
**Purpose:** The "Brain" of our project. A Transformer-based architecture designed to learn patterns in DNA sequences.
**Structure:**
1.  **Embedding Layer:** Converts integer tokens (e.g., `5` for G) into rich vectors (lists of 128 numbers) representing that base's properties.
2.  **Positional Encoding:** Adds information about *order*. In DNA, "ATG" (Start) is very different from "TGA" (Stop), so the model needs to know which letter comes first.
3.  **Transformer Encoder:** The heavy lifter. It uses "Self-Attention" to look at the whole sequence at once and understand how base pairs interact, even if they are far apart.
4.  **Regressor Head:** A final simple layer that boils the complex understanding down to a single number (Prediction).

In [9]:
class DNABert(nn.Module):
    def __init__(self, vocab_size=7, d_model=128, nhead=4, num_layers=2):
        super().__init__()
        
        # 1. Embeddings: Convert IDs (0,1,2..) to Vectors
        self.embedding = nn.Embedding(vocab_size, d_model)
        
        # 2. Positional Encoding (Simple learned version)
        self.pos_encoding = nn.Parameter(torch.zeros(1, 5000, d_model))
        
        # 3. Transformer Encoder
        encoder_layer = nn.TransformerEncoderLayer(d_model=d_model, nhead=nhead, batch_first=True)
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
        
        # 4. Regression Head (Predicts a single number: GC/Expression)
        self.regressor = nn.Linear(d_model, 1)
        
    def forward(self, x):
        # x shape: [batch_size, seq_len]
        
        # Add embedding + pos encoding
        x = self.embedding(x) + self.pos_encoding[:, :x.size(1), :]
        
        # Pass through Transformer
        x = self.transformer(x)
        
        # Average Pool (Combine all positions into one vector)
        x = x.mean(dim=1)
        
        # Final Prediction
        x = self.regressor(x)
        return x.squeeze()

# Instantiate Model
model = DNABert().to(device)
print(model)

DNABert(
  (embedding): Embedding(7, 128)
  (transformer): TransformerEncoder(
    (layers): ModuleList(
      (0-1): 2 x TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=128, out_features=128, bias=True)
        )
        (linear1): Linear(in_features=128, out_features=2048, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=2048, out_features=128, bias=True)
        (norm1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (regressor): Linear(in_features=128, out_features=1, bias=True)
)


### 5. Verification: The "Dry Run"
**Purpose:** Safety check.
**How it works:**
* We grab a single batch of real data from our new DataLoader.
* We push it through the untrained model.
* **Goal:** We want to see if it *crashes*. If we see "Forward Pass Successful" and a predicted number (even if it's random/wrong), it means the "plumbing" is connected correctly. We are ready to train.

In [10]:
print("ðŸ§ª Starting Dry Run...")

# 1. Get Batch
inputs, targets = next(iter(dataloader))
inputs = inputs.to(device)

# 2. Pass through Model
outputs = model(inputs)

# 3. Check Outputs
print("\nâœ… Forward Pass Successful!")
print(f"Input Shape:  {inputs.shape}")
print(f"Output Shape: {outputs.shape}")
print(f"Example Prediction: {outputs[0].item():.4f}")
print(f"Actual Target:      {targets[0].item():.4f}")

print("\nðŸš€ Ready for Training Loop in Notebook 03!")

ðŸ§ª Starting Dry Run...

âœ… Forward Pass Successful!
Input Shape:  torch.Size([4, 2000])
Output Shape: torch.Size([4])
Example Prediction: 0.3524
Actual Target:      0.4741

ðŸš€ Ready for Training Loop in Notebook 03!


# âœ… Architecture Verified

**Status:** Ready for Training.

We have successfully:
1.  **Built the Engine:** The DNA-BERT model is defined and compiles without errors.
2.  **Connected the Fuel:** The Data Loader is successfully pulling DNA sequences from our raw genome file and tokenizing them.
3.  **Test Start:** The "Dry Run" confirmed that data flows through the model and produces a prediction output.

### ðŸš€ Next Step: Notebook 03
Now that the car is built and the fuel line is connected, it's time to teach it how to drive.
In **`03_Model_Training.ipynb`**, we will:
* Define the **Loss Function** (how the model measures its mistakes).
* Set up the **Optimizer** (how the model learns).
* Run the **Training Loop** to actually teach the AI to predict GC content.