# Introduction to Deep Learning 67822 - [Ex1](https://docs.google.com/document/d/11Q1ejfwTH_tHjdQob0gYLA3bS88lNsBStpBWz085rB0/edit?tab=t.0)
### NAME1 (ID1) & NAME2 (ID2)

### Section 1: Load and Prepare the Data
#### Split training data (from the .txt files)
We are training a model to classify 9-mer peptides based on whether they are detected by the immune system via specific HLA alleles. Each positive sample is associated with one of six common alleles. The negative samples are peptides not detected by any of the alleles.

When splitting the data into training and test sets, it’s crucial to avoid introducing bias. One tempting idea is to take the first 90% of each file for training and the last 10% for testing. However, this assumes that the peptide order inside each file is random — which may not be true. The files might be sorted by binding strength, similarity, or even alphabetically, which could skew the distribution.

To prevent such biases and ensure fair training and evaluation, we use a **stratified random split per allele**:

1. We load and shuffle the peptides from each positive allele file individually.
2. We split each file into a 90% training / 10% test set.
3. We do the same for the negative examples (from `negs.txt`).
4. Finally, we combine all subsets and shuffle them again.

This approach ensures that all alleles are represented in both training and test sets, the overall class balance between positive and negative is maintained and no ordering bias from the original files leaks into the learning process.

In [1]:
from dataset import load_and_split_data

# Load and split data
datasets = load_and_split_data()

# Printing statistics will be handled inside the load_and_split_data function


--- A0207 Dataset ---
Train set size: 25024 (89.99%))
Test set size: 2782 (10.01%)

A0207 Train set distribution:
  NEG    (label 0): 22042 samples (88.08%)
  A0101  (label 1): 2982 samples (11.92%)

A0207 Test set distribution:
  NEG    (label 0): 2450 samples (88.07%)
  A0101  (label 1): 332 samples (11.93%)

--- A0101 Dataset ---
Train set size: 23184 (90.00%))
Test set size: 2577 (10.00%)

A0101 Train set distribution:
  NEG    (label 0): 22042 samples (95.07%)
  A0101  (label 1): 1142 samples (4.93%)

A0101 Test set distribution:
  NEG    (label 0): 2450 samples (95.07%)
  A0101  (label 1): 127 samples (4.93%)

--- A2402 Dataset ---
Train set size: 24028 (90.00%))
Test set size: 2671 (10.00%)

A2402 Train set distribution:
  NEG    (label 0): 22042 samples (91.73%)
  A0101  (label 1): 1986 samples (8.27%)

A2402 Test set distribution:
  NEG    (label 0): 2450 samples (91.73%)
  A0101  (label 1): 221 samples (8.27%)

--- A0201 Dataset ---
Train set size: 24394 (89.99%))
Test set s

### Section 2 

#### a. Peptide Representation

##### How would you represent these 9-mers of amino acids?

We considered two approaches:

**1. One-hot encoding:**  
Each amino acid is represented as a 20-dimensional one-hot vector. For a 9-mer peptide, this would require 180 input features. While this is straightforward, it’s sparse and does not capture biological similarities between amino acids.

**2. Embedding (used):**  
Instead, we map each amino acid to a dense embedding vector of size `d` (e.g., 4 or 8). This allows the model to learn meaningful representations during training, such as that hydrophobic or acidic amino acids may behave similarly.

Each peptide is converted to 9 indices (integers from 0–19), then embedded to get a `9 × d` matrix, which is then flattened for input to an MLP.

##### How would you represent the associate alleles?

Each positive sample comes from a known allele, and each negative sample is from none. We label:
- `0` → NEG (non-detecting)
- `1–6` → Alleles A0101 to A2402

This forms a 7-class multi-class classification problem.

## Modified Data Processing Approach

We've updated our data processing to create 6 binary classification datasets, one for each allele. Each dataset includes:
- Positive examples: peptides binding to that specific allele (labeled as 1)
- Negative examples: peptides not binding to any allele (labeled as 0)

This approach allows us to train specialized models for each allele, potentially improving performance compared to a single multi-class model. The processed data is organized in a dictionary where each key is an allele name, and each value contains the training and testing tensors for that allele.

In [2]:
from dataset import prepare_data

# Prepare tensor datasets for all alleles
# Our datasets variable is already in the format {allele: (train_data, test_data)}
processed_datasets = prepare_data(datasets)
# Create a dictionary to store alleles and their data in a more organized way
allele_data = {}
for allele, (X_train_allele, y_train_allele, X_test_allele, y_test_allele) in processed_datasets.items():
    allele_data[allele] = {
        'X_train': X_train_allele,
        'y_train': y_train_allele,
        'X_test': X_test_allele,
        'y_test': y_test_allele
    }


--- Processing A0207 Dataset ---
A0207 X_train shape: torch.Size([25024, 9])
A0207 y_train shape: torch.Size([25024])
A0207 X_test shape: torch.Size([2782, 9])
A0207 y_test shape: torch.Size([2782])

A0207 example input (peptide indices): tensor([15, 17, 17,  5,  7, 12, 17,  4, 14])
A0207 corresponding label (binary): tensor(0)

--- Processing A0101 Dataset ---
A0101 X_train shape: torch.Size([23184, 9])
A0101 y_train shape: torch.Size([23184])
A0101 X_test shape: torch.Size([2577, 9])
A0101 y_test shape: torch.Size([2577])

A0101 example input (peptide indices): tensor([ 2, 15, 15, 12, 18, 15,  0,  5,  8])
A0101 corresponding label (binary): tensor(0)

--- Processing A2402 Dataset ---
A2402 X_train shape: torch.Size([24028, 9])
A2402 y_train shape: torch.Size([24028])
A2402 X_test shape: torch.Size([2671, 9])
A2402 y_test shape: torch.Size([2671])

A2402 example input (peptide indices): tensor([ 0, 16,  9, 11, 15,  7, 17,  0,  8])
A2402 corresponding label (binary): tensor(0)

--- Pr

TEST PER PEPTIDE

In [3]:
from model import PeptideToHLAClassifier
from training import create_data_loaders, setup_training, train_model
from hyperparameters import allelse_data_hyperparameters


to_plot = {}
for selected_allele in allele_data.keys():

    print(f"--- Processing {selected_allele} Dataset ---")
    
    # Create model instance
    model = PeptideToHLAClassifier(
        emb_dim=allelse_data_hyperparameters[selected_allele]['EMB_DIM'],
        fc_hidden_dim=allelse_data_hyperparameters[selected_allele]['FC_HIDDEN_DIM'],
    )

    # Extract data for the selected allele
    X_train = allele_data[selected_allele]['X_train']
    y_train = allele_data[selected_allele]['y_train']
    X_test = allele_data[selected_allele]['X_test']
    y_test = allele_data[selected_allele]['y_test']

    # Create data loaders
    train_loader, test_loader = create_data_loaders(X_train, y_train, X_test, y_test)

    # Setup loss function and optimizer
    loss_fn, optimizer = setup_training(
        model=model,
        loss_function=allelse_data_hyperparameters[selected_allele]['loss_function'],
        learning_rate=allelse_data_hyperparameters[selected_allele]['LEARNING_RATE'],
        y_train=y_train)
    
    # Train the model
    # model, train_loader, test_loader, loss_fn, optimizer, epochs=EPOCHS):
    print(f"emb_dim: {allelse_data_hyperparameters[selected_allele]['EMB_DIM']}, "
          f"fc_hidden_dim: {allelse_data_hyperparameters[selected_allele]['FC_HIDDEN_DIM']}, "
          f"epochs: {allelse_data_hyperparameters[selected_allele]['EPOCHS']}, "
          f"batch_size: {allelse_data_hyperparameters[selected_allele]['BATCH_SIZE']}, "
          f"loss_function: {allelse_data_hyperparameters[selected_allele]['loss_function']}, "
          f"learning_rate: {allelse_data_hyperparameters[selected_allele]['LEARNING_RATE']}, "
          f"threshold: {allelse_data_hyperparameters[selected_allele]['THRESHOLD']}, "
          f"train_size: {len(X_train)}, "
          f"test_size: {len(X_test)}, "
          f"optimizer: {optimizer.__class__.__name__}"
          )

    train_losses, test_losses, accuracies = train_model(
        model=model,
        train_loader=train_loader,
        test_loader=test_loader,
        loss_fn=loss_fn,
        optimizer=optimizer,
        epochs=allelse_data_hyperparameters[selected_allele]['EPOCHS'],
        threshold=allelse_data_hyperparameters[selected_allele]['THRESHOLD'],
    )
    to_plot[selected_allele] = {
        'train_losses': train_losses,
        'test_losses': test_losses,
        'accuracies': accuracies,
        'epochs': allelse_data_hyperparameters[selected_allele]['EPOCHS'],
    }


--- Processing A0207 Dataset ---
emb_dim: 32, fc_hidden_dim: 128, epochs: 20, batch_size: 64, loss_function: BCELoss, learning_rate: 0.001, threshold: 0.5, train_size: 25024, test_size: 2782, optimizer: Adam
Epoch  1/20 | Train Loss: 0.2949 | Test Loss: 0.2108 | Accuracy: 89.11%
Epoch  2/20 | Train Loss: 0.2145 | Test Loss: 0.1986 | Accuracy: 89.47%
Epoch  3/20 | Train Loss: 0.2035 | Test Loss: 0.1911 | Accuracy: 89.68%
Epoch  4/20 | Train Loss: 0.1940 | Test Loss: 0.1932 | Accuracy: 89.04%
Epoch  5/20 | Train Loss: 0.1926 | Test Loss: 0.1918 | Accuracy: 88.93%
Epoch  6/20 | Train Loss: 0.1857 | Test Loss: 0.1886 | Accuracy: 89.61%
Epoch  7/20 | Train Loss: 0.1837 | Test Loss: 0.1870 | Accuracy: 89.47%
Epoch  8/20 | Train Loss: 0.1786 | Test Loss: 0.1905 | Accuracy: 89.79%
Epoch  9/20 | Train Loss: 0.1750 | Test Loss: 0.1903 | Accuracy: 89.29%
Epoch 10/20 | Train Loss: 0.1736 | Test Loss: 0.1894 | Accuracy: 88.96%
Epoch 11/20 | Train Loss: 0.1729 | Test Loss: 0.1932 | Accuracy: 89.47%


#### b. Model

##### What will the network’s input dimension be?
With embeddings of size `d` and peptides of length 9, the input dimension is `9 × d`.  
For example, using `d = 4`, the input to the MLP is of size 36.

##### Implement an MLP that keeps this dimension for 2 inner layers
We construct a small feedforward neural network (MLP) with the following layers:
- **Embedding layer:** Maps 20 amino acid types to `d`-dimensional learnable vectors.
- **Flatten layer:** Concatenates the 9 embedded amino acids into a single vector of size `9 × d`.
- **Two hidden layers:** Fully connected, both using the same dimension (`9 × d`) with ReLU activations.
- **Output layer:** A linear layer with 7 outputs, representing the 7 classification labels (6 alleles + NEG).

We use `CrossEntropyLoss` as our loss function, and the `Adam` optimizer. During training, we track both training and validation loss.

##### Defining the MLP Model

##### Loss & Optimization 

##### Training Loop

##### Does the input dimension cause training problems?
In our setup, Each amino acid is embedded into a small vector (e.g. 4D), A peptide of length 9 becomes a 36D input vector (`9 × 4`) and the hidden layers also use this dimension.

This is a relatively small dimensional space (especially compared to one-hot encoding with 180 features). The network trains quickly and converges within a few epochs to 70% accuracy. No numerical instability is observed.

**Conclusion:** The embedding-based representation allows the model to learn efficiently up to 70% accuracy which is not much for peptides in the real world (if this MLP was used for real vaccine manufacturing, 70% would not be enough!) struggling with too high-dimensional sparse inputs - so yes, defenitly

##### Plot the resulting train and test losses.

In [4]:
from evaluation import plot_training_results

# Generate plots
for selected_allele, data in to_plot.items():
    train_losses = data['train_losses']
    test_losses = data['test_losses']
    accuracies = data['accuracies']
    epochs = data['epochs']
    plot_training_results(selected_allele, train_losses, test_losses, accuracies, 20)



#### c. Model without constraints