# Introduction to Deep Learning 67822 - [Ex1](https://docs.google.com/document/d/11Q1ejfwTH_tHjdQob0gYLA3bS88lNsBStpBWz085rB0/edit?tab=t.0)
### NAME1 (ID1) & NAME2 (ID2)

### Section 1: Load and Prepare the Data
#### Split training data (from the .txt files)
We are training a model to classify 9-mer peptides based on whether they are detected by the immune system via specific HLA alleles. Each positive sample is associated with one of six common alleles. The negative samples are peptides not detected by any of the alleles.

When splitting the data into training and test sets, it’s crucial to avoid introducing bias. One tempting idea is to take the first 90% of each file for training and the last 10% for testing. However, this assumes that the peptide order inside each file is random — which may not be true. The files might be sorted by binding strength, similarity, or even alphabetically, which could skew the distribution.

To prevent such biases and ensure fair training and evaluation, we use a **stratified random split per allele**:

1. We load and shuffle the peptides from each positive allele file individually.
2. We split each file into a 90% training / 10% test set.
3. We do the same for the negative examples (from `negs.txt`).
4. Finally, we combine all subsets and shuffle them again.

This approach ensures that all alleles are represented in both training and test sets, the overall class balance between positive and negative is maintained and no ordering bias from the original files leaks into the learning process.

In [1]:
from dataset import load_and_split_data

# Load and split data
train_data, test_data = load_and_split_data()

# Printing statistics will be handled inside the load_and_split_data function

Train set size: 33642 (89.99%))
Test set size: 3741 (10.01%)

Train set distribution:
  NEG    (label 0): 22042 samples (65.52%)
  A0101  (label 1): 1142 samples (3.39%)
  A0201  (label 2): 2352 samples (6.99%)
  A0203  (label 3): 1645 samples (4.89%)
  A0207  (label 4): 2982 samples (8.86%)
  A0301  (label 5): 1493 samples (4.44%)
  A2402  (label 6): 1986 samples (5.90%)

Test set distribution:
  NEG    (label 0): 2450 samples (65.49%)
  A0101  (label 1): 127 samples (3.39%)
  A0201  (label 2): 262 samples (7.00%)
  A0203  (label 3): 183 samples (4.89%)
  A0207  (label 4): 332 samples (8.87%)
  A0301  (label 5): 166 samples (4.44%)
  A2402  (label 6): 221 samples (5.91%)


### Section 2 

#### a. Peptide Representation

##### How would you represent these 9-mers of amino acids?

We considered two approaches:

**1. One-hot encoding:**  
Each amino acid is represented as a 20-dimensional one-hot vector. For a 9-mer peptide, this would require 180 input features. While this is straightforward, it’s sparse and does not capture biological similarities between amino acids.

**2. Embedding (used):**  
Instead, we map each amino acid to a dense embedding vector of size `d` (e.g., 4 or 8). This allows the model to learn meaningful representations during training, such as that hydrophobic or acidic amino acids may behave similarly.

Each peptide is converted to 9 indices (integers from 0–19), then embedded to get a `9 × d` matrix, which is then flattened for input to an MLP.

##### How would you represent the associate alleles?

Each positive sample comes from a known allele, and each negative sample is from none. We label:
- `0` → NEG (non-detecting)
- `1–6` → Alleles A0101 to A2402

This forms a 7-class multi-class classification problem.

In [2]:
from dataset import prepare_data

# Prepare tensor datasets
X_train, y_train, X_test, y_test = prepare_data(train_data, test_data)

X_train shape: torch.Size([33642, 9])
y_train shape: torch.Size([33642])
X_test shape: torch.Size([3741, 9])
y_test shape: torch.Size([3741])

Example input (peptide indices): tensor([17, 10,  9, 12, 16, 12, 17,  9, 14])
Corresponding label (allele class): tensor(0)


#### b. Model

##### What will the network’s input dimension be?
With embeddings of size `d` and peptides of length 9, the input dimension is `9 × d`.  
For example, using `d = 4`, the input to the MLP is of size 36.

##### Implement an MLP that keeps this dimension for 2 inner layers
We construct a small feedforward neural network (MLP) with the following layers:
- **Embedding layer:** Maps 20 amino acid types to `d`-dimensional learnable vectors.
- **Flatten layer:** Concatenates the 9 embedded amino acids into a single vector of size `9 × d`.
- **Two hidden layers:** Fully connected, both using the same dimension (`9 × d`) with ReLU activations.
- **Output layer:** A linear layer with 7 outputs, representing the 7 classification labels (6 alleles + NEG).

We use `CrossEntropyLoss` as our loss function, and the `Adam` optimizer. During training, we track both training and validation loss.

##### Defining the MLP Model

In [3]:
from model import PeptideClassifier2b

# Create model instance
model = PeptideClassifier2b(emb_dim=4)

##### Loss & Optimization 

In [4]:
from training import create_data_loaders, setup_training

# Create data loaders
train_loader, test_loader = create_data_loaders(X_train, y_train, X_test, y_test)

# Setup loss function and optimizer
loss_fn, optimizer = setup_training(model, y_train)

Model, optimizer, and loss function initialized!


##### Training Loop

In [5]:
from config import EPOCHS
from training import train_model

# Train the model
train_losses, test_losses, accuracies = train_model(
    model, train_loader, test_loader, loss_fn, optimizer, epochs=EPOCHS
)

Epoch  1/20 | Train Loss: 1.1041 | Test Loss: 0.9179 | Accuracy: 65.76%
Epoch  2/20 | Train Loss: 0.8572 | Test Loss: 0.8220 | Accuracy: 67.71%
Epoch  3/20 | Train Loss: 0.7950 | Test Loss: 0.7882 | Accuracy: 69.15%
Epoch  4/20 | Train Loss: 0.7635 | Test Loss: 0.7716 | Accuracy: 68.86%
Epoch  5/20 | Train Loss: 0.7442 | Test Loss: 0.7553 | Accuracy: 69.58%
Epoch  6/20 | Train Loss: 0.7290 | Test Loss: 0.7386 | Accuracy: 70.06%
Epoch  7/20 | Train Loss: 0.7161 | Test Loss: 0.7320 | Accuracy: 70.38%
Epoch  8/20 | Train Loss: 0.7059 | Test Loss: 0.7304 | Accuracy: 70.62%
Epoch  9/20 | Train Loss: 0.6952 | Test Loss: 0.7191 | Accuracy: 70.70%
Epoch 10/20 | Train Loss: 0.6880 | Test Loss: 0.7116 | Accuracy: 71.24%
Epoch 11/20 | Train Loss: 0.6800 | Test Loss: 0.7120 | Accuracy: 71.00%
Epoch 12/20 | Train Loss: 0.6753 | Test Loss: 0.7091 | Accuracy: 70.54%
Epoch 13/20 | Train Loss: 0.6698 | Test Loss: 0.7032 | Accuracy: 71.40%
Epoch 14/20 | Train Loss: 0.6643 | Test Loss: 0.7046 | Accuracy:

##### Does the input dimension cause training problems?
In our setup, Each amino acid is embedded into a small vector (e.g. 4D), A peptide of length 9 becomes a 36D input vector (`9 × 4`) and the hidden layers also use this dimension.

This is a relatively small dimensional space (especially compared to one-hot encoding with 180 features). The network trains quickly and converges within a few epochs to 70% accuracy. No numerical instability is observed.

**Conclusion:** The embedding-based representation allows the model to learn efficiently up to 70% accuracy which is not much for peptides in the real world (if this MLP was used for real vaccine manufacturing, 70% would not be enough!) struggling with too high-dimensional sparse inputs - so yes, defenitly

##### Plot the resulting train and test losses.

In [6]:
from evaluation import plot_training_results

# Generate plots
plot_training_results(train_losses, test_losses, accuracies, EPOCHS)

#### c. Model without constraints

In [7]:
# Defining the MLP Model
from model import PeptideClassifier2c

# Create model instance
model2c = PeptideClassifier2c(emb_dim=32)

# Loss & Optimization 
from training import create_data_loaders, setup_training

# Create data loaders
train_loader, test_loader = create_data_loaders(X_train, y_train, X_test, y_test)

# Setup loss function and optimizer
loss_fn, optimizer = setup_training(model2c, y_train)

# Training Loop
from config import EPOCHS
from training import train_model

# Train the model
train_losses, test_losses, accuracies = train_model(
    model2c, train_loader, test_loader, loss_fn, optimizer, epochs=EPOCHS
)

#plotting
from evaluation import plot_training_results

# Generate plots
plot_training_results(train_losses, test_losses, accuracies, EPOCHS)

Model, optimizer, and loss function initialized!
Epoch  1/20 | Train Loss: 1.0812 | Test Loss: 0.8618 | Accuracy: 66.45%
Epoch  2/20 | Train Loss: 0.7785 | Test Loss: 0.7541 | Accuracy: 70.03%
Epoch  3/20 | Train Loss: 0.7150 | Test Loss: 0.7018 | Accuracy: 70.68%
Epoch  4/20 | Train Loss: 0.6820 | Test Loss: 0.6898 | Accuracy: 71.29%
Epoch  5/20 | Train Loss: 0.6554 | Test Loss: 0.6731 | Accuracy: 72.07%
Epoch  6/20 | Train Loss: 0.6308 | Test Loss: 0.6971 | Accuracy: 71.21%
Epoch  7/20 | Train Loss: 0.6045 | Test Loss: 0.6614 | Accuracy: 72.09%
Epoch  8/20 | Train Loss: 0.5824 | Test Loss: 0.6900 | Accuracy: 70.11%
Epoch  9/20 | Train Loss: 0.5593 | Test Loss: 0.6845 | Accuracy: 71.45%
Epoch 10/20 | Train Loss: 0.5350 | Test Loss: 0.6861 | Accuracy: 71.24%
Epoch 11/20 | Train Loss: 0.5115 | Test Loss: 0.6992 | Accuracy: 70.52%
Epoch 12/20 | Train Loss: 0.4858 | Test Loss: 0.7222 | Accuracy: 71.18%
Epoch 13/20 | Train Loss: 0.4650 | Test Loss: 0.7381 | Accuracy: 69.39%
Epoch 14/20 | T