# Introduction to Deep Learning 67822 - [Ex1](https://docs.google.com/document/d/11Q1ejfwTH_tHjdQob0gYLA3bS88lNsBStpBWz085rB0/edit?tab=t.0)
#### NAME1 (ID1) & NAME2 (ID2)

##### Section 1: Load and Prepare the Data

###### Task 1: Split training data (from the .txt files)

We are training a model to classify 9-mer peptides based on whether they are detected by the immune system via specific HLA alleles. Each positive sample is associated with one of six common alleles. The negative samples are peptides not detected by any of the alleles.

When splitting the data into training and test sets, it’s crucial to avoid introducing bias. One tempting idea is to take the first 90% of each file for training and the last 10% for testing. However, this assumes that the peptide order inside each file is random — which may not be true. The files might be sorted by binding strength, similarity, or even alphabetically, which could skew the distribution.

To prevent such biases and ensure fair training and evaluation, we use a **stratified random split per allele**:

1. We load and shuffle the peptides from each positive allele file individually.
2. We split each file into a 90% training / 10% test set.
3. We do the same for the negative examples (from `negs.txt`).
4. Finally, we combine all subsets and shuffle them again.

This approach ensures that all alleles are represented in both training and test sets, the overall class balance between positive and negative is maintained and no ordering bias from the original files leaks into the learning process.

In [10]:
import os
import random
from pathlib import Path

# Config
data_dir = Path("Data/HLA_Dataset")
train_ratio = 0.9

# Locate all allele-positive files
allele_files = [f for f in data_dir.glob("*.txt") if "neg" not in f.name]

# Store train/test samples
pos_train, pos_test = [], []

# Process each positive file (1 per allele)
for file in allele_files:
    allele = file.stem.replace("_pos", "")
    with open(file) as f:
        peptides = [line.strip() for line in f if line.strip()]
        random.shuffle(peptides)  # shuffle within each allele
        split_idx = int(len(peptides) * train_ratio)
        pos_train += [(pep, allele, 1) for pep in peptides[:split_idx]]
        pos_test  += [(pep, allele, 1) for pep in peptides[split_idx:]]

# Process negatives
with open(data_dir / "negs.txt") as f:
    neg_peptides = [line.strip() for line in f if line.strip()]
    random.shuffle(neg_peptides)
    split_idx = int(len(neg_peptides) * train_ratio)
    neg_train = [(pep, "NEG", 0) for pep in neg_peptides[:split_idx]]
    neg_test  = [(pep, "NEG", 0) for pep in neg_peptides[split_idx:]]

# Final datasets
train_data = pos_train + neg_train
test_data = pos_test + neg_test
random.shuffle(train_data)
random.shuffle(test_data)

# Summary
train_pct = (len(train_data) / (len(train_data) + len(test_data))) * 100
test_pct = (len(test_data) / (len(train_data) + len(test_data))) * 100
neg_train_pct = (len(neg_train) / len(train_data)) * 100
pos_train_pct = (len(pos_train) / len(train_data)) * 100
neg_test_pct = (len(neg_test) / len(test_data)) * 100
pos_test_pct = (len(pos_test) / len(test_data)) * 100

print(f"Train set size: {len(train_data)}({train_pct:.2f}%)\nPos: {len(pos_train)}({pos_train_pct:.2f}%), Neg: {len(neg_train)}({neg_train_pct:.2f}%)")
print(f"Test set size:  {len(test_data)}({test_pct:.2f}%)\nPos: {len(pos_test)}({pos_test_pct:.2f}%), Neg: {len(neg_test)}({neg_test_pct:.2f}%)")

Train set size: 33642(89.99%)
Pos: 11600(34.48%), Neg: 22042(65.52%)
Test set size:  3741(10.01%)
Pos: 1291(34.51%), Neg: 2450(65.49%)
