# Data Exploration

In [3]:
import os
import sys
import pandas as pd
if os.getcwd().endswith('notebooks'):
    os.chdir('..')
sys.path.append('src') 
from src.data_loader import get_dataloader

In [4]:
experiment_path = "data/HUMAN"
data_loader = get_dataloader(experiment_path, folds=[1,2,3,4], batch_size=3)
type(data_loader)

torch.utils.data.dataloader.DataLoader

In [5]:
# First let's look at the metadata
df = pd.read_csv("data/HUMAN/HUMAN.csv")
df.head()

Unnamed: 0,mutant,mutated_sequence,DMS_score,DMS_score_bin
0,A101C,MYGKIIFVLLLSEIVSISASSTTGVAMHTSTSSSVTKSYISSQTND...,0.573154,1
1,A101F,MYGKIIFVLLLSEIVSISASSTTGVAMHTSTSSSVTKSYISSQTND...,0.765705,1
2,A101G,MYGKIIFVLLLSEIVSISASSTTGVAMHTSTSSSVTKSYISSQTND...,-2.460507,0
3,A101H,MYGKIIFVLLLSEIVSISASSTTGVAMHTSTSSSVTKSYISSQTND...,-2.230238,0
4,A101I,MYGKIIFVLLLSEIVSISASSTTGVAMHTSTSSSVTKSYISSQTND...,1.122181,1


We can see that the metadata dataframe contains sequences for the same protein, each one with a single mutation. The mutation is specified by the first column A101C means that in position 101 amino acid A (alanine) was replaced with C (cystein). The DMS_score is the value we are trying to predict

In [6]:
# next let's see what data is returned by the dataloader:
for batch in data_loader:
    print(f"The type returned by the dataloader is {type(batch)}")
    print(f"The keys of the dataloader are {batch.keys()}")
    break

The type returned by the dataloader is <class 'dict'>
The keys of the dataloader are dict_keys(['embedding', 'mutant', 'DMS_score', 'mutant_sequence', 'logits', 'wt_logits', 'wt_embedding'])


In [9]:
print("embedding shape:", batch['embedding'].shape)
print("wt_embedding shape:", batch['wt_embedding'].shape)
print("mutants:", batch['mutant'])
print("DMS_score:", batch["DMS_score"])
print("mutant_sequence:", batch["mutant_sequence"])
print("logits shape:", batch["logits"].shape)
print("wt_logits shape:", batch["wt_logits"].shape)

embedding shape: torch.Size([3, 152, 1280])
wt_embedding shape: torch.Size([3, 152, 1280])
mutants: ['L108W', 'T106C', 'G102K']
DMS_score: tensor([-1.4484, -1.3308, -2.7186])
mutant_sequence: ['MYGKIIFVLLLSEIVSISASSTTGVAMHTSTSSSVTKSYISSQTNDTHKRDTYAATPRAHEVSEISVRTVYPPEEETGERVQLAHHFSEPEITLIIFGVMAGVIGTIWLISYGIRRLIKKSPSDVKPLPSPDTDVPLSSVEIENPETSDQ', 'MYGKIIFVLLLSEIVSISASSTTGVAMHTSTSSSVTKSYISSQTNDTHKRDTYAATPRAHEVSEISVRTVYPPEEETGERVQLAHHFSEPEITLIIFGVMAGVIGCILLISYGIRRLIKKSPSDVKPLPSPDTDVPLSSVEIENPETSDQ', 'MYGKIIFVLLLSEIVSISASSTTGVAMHTSTSSSVTKSYISSQTNDTHKRDTYAATPRAHEVSEISVRTVYPPEEETGERVQLAHHFSEPEITLIIFGVMAKVIGTILLISYGIRRLIKKSPSDVKPLPSPDTDVPLSSVEIENPETSDQ']
logits shape: torch.Size([3, 152, 33])
wt_logits shape: torch.Size([3, 152, 33])


[ESM is a protein language model](https://github.com/facebookresearch/esm) which is used to create embedded representations of proteins that can then be used as features for downstream tasks (like we aredoing here.
The embeddings shape is composed of \[batch_size, sequence_length, ESM_embedding_size\] the embedding feature is likely to be the most useful for our purposes and you may choose to not use any of the other fetures.
logits and wt_logits: These are the raw ESM model outputs (before applying activation functions like softmax) for the mutant and wild-type sequences, respectively. For each amino acid in each sequence, the model predicts 33 different features or scores, some of which reflect the predicted probability of the amino acid type occuring at that position.

wt stands for wild-type and it means the canonical sequence of the protein (without any mutation applied) the wild type features are always the same both within the batch and across batches.
If your model does not need logits and or wt features you can set return_logits=False, return_wt=false when calling: `get_dataloader()`