<a href="https://www.kaggle.com/code/ayushs9020/utc-pytorch-dataloader-cafa?scriptVersionId=135482977" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# CAFA 5 Protein Prediction

<img src = "https://i.pinimg.com/originals/8e/de/73/8ede737acdcf671e000ae0f87a742e40.png" width = 400>

The `goal` of this competition is to `predict the function of a set of proteins`. We will `develop a model trained` on the `amino-acid sequences` of the `proteins and on other data`. Our work `will help researchers` better `understand the function of proteins`, which is `important for discovering` `how cells, tissues, and organs work`.

# 1 | Basic Terminologies 💻

* $Structure$ $of$ $a$ $Protein$
* $Gene$ $Ontology$ $(GO)$

## 1.1 | Structure of a Protein

So what is a actually a **Protein...?**

First of all lets understand the structure of an **Atom**

<img src = "https://www.sciencefacts.net/wp-content/uploads/2020/11/Parts-of-an-Atom-Diagram.jpg"  width = 300>

There is a really good image I found of the `structure of atom`. Though there are many debates on the structure like this, but this `model is accepted universaly at this moment`.

In the centre we have the `Neucleus`. The `Neucleus` is made up of $2$ more structures named as `Neutron` and `Proton`. A `Proton` is `positively charged element` and a `Neutron` is a `neutral charged element`. A `Electron`, `negatively charged element`, `orbits` this `Neucleus` at some `distance apart`.

The `more we increase the number` of `Electrons` and `Protons`. The `bigger the atoms becomes`.

There are `different shells` where the `Electrons reside`. The `more closer the shell` is, the `less Electrons` it contrains. There are mainly $4$ shells. 

|||
|---|---
|$K$|$2$
|$L$|$8$
|$M$|$18$
|$N$|$32$

Once an atom `fills its outer most shell` with `Electrons`. It becomes `stable atom` and try to `refuse any donation` or `recieve of extra atom`.

<img src = "https://cdn1.byjus.com/wp-content/uploads/2022/01/word-image128.png" width = 400>

`Different atoms combine` to `share Electrons` and become `stable Molecules` 

<img src = "https://www.astrochem.org/sci_img/Amino_Acid_Structure.jpg" width = 300>

A `Amino Acid` is made up of mainly $4$ different atoms
`[H , C , O , N]`. 

<img src = "https://upload.wikimedia.org/wikipedia/commons/thumb/5/51/L-amino_acid_structure.svg/1200px-L-amino_acid_structure.svg.png" alt = "Bro use Light Theme" width = 300 >

We also have a free `Electron Pair` of `Carbon` in this molecule, we call this as a `Side Chain` which can be of different types. Basically this `Side Chain` provide the flexibility to make `different types` of `Amino Acids`. This flexibilty allows for $20$ `different` `Amino Acids` 

When we join `Amino Acids` with `peptide bonds`, we get `Proteins`. Conncecting different types of `Amino Acids` ends up in different types of `Proteins`.

## 1.2 | Gene Ontology (GO)

$Gene$ $Ontology$ $(GO)$ is a `controlled vocabulary` that `describes the functions of genes` and gene products. It is constantly being updated as new information becomes available.

There are mainly $3$ `Ontologies`

||||
|---|---|---
|$Biological$ $Process$|Describes the `biological processes`|A gene product might be involved in the process of `cell cycle`/`signal transduction.`
|$Cellular$ $Component$|Describes the `cellular components`|A gene product might be located in the `nucleus`/`cytoplasm`.
|$Molecular$ $Function$|Describes the `molecular functions`|A gene product might be involved in the `catalysis of a reaction`/`binding of a molecule`.

**[Gene Ontology Documentation](http://geneontology.org/docs/ontology-documentation/)**

# 2 | Data 📊

In [1]:
import pandas as pd 
import re
from Bio.Seq import Seq

The $Training$ $Set$ contains all `proteins with annotated terms` that have been validated by 
* $Experimental$
* $High-Throughput$ $Evidence$
* [$Traceable$ $Author$ $Statement$](https://wiki.geneontology.org/index.php/Traceable_Author_Statement_(TAS)#:~:text=The%20TAS%20evidence%20code%20covers,annotations%20come%20from%20review%20articles.)
* [$Inferred$ $by$ $Curator$ $(IC)$](https://wiki.geneontology.org/Inferred_by_Curator_(IC)) 

**Any other sources of Data are allowed**

### 2.1.1.1 | Go-Basic.obo

The $Ontology$ data is in the `file go-basic.obo`. This file is in $OBO$ `Biology-Oriented Language`. The nodes in `the graph are indexed` by the `term name`
```
subontology_roots = {'BPO':'GO:0008150',
                     'CCO':'GO:0005575',
                     'MFO':'GO:0003674'}
```

In [2]:
with open('/kaggle/input/cafa-5-protein-function-prediction/Train/go-basic.obo') as file :
    
    content = file.read()
    stanzas =  re.findall(r'\[Term\][\s\S]*?(?=\n\[|$)' , content)
    
print(stanzas[0])

[Term]
id: GO:0000001
name: mitochondrion inheritance
namespace: biological_process
def: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton." [GOC:mcc, PMID:10873824, PMID:11389764]
synonym: "mitochondrial inheritance" EXACT []
is_a: GO:0048308 ! organelle inheritance
is_a: GO:0048311 ! mitochondrion distribution



### 2.1.1.2 | Training Sequences.fasta

This file contains only `sequences` for `proteins` with `annotations` in the dataset `labeled proteins`.

This files are in `FASTA` format. 

This file contains . To obtain the full set of protein sequences for unlabeled proteins, the Swiss-Prot and TrEMBL databases can be found here.

In [3]:
with open("/kaggle/input/cafa-5-protein-function-prediction/Train/train_sequences.fasta" , "r") as file:
    x = file.read()
    
x[:200]

'>P20536 sp|P20536|UNG_VACCC Uracil-DNA glycosylase OS=Vaccinia virus (strain Copenhagen) OX=10249 GN=UNG PE=1 SV=1\nMNSVTVSHAPYTITYHDDWEPVMSQLVEFYNEVASWLLRDETSPIPDKFFIQLKQPLRNK\nRVCVCGIDPYPKDGTGVPFESPNF'

### 2.1.1.3 | Train Terms.tsv

This file contains the list of annotated terms `ground truth` for the proteins in `train_sequences.fasta`. 

In [4]:
pd.read_csv("/kaggle/input/cafa-5-protein-function-prediction/Train/train_terms.tsv" , sep = "\t")

Unnamed: 0,EntryID,term,aspect
0,A0A009IHW8,GO:0008152,BPO
1,A0A009IHW8,GO:0034655,BPO
2,A0A009IHW8,GO:0072523,BPO
3,A0A009IHW8,GO:0044270,BPO
4,A0A009IHW8,GO:0006753,BPO
...,...,...,...
5363858,X5L565,GO:0050649,MFO
5363859,X5L565,GO:0016491,MFO
5363860,X5M5N0,GO:0005515,MFO
5363861,X5M5N0,GO:0005488,MFO


The first column indicates the `protein's UniProt accession ID`, the second is the `GO term ID`, and the third indicates in which `ontology the term appears`.

### 2.1.1.4 | Train Taxonomy.tsv

This file contains the list of `proteins and the species to which they belong`, represented by a `taxonomic identifier` `taxon ID` number.

In [5]:
pd.read_csv("/kaggle/input/cafa-5-protein-function-prediction/Train/train_taxonomy.tsv" , sep = "\t")

Unnamed: 0,EntryID,taxonomyID
0,Q8IXT2,9606
1,Q04418,559292
2,A8DYA3,7227
3,Q9UUI3,284812
4,Q57ZS4,185431
...,...,...
142241,Q5TD07,9606
142242,A8BB17,7955
142243,A0A2R8QBB1,7955
142244,P0CT72,284812


### 2.1.1.5 | IA.txt

IA.txt contains the information accretion (weights) for each GO term. These weights are used to compute weighted precision and recall, as described in the Evaluation section. 

## 2.1.2 | Test Set

The $Test$ $Set$ is `unknown at the beginning` of the competition. It will contain `protein sequences` `their functions` from the `test superset` that `gained experimental annotations` between the `submission-deadline` and the `time of evaluation`.

# 3 | PyTorch DataLoader ⚙️

In [6]:
import numpy as np
import tqdm

import torch
from torch.utils.data import Dataset

The dataloader is highly inspired by **[Henri Upton](https://www.kaggle.com/henriupton)=>[ProteiNet 🧬 PyTorch+EMS2/T5/ProtBERT Embeddings](https://www.kaggle.com/code/henriupton/proteinet-pytorch-ems2-t5-protbert-embeddings)**

First we will make a simple class...

In [7]:
class P_Dataset(Dataset):
    pass

Now we will load the `embeds , id` from the `T5 Embeds`

Embeds are like representation of a non-numercial elemenet to a list of numerical elemenet. 

In [8]:
class P_Dataset(Dataset):
    
    def __init__(self):
        
        super(P_Dataset).__init__()
        
        embeds = np.load("/kaggle/input/t5embeds/test_embeds.npy")
        ids = np.load("/kaggle/input/t5embeds/test_ids.npy")

In case you wanna see, what these `embeds` and `id` looks like, open the below hidden cells

In [9]:
embeds = np.load("/kaggle/input/t5embeds/test_embeds.npy")
ids = np.load("/kaggle/input/t5embeds/test_ids.npy")

embeds , embeds.shape , ids , ids.shape

(array([[ 0.05470492,  0.06342026, -0.01531996, ..., -0.04331931,
          0.03600927,  0.06309301],
        [ 0.09037268,  0.08984205, -0.02388695, ..., -0.05335043,
          0.01964429,  0.07962959],
        [ 0.04358805,  0.03957234, -0.01433173, ..., -0.04446448,
          0.03097377,  0.04032155],
        ...,
        [ 0.03274843,  0.14186755,  0.03414193, ...,  0.0206462 ,
          0.05467706, -0.01504807],
        [ 0.05271317,  0.15701264,  0.04327936, ...,  0.02159062,
          0.0625835 , -0.01490347],
        [ 0.04187706,  0.13404095,  0.0790938 , ...,  0.0541121 ,
          0.01576663, -0.02881737]]),
 (141865, 1024),
 array(['Q9CQV8', 'P62259', 'P68510', ..., 'C0HK73', 'C0HK74',
        'A0A3G2FQK2'], dtype='<U10'),
 (141865,))

Now we will make the `embeds_list`

In [10]:
class P_Dataset(Dataset):
    
    def __init__(self):
        
        super(P_Dataset).__init__()
        
        embeds = np.load("/kaggle/input/t5embeds/test_embeds.npy")
        ids = np.load("/kaggle/input/t5embeds/test_ids.npy")
        
        self.embeds_list = [row for row in embeds]

Below is the output for the `embeds_list`

In [11]:
embeds_list = [row for row in embeds]

embeds_list[0]

array([ 0.05470492,  0.06342026, -0.01531996, ..., -0.04331931,
        0.03600927,  0.06309301])

For the `training` we need the `targets`, which we will take from the `train_targets_top500.npy`.

In [12]:
class P_Dataset(Dataset):
    
    def __init__(self , datatype):
        
        super(P_Dataset).__init__()
        
        self.datatype  = datatype
        
        embeds = np.load("/kaggle/input/t5embeds/test_embeds.npy")
        ids = np.load("/kaggle/input/t5embeds/test_ids.npy")
        
        self.embeds_list = [row for row in embeds]
        
        if self.datatype == "train" : 
            
            targets = np.load("/kaggle/input/train-targets-top500/train_targets_top500.npy")[:len(self.embeds_list)]
            self.targets_list = [row for row in targets]

These are our targets

In [13]:
targets = np.load("/kaggle/input/train-targets-top500/train_targets_top500.npy")[:len(embeds_list)]
targets_list = [row for row in targets]

targets_list[0]

array([0., 1., 0., 1., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0.,
       1., 0., 0., 0., 1., 0., 0., 0., 1., 1., 0., 1., 0., 0., 0., 0., 0.,
       0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1.,
       0., 0., 0., 1., 0., 1., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 1.,
       0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0.

Now we will just apply some `Getters` in the class 

In [14]:
class P_Dataset(Dataset):

    def __init__(self , datatype ):
        super(P_Dataset).__init__()

        embeds = np.load("/kaggle/input/t5embeds/test_embeds.npy")
        
        self.datatype = datatype
        self.ids = np.load("/kaggle/input/t5embeds/test_ids.npy")
        self.embeds_list = [row for row in embeds]

        if datatype == "train":

            targets = np.load("/kaggle/input/train-targets-top500/train_targets_top500.npy")[:len(self.embeds_list)]
            
            self.targets_list = [row for row in targets]

    def __len__(self):return len(self.targets_list)

    def __getitem__(self , index):

        embed = torch.tensor(self.embeds_list[index] , dtype = torch.float32)

        if self.datatype == "train":

            targets = torch.tensor(self.targets_list[index], dtype = torch.float32)

            return embed, targets

        id = self.ids[index]

        return embed, id

In [15]:
train_data = P_Dataset(datatype = "train")

# 4 | TO DO LIST 📄

```
TO DO 1 : VISUALIZE THE DATA

TO DO 2 : TRAIN A MODEL

TO DO 3 : TRY DIFFERENT MODELS

TO DO 4 : ADD WANDB SUPPORT

TO DO 5 : ADD TENSORFLOW DATA LOADER

TO DO 6 : TRAIN A TF MODEL

TO DO 7 : IMPROVE RESULTS

TO DO 8 : DECREASE TRAINING TIME

TO DO 9 : DANCE 
```

# 3 | Ending 🏁

**THAT'S IT FOR TODAY GUYS**

**WE WILL GO DEEPER INTO THE DATA IN THE UPCOMING VERSIONS**

**PLEASE COMMENT YOUR THOUGHTS, HIHGLY APPRICIATED**

**DONT FORGET TO MAKE AN UPVOTE, IF YOU LIKED MY WORK $:)$**

<img src = "https://i.imgflip.com/19aadg.jpg">

**PEACE OUT $:)$**