In [1]:
import pandas as pd

# Glycan-Structures-CFG611.txt

In [4]:
glycans = pd.read_csv('../data/Glycan-Structures-CFG611.txt', sep="\s")

  glycans = pd.read_csv('../data/Glycan-Structures-CFG611.txt', sep="\s")#, delimiter=' ')


Features:

1. Name is the GlycanID used in Fractions file

2. IUPAC is a naming method for glycans. (Used simply as an easy identifier for humans to identify a glycan (so they dont have to read long SMILES text for example))
*Probably wont be used in training as we will use SMILES mainly (but could be used for analysis of results to scientists)

3. SMILES (Simplified Molecular Input Line Entry System). ASCII string that represents the chemical structure of our glycan. (Which chemical elements link to each other and where).

**SMILES is our bread and butter for representing the glycan. Look into techniques for embedding it to be passed into our prediction model. Examples like: Morgan Fingerprint, Graph Neural net, one hot encoding, etc...**

Note: Every single value in this table is unique

In [5]:
glycans.head()

Unnamed: 0,Name,IUPAC,SMILES
0,CFG-007-Sp8,Gal(α-Sp8,OC[C@@H](O1)[C@H](O)[C@H](O)[C@@H](O)[C@H]1-OC...
1,CFG-008-Sp8,Glc(α-Sp8,OC[C@@H](O1)[C@@H](O)[C@H](O)[C@@H](O)[C@H]1-O...
2,CFG-009-Sp8,Man(α-Sp8,OC[C@@H](O1)[C@@H](O)[C@H](O)[C@H](O)[C@H]1-OC...
3,CFG-010-Sp8,GalNAc(α-Sp8,OC[C@@H](O1)[C@H](O)[C@H](O)[C@@H](NC(=O)C)[C@...
4,CFG-010-Sp15,GalNAc(α-Sp15,OC[C@@H](O1)[C@H](O)[C@H](O)[C@@H](NC(=O)C)[C@...


# Protein-Sequence-Table.txt

Features:

1. ProteinGroup - ID for the ProteinGroup and used in Fractions-Bound-Table

2. Accession - Unique ID for a protein in the Uniprot database. "An accession number (AC) is assigned to each sequence upon inclusion into UniProtKB." Reference of quote: https://www.uniprot.org/help/difference_accession_entryname


3. Uniprot - ID for a protein stored at a Globally accesible database called Uniprot. See this vid for more info: https://youtu.be/GusiW6YUpr0, https://www.uniprot.org/

4. Description - Describes the function of a protein in scientific terms. Ex: "Fucose-specific lectin" refers to a lectin (a type of protein) that specifically binds to the carbohydrate fucose.
(Could maybe encode the words as a feature)

5. Amino Acid Sequence - The sequence of amino acids making up the protein. Main component used to understand proteins strucutre and its binding points. Could embed with things like ESM3, transformer, one-hot encoding, etc..

Note: Every single value in this table is unique

cant split on space here as description value has spaces so seperate description and Amino Acid Sequence on the last space in the row.


In [13]:
rows = []
with open('../data/Protein-Sequence-Table.txt', 'r') as file:
    next(file) # skip header line
    for line in file:
        parts = line.split()
        
        protein_group = parts[0]
        accession = parts[1]
        uniprot = parts[2]
        description_and_sequence = ' '.join(parts[3:]) 
        
        # split on last space then assign description to before split and amino acid after the split
        split_index = description_and_sequence.rfind(' ')
        description = description_and_sequence[:split_index].strip()
        amino_acid_sequence = description_and_sequence[split_index + 1:].strip()
        
        
        rows.append([protein_group, accession, uniprot, description, amino_acid_sequence])


proteins = pd.DataFrame(rows, columns=['ProteinGroup', 'Accession', 'Uniprot', 'Description', 'Amino Acid Sequence'])

In [14]:
proteins.head()

Unnamed: 0,ProteinGroup,Accession,Uniprot,Description,Amino Acid Sequence
0,1,Q41358,SNAIB_SAMNI,Ribosome-inactivating protein SNAI,MRLVAKLLYLAVLAICGLGIHGALTHPRVTPPVYPSVSFNLTGADT...
1,2,P22972,LEC1_ULEEU,Anti-H(O) lectin 1,SDDLSFKFKNFSQNGKDLSFQGDASVIETGVLQLNKVGNNLPDETG...
2,4,A8WDZ4,A8WDZ4_CANEN,Concanavalin A,MAISKKSSLFLPIFTFITMFLMVVNKVSSSTHETNALHFMFNQFSK...
3,6,P09382,LEG1_HUMAN,Galectin-1,MACGLVASNLNLKPGECLRVRGEVAPDAKSFVLNLGKDSNNLCLHF...
4,7,P16045,LEG1_MOUSE,Galectin-1,MACGLVASNLNLKPGECLKVRGEVASDAKSFVLNLGKDSNNLCLHF...


# Fractions-Bound-Table.txt

Features:

1. ObjId - some kind of object ID

2. ProteinGroup - ProteinGroup feature from protein sequences table

3. Concentration - The concentration of the glycan used in the sample that got us the f (strength) reading. Important as it tells us how much of the glycan gives off how much of an f value from its reaction.

4. GlycanID - The GlycanID used in the Glycan Structures table

5. f - the strength relationship reading (or the luminosity reaction as stated in slides/demo).

**We are predicting f**

In [20]:
fractions = pd.read_csv('../data/Fractions-Bound-Table.txt', sep="\s")

  fractions = pd.read_csv('../data/Fractions-Bound-Table.txt', sep="\s")


In [21]:
fractions.head()

Unnamed: 0,ObjId,ProteinGroup,Concentration,GlycanID,f
0,1004699,1,0.001,CFG-007-Sp8,0.0
1,1004699,1,0.001,CFG-008-Sp8,0.000154
2,1004699,1,0.001,CFG-009-Sp8,8.2e-05
3,1004699,1,0.001,CFG-010-Sp15,0.00029
4,1004699,1,0.001,CFG-010-Sp8,0.0


In [23]:
fractions.describe()

Unnamed: 0,ObjId,ProteinGroup,Concentration,f
count,334679.0,334679.0,334679.0,334679.0
mean,1004931.0,66.253589,37.540266,0.01338208
std,632.7957,42.799381,67.169264,0.07008441
min,1003786.0,1.0,0.001,0.0
25%,1004512.0,29.0,1.0,0.0
50%,1004714.0,63.0,5.0,5.260004e-07
75%,1005506.0,102.0,30.0,0.0002794406
max,1006422.0,147.0,500.0,0.9246077


In [25]:
fractions.nunique()

ObjId               548
ProteinGroup        147
Concentration        23
GlycanID            611
f                167464
dtype: int64