# Introduction

This notebook serves as a guide for new developers using the `chebai` package. If you just want to run the experiments, you can refer to the [README.md](https://github.com/ChEB-AI/python-chebai/blob/dev/README.md) and the [wiki](https://github.com/ChEB-AI/python-chebai/wiki) for the basic commands. This notebook explains what happens under the hood for the GO-UniProt dataset. It covers
- how to instantiate a data class and generate data
- how the data is processed and stored
- and how to work with different molecule encodings.

The chebai package simplifies the handling of these datasets by **automatically creating** them as needed. This means that you do not have to input any data manually; the package will generate and organize the data files based on the parameters and encodings selected. This feature ensures that the right data is available and formatted properly. You can however provide your own data files, for instance if you want to replicate a specific experiment.

---


# 1. Instantiation of a Data Class

To start working with `chebai`, you first need to instantiate a GO-UniProt data class. This class is responsible for managing, interacting with, and preprocessing the GO and UniProt data

In [None]:
from chebai.preprocessing.datasets.go_uniprot import GOUniProtOver250

In [7]:
go_class = GOUniProtOver250(go_branch="BP")

### Inheritance Hierarchy

GO_UniProt data classes inherit from [`_DynamicDataset`](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/datasets/base.py#L597), which in turn inherits from [`XYBaseDataModule`](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/datasets/base.py#L22). Specifically:

- **`_DynamicDataset`**: This class serves as an intermediate base class that provides additional functionality or customization for datasets that require dynamic behavior. It inherits from `XYBaseDataModule`, which provides the core methods for data loading and processing.

- **`XYBaseDataModule`**: This is the base class for data modules, providing foundational properties and methods for handling and processing datasets, including data splitting, loading, and preprocessing.

In summary, GO_UniProt data classes are designed to manage and preprocess chemical data effectively by leveraging the capabilities provided by `XYBaseDataModule` through the `_DynamicDataset` intermediary.


### Configuration Parameters

Data classes related to proteins can be configured using the following main parameters:

- **`go_branch (str)`**: The Gene Ontology (GO) branch. The default value is `"all"`, which includes all branches of GO in the dataset.
  - **`"BP"`**: Biological Process branch.
  - **`"MF"`**: Molecular Function branch.
  - **`"CC"`**: Cellular Component branch.

This allows for more specific datasets focused on a particular aspect of gene function.

- **`splits_file_path (str, optional)`**: Path to a CSV file containing data splits. If not provided, the class will handle splits internally. The default is `None`.

### Additional Input Parameters

To get more control over various aspects of data loading, processing, and splitting, you can refer to documentation of additional parameters in docstrings of the respective classes: [`_GOUniProtDataExtractor`](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/datasets/go_uniprot.py#L33), [`XYBaseDataModule`](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/datasets/base.py#L22), [`_DynamicDataset`](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/datasets/base.py#L597), etc.


# Available ChEBI Data Classes

__Note__: Check the code implementation of classes [here](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/datasets/go_uniprot.py):

There is a range of available dataset classes for GOUniProt classes. Usually, you want to use `GOUniProtOver250` or `GOUniProtOver50`. Both inherit from `_GOUniProtOverX`. The number indicates the threshold for selecting label classes. The selection process is based on the annotations of the GO terms with its ancestors across the dataset.

Refer `select_classes` method of `_GOUniProtOverX` for more details on selection process.

If you need a different threshold, you can create your own subclass.

---

# 2. Preparation / Setup Methods

Once a GOUniProt data class instance is created, it typically requires preparation before use. This step is to generate the actual dataset.

In [None]:
go_class.prepare_data()
go_class.setup()

### Automatic Execution: 
These methods are executed automatically within the data class instance. Users do not need to call them explicitly, as the code internally manages the preparation and setup of data, ensuring that it is ready for subsequent use in training and validation processes.


### Why is Preparation Needed?

- **Data Availability**: The preparation step ensures that the required GOUniProt data files are downloaded or loaded, which are essential for analysis.
- **Data Integrity**: It ensures that the data files are transformed into a compatible format required for model input.

### Main Methods for Data Preprocessing

The data preprocessing in a data class involves two main methods:

1. **`prepare_data` Method**:
   - **Purpose**: This method checks for the presence of raw data in the specified directory. If the raw data is missing, it fetches the ontology, creates a dataframe, and saves it to a file (`data.pkl`). The dataframe includes columns such as IDs, data representations, and labels.
   - **Documentation**: [PyTorch Lightning - `prepare_data`](https://lightning.ai/docs/pytorch/stable/data/datamodule.html#prepare-data)

2. **`setup` Method**:
   - **Purpose**: This method sets up the data module for training, validation, and testing. It checks for the processed data and, if necessary, performs additional setup to ensure the data is ready for model input. It also handles cross-validation settings if enabled.
   - **Description**: Transforms `data.pkl` into a model input data format (`data.pt`), ensuring that the data is in a format compatible for input to the model. The transformed data contains the following keys: `ident`, `features`, `labels`, and `group`. This method uses a subclass of Data Reader to perform the transformation.
   - **Documentation**: [PyTorch Lightning - `setup`](https://lightning.ai/docs/pytorch/stable/data/datamodule.html#setup)

These methods ensure that the data is correctly prepared and set up for subsequent use in training and validation processes.

---

# 3. Overview of the 3 preprocessing stages

The `chebai` library follows a three-stage preprocessing pipeline, which is reflected in its file structure:

1. **Raw Data Stage**:
   - **File**: `go-basic.obo` and `uniprot_sprot.data`
   - **Description**: This stage contains the raw GO ontology data and raw Swiss-UniProt data, serving as the initial input for further processing.
   - **File Paths**:
     - `data/GO_UniProt/raw/go-basic.obo`
     - `data/GO_UniProt/raw/uniprot_sprot.dat`

2. **Processed Data Stage 1**:
   - **File**: `data.pkl`
   - **Description**: This stage includes the data after initial processing. It contains sequence strings, class columns, and metadata but lacks data splits.
   - **File Path**: `data/GO_UniProt/${dataset_name}/processed/data.pkl`
   - **Additional File**: `classes.txt` - A file listing the relevant ChEBI classes.

3. **Processed Data Stage 2**:
   - **File**: `data.pt`
   - **Description**: This final stage includes the encoded data in a format compatible with PyTorch, ready for model input. This stage also references data splits when available.
   - **File Path**: `data/GO_UniProt/${dataset_name}/processed/${reader_name}/data.pt`
   - **Additional File**: `splits.csv` - Contains saved splits for reproducibility.

**Note**: If `go_branch` is specified, the `dataset_name` will include the branch name in the format `${dataset_name}_${go_branch}`. Otherwise, it will just be `${dataset_name}`.

### Summary of File Paths

- **Raw Data**: `data/GO_UniProt/raw`
- **Processed Data 1**: `data/GO_UniProt/${dataset_name}/processed`
- **Processed Data 2**: `data/GO_UniProt/${dataset_name}/processed/${reader_name}`

This structured approach to data management ensures that each stage of data processing is well-organized and documented, from raw data acquisition to the preparation of model-ready inputs. It also facilitates reproducibility and traceability across different experiments.

### Data Splits

- **Creation**: Data splits are generated dynamically "on the fly" during training and evaluation to ensure flexibility and adaptability to different tasks.
- **Reproducibility**: To maintain consistency across different runs, splits can be reproduced by comparing hashes with a fixed seed value.


---

# 4. Data Files and their structure

`chebai` creates and manages several data files during its operation. These files store various chemical data and metadata essential for different tasks. Let’s explore these files and their content.


## <u>go-basic.obo</u> File

**Description**: The `go-basic.obo` file is a key resource in the Gene Ontology (GO) dataset, containing the ontology data that defines various biological processes, molecular functions, and cellular components, as well as their relationships. This file is downloaded directly from the Gene Ontology Consortium and serves as the foundational raw data for further processing in GO-based applications.

#### Example of a Term Document

```plaintext
[Term]
id: GO:0000032
name: cell wall mannoprotein biosynthetic process
namespace: biological_process
def: "The chemical reactions and pathways resulting in the formation of cell wall mannoproteins, any cell wall protein that contains covalently bound mannose residues." [GOC:ai]
synonym: "cell wall mannoprotein anabolism" EXACT []
is_a: GO:0006057 ! mannoprotein biosynthetic process
is_a: GO:0031506 ! cell wall glycoprotein biosynthetic process
```

**File Path**: `data/GO_UniProt/raw/go-basic.obo`

### Structure of `go-basic.obo`

The `go-basic.obo` file is organized into blocks of text known as "term documents." Each block starts with a `[Term]` header and contains various attributes that describe a specific biological process, molecular function, or cellular component within the GO ontology. These attributes include identifiers, names, relationships to other terms, and more.



### Breakdown of Attributes

Each term document in the `go-basic.obo` file consists of the following key attributes:

- **`[Term]`**: 
  - **Description**: Indicates the beginning of a new term in the ontology. Each term represents a distinct biological process, molecular function, or cellular component.

- **`id: GO:0000032`**: 
  - **Description**: A unique identifier for the biological term within the GO ontology.
  - **Example**: `GO:0000032` refers to the term "cell wall mannoprotein biosynthetic process."

- **`name: cell wall mannoprotein biosynthetic process`**: 
  - **Description**: The name of the biological process, molecular function, or cellular component being described.
  - **Example**: The name "cell wall mannoprotein biosynthetic process" is a descriptive label for the GO term with the identifier `GO:0000032`.

- **`namespace: biological_process`**: 
  - **Description**: Specifies which ontology the term belongs to. The main namespaces are `biological_process`, `molecular_function`, and `cellular_component`.

- **`is_a: GO:0006057`**: 
  - **Description**: Defines hierarchical relationships to other terms within the ontology. The `is_a` attribute indicates that the current term is a subclass or specific instance of the referenced term.
  - **Example**: The term `GO:0000032` ("cell wall mannoprotein biosynthetic process") is a subclass of `GO:0006057` and subclass of `GO:0031506`.


## <u>uniprot_sprot.dat</u> File

**Description**: The `uniprot_sprot.dat` file is a key component of the UniProtKB/Swiss-Prot dataset. It contains curated protein sequences with detailed annotation. Each entry in the file corresponds to a reviewed protein sequence, complete with metadata about its biological function, taxonomy, gene name, cross-references to other databases, and more. Below is a breakdown of the structure and key attributes in the file, using the provided example.


### Example of a Protein Entry

```plaintext
ID   002L_FRG3G              Reviewed;         320 AA.
AC   Q6GZX3;
DT   28-JUN-2011, integrated into UniProtKB/Swiss-Prot.
DT   19-JUL-2004, sequence version 1.
DT   08-NOV-2023, entry version 46.
DE   RecName: Full=Uncharacterized protein 002L;
GN   ORFNames=FV3-002L;
OS   Frog virus 3 (isolate Goorha) (FV-3).
OC   Viruses; Varidnaviria; Bamfordvirae; Nucleocytoviricota; Megaviricetes;
OX   NCBI_TaxID=654924;
OH   NCBI_TaxID=8404; Lithobates pipiens (Northern leopard frog) (Rana pipiens).
RN   [1]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RX   PubMed=15165820; DOI=10.1016/j.virol.2004.02.019;
RA   Tan W.G., Barkman T.J., Gregory Chinchar V., Essani K.;
RT   "Comparative genomic analyses of frog virus 3, type species of the genus
RT   Ranavirus (family Iridoviridae).";
RL   Virology 323:70-84(2004).
CC   -!- SUBCELLULAR LOCATION: Host membrane {ECO:0000305}; Single-pass membrane
CC       protein {ECO:0000305}.
DR   EMBL; AY548484; AAT09661.1; -; Genomic_DNA.
DR   RefSeq; YP_031580.1; NC_005946.1.
DR   GeneID; 2947774; -.
DR   KEGG; vg:2947774; -.
DR   Proteomes; UP000008770; Segment.
DR   GO; GO:0033644; C:host cell membrane; IEA:UniProtKB-SubCell.
DR   GO; GO:0016020; C:membrane; IEA:UniProtKB-KW.
PE   4: Predicted;
KW   Host membrane; Membrane; Reference proteome; Transmembrane;
KW   Transmembrane helix.
FT   CHAIN           1..320
FT                   /note="Uncharacterized protein 002L"
FT                   /id="PRO_0000410509"
SQ   SEQUENCE   320 AA;  34642 MW;  9E110808B6E328E0 CRC64;
     MSIIGATRLQ NDKSDTYSAG PCYAGGCSAF TPRGTCGKDW DLGEQTCASG FCTSQPLCAR
     IKKTQVCGLR YSSKGKDPLV SAEWDSRGAP YVRCTYDADL IDTQAQVDQF VSMFGESPSL
     AERYCMRGVK NTAGELVSRV SSDADPAGGW CRKWYSAHRG PDQDAALGSF CIKNPGAADC
     KCINRASDPV YQKVKTLHAY PDQCWYVPCA ADVGELKMGT QRDTPTNCPT QVCQIVFNML
     DDGSVTMDDV KNTINCDFSK YVPPPPPPKP TPPTPPTPPT PPTPPTPPTP PTPRPVHNRK
     VMFFVAGAVL VAILISTVRW
//
```

**File Path**: `data/GO_UniProt/raw/uniprot_sprot.dat`


## Structure of `uniprot_sprot.dat`

The `uniprot_sprot.dat` file is organized into blocks of text, each representing a single protein entry. These blocks contain specific tags and fields that describe different aspects of the protein, including its sequence, function, taxonomy, and cross-references to external databases.

### Breakdown of Attributes

Each protein entry in the `uniprot_sprot.dat` file is structured with specific tags and sections that describe the protein in detail. Here's a breakdown of the key attributes:

- **`ID`**: 
  - **Description**: Contains the unique identifier for the protein and its status (e.g., `Reviewed` indicates the sequence has been manually curated).
  - **Example**: `002L_FRG3G` is the identifier for the protein from Frog virus 3.

- **`AC`**: 
  - **Description**: Accession number, a unique identifier for the protein sequence.
  - **Example**: `Q6GZX3` is the accession number for this entry.

- **`DR`**: 
  - **Description**: Cross-references to other databases like EMBL, RefSeq, KEGG, and GeneID.
  - **Example**: This entry is cross-referenced with the EMBL database, RefSeq, GO, etc.

- **`GO`**: 
  - **Description**: Gene Ontology annotations that describe the cellular component, biological process, or molecular function associated with the protein.
  - **Example**: The protein is associated with the GO terms `GO:0033644` (host cell membrane) and `GO:0016020` (membrane).

- **`SQ`**: 
  - **Description**: The amino acid sequence of the protein.
  - **Example**: The sequence consists of 320 amino acids.

The `uniprot_sprot.dat` file is an extensively curated resource, containing comprehensive protein data used for various bioinformatics applications.

__Note__: For more detailed information refer [here](https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/docs/keywlist.txt
). 

Consider the below line from above example: 
```plaintext
DR   GO; GO:0033644; C:host cell membrane; IEA:UniProtKB-SubCell.
```

The line contains a **Gene Ontology (GO) annotation** describing the protein's subcellular location. Here's a detailed breakdown:

- **`GO:0033644`**: This is the specific **GO term** identifier for "host cell membrane," which indicates that the protein is associated with or located at the membrane of the host cell.

- **`IEA`**: This stands for **Inferred from Electronic Annotation**, which is part of the **GO Evidence Codes**. **IEA** indicates that the annotation was automatically generated based on computational methods rather than direct experimental evidence. While **IEA** annotations are useful, they are generally considered less reliable than manually curated or experimentally verified evidence codes.

__Note__: For more details on evidence codes check section 5.2

## <u>data.pkl</u> File

**Description**: This file is generated by the `prepare_data` method and contains the processed GO data in a dataframe format. It includes protein IDs, data representations (such as sequence strings), and class columns with boolean values.

In [5]:
import pandas as pd
import os

In [8]:
pkl_df = pd.DataFrame(
    pd.read_pickle(
        os.path.join(
            go_class.processed_dir_main,
            go_class.processed_dir_main_file_names_dict["data"],
        )
    )
)
print("Size of the data (rows x columns): ", pkl_df.shape)
pkl_df.head()

Size of the data (rows x columns):  (32933, 1049)


Unnamed: 0,swiss_id,accession,go_ids,sequence,41,75,122,165,209,226,...,2000145,2000146,2000147,2000241,2000242,2000243,2000377,2001141,2001233,2001234
8,14331_ARATH,"P42643,Q945M2,Q9M0S7",[19222],MATPGASSARDEFVYMAKLAEQAERYEEMVEFMEKVAKAVDKDELT...,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
9,14331_CAEEL,"P41932,Q21537","[132, 1708, 5634, 5737, 5938, 6611, 7346, 8340...",MSDTVEELVQRAKLAEQAERYDDMAAAMKKVTEQGQELSNEERNLL...,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
10,14331_MAIZE,P49106,"[3677, 5634, 10468, 44877]",MASAELSREENVYMAKLAEQAERYEEMVEFMEKVAKTVDSEELTVE...,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
13,14332_MAIZE,Q01526,"[3677, 5634, 10468, 44877]",MASAELSREENVYMAKLAEQAERYEEMVEFMEKVAKTVDSEELTVE...,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
14,14333_ARATH,"P42644,F4KBI7,Q945L2","[5634, 5737, 6995, 9409, 9631, 16036, 19222, 5...",MSTREENVYMAKLAEQAERYEEMVEFMEKVAKTVDVEELSVEERNL...,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


**File Path**: `data/GO_UniProt/${dataset_name}/processed/data.pkl`


### Structure of `data.pkl`
`data.pkl` as following structure: 
- **Column 0**: Contains the Identifier from Swiss-UniProt Dataset for each Swiss Protein data instance.
- **Column 1**: Contains the accession of each Protein data instance.
- **Column 2**: Contains the list of GO-IDs (Identifiers from Gene Ontology) which maps each Swiss Protein to the Gene Ontology instance.
- **Column 3**: Contains the sequence representation for the Swiss Protein using Amino Acid notation.
- **Column 4 and onwards**: Contains the labels, starting from column 4.

This structure ensures that the data is organized and ready for further processing, such as further encoding.


## <u>data.pt</u> File

**Description**: Generated by the `setup` method, this file contains encoded data in a format compatible with the PyTorch library. It includes keys such as `ident`, `features`, `labels`, and `group`, making it ready for model input.

In [9]:
import torch

In [12]:
data_pt = torch.load(
    os.path.join(go_class.processed_dir, go_class.processed_file_names_dict["data"]),
    weights_only=False,
)
print("Type of loaded data:", type(data_pt))
print("Content of the data file: \n", data_pt[0])

Type of loaded data: <class 'list'>
Content of the data file: 
 {'features': [10, 14, 15, 23, 13, 14, 11, 11, 14, 16, 20, 27, 25, 28, 22, 10, 14, 21, 17, 14, 27, 18, 14, 27, 16, 22, 27, 27, 10, 28, 27, 25, 10, 27, 21, 28, 14, 21, 14, 28, 20, 21, 20, 27, 17, 15, 28, 27, 27, 16, 19, 17, 17, 11, 28, 14, 22, 21, 19, 28, 12, 13, 14, 16, 16, 14, 11, 26, 16, 12, 12, 11, 11, 12, 27, 18, 21, 27, 27, 11, 16, 13, 19, 20, 20, 29, 28, 11, 17, 12, 16, 20, 22, 16, 11, 21, 12, 27, 15, 27, 17, 11, 20, 12, 24, 20, 13, 12, 17, 21, 17, 17, 20, 15, 12, 17, 28, 23, 14, 14, 14, 11, 13, 20, 11, 21, 28, 25, 22, 17, 21, 10, 21, 13, 20, 22, 29, 16, 22, 17, 14, 27, 25, 21, 11, 13, 18, 27, 16, 21, 20, 14, 14, 27, 29, 15, 17, 15, 14, 22, 21, 14, 14, 18, 20, 12, 14, 19, 11, 27, 17, 14, 23, 15, 29, 23, 12, 16, 17, 13, 17, 14, 17, 19, 25, 11, 28, 25, 22, 22, 27, 12, 17, 19, 11, 23, 20, 16, 14, 24, 19, 17, 14, 21, 18, 14, 25, 20, 27, 14, 12, 14, 27, 17, 20, 15, 17, 13, 27, 27, 11, 22, 21, 20, 11, 15, 17, 12, 10, 18, 17

**File Path**: `data/GO_UniProt/${dataset_name}/processed/${reader_name}/data.pt`

The `data.pt` file is a list where each element is a dictionary with the following keys:

- **`features`**: 
  - **Description**: This key holds the input features for the model. The features are typically stored as tensors and represent the attributes used by the model for training and evaluation.

- **`labels`**: 
  - **Description**: This key contains the labels or target values associated with each instance. Labels are also stored as tensors and are used by the model to learn and make predictions.

- **`ident`**: 
  - **Description**: This key holds identifiers for each data instance. These identifiers help track and reference the individual samples in the dataset.


## <u>classes.txt</u> File

**Description**: This file lists the selected GO or UniProt classes based on a specified threshold. It ensures that only the relevant classes are included in the dataset for analysis.

In [13]:
with open(os.path.join(go_class.processed_dir_main, "classes.txt"), "r") as file:
    for i in range(5):
        line = file.readline()
        print(line.strip())

41
75
122
165
209


**File Path**: `data/GO_UniProt/${dataset_name}/processed/classes.txt`

The `classes.txt` file lists selected Swiss Proteins classes. These classes are chosen based on a specified threshold, which is typically used for filtering or categorizing the dataset. Each line in the file corresponds to a unique Swiss Protein class ID, identifying specific protein from Swiss-UniProt dataset.

This file is essential for organizing the data and ensuring that only relevant classes, as defined by the threshold, are included in subsequent processing and analysis tasks.

## <u>splits.csv</u> File

**Description**: This file contains saved data splits from previous runs. During subsequent runs, it is used to reconstruct the train, validation, and test splits by filtering the encoded data (`data.pt`) based on the IDs stored in `splits.csv`.

In [14]:
csv_df = pd.read_csv(os.path.join(go_class.processed_dir_main, "splits.csv"))
csv_df.head()

Unnamed: 0,id,split
0,14331_ARATH,train
1,14331_CAEEL,train
2,14331_MAIZE,train
3,14332_MAIZE,train
4,14333_ARATH,train


**File Path**: `data/GO_UniProt/${dataset_name}/processed/splits.csv`

The `splits.csv` file contains the saved data splits from previous runs, including the train, validation, and test sets. During subsequent runs, this file is used to reconstruct these splits by filtering the encoded data (`data.pt`) based on the IDs stored in `splits.csv`. This ensures consistency and reproducibility in data splitting, allowing for reliable evaluation and comparison of model performance across different run.

---

## 5.1 Protein Representation Using Amino Acid Sequence Notation

Proteins are composed of chains of amino acids, and these sequences can be represented using a one-letter notation for each amino acid. This notation provides a concise way to describe the primary structure of a protein.

### Example Protein Sequence

Protein: **Lysozyme C** from **Gallus gallus** (Chicken).  
[Lysozyme C - UniProtKB P00698](https://www.uniprot.org/uniprotkb/P00698/entry#function)

- **Sequence**: `MRSLLILVLCFLPLAALGKVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAWVAWRNRCKGTDVQAWIRGCRL`
- **Sequence Length**: 147

In this sequence, each letter corresponds to a specific amino acid. This notation is widely used in bioinformatics and molecular biology to represent protein sequences.

### Tokenization and Encoding

To tokenize and numerically encode this protein sequence, the `ProteinDataReader` class is used. This class allows for n-gram tokenization, where the `n_gram` parameter defines the size of the tokenized units. If `n_gram` is not provided (default is `None`), each amino acid letter is treated as a single token.

For more details, you can explore the implementation of the `ProteinDataReader` class in the source code [here](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/reader.py).

In [15]:
from chebai.preprocessing.reader import ProteinDataReader

In [16]:
protein_dr = ProteinDataReader()

In [17]:
protein_dr._read_data("MRSLLILVLCFLPLAALGK")

[10, 16, 11, 17, 17, 12, 17, 28, 17, 24, 25, 17, 23, 17, 14, 14, 17, 13, 21]

The numbers mentioned above refer to the index of each individual token from the [`tokens.txt`](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/bin/protein_token/tokens.txt) file, which is used by the `ProteinDataReader` class. 

Each token in the `tokens.txt` file corresponds to a specific amino-acid letter, and these tokens are referenced by their index. Additionally, the index values are offset by the `EMBEDDING_OFFSET`, ensuring that the token embeddings are adjusted appropriately during processing.

### The 20 Amino Acids and Their One-Letter Notations

Here is a list of the 20 standard amino acids, along with their one-letter notations and descriptions:

| One-Letter Notation | Amino Acid Name      | Description                                             |
|---------------------|----------------------|---------------------------------------------------------|
| **A**               | Alanine              | Non-polar, aliphatic amino acid.                        |
| **C**               | Cysteine             | Polar, contains a thiol group, forms disulfide bonds.   |
| **D**               | Aspartic Acid        | Acidic, negatively charged at physiological pH.         |
| **E**               | Glutamic Acid        | Acidic, negatively charged at physiological pH.         |
| **F**               | Phenylalanine        | Aromatic, non-polar.                                    |
| **G**               | Glycine              | Smallest amino acid, non-polar.                         |
| **H**               | Histidine            | Polar, positively charged, can participate in enzyme active sites. |
| **I**               | Isoleucine           | Non-polar, aliphatic.                                   |
| **K**               | Lysine               | Basic, positively charged at physiological pH.          |
| **L**               | Leucine              | Non-polar, aliphatic.                                   |
| **M**               | Methionine           | Non-polar, contains sulfur, start codon in mRNA translation. |
| **N**               | Asparagine           | Polar, uncharged.                                       |
| **P**               | Proline              | Non-polar, introduces kinks in protein chains.          |
| **Q**               | Glutamine            | Polar, uncharged.                                       |
| **R**               | Arginine             | Basic, positively charged, involved in binding phosphate groups. |
| **S**               | Serine               | Polar, can be phosphorylated.                           |
| **T**               | Threonine            | Polar, can be phosphorylated.                           |
| **V**               | Valine               | Non-polar, aliphatic.                                   |
| **W**               | Tryptophan           | Aromatic, non-polar, largest amino acid.                |
| **Y**               | Tyrosine             | Aromatic, polar, can be phosphorylated.                 |

### Understanding Protein Sequences

In the example sequence, each letter represents one of the above amino acids. The sequence reflects the specific order of amino acids in the protein, which is critical for its structure and function.

This notation is used extensively in various bioinformatics tools and databases to study protein structure, function, and interactions.


_Note_:  Refer for amino acid sequence:  https://en.wikipedia.org/wiki/Protein_primary_structure

---

## 5.2 More on GO Evidence Codes

The **Gene Ontology (GO) Evidence Codes** provide a way to indicate the level of evidence supporting a GO annotation. Here's a list of the GO evidence codes with brief descriptions:

| **Evidence Code**     | **Description** |
|-----------------------|-----------------|
| **EXP**               | [Inferred from Experiment (EXP)](http://wiki.geneontology.org/index.php/Inferred_from_Experiment_(EXP)) |
| **IDA**               | [Inferred from Direct Assay (IDA)](http://wiki.geneontology.org/index.php/Inferred_from_Direct_Assay_(IDA)) |
| **IPI**               | [Inferred from Physical Interaction (IPI)](http://wiki.geneontology.org/index.php/Inferred_from_Physical_Interaction_(IPI)) |
| **IMP**               | [Inferred from Mutant Phenotype (IMP)](http://wiki.geneontology.org/index.php/Inferred_from_Mutant_Phenotype_(IMP)) |
| **IGI**               | [Inferred from Genetic Interaction (IGI)](http://wiki.geneontology.org/index.php/Inferred_from_Genetic_Interaction_(IGI)) |
| **IEP**               | [Inferred from Expression Pattern (IEP)](http://wiki.geneontology.org/index.php/Inferred_from_Expression_Pattern_(IEP)) |
| **HTP**               | [Inferred from High Throughput Experiment (HTP)](http://wiki.geneontology.org/index.php/Inferred_from_High_Throughput_Experiment_(HTP) ) |
| **HDA**               | [Inferred from High Throughput Direct Assay (HDA)](http://wiki.geneontology.org/index.php/Inferred_from_High_Throughput_Direct_Assay_(HDA)) |
| **HMP**               | [Inferred from High Throughput Mutant Phenotype (HMP)](http://wiki.geneontology.org/index.php/Inferred_from_High_Throughput_Mutant_Phenotype_(HMP)) |
| **HGI**               | [Inferred from High Throughput Genetic Interaction (HGI)](http://wiki.geneontology.org/index.php/Inferred_from_High_Throughput_Genetic_Interaction_(HGI)) |
| **HEP**               | [Inferred from High Throughput Expression Pattern (HEP)](http://wiki.geneontology.org/index.php/Inferred_from_High_Throughput_Expression_Pattern_(HEP)) |
| **IBA**               | [Inferred from Biological aspect of Ancestor (IBA)](http://wiki.geneontology.org/index.php/Inferred_from_Biological_aspect_of_Ancestor_(IBA)) |
| **IBD**               | [Inferred from Biological aspect of Descendant (IBD)](http://wiki.geneontology.org/index.php/Inferred_from_Biological_aspect_of_Descendant_(IBD)) |
| **IKR**               | [Inferred from Key Residues (IKR)](http://wiki.geneontology.org/index.php/Inferred_from_Key_Residues_(IKR)) |
| **IRD**               | [Inferred from Rapid Divergence (IRD)](http://wiki.geneontology.org/index.php/Inferred_from_Rapid_Divergence(IRD)) |
| **ISS**               | [Inferred from Sequence or Structural Similarity (ISS)](http://wiki.geneontology.org/index.php/Inferred_from_Sequence_or_structural_Similarity_(ISS)) |
| **ISO**               | [Inferred from Sequence Orthology (ISO)](http://wiki.geneontology.org/index.php/Inferred_from_Sequence_Orthology_(ISO)) |
| **ISA**               | [Inferred from Sequence Alignment (ISA)](http://wiki.geneontology.org/index.php/Inferred_from_Sequence_Alignment_(ISA)) |
| **ISM**               | [Inferred from Sequence Model (ISM)](http://wiki.geneontology.org/index.php/Inferred_from_Sequence_Model_(ISM)) |
| **RCA**               | [Inferred from Reviewed Computational Analysis (RCA)](http://wiki.geneontology.org/index.php/Inferred_from_Reviewed_Computational_Analysis_(RCA)) |
| **IEA**               | [Inferred from Electronic Annotation (IEA)](http://wiki.geneontology.org/index.php/Inferred_from_Electronic_Annotation_(IEA)) |
| **TAS**               | [Traceable Author Statement (TAS)](http://wiki.geneontology.org/index.php/Traceable_Author_Statement_(TAS)) |
| **NAS**               | [Non-traceable Author Statement (NAS)](http://wiki.geneontology.org/index.php/Non-traceable_Author_Statement_(NAS)) |
| **IC**                | [Inferred by Curator (IC)](http://wiki.geneontology.org/index.php/Inferred_by_Curator_(IC)) |
| **ND**                | [No Biological Data Available (ND)](http://wiki.geneontology.org/index.php/No_biological_Data_available_(ND)_evidence_code) |
| **NR**                | Not Recorded |


### **Grouping of Codes**:

- **Experimental Evidence Codes**:
  - **EXP**, **IDA**, **IPI**, **IMP**, **IGI**, **IEP**
  
- **High-Throughput Experimental Codes**:
  - **HTP**, **HDA**, **HMP**, **HGI**, **HEP**

- **Phylogenetically-Inferred Codes**:
  - **IBA**, **IBD**, **IKR**, **IRD**

- **Author/Curator Inferred Codes**:
  - **TAS**, **IC**, **NAS**

- **Computational Evidence Codes**:
  - **IEA**, **ISS**, **ISA**, **ISM**, **ISO**, **RCA**

- **Others**:
  - **ND** (No Biological Data Available), **NR** (Not Recorded)


These evidence codes ensure transparency and give researchers an understanding of how confident they can be in a particular GO annotation.

__Note__ : For more information on GO evidence codes please check [here](https://geneontology.org/docs/guide-go-evidence-codes/) 

---