# Introduction

This notebook serves as a guide for new users of the `chebai` package, which is used for working with chemical data, especially focusing on Gene Ontology (GO) and Swiss UniProt Protein data. This notebook will explain how to instantiate the main data class, how the data files are structured, and how to work with different molecule encodings.

One key aspect of the package is its **dataset management system**. In the training process, chemical datasets play a critical role by providing the necessary data for model learning and validation. The chebai package simplifies the handling of these datasets by **automatically creating** them as needed. This means that users do not have to manually prepare datasets before running models; the package will generate and organize the data files based on the parameters and encodings selected. This feature ensures that the right data is available and formatted properly.

---

# Information for Protein Dataset

# 1. Instantiation of a Data Class

To start working with `chebai`, you first need to instantiate a GO_UniProt data class. This class is responsible for managing, interacting with, and preprocessing the GO and UniProt data
### Inheritance Hierarchy

GO_UniProt data classes inherit from [`_DynamicDataset`](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/datasets/base.py#L597), which in turn inherits from [`XYBaseDataModule`](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/datasets/base.py#L22). Specifically:

- **`_DynamicDataset`**: This class serves as an intermediate base class that provides additional functionality or customization for datasets that require dynamic behavior. It inherits from `XYBaseDataModule`, which provides the core methods for data loading and processing.

- **`XYBaseDataModule`**: This is the base class for data modules, providing foundational properties and methods for handling and processing datasets, including data splitting, loading, and preprocessing.

In summary, GO_UniProt data classes are designed to manage and preprocess chemical data effectively by leveraging the capabilities provided by `XYBaseDataModule` through the `_DynamicDataset` intermediary.


### Configuration Parameters

Data classes related to proteins can be configured using the following main parameters:

- **`go_branch (str)`**: The Gene Ontology (GO) branch. The default value is `"all"`, which includes all branches of GO in the dataset.

- **`splits_file_path (str, optional)`**: Path to a CSV file containing data splits. If not provided, the class will handle splits internally. The default is `None`.

### Additional Input Parameters

To get more control over various aspects of data loading, processing, and splitting, you can refer to documentation of additional parameters in docstrings of the respective classes: [`_GOUniProtDataExtractor`](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/datasets/go_uniprot.py#L33), [`XYBaseDataModule`](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/datasets/base.py#L22), [`_DynamicDataset`](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/datasets/base.py#L597), etc.

### Available GOUniProt Data Classes

__Note__: Check the code implementation of classes [here](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/datasets/go_uniprot.py):

#### `GOUniProtOver250`

A class for extracting data from the Gene Ontology and Swiss UniProt dataset with a threshold of 250 for selecting classes.

- **Inheritance**: Inherits from `_GOUniProtOverX`.

#### `GOUniProtOver50`

A class for extracting data from the Gene Ontology and Swiss UniProt dataset with a threshold of 50 for selecting classes.

- **Inheritance**: Inherits from `_GOUniProtOverX`.


### Instantiation Example

In [None]:
from chebai.preprocessing.datasets.go_uniprot import GOUniProtOver250

In [2]:
go_class = GOUniProtOver250()

## GOUniProt Data File Structure

1. **`Raw Data Files`**: (e.g., `.obo` file and `.dat` file)
   - **Description**: These files contain the raw GO ontology and Swiss UniProt data, which are downloaded directly from their respective websites. They serve as the foundation for data processing. Since there are no versions associated with this dataset, common raw files are used for all subsets of the data.
   - **File Paths**:
     - `data/GO_UniProt/raw/${filename}.obo`
     - `data/GO_UniProt/raw/${filename}.dat`

2. **`data.pkl`**
   - **Description**: This file is generated by the `prepare_data` method and contains the processed data in a dataframe format. It includes protein IDs, data representations (such as SMILES strings), and class columns with boolean values.
   - **File Path**: `data/GO_UniProt/${dataset_name}/processed/data.pkl`

3. **`data.pt`**
   - **Description**: Generated by the `setup` method, this file contains encoded data in a format compatible with the PyTorch library. It includes keys such as `ident`, `features`, `labels`, and `group`, making it ready for model input.
   - **File Path**: `data/GO_UniProt/${dataset_name}/processed/${reader_name}/data.pt`

4. **`classes.txt`**
   - **Description**: This file lists the selected GO or UniProt classes based on a specified threshold. It ensures that only the relevant classes are included in the dataset for analysis.
   - **File Path**: `data/GO_UniProt/${dataset_name}/processed/classes.txt`

5. **`splits.csv`**
   - **Description**: This file contains saved data splits from previous runs. During subsequent runs, it is used to reconstruct the train, validation, and test splits by filtering the encoded data (`data.pt`) based on the IDs stored in `splits.csv`.
   - **File Path**: `data/GO_UniProt/${dataset_name}/processed/splits.csv`

**Note**: If `go_branch` is specified, the `dataset_name` will include the branch name in the format `${dataset_name}_${go_branch}`. Otherwise, it will just be `${dataset_name}`.


# 2. Preparation / Setup Methods

Once a ChEBI data class instance is created, it typically requires preparation before use. This step is necessary to download or load the relevant data files and set up the internal data structures.
### Automatic Execution: 
These methods are executed automatically within the data class instance. Users do not need to call them explicitly, as the code internally manages the preparation and setup of data, ensuring that it is ready for subsequent use in training and validation processes.


### Why is Preparation Needed?

- **Data Availability**: The preparation step ensures that the required ChEBI data files are downloaded or loaded, which are essential for analysis.
- **Data Integrity**: It ensures that the data files are transformed into a compatible format required for model input.

### Main Methods for Data Preprocessing

The data preprocessing in a data class involves two main methods:

1. **`prepare_data` Method**:
   - **Purpose**: This method checks for the presence of raw data in the specified directory. If the raw data is missing, it fetches the ontology, creates a dataframe, and saves it to a file (`data.pkl`). The dataframe includes columns such as IDs, data representations, and labels.
   - **Documentation**: [PyTorch Lightning - `prepare_data`](https://lightning.ai/docs/pytorch/stable/data/datamodule.html#prepare-data)

2. **`setup` Method**:
   - **Purpose**: This method sets up the data module for training, validation, and testing. It checks for the processed data and, if necessary, performs additional setup to ensure the data is ready for model input. It also handles cross-validation settings if enabled.
   - **Description**: Transforms `data.pkl` into a model input data format (`data.pt`), ensuring that the data is in a format compatible for input to the model. The transformed data contains the following keys: `ident`, `features`, `labels`, and `group`. This method uses a subclass of Data Reader to perform the transformation.
   - **Documentation**: [PyTorch Lightning - `setup`](https://lightning.ai/docs/pytorch/stable/data/datamodule.html#setup)

These methods ensure that the data is correctly prepared and set up for subsequent use in training and validation processes.

In [None]:
go_class.prepare_data()
go_class.setup()

## data.pkl

In [3]:
import pandas as pd

In [7]:
pkl_df = pd.DataFrame(pd.read_pickle(r"data/GO_UniProt/GO250_BP/processed/data.pkl"))
print("Size of the data (rows x columns): ", pkl_df.shape)
pkl_df.head()

Size of the data (rows x columns):  (27459, 1050)


Unnamed: 0,swiss_id,accession,go_ids,sequence,41,75,122,165,209,226,...,2000145,2000146,2000147,2000241,2000243,2000377,2001020,2001141,2001233,2001234
8,14331_ARATH,"P42643,Q945M2,Q9M0S7",[19222],MATPGASSARDEFVYMAKLAEQAERYEEMVEFMEKVAKAVDKDELT...,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
9,14331_CAEEL,"P41932,Q21537","[132, 1708, 5634, 5737, 5938, 6611, 7346, 8340...",MSDTVEELVQRAKLAEQAERYDDMAAAMKKVTEQGQELSNEERNLL...,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
10,14331_MAIZE,P49106,"[3677, 5634, 10468, 44877]",MASAELSREENVYMAKLAEQAERYEEMVEFMEKVAKTVDSEELTVE...,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
13,14332_MAIZE,Q01526,"[3677, 5634, 10468, 44877]",MASAELSREENVYMAKLAEQAERYEEMVEFMEKVAKTVDSEELTVE...,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
14,14333_ARATH,"P42644,F4KBI7,Q945L2","[5634, 5737, 6995, 9409, 9631, 16036, 19222, 5...",MSTREENVYMAKLAEQAERYEEMVEFMEKVAKTVDVEELSVEERNL...,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


## data.pt

In [8]:
import torch

In [11]:
data_pt = torch.load(r"data/GO_UniProt/GO250_BP/processed/protein_token/data.pt")
print("Type of loaded data:", type(data_pt))
for i in range(1):
    print(data_pt[i])

Type of loaded data: <class 'list'>
{'features': [10, 14, 15, 23, 13, 14, 11, 11, 14, 16, 20, 27, 25, 28, 22, 10, 14, 21, 17, 14, 27, 18, 14, 27, 16, 22, 27, 27, 10, 28, 27, 25, 10, 27, 21, 28, 14, 21, 14, 28, 20, 21, 20, 27, 17, 15, 28, 27, 27, 16, 19, 17, 17, 11, 28, 14, 22, 21, 19, 28, 12, 13, 14, 16, 16, 14, 11, 26, 16, 12, 12, 11, 11, 12, 27, 18, 21, 27, 27, 11, 16, 13, 19, 20, 20, 29, 28, 11, 17, 12, 16, 20, 22, 16, 11, 21, 12, 27, 15, 27, 17, 11, 20, 12, 24, 20, 13, 12, 17, 21, 17, 17, 20, 15, 12, 17, 28, 23, 14, 14, 14, 11, 13, 20, 11, 21, 28, 25, 22, 17, 21, 10, 21, 13, 20, 22, 29, 16, 22, 17, 14, 27, 25, 21, 11, 13, 18, 27, 16, 21, 20, 14, 14, 27, 29, 15, 17, 15, 14, 22, 21, 14, 14, 18, 20, 12, 14, 19, 11, 27, 17, 14, 23, 15, 29, 23, 12, 16, 17, 13, 17, 14, 17, 19, 25, 11, 28, 25, 22, 22, 27, 12, 17, 19, 11, 23, 20, 16, 14, 24, 19, 17, 14, 21, 18, 14, 25, 20, 27, 14, 12, 14, 27, 17, 20, 15, 17, 13, 27, 27, 11, 22, 21, 20, 11, 15, 17, 12, 10, 18, 17, 17, 16, 20, 19, 17, 15, 17

## Protein Representation Using Amino Acid Sequence Notation

Proteins are composed of chains of amino acids, and these sequences can be represented using a one-letter notation for each amino acid. This notation provides a concise way to describe the primary structure of a protein.

### Example Protein Sequence

Protein: **Lysozyme C** from **Gallus gallus** (Chicken).  
[Lysozyme C - UniProtKB P00698](https://www.uniprot.org/uniprotkb/P00698/entry#function)

- **Sequence**: `MRSLLILVLCFLPLAALGKVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAWVAWRNRCKGTDVQAWIRGCRL`
- **Sequence Length**: 147

In this sequence, each letter corresponds to a specific amino acid. This notation is widely used in bioinformatics and molecular biology to represent protein sequences.

### The 20 Amino Acids and Their One-Letter Notations

Here is a list of the 20 standard amino acids, along with their one-letter notations and descriptions:

| One-Letter Notation | Amino Acid Name      | Description                                             |
|---------------------|----------------------|---------------------------------------------------------|
| **A**               | Alanine              | Non-polar, aliphatic amino acid.                        |
| **C**               | Cysteine             | Polar, contains a thiol group, forms disulfide bonds.   |
| **D**               | Aspartic Acid        | Acidic, negatively charged at physiological pH.         |
| **E**               | Glutamic Acid        | Acidic, negatively charged at physiological pH.         |
| **F**               | Phenylalanine        | Aromatic, non-polar.                                    |
| **G**               | Glycine              | Smallest amino acid, non-polar.                         |
| **H**               | Histidine            | Polar, positively charged, can participate in enzyme active sites. |
| **I**               | Isoleucine           | Non-polar, aliphatic.                                   |
| **K**               | Lysine               | Basic, positively charged at physiological pH.          |
| **L**               | Leucine              | Non-polar, aliphatic.                                   |
| **M**               | Methionine           | Non-polar, contains sulfur, start codon in mRNA translation. |
| **N**               | Asparagine           | Polar, uncharged.                                       |
| **P**               | Proline              | Non-polar, introduces kinks in protein chains.          |
| **Q**               | Glutamine            | Polar, uncharged.                                       |
| **R**               | Arginine             | Basic, positively charged, involved in binding phosphate groups. |
| **S**               | Serine               | Polar, can be phosphorylated.                           |
| **T**               | Threonine            | Polar, can be phosphorylated.                           |
| **V**               | Valine               | Non-polar, aliphatic.                                   |
| **W**               | Tryptophan           | Aromatic, non-polar, largest amino acid.                |
| **Y**               | Tyrosine             | Aromatic, polar, can be phosphorylated.                 |

### Understanding Protein Sequences

In the example sequence, each letter represents one of the above amino acids. The sequence reflects the specific order of amino acids in the protein, which is critical for its structure and function.

This notation is used extensively in various bioinformatics tools and databases to study protein structure, function, and interactions.


_Note_:  Refer for amino acid sequence:  https://en.wikipedia.org/wiki/Protein_primary_structure