# Introduction

This notebook serves as a guide for new developers using the `chebai` package. If you just want to run the experiments, you can refer to the [README.md](https://github.com/ChEB-AI/python-chebai/blob/dev/README.md) and the [wiki](https://github.com/ChEB-AI/python-chebai/wiki) for the basic commands. This notebook explains what happens under the hood for the SCOPe dataset. It covers
- how to instantiate a data class and generate data
- how the data is processed and stored
- and how to work with different molecule encodings.

The `chebai` package simplifies the handling of these datasets by **automatically downloading and processing** them as needed. This means that you do not have to input any data manually; the package will generate and organize the data files based on the parameters and encodings selected. You can however provide your own data files, for instance if you want to replicate a specific experiment.

---


### Overview of SCOPe Data and its Usage in Protein-Related Tasks

#### **What is SCOPe?**

The **Structural Classification of Proteins — extended (SCOPe)** is a comprehensive database that extends the original SCOP (Structural Classification of Proteins) database. SCOPe offers a detailed classification of protein domains based on their structural and evolutionary relationships.

The SCOPe database, like SCOP, organizes proteins into a hierarchy of domains based on structural similarities, which is crucial for understanding evolutionary patterns and functional aspects of proteins. This hierarchical structure is comparable to taxonomy in biology, where species are classified based on shared characteristics.

#### **SCOPe Hierarchy:**
By analogy with taxonomy, SCOP was created as a hierarchy of several levels where the <u>fundamental unit of classification is a **domain** </u> in the experimentally determined protein structure. Starting at the bottom, the hierarchy of SCOP domains comprises the following levels:

1. **Species**: Representing distinct protein sequences and their naturally occurring or artificially created variants.
2. **Protein**: Groups together similar sequences with essentially the same functions. These can originate from different biological species or represent isoforms within the same species.
3. **Family**: Contains proteins with similar sequences but typically distinct functions.
4. **Superfamily**: Bridges protein families with common functional and structural features, often inferred from a shared evolutionary ancestor.
5. **Fold**: Groups structurally similar superfamilies. 
6. **Class**: Based on secondary structure content and organization. This level classifies proteins based on their secondary structure properties, such as alpha-helices and beta-sheets.



For more details, you can refer to the [SCOPe documentation](https://scop.berkeley.edu/help/ver=2.08).

---

#### **Why are We Using SCOPe?**

We are integrating the SCOPe data into our pipeline as part of an ontology pretraining task for protein-related models. SCOPe is a great fit for our goal because it is primarily **structure-based**, unlike other protein-related databases like Gene Ontology (GO), which focuses more on functional classes.

Our primary objective is to reproduce **ontology pretraining** on a protein-related task, and SCOPe provides the structural ontology that we need for this. The steps in our pipeline are aligned as follows:

| **Stage**                | **Chemistry Task**                  | **Proteins Task**                              |
|--------------------------|-------------------------------------|------------------------------------------------|
| **Unsupervised Pretraining** | Mask pretraining (ELECTRA)         | Mask pretraining (ESM2, optional)              |
| **Ontology Pretraining** | ChEBI                               | SCOPe                                          |
| **Finetuning Task**     | Toxicity, Solubility, etc.          | GO (MF, BP, CC branches)                      |

                                                                                                                                                        
This integration will allow us to use **SCOPe** for tasks such as **protein classification** and will contribute to the success of **pretraining models** for protein structures. The data will be processed with the same approach as the GO data, with **different labels** corresponding to the SCOPe classification system.

---

#### **Why SCOPe is Suitable for Our Task**

1. **Structure-Based Classification**: SCOPe is primarily concerned with the structural characteristics of proteins, making it ideal for protein structure pretraining tasks. This contrasts with other ontology databases like **GO**, which categorize proteins based on more complex functional relationships.
   
2. **Manageable Size**: SCOPe contains around **140,000 entries**, making it a manageable dataset for training models. This is similar in size to **ChEBI**, which is used in the chemical domain, and ensures we can work with it effectively for pretraining.


### Protein Data Bank (PDB)

The **Protein Data Bank (PDB)** is a global repository that stores 3D structural data of biological macromolecules like proteins and nucleic acids. It contains information obtained through experimental methods such as **X-ray crystallography**, **NMR spectroscopy**, and **cryo-EM**. The data includes atomic coordinates, secondary structure details, and experimental conditions.

The PDB is an essential resource for **structural biology**, **bioinformatics**, and **drug discovery**, enabling scientists to understand protein functions, interactions, and mechanisms at the molecular level.

For more details, visit the [RCSB PDB website](https://www.rcsb.org/).


### Understanding [SCOPe](https://scop.berkeley.edu/) and [PDB](https://www.rcsb.org/)  


1. **Protein domains form chains.**  
2. **Chains form complexes** (protein complexes or structures).  
3. These **complexes are the entries in PDB**, represented by unique identifiers like `"1A3N"`.  

---

#### **Protein Domain**  
A **protein domain** is a **structural and functional unit** of a protein.  


##### Key Characteristics:
- **Domains are part of a protein chain.**  
- A domain can span:  
  1. **The entire chain** (single-domain protein):  
     - In this case, the protein domain is equivalent to the chain itself.  
     - Example:  
       - All chains of the **PDB structure "1A3N"** are single-domain proteins.  
       - Each chain has a SCOPe domain identifier. 
       - For example, Chain **A**:  
         - Domain identifier: `d1a3na_`  
         - Breakdown of the identifier:  
           - `d`: Denotes domain.  
           - `1a3n`: Refers to the PDB protein structure identifier.  
           - `a`: Specifies the chain within the structure.  (`_` for None and `.` for multiple chains)
           - `_`: Indicates the domain spans the entire chain (single-domain protein).  
         - Example: [PDB Structure 1A3N - Chain A](https://www.rcsb.org/sequence/1A3N#A)
  2. **A specific portion of the chain** (multi-domain protein):  
     - Here, a single chain contains multiple domains.  
     - Example: Chain **A** of the **PDB structure "1PKN"** contains three domains: `d1pkna1`, `d1pkna2`, `d1pkna3`.  
     - Example: [PDB Structure 1PKN - Chain A](https://www.rcsb.org/annotations/1PKN).  

---

#### **Protein Chain**  
A **protein chain** refers to the entire **polypeptide chain** observed in a protein's 3D structure (as described in PDB files).  

##### Key Points:
- A chain can consist of **one or multiple domains**:
  - **Single-domain chain**: The chain and domain are identical.  
    - Example: Myoglobin.  
  - **Multi-domain chain**: Contains several domains, each with distinct structural and functional roles.  
- Chains assemble to form **protein complexes** or **structures**.  


---

#### **Key Observations About SCOPe**  
- The **fundamental classification unit** in SCOPe is the **protein domain**, not the entire protein.  
- _**The taxonomy in SCOPe is not for the entire protein (i.e., the full-length amino acid sequence as encoded by a gene) but for protein domains, which are smaller, structurally and functionally distinct regions of the protein.**_


--- 

**SCOPe 2.08 Data Analysis:**

The current SCOPe version (2.08) includes the following statistics based on analysis for relevant data:

- **Classes**: 12
- **Folds**: 1485
- **Superfamilies**: 2368
- **Families**: 5431
- **Proteins**: 13,514
- **Species**: 30,294
- **Domains**: 344,851

For more detailed statistics, please refer to the official SCOPe website:

- [SCOPe 2.08 Statistics](https://scop.berkeley.edu/statistics/ver=2.08)
- [SCOPe 2.08 Release](https://scop.berkeley.edu/ver=2.08)

---

## SCOPe Labeling 

- Use SCOPe labels for protein domains.
- Map them back to their **protein-chain** sequences (protein sequence label = sum of all domain labels).
- Train on protein sequences.
- This pretraining task would be comparable to GO-based training.

--- 

In [1]:
# To run this notebook, you need to change the working directory of the jupyter notebook to root dir of the project.
import os

# Root directory name of the project
expected_root_dir = "python-chebai"

# Check if the current directory ends with the expected root directory name
if not os.getcwd().endswith(expected_root_dir):
    os.chdir("..")  # Move up one directory level
    if os.getcwd().endswith(expected_root_dir):
        print("Changed to project root directory:", os.getcwd())
    else:
        print("Warning: Directory change unsuccessful. Current directory:", os.getcwd())
else:
    print("Already in the project root directory:", os.getcwd())

Changed to project root directory: G:\github-aditya0by0\python-chebai


# 1. Instantiation of a Data Class

To start working with `chebai`, you first need to instantiate a SCOPe data class. This class is responsible for managing, interacting with, and preprocessing the ChEBI chemical data.

In [6]:
from chebai.preprocessing.datasets.scope.scope import SCOPeOver50

In [7]:
scope_class = SCOPeOver50(scope_version="2.08")


### Inheritance Hierarchy

SCOPe data classes inherit from [`_DynamicDataset`](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/datasets/base.py#L598), which in turn inherits from [`XYBaseDataModule`](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/datasets/base.py#L23). Specifically:

- **`_DynamicDataset`**: This class serves as an intermediate base class that provides additional functionality or customization for datasets that require dynamic behavior. It inherits from `XYBaseDataModule`, which provides the core methods for data loading and processing.

- **`XYBaseDataModule`**: This is the base class for data modules, providing foundational properties and methods for handling and processing datasets, including data splitting, loading, and preprocessing.

In summary, ChEBI data classes are designed to manage and preprocess chemical data effectively by leveraging the capabilities provided by `XYBaseDataModule` through the `_DynamicDataset` intermediary.


### Input parameters
A SCOPe data class can be configured with a range of parameters, including:

- **scope_version (str)**: Specifies the version of the ChEBI database to be used. Specifying a version ensures the reproducibility of your experiments by using a consistent dataset.

- **scope_version_train (str, optional)**: The version of ChEBI to use specifically for training and validation. If not set, the `scope_version` specified will be used for all data splits, including training, validation, and test. Defaults to `None`.

- **splits_file_path (str, optional)**: Path to a CSV file containing data splits. If not provided, the class will handle splits internally. Defaults to `None`.

### Additional Input Parameters

To get more control over various aspects of data loading, processing, and splitting, you can refer to documentation of additional parameters in docstrings of the respective classes: [`_SCOPeDataExtractor`](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/datasets/scope/scope.py#L31), [`XYBaseDataModule`](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/datasets/base.py#L22), [`_DynamicDataset`](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/datasets/base.py#L597), etc.


# Available SCOPe Data Classes

__Note__: Check the code implementation of classes [here](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/datasets/scope/scope.py):

There is a range of available dataset classes for SCOPe. Usually, you want to use `SCOPeOver2000` or `SCOPeOver50`. The number indicates the threshold for selecting label classes: SCOPe classes which have at least 2000 / 50 subclasses will be used as labels.

Both inherit from `SCOPeOverX`. If you need a different threshold, you can create your own subclass. By default, `SCOPeOverX` uses the Protein encoding (see Section 5).

Finally, `SCOPeOver2000Partial` selects extracts a part of SCOPe based on a given top class, with a threshold of 2000 for selecting labels.
This class inherits from `SCOPEOverXPartial`.


---

# 2. Preparation / Setup Methods

Now we have a SCOPe data class with all the relevant parameters. Next, we need to generate the actual dataset.

In [8]:
scope_class.prepare_data()
scope_class.setup()

Checking for processed data in data\SCOPe\version_2.08\SCOPe50\processed
Missing processed data file (`data.pkl` file)
Missing PDB raw data, Downloading PDB sequence data....
Downloading to temporary file C:\Users\HP\AppData\Local\Temp\tmpsif7r129
Downloaded to C:\Users\HP\AppData\Local\Temp\tmpsif7r129
Unzipping the file....
Unpacked and saved to data\SCOPe\pdb_sequences.txt
Removed temporary file C:\Users\HP\AppData\Local\Temp\tmpsif7r129
Missing Scope: cla.txt raw data, Downloading...




Missing Scope: hie.txt raw data, Downloading...
Missing Scope: des.txt raw data, Downloading...
Extracting class hierarchy...
Computing transitive closure
Process graph
101 labels has been selected for specified threshold, 
Constructing data.pkl file .....


Check for processed data in data\SCOPe\version_2.08\SCOPe50\processed\protein_token
Cross-validation enabled: False


Missing transformed data (`data.pt` file). Transforming data.... 
Processing 60298 lines...


100%|█████████████████████████████████████████████████████████████████████████| 60298/60298 [00:53<00:00, 1119.10it/s]


Saving 21 tokens to G:\github-aditya0by0\python-chebai\chebai\preprocessing\bin\protein_token\tokens.txt...
First 10 tokens: ['M', 'S', 'I', 'G', 'A', 'T', 'R', 'L', 'Q', 'N']



### Automatic Execution: 
These methods are executed automatically when using the training command `chebai fit`. Users do not need to call them explicitly, as the code internally manages the preparation and setup of data, ensuring that it is ready for subsequent use in training and validation processes.

### Why is Preparation Needed?

- **Data Availability**: The preparation step ensures that the required SCOPe data files are downloaded or loaded, which are essential for analysis.
- **Data Integrity**: It ensures that the data files are transformed into a compatible format required for model input.

### Main Methods for Data Preprocessing

The data preprocessing in a data class involves two main methods:

1. **`prepare_data` Method**:
   - **Purpose**: This method checks for the presence of raw data in the specified directory. If the raw data is missing, it fetches the ontology, creates a dataframe, and saves it to a file (`data.pkl`). The dataframe includes columns such as IDs, data representations, and labels. This step is independent of input encodings.
   - **Documentation**: [PyTorch Lightning - `prepare_data`](https://lightning.ai/docs/pytorch/stable/data/datamodule.html#prepare-data)

2. **`setup` Method**:
   - **Purpose**: This method sets up the data module for training, validation, and testing. It checks for the processed data and, if necessary, performs additional setup to ensure the data is ready for model input. It also handles cross-validation settings if enabled.
   - **Description**: Transforms `data.pkl` into a model input data format (`data.pt`), tokenizing the input according to the specified encoding. The transformed data contains the following keys: `ident`, `features`, `labels`, and `group`. This method uses a subclass of Data Reader to perform the tokenization.
   - **Documentation**: [PyTorch Lightning - `setup`](https://lightning.ai/docs/pytorch/stable/data/datamodule.html#setup)

These methods ensure that the data is correctly prepared and set up for subsequent use in training and validation processes.

---

# 3. Overview of the 3 preprocessing stages

The `chebai` library follows a three-stage preprocessing pipeline, which is reflected in its file structure:

1. **Raw Data Stage**:
   - **Files**: `cla.txt`, `des.txt` and `hie.txt`. Please find description of each file [here](https://scop.berkeley.edu/help/ver=2.08#parseablefiles-2.08).
   - **Description**: This stage contains the raw SCOPe data in txt format, serving as the initial input for further processing.
   - **File Path**: `data/SCOPe/version_${scope_version}/raw/${filename}.txt`

2. **Processed Data Stage 1**:
   - **File**: `data.pkl`
   - **Description**: This stage includes the data after initial processing. It contains protein sequence strings, class columns, and metadata but lacks data splits.
   - **File Path**: `data/SCOPe/version_${scope_version}/${dataset_name}/processed/data.pkl`
   - **Additional File**: `classes.txt` - A file listing the relevant SCOPe classes.

3. **Processed Data Stage 2**:
   - **File**: `data.pt`
   - **Description**: This final stage includes the encoded data in a format compatible with PyTorch, ready for model input. This stage also references data splits when available.
   - **File Path**: `data/SCOPe/version_${scope_version}/${dataset_name}/processed/${reader_name}/data.pt`
   - **Additional File**: `splits.csv` - Contains saved splits for reproducibility.

This structured approach to data management ensures that each stage of data processing is well-organized and documented, from raw data acquisition to the preparation of model-ready inputs. It also facilitates reproducibility and traceability across different experiments.

### Data Splits

- **Creation**: Data splits are generated dynamically "on the fly" during training and evaluation to ensure flexibility and adaptability to different tasks.
- **Reproducibility**: To maintain consistency across different runs, splits can be reproduced by comparing hashes with a fixed seed value.


# 4. Data Files and their structure

`chebai` creates and manages several data files during its operation. These files store various chemical data and metadata essential for different tasks. Let’s explore these files and their content.


## <u>raw files</u>
- cla.txt, des.txt and hie.txt

For detailed description of raw files and their structures, please refer the official website [here](https://scop.berkeley.edu/help/ver=2.08#parseablefiles-2.08).


## <u>data.pkl</u> File

**Description**: Generated by the `prepare_data` method, this file contains processed data in a dataframe format. It includes the ids, sids which are used to label corresponding sequence, protein-chain sequence, and columns for each label with boolean values.

In [3]:
import pandas as pd
import os

In [8]:
pkl_df = pd.DataFrame(
    pd.read_pickle(
        os.path.join(
            scope_class.processed_dir_main,
            scope_class.processed_main_file_names_dict["data"],
        )
    )
)
print("Size of the data (rows x columns): ", pkl_df.shape)
pkl_df.head()

Size of the data (rows x columns):  (60424, 1035)


Unnamed: 0,id,sids,sequence,class_46456,class_48724,class_51349,class_53931,class_56572,class_56835,class_56992,...,species_187294,species_56257,species_186882,species_56690,species_161316,species_57962,species_58067,species_267696,species_311502,species_311501
0,1,"[d4oq9a_, d4oq9b_, d4oq9c_, d4oq9d_, d4niaa_, ...",AAAAAAAAAA,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,2,[d7dxhc_],AAAAAAAAAAAAAAAAAAAAAAA,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False
2,3,"[d1gkub1, d1gkub2, d1gkub3, d1gkub4]",AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAASLCLFPEDFLLKEF...,False,False,True,False,True,False,False,...,False,False,False,False,False,False,False,False,False,True
3,4,"[d3c9wa2, d3c9wb2, d3c9wa3, d3c9wb3]",AAAAAAGPEMVRGQVFDVGPRYTNLSYIGEGAYGMVCSAYDNLNKV...,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,True
4,5,"[d1xwaa1, d1xwab_, d1xwac_, d1xwad_, d1xwaa2]",AAAAAMVYQVKDKADLDGQLTKASGKLVVLDFFATWCGPCKMISPK...,False,False,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True


**File Path**: `data/SCOPe/version_${scope_version}/${dataset_name}/processed/data.pkl`


### Structure of `data.pkl`
`data.pkl` as following structure: 
- **Column 0**: Contains the ID of eachdata instance.
- **Column 1**: Contains the `sids` which are associated with corresponding protein-chain sequence.
- **Column 2**: Contains the protein-chain sequence.
- **Column 3 and onwards**: Contains the labels, starting from column 3.

This structure ensures that the data is organized and ready for further processing, such as further encoding.


## <u>data.pt</u> File


**Description**: Generated by the `setup` method, this file contains encoded data in a format compatible with the PyTorch library, specifically as a list of dictionaries. Each dictionary in this list includes keys such as `ident`, `features`, `labels`, and `group`, ready for model input.

In [9]:
import torch

In [10]:
data_pt = torch.load(
    os.path.join(
        scope_class.processed_dir, scope_class.processed_file_names_dict["data"]
    ),
    weights_only=False,
)
print("Type of loaded data:", type(data_pt))

Type of loaded data: <class 'list'>


In [12]:
for i in range(5, 6):
    print(data_pt[i])

{'features': [14, 14, 14, 14, 20, 15, 15, 28, 15, 18, 25, 17, 18, 11, 25, 21, 27, 19, 14, 27, 19, 13, 14, 17, 16, 21, 25, 22, 27, 28, 12, 10, 20, 19, 13, 13, 14, 28, 17, 20, 20, 12, 19, 11, 17, 15, 27, 28, 15, 12, 17, 14, 23, 11, 19, 27, 14, 26, 19, 11, 11, 19, 12, 19, 19, 28, 17, 16, 20, 16, 19, 21, 10, 16, 18, 12, 17, 19, 10, 29, 12, 12, 21, 20, 16, 17, 19, 28, 20, 21, 12, 16, 18, 21, 19, 14, 19, 17, 12, 14, 18, 28, 23, 15, 28, 19, 19, 19, 15, 25, 17, 22, 25, 19, 28, 16, 13, 27, 13, 11, 20, 15, 28, 12, 15, 28, 27, 13, 13, 13, 28, 19, 14, 15, 28, 12, 18, 14, 20, 28, 14, 18, 15, 19, 13, 22, 28, 29, 12, 12, 20, 29, 28, 17, 13, 28, 23, 22, 15, 15, 28, 17, 13, 21, 17, 27, 11, 20, 23, 10, 10, 11, 20, 15, 22, 21, 10, 13, 21, 25, 11, 29, 25, 19, 20, 18, 17, 19, 19, 15, 18, 16, 16, 25, 15, 22, 25, 28, 23, 16, 20, 21, 13, 26, 18, 21, 15, 27, 17, 20, 22, 23, 11, 14, 29, 21, 21, 17, 25, 10, 14, 20, 25, 11, 22, 29, 11, 21, 11, 12, 17, 27, 16, 29, 17, 14, 12, 11, 20, 21, 27, 22, 15, 10, 21, 20, 17

**File Path**: `data/SCOPe/version_${scope_version}/${dataset_name}/processed/${reader_name}/data.pt`


### Structure of `data.pt`

The `data.pt` file is a list where each element is a dictionary with the following keys:

- **`features`**: 
  - **Description**: This key holds the input features for the model. The features are typically stored as tensors and represent the attributes used by the model for training and evaluation.

- **`labels`**: 
  - **Description**: This key contains the labels or target values associated with each instance. Labels are also stored as tensors and are used by the model to learn and make predictions.

- **`ident`**: 
  - **Description**: This key holds identifiers for each data instance. These identifiers help track and reference the individual samples in the dataset.


## <u>classes.txt</u> File

**Description**: A file containing the list of selected SCOPe **labels** based on the specified threshold. This file is crucial for ensuring that only relevant **labels** are included in the dataset.

In [9]:
with open(os.path.join(scope_class.processed_dir_main, "classes.txt"), "r") as file:
    for i in range(15):
        line = file.readline()
        print(line.strip())

class_48724
class_53931
class_310555
fold_48725
fold_56111
fold_56234
fold_310573
superfamily_48726
superfamily_56112
superfamily_56235
superfamily_310607
family_48942
family_56251
family_191359
family_191470



**File Path**: `data/SCOPe/version_${scope_version}/${dataset_name}/processed/classes.txt`

The `classes.txt` file lists selected SCOPe classes. These classes are chosen based on a specified threshold, which is typically used for filtering or categorizing the dataset. Each line in the file corresponds to a unique SCOPe class ID, identifying specific class withing SCOPe ontology along with the hierarchy level.

This file is essential for organizing the data and ensuring that only relevant classes, as defined by the threshold, are included in subsequent processing and analysis tasks.


## <u>splits.csv</u> File

**Description**: Contains saved data splits from previous runs. During subsequent runs, this file is used to reconstruct the train, validation, and test splits by filtering the encoded data (`data.pt`) based on the IDs stored in `splits.csv`.

In [10]:
csv_df = pd.read_csv(os.path.join(scope_class.processed_dir_main, "splits.csv"))
csv_df.head()

Unnamed: 0,id,split
0,1,train
1,3,train
2,4,train
3,6,train
4,9,train




**File Path**: `data/SCOPe/version_${scope_version}/${dataset_name}/processed/splits.csv`

The `splits.csv` file contains the saved data splits from previous runs, including the train, validation, and test sets. During subsequent runs, this file is used to reconstruct these splits by filtering the encoded data (`data.pt`) based on the IDs stored in `splits.csv`. This ensures consistency and reproducibility in data splitting, allowing for reliable evaluation and comparison of model performance across different run.


In [11]:
# You can specify a literal path for the `splits_file_path`, or if another `scope_class` instance is already defined,
# you can use its existing `splits_file_path` attribute for consistency.
scope_class_with_splits = SCOPeOver2000(
    scope_version="2.08",
    # splits_file_path="data/chebi_v231/ChEBI50/processed/splits.csv",  # Literal path option
    splits_file_path=scope_class.splits_file_path,  # Use path from an existing `chebi_class` instance
)

---

## 5.1 Protein Representation Using Amino Acid Sequence Notation

Proteins are composed of chains of amino acids, and these sequences can be represented using a one-letter notation for each amino acid. This notation provides a concise way to describe the primary structure of a protein.

### Example Protein Sequence

Protein-Chain: PDB ID:**1cph** Chain ID:**B** mol:protein length:30  INSULIN (PH 10)
</br>Refer - [1cph_B](https://www.rcsb.org/sequence/1CPH)

- **Sequence**: `FVNQHLCGSHLVEALYLVCGERGFFYTPKA`
- **Sequence Length**: 30

In this sequence, each letter corresponds to a specific amino acid. This notation is widely used in bioinformatics and molecular biology to represent protein sequences.

### Tokenization and Encoding

To tokenize and numerically encode this protein sequence, the `ProteinDataReader` class is used. This class allows for n-gram tokenization, where the `n_gram` parameter defines the size of the tokenized units. If `n_gram` is not provided (default is `None`), each amino acid letter is treated as a single token.

For more details, you can explore the implementation of the `ProteinDataReader` class in the source code [here](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/reader.py).

In [12]:
from chebai.preprocessing.reader import ProteinDataReader

In [13]:
protein_dr_3gram = ProteinDataReader(n_gram=3)
protein_dr = ProteinDataReader()

In [14]:
protein = "FVNQHLCGSHLVEALYLVCGERGFFYTPKA"
print(protein_dr._read_data(protein))
print(protein_dr_3gram._read_data(protein))

[25, 28, 19, 18, 29, 17, 24, 13, 11, 29, 17, 28, 27, 14, 17, 22, 17, 28, 24, 13, 27, 16, 13, 25, 25, 22, 15, 23, 21, 14]
[5023, 2218, 3799, 2290, 6139, 2208, 6917, 4674, 484, 439, 2737, 851, 365, 2624, 3240, 4655, 1904, 3737, 1453, 2659, 5160, 3027, 2355, 7163, 4328, 3115, 6207, 1234]


The numbers mentioned above refer to the index of each individual token from the [`tokens.txt`](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/bin/protein_token/tokens.txt) file, which is used by the `ProteinDataReader` class. 

Each token in the `tokens.txt` file corresponds to a specific amino-acid letter, and these tokens are referenced by their index. Additionally, the index values are offset by the `EMBEDDING_OFFSET`, ensuring that the token embeddings are adjusted appropriately during processing.

---