# Introduction

This notebook serves as a guide for new users of the `chebai` package, which is used for working with chemical data, especially focusing on ChEBI (Chemical Entities of Biological Interest). This notebook will explain how to instantiate the main data class, how the data files are structured, and how to work with different molecule encodings.

---


# 1. Instantiation of a Data Class

To start working with `chebai`, you first need to instantiate a ChEBI data class. This class is responsible for managing, interacting with, and preprocessing the ChEBI chemical data
### Inheritance Hierarchy

ChEBI data classes inherit from `_DynamicDataset`, which in turn inherits from `XYBaseDataModule`. Specifically:

- **`_DynamicDataset`**: This class serves as an intermediate base class that provides additional functionality or customization for datasets that require dynamic behavior. It inherits from `XYBaseDataModule`, which provides the core methods for data loading and processing.

- **`XYBaseDataModule`**: This is the base class for data modules, providing foundational properties and methods for handling and processing datasets, including data splitting, loading, and preprocessing.

In summary, ChEBI data classes are designed to manage and preprocess chemical data effectively by leveraging the capabilities provided by `XYBaseDataModule` through the `_DynamicDataset` intermediary.
.

### Explanation
a ChEBI data classiData` class can be configured with the following main parameters:

- **chebi_version (int)**: Specifies the version of the ChEBI database to be used. The default is `200`. Specifying a version ensures the reproducibility of your experiments by using a consistent dataset.

- **chebi_version_train (int, optional)**: The version of ChEBI to use specifically for training and validation. If not set, the `chebi_version` specified will be used for all data splits, including training, validation, and test. Defaults to `None`.

- **single_class (int, optional)**: The ID of the single class to predict. If not set, predictions will be made for all available labels. Defaults to `None`.

- **dynamic_data_split_seed (int, optional)**: The seed for random data splitting, which ensures reproducibility. Defaults to `42`.

- **splits_file_path (str, optional)**: Path to a CSV file containing data splits. If not provided, the class will handle splits internally. Defaults to `None`.

- **kwargs**: Additional keyword arguments passed to `XYBaseDataModule`.

These parameters provide flexibility in handling and processing the data, allowing you to set specific versions for different stages of analysis and manage how data is split for training and validation.

### Additional Input Parameters

The `XYBaseDa ChEBI data class, whsich `ChebaiData` may use internally, includes several important parameters for data loading and processing:

- **batch_size (int)**: The batch size for data loading. Default is `1`.

- **train_split (float)**: The ratio of training data to total data and the ratio of test data to (validation + test) data. Default is `0.85`.

- **reader_kwargs (dict)**: Additional keyword arguments to be passed to the data reader. Default is `None`.

- **prediction_kind (str)**: Specifies the kind of prediction to be performed, relevant only for the `predict_dataloader`. Default is `"test"`.

- **data_limit (Optional[int])**: The maximum number of data samples to load. If set to `None`, the complete dataset will be used. Default is `None`.

- **label_filter (Optional[int])**: The index of the label to filter. Default is `None`.

- **balance_after_filter (Optional[float])**: The ratio of negative samples to positive samples after filtering. Default is `None`.

- **num_workers (int)**: The number of worker processes for data loading. Default is `1`.

- **inner_k_folds (int)**: The number of folds for inner cross-validation. Use `-1` to disable inner cross-validation. Default is `-1`.

- **fold_index (Optional[int])**: The index of the fold to use for training and validation. Default is `None`.

- **base_dir (Optional[str])**: The base directory for storing processed and raw data. Default is `None`.

- **kwargs**: Additional keyword arguments.

These parameters allow you to control various aspects of data loading, processing, and splitting, providing flexibility in how datasets are managed throughout your analysis pipeline.
ining and validation.


# Available ChEBI Data Classes

## `ChEBIOver100`
A class for extracting data from the ChEBI dataset with a threshold of 100 for selecting classes.

- **Inheritance**: Inherits from `ChEBIOverX`.

## `ChEBIOver50`
A class for extracting data from the ChEBI dataset with a threshold of 50 for selecting classes.

- **Inheritance**: Inherits from `ChEBIOverX`.

## `ChEBIOver100DeepSMILES`
A class for extracting data from the ChEBI dataset using the DeepChem SMILES reader with a threshold of 100.

- **Inheritance**: Inherits from `ChEBIOverXDeepSMILES` and `ChEBIOver100`.

## `ChEBIOver100SELFIES`
A class for extracting data from the ChEBI dataset using the SELFIES reader with a threshold of 100.

- **Inheritance**: Inherits from `ChEBIOverXSELFIES` and `ChEBIOver100`.

## `ChEBIOver50SELFIES`
A class for extracting data from the ChEBI dataset using the SELFIES reader with a threshold of 50.

- **Inheritance**: Inherits from `ChEBIOverXSELFIES` and `ChEBIOver50`.

## `ChEBIOver50Partial`
A dataset class that extracts a part of ChEBI based on subclasses of a given top class, with a threshold of 50 for selecting classes.

- **Inheritance**: Inherits from `ChEBIOverXPartial` and `ChEBIOver50`.


In [18]:
from chebai.preprocessing.datasets.chebi import ChEBIOver50

In [20]:
chebi_class = ChEBIOver50(chebi_version=231)

---

# 2. Preparation / Setup Methods

Once a ChEBI data class instance is created, it typically requires preparation before use. This step is necessary to download or load the relevant data files and set up the internal data structures.

### Why is Preparation Needed?

- **Data Availability**: The preparation step ensures that the required ChEBI data files are downloaded or loaded, which are essential for analysis.
- **Data Integrity**: It ensures that the data files are up-to-date and compatible with the specified ChEBI version.

### Main Methods for Data Preprocessing

The data preprocessing in a data class involves two main methods:

1. **`prepare_data` Method**:
   - **Purpose**: This method checks for the presence of raw data in the specified directory. If the raw data is missing, it fetches the ontology, creates a dataframe, and saves it to a file (`data.pkl`). The dataframe includes columns such as IDs, data representations, and labels.
   - **Documentation**: [PyTorch Lightning - `prepare_data`](https://lightning.ai/docs/pytorch/stable/data/datamodule.html#prepare-data)

2. **`setup` Method**:
   - **Purpose**: This method sets up the data module for training, validation, and testing. It checks for the processed data and, if necessary, performs additional setup to ensure the data is ready for model input. It also handles cross-validation settings if enabled.
   - **Description**: Transforms `data.pkl` into a model input data format (`data.pt`), ensuring that the data is in a format compatible for input to the model. The transformed data contains the following keys: `ident`, `features`, `labels`, and `group`. This method uses a subclass of Data Reader to perform the transformation.
   - **Documentation**: [PyTorch Lightning - `setup`](https://lightning.ai/docs/pytorch/stable/data/datamodule.html#setup)

These methods ensure that the data is correctly prepared and set up for subsequent use in training and validation processes.
alidation processes.
processed(data_df, processed_name)


In [36]:
chebi_class.prepare_data()
chebi_class.setup()

Check for processed data in data\chebi_v231\ChEBI50\processed\smiles_token
Cross-validation enabled: False


Check for processed data in data\chebi_v231\ChEBI50\processed
saving 771 tokens to G:\github-aditya0by0\python-chebai\chebai\preprocessing\bin\smiles_token\tokens.txt...
first 10 tokens: ['[*-]', '[Al-]', '[F-]', '.', '[H]', '[N]', '(', ')', '[Ag+]', 'C']


---

# 3. Different Data Files Created and their Structure


`chebai` creates and manages several data files during its operation. These files store various chemical data and metadata essential for different tasks. Let’s explore these files and their structures.

### Data Files

1. **`Raw Data Files`**: (e.g., `.obo` file)
   - **Description**: Contains the raw ChEBI ontology data, downloaded directly from the ChEBI website. This file serves as the foundation for data processing.
   - **File Path**: `data/${chebi_version}/${dataset_name}/raw/${filename}.obo`

2. **`data.pkl`**
   - **Description**: Generated by the `prepare_data` method, this file contains processed data in a dataframe format. It includes chemical IDs, data representations (such as SMILES strings), and class columns with boolean values.
   - **File Path**: `data/${chebi_version}/${dataset_name}/processed/data.pkl`

3. **`data.pt`**
   - **Description**: Generated by the `setup` method, this file contains encoded data in a format compatible with the PyTorch library. It includes keys such as `ident`, `features`, `labels`, and `group`, ready for model input.
   - **File Path**: `data/${chebi_version}/${dataset_name}/processed/${reader_name}/data.pt`

4. **`classes.txt`**
   - **Description**: A file containing the list of selected ChEBI classes based on the specified threshold. This file is crucial for ensuring that only relevant classes are included in the dataset.
   - **File Path**: `data/${chebi_version}/${dataset_name}/processed/classes.txt`

5. **`splits.csv`**
   - **Description**: Contains saved data splits from previous runs. During subsequent runs, this file is used to reconstruct the train, validation, and test splits by filtering the encoded data (`data.pt`) based on the IDs stored in `splits.csv`.
   - **File Path**: `data/${chebi_version}/${dataset_name}/processed/splits.csv`

### File Structure and Preprocessing Stages

The `chebai` library follows a three-stage preprocessing pipeline, which is reflected in its file structure:

1. **Raw Data Stage**:
   - **File**: `chebi.obo`
   - **Description**: This stage contains the raw ChEBI ontology data, serving as the initial input for further processing.
   - **File Path**: `data/${chebi_version}/${dataset_name}/raw/${filename}.obo`

2. **Processed Data Stage 1**:
   - **File**: `data.pkl`
   - **Description**: This stage includes the data after initial processing. It contains SMILES strings, class columns, and metadata but lacks data splits.
   - **File Path**: `data/${chebi_version}/${dataset_name}/processed/data.pkl`
   - **Additional File**: `classes.txt` - A file listing the relevant ChEBI classes.

3. **Processed Data Stage 2**:
   - **File**: `data.pt`
   - **Description**: This final stage includes the encoded data in a format compatible with PyTorch, ready for model input. This stage also references data splits when available.
   - **File Path**: `data/${chebi_version}/${dataset_name}/processed/${reader_name}/data.pt`
   - **Additional File**: `splits.csv` - Contains saved splits for reproducibility.

### Data Splits

- **Creation**: Data splits are generated dynamically "on the fly" during training and evaluation to ensure flexibility and adaptability to different tasks.
- **Reproducibility**: To maintain consistency across different runs, splits can be reproduced by comparing hashes with a fixed seed value.

### Summary of File Paths

- **Raw Data**: `data/${chebi_version}/${dataset_name}/raw`
- **Processed Data 1**: `data/${chebi_version}/${dataset_name}/processed`
- **Processed Data 2**: `data/${chebi_version}/${dataset_name}/processed/${reader_name}`

This structured approach to data management ensures that each stage of data processing is well-organized and documented, from raw data acquisition to the preparation of model-ready inputs. It also facilitates reproducibility and traceability across different experiments.
that each step is well-documented and reproducible.
sing, from raw input to model-ready formats.


---

# 4. Information Stored in the Files


## chebi.obo

The `chebi.obo` file is a key resource in the ChEBI (Chemical Entities of Biological Interest) dataset, containing the ontology data that defines various chemical entities and their relationships. This file is downloaded directly from the ChEBI database and serves as the foundational raw data for further processing in `chebai`.

### Structure of `chebi.obo`

The `chebi.obo` file is organized into blocks of text known as "term documents." Each block starts with a `[Term]` header and contains various attributes that describe a specific chemical entity within the ChEBI ontology. These attributes include identifiers, names, relationships to other entities, and more.

#### Example of a Term Document

```plaintext
[Term]
id: CHEBI:24867
name: monoatomic ion
subset: 3_STAR
synonym: "monoatomic ions" RELATED [ChEBI]
is_a: CHEBI:24870
is_a: CHEBI:33238
```0
is_a: CHEBI:3323Relevant 8
```

### Breakdown of Attributes

Each term document in the `chebi.obo` file consists of the following key attributes:

- **`[Term]`**: 
  - **Description**: Indicates the beginning of a new term in the ontology. Each term represents a distinct chemical entity.

- **`id: CHEBI:24867`**: 
  - **Description**: A unique identifier for the chemical entity within the ChEBI database.
  - **Example**: `CHEBI:24867` refers to the entity "monoatomic ion."

- **`name: monoatomic ion`**: 
  - **Description**: The common name of the chemical entity. This is the main descriptor used to identify the term.
  - **Example**: "monoatomic ion" is the namcating a related term within the ChEBI ontology.

- **`is_a: CHEBI:24870`** and **`is_a: CHEBI:33238`**: 
  - **Description**: Defines hierarchical relationships to other terms within the ontology. The `is_a` attribute indicates that the current entity is a subclass or specific instance of the referenced term.
  - **Example**: The entity `CHEBI:24867` ("monoatomic ion") is a subclass of both `CHEBI:24870` and `CHEBI:33238`, meaent stages of preprocessing, from raw input files to processed, model-ready formats.
```

## `data.pkl` File

The `data.pkl` file, generated during the preprocessing stage, contains the processed ChEBI data in a dataframe format. Below is an example of how this data is structured:



### Structure of `data.pkl`
`data.pkl` as following structure: 
- **Column 0**: Contains the ID of each ChEBI data instance.
- **Column 1**: Contains the name of each ChEBI data instance.
- **Column 2**: Contains the SMILES representation of the chemical.
- **Column 3 and onwards**: Contains the labels, starting from column 3.

This structure ensures that the data is organized and ready for further processing, such as further encoding.


In [49]:
import pandas as pd

In [53]:
pkl_df = pd.DataFrame(pd.read_pickle(r"data/chebi_v200/ChEBI50/processed/data.pkl"))
print("Size of the data (rows x columns): ", pkl_df.shape)
pkl_df.head()

Size of the data (rows x columns):  (129184, 1335)


Unnamed: 0,id,name,SMILES,1722,2468,2571,2580,2634,3098,3992,...,143017,143212,143813,146180,147334,156473,166828,166904,167497,167559
0,33429,monoatomic monoanion,[*-],False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,30151,aluminide(1-),[Al-],False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,16042,halide anion,[*-],False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,17051,fluoride,[F-],False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,28741,sodium fluoride,[F-].[Na+],False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


## `data.pt` File

The `data.pt` file is an important output of the preprocessing stage in `chebai`. It contains data in a format compatible with PyTorch, specifically as a list of dictionaries. Each dictionary in this list is structured to hold key information used for model training and evaluation.

### Structure of `data.pt`

The `data.pt` file is a list where each element is a dictionary with the following keys:

- **`features`**: 
  - **Description**: This key holds the input features for the model. The features are typically stored as tensors and represent the attributes used by the model for training and evaluation.

- **`labels`**: 
  - **Description**: This key contains the labels or target values associated with each instance. Labels are also stored as tensors and are used by the model to learn and make predictions.

- **`ident`**: 
  - **Description**: This key holds identifiers for each data instance. These identifiers help track and reference the individual samples in the dataset.


In [75]:
import torch

In [77]:
data_pt = torch.load(r"data/chebi_v200/ChEBI50/processed/smiles_token/data.pt")
print("Type of loaded data:", type(data_pt))

Type of loaded data: <class 'list'>


In [81]:
for i in range(5):
    print(data_pt[i])

{'features': [10], 'labels': array([False, False, False, ..., False, False, False]), 'ident': 33429, 'group': None}
{'features': [11], 'labels': array([False, False, False, ..., False, False, False]), 'ident': 30151, 'group': None}
{'features': [10], 'labels': array([False, False, False, ..., False, False, False]), 'ident': 16042, 'group': None}
{'features': [12], 'labels': array([False, False, False, ..., False, False, False]), 'ident': 17051, 'group': None}
{'features': [12, 13, 32], 'labels': array([False, False, False, ..., False, False, False]), 'ident': 28741, 'group': None}


## `classes.txt` File

The `classes.txt` file lists selected ChEBI (Chemical Entities of Biological Interest) classes. These classes are chosen based on a specified threshold, which is typically used for filtering or categorizing the dataset. Each line in the file corresponds to a unique ChEBI class ID, identifying specific chemical entities within the ChEBI ontology.

This file is essential for organizing the data and ensuring that only relevant classes, as defined by the threshold, are included in subsequent processing and analysis tasks.


In [87]:
with open(r"data/chebi_v200/ChEBI50/processed/classes.txt", "r") as file:
    for i in range(5):
        line = file.readline()
        print(line.strip())

1722
2468
2571
2580
2634


## `splits.csv` File

The `splits.csv` file contains the saved data splits from previous runs, including the train, validation, and test sets. During subsequent runs, this file is used to reconstruct these splits by filtering the encoded data (`data.pt`) based on the IDs stored in `splits.csv`. This ensures consistency and reproducibility in data splitting, allowing for reliable evaluation and comparison of model performance across different run.


In [98]:
csv_df = pd.read_csv(r"data/chebi_v231/ChEBI50/processed/splits.csv")
csv_df.head()

Unnamed: 0,id,split
0,33429,train
1,30151,train
2,17051,train
3,32129,train
4,30340,train


---

# 5. Example Molecule: Different Encodings

`chebai` supports various encodings for molecules, such as SMILES and SELFIES. Let's take an example molecule and explore its different encodings.

### Explanation:
- **SMILES (Simplified Molecular Input Line Entry System)**: A linear notation for representing molecular structures.
- **SELFIES (SELF-referencIng Embedded Strings)**: A more robust encoding that can handle a broader range of chemical structures.

To illustrate different encodings of a molecule, let's consider the molecule **benzene**, which has the chemical formula **C₆H₆**. Here are the different encodings for benzene:

### 1. **SMILES (Simplified Molecular Input Line Entry System)**
   - **Benzene SMILES**: `c1ccccc1`
   - **Explanation**: 
     - `c1ccccc1` represents a six-membered aromatic ring, with lowercase `c` indicating aromatic carbon atoms.

### 2. **SELFIES (SELF-referencIng Embedded Strings)**
   - **Benzene SELFIES**: `[C][=C][C][=C][C][=C]`
   - **Explanation**: 
     - Each `[C]` represents a carbon atom, and `[=C]` represents a carbon atom with a double bond.
     - SELFIES encodes the alternating single and double bonds in benzene's aromatic ring.

### 3. **InChI (IUPAC International Chemical Identifier)**
   - **Benzene InChI**: `InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H`
   - **Explanation**: 
     - This InChI string provides a systematic representation of benzene's structure, showing the connections between the carbon and hydrogen atoms.

### 4. **InChIKey**
   - **Benzene InChIKey**: `UHOVQNZJYSORNB-UHFFFAOYSA-N`
   - **Explanation**: 
     - A hashed, fixed-length version of the InChI string, used for easier database searching and indexing.

### 5. **Canonical SMILES**
   - **Benzene Canonical SMILES**: `c1ccccc1`
   - **Explanation**:
     - The canonical SMILES for benzene is identical to the regular SMILES, ensuring a unique and consistent representation for database use.

### 6. **SMARTS (SMILES Arbitrary Target Specification)**
   - **Benzene SMARTS**: `[c]1[c][c][c][c][c]1`
   - **Explanation**: 
     - This SMARTS pattern represents the benzene ring structure, which can be used for substructure searching in larger molecules.

These different encodings provide various ways to represent the structure and properties of benzene, each suited to different computational tasks such as molecule identification, database searches, and pattern recognition in cheminformatics.d by different computational tools.