
# Table of Contents
1. [Setup](#1)
2. [Load Data](#2)
3. [Data Preprocessing](#3)
4. [Model Prediction](#4)
5. [Results Analysis](#5)


# 1. Setup<a id = 1></a>
In this section, we mount the Google Drive to access the required datasets and set up necessary paths.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import sys
sys.path.append('/content/drive/MyDrive/Chemoinformatics/Projects/acetylcholinesterase_2016/predictions/')

# 2. Load Data<a id = 2></a>
Here, we load the provided bioactivity dataset and select a few sample molecules for demonstration purposes.


In [5]:
import pandas as pd

# Load the provided data
df_sample = pd.read_csv("/content/drive/MyDrive/Chemoinformatics/Projects/acetylcholinesterase_2016/data/acetylcholinesterase_bioactivity_data_3class_pIC50.csv")

# Selecting a few sample molecules (let's say 5 for demonstration)
sample_molecules = df_sample[['canonical_smiles', 'molecule_chembl_id']].head(5)
sample_molecules

Unnamed: 0,canonical_smiles,molecule_chembl_id
0,CCOc1nn(-c2cccc(OCc3ccccc3)c2)c(=O)o1,CHEMBL133897
1,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC1CC1,CHEMBL336398
2,CN(C(=O)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F)c1ccccc1,CHEMBL131588
3,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F,CHEMBL130628
4,CSc1nc(-c2ccc(OC(F)(F)F)cc2)nn1C(=O)N(C)C,CHEMBL130478


# 3. Data Preprocessing<a id = 3></a>
Before using the data for prediction, it undergoes several preprocessing steps. This includes generating molecular descriptors and ensuring the data is in the correct format for the prediction model.


In [6]:
# Save the canonical_smiles of the sample molecules to a .smi file
sample_molecules_selection = sample_molecules[['canonical_smiles', 'molecule_chembl_id']]
sample_file_path = "sample_molecules.smi"
sample_molecules_selection.to_csv(sample_file_path, sep='\t', index=False, header=False)

sample_file_path

'sample_molecules.smi'

In [7]:
! unzip /content/drive/MyDrive/Chemoinformatics/Projects/acetylcholinesterase_2016/predictions/padel.zip

Archive:  /content/drive/MyDrive/Chemoinformatics/Projects/acetylcholinesterase_2016/predictions/padel.zip
   creating: PaDEL-Descriptor/
  inflating: __MACOSX/._PaDEL-Descriptor  
  inflating: PaDEL-Descriptor/MACCSFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml  
  inflating: PaDEL-Descriptor/AtomPairs2DFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml  
  inflating: PaDEL-Descriptor/EStateFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._EStateFingerprinter.xml  
  inflating: PaDEL-Descriptor/Fingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._Fingerprinter.xml  
  inflating: PaDEL-Descriptor/.DS_Store  
  inflating: __MACOSX/PaDEL-Descriptor/._.DS_Store  
   creating: PaDEL-Descriptor/license/
  inflating: __MACOSX/PaDEL-Descriptor/._license  
  inflating: PaDEL-Descriptor/KlekotaRothFingerprintCount.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._KlekotaRothFingerprintCount.xml  
  infla

In [9]:
! bash /content/drive/MyDrive/Chemoinformatics/Projects/acetylcholinesterase_2016/predictions/padel.sh

Processing CHEMBL133897 in sample_molecules.smi (1/5). 
Processing CHEMBL336398 in sample_molecules.smi (2/5). 
Processing CHEMBL131588 in sample_molecules.smi (3/5). Average speed: 1.82 s/mol.
Processing CHEMBL130628 in sample_molecules.smi (4/5). Average speed: 0.92 s/mol.
Processing CHEMBL130478 in sample_molecules.smi (5/5). Average speed: 1.21 s/mol.
Descriptor calculation completed in 2.781 secs . Average speed: 0.56 s/mol.


# 4. Model Prediction <a id = 4></a>

After preprocessing the data and ensuring that the model and the input data are compatible, we can proceed to predict the bioactivity of the given compounds using our trained model.

The `predict_pIC50` function I previously defined facilitates this prediction. It takes the input data path, model path, and original data path as arguments and returns the predicted bioactivity values.

In [10]:
sample_descriptors = pd.read_csv('sample_descriptors_output.csv')

In [11]:
!pip install joblib



In [12]:
from predict_function import predict_pIC50

# Paths to the required files
input_data_path = 'sample_descriptors_output.csv'
model_path = '/content/drive/MyDrive/Chemoinformatics/Projects/acetylcholinesterase_2016/predictions/rf_model.pkl'
original_data_path = '/content/drive/MyDrive/Chemoinformatics/Projects/acetylcholinesterase_2016/data/descriptors_output.csv'

# Get predictions
predictions = predict_pIC50(input_data_path, model_path, original_data_path)

In [13]:
print(predictions)

[6.7224679  5.74351207 6.58242185 5.32671159 5.60673983]


In [14]:
original_data = pd.read_csv("/content/drive/MyDrive/Chemoinformatics/Projects/acetylcholinesterase_2016/data/acetylcholinesterase_bioactivity_data_3class_pIC50.csv")

In [15]:
selected_molecules = ['CHEMBL336398', 'CHEMBL133897', 'CHEMBL131588', 'CHEMBL130628', 'CHEMBL130478']
selected_data = original_data[original_data['molecule_chembl_id'].isin(selected_molecules)][['pIC50']]
selected_pIC50_values = selected_data['pIC50'].values

# 5. Results Analysis<a id = 5></a>

By comparing the predicted values with actual bioactivity values (if available), we can assess the accuracy and reliability of our model.

In this notebook, I extracted a subset of compounds from the original dataset and compared the predicted pIC50 values with the actual ones. This comparison helps in understanding how well our model generalizes to new, unseen data and provides insights into its potential applications in real-world drug discovery scenarios.

In [16]:
molecule_names = sample_descriptors['Name']

# Create a DataFrame
results_df = pd.DataFrame({
    'Molecule Name': molecule_names,
    'Predicted pIC50': predictions,
    'Original pIC50': selected_pIC50_values
})

# Display the DataFrame
print(results_df)

  Molecule Name  Predicted pIC50  Original pIC50
0  CHEMBL336398         6.722468        6.124939
1  CHEMBL133897         5.743512        7.000000
2  CHEMBL130628         6.582422        4.301030
3  CHEMBL131588         5.326712        6.522879
4  CHEMBL130478         5.606740        6.096910
