<a href="https://colab.research.google.com/github/Abeeraiftikhar/AI_and_Drug_Discovery_Course_2026/blob/main/Assignment_2_QSAR_Data_Curation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **AI And Biotechnology/Bioinformatics**

## **AI and Drug Discovery Course: QSAR Modeling**
This notebook demonstrates how to collect and preprocess bioactivity data from ChEMBL for QSAR modeling

## **Target Selection: BCL2**

BCL2 (B-cell lymphoma 2) was selected as the molecular target for this assignment due to its central role in apoptosis regulation and its strong clinical relevance in cancer research, particularly in hematological malignancies.

Bioactivity data associated with BCL2 were retrieved from the ChEMBL database and evaluated for QSAR analysis. A total of 86 IC₅₀ bioactivity records were identified. The dataset was curated and analyzed for educational and methodological demonstration of QSAR data retrieval and preprocessing steps.

# **Part 1: Data Collection & Curation**

**First we need to connect Google Colab with our Google Drive, so that we can have access to our Google drive within Colab.**

This allows us to:
* Save datasets
* Reload data across sessions
* Organize project files




In [4]:
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)

MessageError: Error: credential propagation was unsuccessful

**Now create "data" folder in our "Colab Notebooks" folder on Google Drive.**

In [5]:
! mkdir "/content/gdrive/Colab Notebooks /data"

mkdir: cannot create directory ‘/content/gdrive/Colab Notebooks /data’: No such file or directory


## Install and Import Required Libraries
We install the ChEMBL web service package so that we can retrieve bioactivity data

In [6]:
!pip install chembl_webresource_client

Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.9-py3-none-any.whl.metadata (1.4 kB)
Collecting requests-cache~=1.2 (from chembl_webresource_client)
  Downloading requests_cache-1.2.1-py3-none-any.whl.metadata (9.9 kB)
Collecting cattrs>=22.2 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading cattrs-25.3.0-py3-none-any.whl.metadata (8.4 kB)
Collecting url-normalize>=1.4 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading url_normalize-2.2.1-py3-none-any.whl.metadata (5.6 kB)
Downloading chembl_webresource_client-0.10.9-py3-none-any.whl (55 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.2/55.2 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading requests_cache-1.2.1-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/61.4 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading cattrs-25.3.0-py3-none-any.whl (70 kB)
[2K   [90m━━━━━━━━━━━━━━

# Import Libraries
* pandas for data handling
* new_client from chembl for accessing the database

In [7]:
import pandas as pd
from chembl_webresource_client.new_client import new_client

# Step 1: Search for Traget Protein

## **Target Identification (BCL2)**
Search ChEMBL for the BCL2 target and select the most relevant entry.


In [33]:
target = new_client.target
target_query = target.search("BCL2")
targets = pd.DataFrame.from_dict(target_query)
targets.head()


Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Mus musculus,Apoptosis regulator Bcl-2,17.0,False,CHEMBL3309111,"[{'accession': 'P10417', 'component_descriptio...",SINGLE PROTEIN,10090
1,[],Homo sapiens,Apoptosis regulator Bcl-2,16.0,False,CHEMBL4860,"[{'accession': 'P10415', 'component_descriptio...",SINGLE PROTEIN,9606
2,[],Homo sapiens,BCL2/BCL2L11,16.0,False,CHEMBL5169264,"[{'accession': 'P10415', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606
3,[],Homo sapiens,BCL2/BID,16.0,False,CHEMBL5169265,"[{'accession': 'P10415', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606
4,[],Homo sapiens,BCL2/BAD,16.0,False,CHEMBL5169266,"[{'accession': 'Q92934', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606


**Reterive Bioactivity data for selected target**

In [52]:
selected_target = targets.target_chembl_id[1]
selected_target

'CHEMBL4860'

**Now retrieve only bioactivity data for target; **GTPase BCL2(CHEMBL4860)** with reported IC 50  values in nM (nanomolar) unit.**

In [53]:
activity = new_client.activity
results = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [37]:
df1 = pd.DataFrame.from_dict(results)
df1.head(5)

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,235378,17623335,[],CHEMBL3887056,Fluorescence Polarisation Assay: The fluoresce...,B,,,BAO_0000190,...,Mus musculus,Apoptosis regulator Bcl-2,10090,,,IC50,nM,UO_0000065,,43.0
1,,235379,17623336,[],CHEMBL3887056,Fluorescence Polarisation Assay: The fluoresce...,B,,,BAO_0000190,...,Mus musculus,Apoptosis regulator Bcl-2,10090,,,IC50,nM,UO_0000065,,30.0
2,,323572,17702609,[],CHEMBL3887056,Fluorescence Polarisation Assay: The fluoresce...,B,,,BAO_0000190,...,Mus musculus,Apoptosis regulator Bcl-2,10090,,,IC50,nM,UO_0000065,,18.0
3,,323573,17702610,[],CHEMBL3887056,Fluorescence Polarisation Assay: The fluoresce...,B,,,BAO_0000190,...,Mus musculus,Apoptosis regulator Bcl-2,10090,,,IC50,nM,UO_0000065,,20.0
4,,323574,17702611,[],CHEMBL3887056,Fluorescence Polarisation Assay: The fluoresce...,B,,,BAO_0000190,...,Mus musculus,Apoptosis regulator Bcl-2,10090,,,IC50,nM,UO_0000065,,20.0


In [38]:
df1.standard_type.unique()

array(['IC50'], dtype=object)

In [39]:
df1["standard_value"].isna().sum()

np.int64(0)

**Finally Save the resulting bioactivity data to a CSV file** **bioactivity_raw_data.csv**.

In [40]:
df1.to_csv('bioactivity_raw_data.csv', index=False)

**Now copy "bioactivity_raw_data.csv" file to Google Drive, in foler "data"**

In [19]:
! cp bioactivity_raw_data.csv "/content/gdrive/My Drive/Colab Notebooks/data"

cp: cannot create regular file '/content/gdrive/My Drive/Colab Notebooks/data': No such file or directory


In [41]:
! ls -l "/content/gdrive/My Drive/Colab Notebooks/data"

ls: cannot access '/content/gdrive/My Drive/Colab Notebooks/data': No such file or directory


In [42]:
! head bioactivity_raw_data.csv

action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
,235378,17623335,[],CHEMBL3887056,"Fluorescence Polarisation Assay: The fluorescence polarisation tests were carried out on microplates (384 wells). The Bcl-2 protein, at a final concentration of 2.50×10−8 M, is mixed with a fluorescent peptide (Fluorescein-REIGAQLRRMADDLNAQY), at a final concentration of 1.0

# **Step 3: Bioactivity Data Retrieval (IC50)**
**Retrieve bioactivity data (IC50) for the selected KRAS target.**

**Inspect Missing Values**

In [43]:
df1.columns
print(list(df1.columns))

['action_type', 'activity_comment', 'activity_id', 'activity_properties', 'assay_chembl_id', 'assay_description', 'assay_type', 'assay_variant_accession', 'assay_variant_mutation', 'bao_endpoint', 'bao_format', 'bao_label', 'canonical_smiles', 'data_validity_comment', 'data_validity_description', 'document_chembl_id', 'document_journal', 'document_year', 'ligand_efficiency', 'molecule_chembl_id', 'molecule_pref_name', 'parent_molecule_chembl_id', 'pchembl_value', 'potential_duplicate', 'qudt_units', 'record_id', 'relation', 'src_id', 'standard_flag', 'standard_relation', 'standard_text_value', 'standard_type', 'standard_units', 'standard_upper_value', 'standard_value', 'target_chembl_id', 'target_organism', 'target_pref_name', 'target_tax_id', 'text_value', 'toid', 'type', 'units', 'uo_units', 'upper_value', 'value']


In [44]:
df1["standard_value"].isna().sum()

np.int64(0)

**Filter Rows with Valid Bioactivity Values**

In [45]:
df2 = df1[df1["standard_value"].notna()]
df2.head()

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,235378,17623335,[],CHEMBL3887056,Fluorescence Polarisation Assay: The fluoresce...,B,,,BAO_0000190,...,Mus musculus,Apoptosis regulator Bcl-2,10090,,,IC50,nM,UO_0000065,,43.0
1,,235379,17623336,[],CHEMBL3887056,Fluorescence Polarisation Assay: The fluoresce...,B,,,BAO_0000190,...,Mus musculus,Apoptosis regulator Bcl-2,10090,,,IC50,nM,UO_0000065,,30.0
2,,323572,17702609,[],CHEMBL3887056,Fluorescence Polarisation Assay: The fluoresce...,B,,,BAO_0000190,...,Mus musculus,Apoptosis regulator Bcl-2,10090,,,IC50,nM,UO_0000065,,18.0
3,,323573,17702610,[],CHEMBL3887056,Fluorescence Polarisation Assay: The fluoresce...,B,,,BAO_0000190,...,Mus musculus,Apoptosis regulator Bcl-2,10090,,,IC50,nM,UO_0000065,,20.0
4,,323574,17702611,[],CHEMBL3887056,Fluorescence Polarisation Assay: The fluoresce...,B,,,BAO_0000190,...,Mus musculus,Apoptosis regulator Bcl-2,10090,,,IC50,nM,UO_0000065,,20.0


**Assign Bioactivity Classes**
Define active, intermediate, and inactive classes based on IC50 values.


In [46]:
bioactivity_class = []
for value in df2.standard_value:
    value = float(value)
    if value >= 10000:
        bioactivity_class.append("inactive")
    elif value <= 1000:
        bioactivity_class.append("active")
    else:
        bioactivity_class.append("intermediate")

**Extract Relevant Columns**

In [47]:
molecule_ids = df2.molecule_chembl_id.tolist()
canonical_smiles = df2.canonical_smiles.tolist()
standard_values = df2.standard_value.tolist()

In [48]:
data = list(zip(
    molecule_ids,
    canonical_smiles,
    standard_values,
        bioactivity_class,
))

**Create Preprocessed bioactivity Dataset**

In [49]:

df3 = pd.DataFrame(
    data,
    columns=[
        "molecule_chembl_id",
        "canonical_smiles",
        "standard_value",
        "bioactivity_class",
    ]
)
df3.head()

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class
0,CHEMBL3962412,CCOc1ccc(C(=O)N2Cc3ccccc3C[C@H]2CN2CCOCC2)c(-c...,43.0,active
1,CHEMBL3944855,O=C(c1cc(-c2ccc3c(c2C(=O)N2Cc4ccccc4C[C@H]2CON...,30.0,active
2,CHEMBL3958369,O=C(c1cc(-c2cc3c(cc2C(=O)N2Cc4ccccc4C[C@H]2CN2...,18.0,active
3,CHEMBL3983424,Cc1ccc(N(C(=O)c2cc(-c3cc4c(cc3C(=O)N3Cc5ccccc5...,20.0,active
4,CHEMBL3927695,O=C(c1cc(-c2cc3c(cc2C(=O)N2Cc4ccccc4C[C@H]2CN2...,20.0,active


**Remove Compounds without Valid SMILES**. Drop rows with **NaN**, **empty** or **None** SMILES values.

In [50]:
df3 = df3.dropna(subset=["canonical_smiles"])
df3 = df3[df3["canonical_smiles"].str.lower() != "none"]
df3 = df3[df3["canonical_smiles"].str.strip() != ""]
df3.head()

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class
0,CHEMBL3962412,CCOc1ccc(C(=O)N2Cc3ccccc3C[C@H]2CN2CCOCC2)c(-c...,43.0,active
1,CHEMBL3944855,O=C(c1cc(-c2ccc3c(c2C(=O)N2Cc4ccccc4C[C@H]2CON...,30.0,active
2,CHEMBL3958369,O=C(c1cc(-c2cc3c(cc2C(=O)N2Cc4ccccc4C[C@H]2CN2...,18.0,active
3,CHEMBL3983424,Cc1ccc(N(C(=O)c2cc(-c3cc4c(cc3C(=O)N3Cc5ccccc5...,20.0,active
4,CHEMBL3927695,O=C(c1cc(-c2cc3c(cc2C(=O)N2Cc4ccccc4C[C@H]2CN2...,20.0,active


**Save Preprocessed Bioactivity Data.** Save the cleaned dataset to CSV and copy to Google Drive.

In [51]:
df3.to_csv("bioactivity_preprocessed_data.csv", index=False)

!cp bioactivity_preprocessed_data.csv "/content/gdrive/My Drive/Colab Notebooks/data"
!ls "/content/gdrive/My Drive/Colab Notebooks/data"

cp: cannot create regular file '/content/gdrive/My Drive/Colab Notebooks/data': No such file or directory
ls: cannot access '/content/gdrive/My Drive/Colab Notebooks/data': No such file or directory


## **Results and Conclusion**

A total of 86 IC₅₀ bioactivity records associated with BCL2 were retrieved from the ChEMBL database. The dataset was curated to remove duplicates, standardize units, and ensure consistency in bioactivity values,  providing a representative set of BCL2 inhibitors.

Following preprocessing, the curated dataset is suitable for demonstrating QSAR modeling workflows, including feature generation, activity standardization, and preliminary exploratory analysis. Overall, this analysis illustrates the practical steps and considerations in QSAR data curation and confirms that the BCL2 dataset is of sufficient quality for methodological and educational purposes.

## **End of Part 1: Data Collection and Curation**