# **AI And Biotechnology/Bioinformatics**

## **AI and Drug Discovery Course: QSAR Modeling**
This notebook demonstrates how to collect and preprocess bioactivity data from ChEMBL for QSAR modeling

# **Part 1: Data Collection & Curation**

**First we need to connect Google Colab with our Google Drive, so that we can have access to our Google drive within Colab.**

This allows us to:
* Save datasets
* Reload data across sessions
* Organize project files




In [None]:
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)

**Now create "data" folder in our "Colab Notebooks" folder on Google Drive.**

In [None]:
! mkdir "/content/gdrive/My Drive/Colab Notebooks/data"

## Install and Import Required Libraries
We install the ChEMBL web service package so that we can retrieve bioactivity data

In [None]:
!pip install chembl_webresource_client

# Import Libraries
* pandas for data handling
* new_client from chembl for accessing the database

In [None]:
import pandas as pd
from chembl_webresource_client.new_client import new_client

# Step 1: Search for Traget Protein

## **Target Identification (KRAS)**
Search ChEMBL for the KRAS target and select the most relevant entry.


In [None]:
target = new_client.target
target_query = target.search("KRAS")
targets = pd.DataFrame.from_dict(target_query)
targets.head()


**Reterive Bioactivity data for selected target**

In [None]:
selected_target = targets.target_chembl_id[0]
selected_target

**Now retrieve only bioactivity data for target; **GTPase KRas(CHEMBL2189121)** with reported IC 50  values in nM (nanomolar) unit.**

In [None]:
activity = new_client.activity
results = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [None]:
df1 = pd.DataFrame.from_dict(results)
df1.head(5)

In [None]:
df1.standard_type.unique()

**Finally Save the resulting bioactivity data to a CSV file** **bioactivity_raw_data.csv**.

In [None]:
df1.to_csv('bioactivity_raw_data.csv', index=False)

**Now copy "bioactivity_raw_data.csv" file to Google Drive, in foler "data"**

In [None]:
! cp bioactivity_raw_data.csv "/content/gdrive/My Drive/Colab Notebooks/data"

In [None]:
! ls -l "/content/gdrive/My Drive/Colab Notebooks/data"

In [None]:
! head bioactivity_raw_data.csv

# **Step 3: Bioactivity Data Retrieval (IC50)**
**Retrieve bioactivity data (IC50) for the selected KRAS target.**

**Inspect Missing Values**

In [None]:
df1["standard_type"].isna().sum()

**Filter Rows with Valid Bioactivity Values**

In [None]:
df2 = df[df["standard_value"].notna()]
df2.head()

**Assign Bioactivity Classes**
Define active, intermediate, and inactive classes based on IC50 values.


In [None]:
bioactivity_class = []
for value in df2.standard_value:
    value = float(value)
    if value >= 10000:
        bioactivity_class.append("inactive")
    elif value <= 1000:
        bioactivity_class.append("active")
    else:
        bioactivity_class.append("intermediate")

**Extract Relevant Columns**

In [None]:
molecule_ids = df2.molecule_chembl_id.tolist()
canonical_smiles = df2.canonical_smiles.tolist()
standard_values = df2.standard_value.tolist()

In [None]:
data = list(zip(
    molecule_ids,
    canonical_smiles,
    standard_values,
        bioactivity_class,
))

**Create Preprocessed bioactivity Dataset**

In [None]:

df3 = pd.DataFrame(
    data,
    columns=[
        "molecule_chembl_id",
        "canonical_smiles",
        "standard_value",
        "bioactivity_class",
    ]
)
df3.head()

**Remove Compounds without Valid SMILES**. Drop rows with **NaN**, **empty** or **None** SMILES values.

In [None]:
df3 = df3.dropna(subset=["canonical_smiles"])
df3 = df3[df3["canonical_smiles"].str.lower() != "none"]
df3 = df3[df3["canonical_smiles"].str.strip() != ""]
df3.head()

**Save Preprocessed Bioactivity Data.** Save the cleaned dataset to CSV and copy to Google Drive.

In [None]:
df3.to_csv("bioactivity_preprocessed_data.csv", index=False)

!cp bioactivity_preprocessed_data.csv "/content/gdrive/My Drive/Colab Notebooks/data"
!ls "/content/gdrive/My Drive/Colab Notebooks/data"

## **End of Part 1: Data Collection and Curation**

## **Part 2: Lipinksi Descriptor Calculation & Exploratory Data Analysis**

## **Install conda and rdkit**

In [None]:
! wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.8.2-Linux-x86_64.sh
! chmod +x Miniconda3-py37_4.8.2-Linux-x86_64.sh
! bash ./Miniconda3-py37_4.8.2-Linux-x86_64.sh -b -f -p /usr/local
! conda install -c rdkit rdkit -y
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')

## **Import bioactivity data**

In [1]:
import pandas as pd

In [None]:
df = pd.read.csv('bioactivity_preprocessed_data.csv')
df.head()

## **Calculate Lipinski descriptors**
Christopher Lipinski, a scientist at Pfizer, came up with a set of rule-of-thumb for evaluating the **druglikeness** of compounds.
Such druglikeness is based on the **Absorption, Distribution, Metabolism and Excretion (ADME)** that is also known as the pharmacokinetic profile. Lipinski analyzed all orally active FDA-approved drugs in the formulation of what is to be known as the **Lipinski's Rule** or **Rule-of-Five**.

The Lipinski's Rule stated the following:
* **Molecular weight** < 500 Dalton
* **Octanol-water partition** coefficient (LogP) < 5
* **Hydrogen bond donors** < 5
* **Hydrogen bond acceptors** < 10

In [None]:
import numpy as np
from rdkit import Chem
from rdkit.Chem import Descriptors, Lipinski

In [None]:
def lipinski(smiles, verbose=False):

    moldata= []
    for elem in smiles:
        mol=Chem.MolFromSmiles(elem)
        moldata.append(mol)

    baseData= np.arange(1,1)
    i=0
    for mol in moldata:

        desc_MolWt = Descriptors.MolWt(mol)
        desc_MolLogP = Descriptors.MolLogP(mol)
        desc_NumHDonors = Lipinski.NumHDonors(mol)
        desc_NumHAcceptors = Lipinski.NumHAcceptors(mol)

        row = np.array([desc_MolWt,
                        desc_MolLogP,
                        desc_NumHDonors,
                        desc_NumHAcceptors])

        if(i==0):
            baseData=row
        else:
            baseData=np.vstack([baseData, row])
        i=i+1

    columnNames=["MW","LogP","NumHDonors","NumHAcceptors"]
    descriptors = pd.DataFrame(data=baseData,columns=columnNames)

    return descriptors

In [None]:
df_lipinski = lipinski(df.canonical_smiles)
df_lipinski

In [None]:
df

In [None]:
df_lipinski

In [None]:
df_combined = pd.concat([df,df_lipinski], axis=1)

In [None]:
def pIC50(input):
    pIC50 = []

    for i in input['standard_value_norm']:
        molar = i*(10**-9) # Converts nM to M
        pIC50.append(-np.log10(molar))

    input['pIC50'] = pIC50
    x = input.drop('standard_value_norm', 1)

    return x

In [None]:
df_combined.standard_value.describe()

In [None]:
-np.log10( (10**-9)* 100000000 )

In [None]:
-np.log10( (10**-9)* 10000000000 )

In [None]:
def norm_value(input):
    norm = []

    for i in input['standard_value']:
        if i > 100000000:
          i = 100000000
        norm.append(i)

    input['standard_value_norm'] = norm
    x = input.drop('standard_value', 1)

    return x

In [None]:
df_norm = norm_value(df_combined)
df_norm

In [None]:
df_norm.standard_value_norm.describe()

In [None]:
df_final = pIC50(df_norm)
df_final

In [None]:
df_final.pIC50.describe()

In [None]:
df4 = df_final[df_final['class'] != 'intermediate']
df4

In [None]:
df4.to_csv('bioactivity_pIC50_data.csv')

In [None]:
import seaborn as sns
sns.set(style='ticks')
import matplotlib.pyplot as plt

In [None]:
plt.figure(figsize=(5.5, 5.5))

sns.countplot(x='class', data=df_2class, edgecolor='black')

plt.xlabel('Bioactivity class', fontsize=14, fontweight='bold')
plt.ylabel('Frequency', fontsize=14, fontweight='bold')

plt.savefig('plot_bioactivity_class.pdf')

In [None]:
plt.figure(figsize=(5.5, 5.5))

sns.scatterplot(x='MW', y='LogP', data=df_2class, hue='class', size='pIC50', edgecolor='black', alpha=0.7)

plt.xlabel('MW', fontsize=14, fontweight='bold')
plt.ylabel('LogP', fontsize=14, fontweight='bold')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0)
plt.savefig('plot_MW_vs_LogP.pdf')