STUDENT: Hamza Age Daudo

PROJECT: VIRTUAL SCREENING THROUGH QSAR-3D ANALYSIS OF POTENTIAL AGENTS USED IN THE FIGHT AGAINST LEPROSY - CLOFAZIMINE ANALOGUES.

INSTALLATIONS AND IMPORTS:

This section should be executed every time this Notebook is reopened.

\1.1. Performing the necessary installations/uninstallations:

In [None]:
!pip install fastapi kaleido python-multipart uvicorn
!pip install chembl_webresource_client

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: C:\Users\DELL\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: C:\Users\DELL\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


1.2. Importing necessary libraries:

In [None]:
import pandas as pd
from chembl_webresource_client.new_client import new_client

ModuleNotFoundError: No module named 'pandas'

## **PART 2:**

DATASET SELECTION:

The chosen database is ChEMBL (https://www.ebi.ac.uk/chembl/). It is a database of bioactive, drug-like small molecules, containing 2D structures, calculated properties (e.g., logP, molecular weight, Lipinski parameters, etc.), and abstracted bioactivities (e.g., binding constants, pharmacology, and ADMET data). The data are summarized and curated from primary scientific literature, covering a significant portion of structure-activity relationships (SAR) and modern drug discovery.



```
# This is formatted as code
```

2.1 Searching for datasets targeting: "CHEMBL235"

In [None]:
alvo = new_client.target
pesquisa_alvo = alvo.search('CHEMBL235')
ds = pd.DataFrame.from_dict(pesquisa_alvo)
ds

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,"[{'xref_id': 'P37231', 'xref_name': None, 'xre...",Homo sapiens,Peroxisome proliferator-activated receptor gamma,12.0,False,CHEMBL235,"[{'accession': 'P37231', 'component_descriptio...",SINGLE PROTEIN,9606
1,[],Homo sapiens,Peroxisome proliferator-activated receptor gam...,7.0,False,CHEMBL2095163,"[{'accession': 'P37231', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606
2,[],Homo sapiens,Peroxisome proliferator-activated receptor gam...,6.0,False,CHEMBL2095161,"[{'accession': 'P37231', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606
3,[],Homo sapiens,Peroxisome proliferator-activated receptor gam...,6.0,False,CHEMBL2095162,"[{'accession': 'P37231', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606
4,[],Homo sapiens,Peroxisome proliferator-activated receptor gam...,6.0,False,CHEMBL2096976,"[{'accession': 'P37231', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606
5,[],Homo sapiens,PPAR alpha/gamma,6.0,False,CHEMBL2111325,"[{'accession': 'P37231', 'component_descriptio...",SELECTIVITY GROUP,9606
6,[],Homo sapiens,PPAR delta/gamma,6.0,False,CHEMBL2111371,"[{'accession': 'P37231', 'component_descriptio...",SELECTIVITY GROUP,9606
7,[],Homo sapiens,Peroxisome proliferator-activated receptor,5.0,False,CHEMBL3559683,"[{'accession': 'P37231', 'component_descriptio...",PROTEIN FAMILY,9606
8,[],Homo sapiens,RXR alpha/PPAR gamma,4.0,False,CHEMBL2111394,"[{'accession': 'P37231', 'component_descriptio...",PROTEIN COMPLEX,9606


2.2 Searching for a specific target within the dataset:

In [None]:
# Defining the target to be searched:
alvo = "Peroxisome proliferator-activated receptor gamma"

# Checking if any element in pref_name contains this target:
contains_alvo = ds['pref_name'].str.contains(alvo)

# Obtaining the indices of the rows with the defined target:
indices_com_alvo = ds[contains_alvo].index.tolist()

if contains_alvo.any():
    print(f"Pelo menos um elemento contém o termo: {alvo}")
    print(f"Índices das linhas com o termo '{alvo}': {indices_com_alvo}")
else:
    print(f"Nenhum elemento contém o termo: {alvo}")

Pelo menos um elemento contém o termo: Peroxisome proliferator-activated receptor gamma
Índices das linhas com o termo 'Peroxisome proliferator-activated receptor gamma': [0, 1, 2, 3, 4]


2.3 Converting IC50 values to a standard concentration unit (Nanomolar - nM) and generating a single dataframe:

In order to expand access to bioactivity data, a unit conversion system has been applied to transfer values in M, µM, mM to a standard nM unit.

Note: For bioassays using common concentration units (m/v), such as µg/mL, the molar mass of each compound would be required to make this conversion feasible.

In [None]:
# prompt: select rows from df['target_chembl_id'] if in this list: 'CHEMBL612893','CHEMBL3797017','CHEMBL612644'

ensaios = ds[ds['target_chembl_id'].isin(["CHEMBL235"])]
ensaios

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,"[{'xref_id': 'P37231', 'xref_name': None, 'xre...",Homo sapiens,Peroxisome proliferator-activated receptor gamma,12.0,False,CHEMBL235,"[{'accession': 'P37231', 'component_descriptio...",SINGLE PROTEIN,9606


In [None]:
indices_com_ensaio = ensaios.index
indices_com_ensaio

Index([0], dtype='int64')

In [None]:
# Creating a list to store individual DataFrames (Required only during the first execution!):

dfs = []




# Iterating over the different indices:

for i in indices_com_ensaio :

    df_nM_i = []

    df_uM_i = []

    df_mM_i = []

    df_M_i = []

    ds_selecionado_i = ds.target_chembl_id[i]




    # Filtering bioactive compounds with IC50 data in nM units for each index:

    atividade = new_client.activity

    resultado_nM = atividade.filter(target_chembl_id=ds_selecionado_i).filter(standard_type="IC50").filter(units="nM")




    # Filtering bioactive compounds with IC50 data in µM units for each index:

    resultado_uM = atividade.filter(target_chembl_id=ds_selecionado_i).filter(standard_type="IC50").filter(units="uM")




    # Filtering bioactive compounds with IC50 data in mM units for each index:

    resultado_mM = atividade.filter(target_chembl_id=ds_selecionado_i).filter(standard_type="IC50").filter(units="mM")




    # Filtering bioactive compounds with IC50 data in M (molar) units for each index:

    resultado_M = atividade.filter(target_chembl_id=ds_selecionado_i).filter(standard_type="IC50").filter(units="M")



    # Creating a DataFrame for each unit:

    df_nM_i = pd.DataFrame.from_dict(resultado_nM)

    df_uM_i = pd.DataFrame.from_dict(resultado_uM)

    df_mM_i = pd.DataFrame.from_dict(resultado_mM)

    df_M_i = pd.DataFrame.from_dict(resultado_M)





    # Converting each DataFrame to a standard unit (Molar - M):

    if not df_nM_i.empty and 'value' in df_nM_i:

        df_nM_i['value'] = df_nM_i['value'].astype(float)

        df_nM_i['value'] *= 1e-9

    else:

        pass




    if not df_uM_i.empty and 'value' in df_uM_i:

        df_uM_i['value'] = df_uM_i['value'].astype(float)

        df_uM_i['value'] *= 1e-6

    else:

        pass




    if not df_mM_i.empty and 'value' in df_mM_i:

        df_mM_i['value'] = df_mM_i['value'].astype(float)

        df_mM_i['value'] *= 1e-3

    else:

        pass




    if not df_M_i.empty and 'value' in df_M_i:

        df_M_i['value'] = df_M_i['value'].astype(float)

    else:

        pass



    # Adding the DataFrames to the list:

    dfs.append(df_nM_i)

    dfs.append(df_uM_i)

    dfs.append(df_mM_i)

    dfs.append(df_M_i)





# Concatenating the individual DataFrames into a single DataFrame:

df_assays = pd.concat(dfs, ignore_index=True)

df_assays['units'] = 'M'





# Displaying the final DataFrame:

display(df_assays)

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,184830,[],CHEMBL760384,Displacement of PPARgamma agonist from human P...,F,,,BAO_0000190,...,Homo sapiens,Peroxisome proliferator-activated receptor gamma,9606,,,IC50,M,UO_0000065,,1.400000e-08
1,,,186297,[],CHEMBL760384,Displacement of PPARgamma agonist from human P...,F,,,BAO_0000190,...,Homo sapiens,Peroxisome proliferator-activated receptor gamma,9606,,,IC50,M,UO_0000065,,1.800000e-08
2,,,189892,[],CHEMBL760384,Displacement of PPARgamma agonist from human P...,F,,,BAO_0000190,...,Homo sapiens,Peroxisome proliferator-activated receptor gamma,9606,,,IC50,M,UO_0000065,,1.280000e-07
3,,,191184,[],CHEMBL760384,Displacement of PPARgamma agonist from human P...,F,,,BAO_0000190,...,Homo sapiens,Peroxisome proliferator-activated receptor gamma,9606,,,IC50,M,UO_0000065,,1.500000e-08
4,,,194645,[],CHEMBL760384,Displacement of PPARgamma agonist from human P...,F,,,BAO_0000190,...,Homo sapiens,Peroxisome proliferator-activated receptor gamma,9606,,,IC50,M,UO_0000065,,1.000000e-08
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1905,"{'action_type': 'AGONIST', 'description': 'Bin...",,25031279,[],CHEMBL5241789,Agonist activity at PPARgamma (unknown origin),B,,,BAO_0000190,...,Homo sapiens,Peroxisome proliferator-activated receptor gamma,9606,,,IC50,M,UO_0000065,,2.310000e-05
1906,"{'action_type': 'AGONIST', 'description': 'Bin...",,25031283,[],CHEMBL5241789,Agonist activity at PPARgamma (unknown origin),B,,,BAO_0000190,...,Homo sapiens,Peroxisome proliferator-activated receptor gamma,9606,,,IC50,M,UO_0000065,,5.660000e-07
1907,"{'action_type': 'ANTAGONIST', 'description': '...",,25031302,[],CHEMBL5241798,Antagonist activity at PPARgamma (unknown origin),B,,,BAO_0000190,...,Homo sapiens,Peroxisome proliferator-activated receptor gamma,9606,,,IC50,M,UO_0000065,,3.500000e-06
1908,"{'action_type': 'PARTIAL AGONIST', 'descriptio...",,25092762,[],CHEMBL5258849,Partial agonist activity at PPARgamma (unknown...,B,,,BAO_0000190,...,Homo sapiens,Peroxisome proliferator-activated receptor gamma,9606,,,IC50,M,UO_0000065,,5.300000e-06


In [None]:
df_assays["value"].isnull().sum()

18

In [None]:
# Assuming your DataFrame is df_assays:
df_assays.dropna(subset=['value'], inplace=True)

In [None]:
df_assays

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,184830,[],CHEMBL760384,Displacement of PPARgamma agonist from human P...,F,,,BAO_0000190,...,Homo sapiens,Peroxisome proliferator-activated receptor gamma,9606,,,IC50,M,UO_0000065,,1.400000e-08
1,,,186297,[],CHEMBL760384,Displacement of PPARgamma agonist from human P...,F,,,BAO_0000190,...,Homo sapiens,Peroxisome proliferator-activated receptor gamma,9606,,,IC50,M,UO_0000065,,1.800000e-08
2,,,189892,[],CHEMBL760384,Displacement of PPARgamma agonist from human P...,F,,,BAO_0000190,...,Homo sapiens,Peroxisome proliferator-activated receptor gamma,9606,,,IC50,M,UO_0000065,,1.280000e-07
3,,,191184,[],CHEMBL760384,Displacement of PPARgamma agonist from human P...,F,,,BAO_0000190,...,Homo sapiens,Peroxisome proliferator-activated receptor gamma,9606,,,IC50,M,UO_0000065,,1.500000e-08
4,,,194645,[],CHEMBL760384,Displacement of PPARgamma agonist from human P...,F,,,BAO_0000190,...,Homo sapiens,Peroxisome proliferator-activated receptor gamma,9606,,,IC50,M,UO_0000065,,1.000000e-08
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1905,"{'action_type': 'AGONIST', 'description': 'Bin...",,25031279,[],CHEMBL5241789,Agonist activity at PPARgamma (unknown origin),B,,,BAO_0000190,...,Homo sapiens,Peroxisome proliferator-activated receptor gamma,9606,,,IC50,M,UO_0000065,,2.310000e-05
1906,"{'action_type': 'AGONIST', 'description': 'Bin...",,25031283,[],CHEMBL5241789,Agonist activity at PPARgamma (unknown origin),B,,,BAO_0000190,...,Homo sapiens,Peroxisome proliferator-activated receptor gamma,9606,,,IC50,M,UO_0000065,,5.660000e-07
1907,"{'action_type': 'ANTAGONIST', 'description': '...",,25031302,[],CHEMBL5241798,Antagonist activity at PPARgamma (unknown origin),B,,,BAO_0000190,...,Homo sapiens,Peroxisome proliferator-activated receptor gamma,9606,,,IC50,M,UO_0000065,,3.500000e-06
1908,"{'action_type': 'PARTIAL AGONIST', 'descriptio...",,25092762,[],CHEMBL5258849,Partial agonist activity at PPARgamma (unknown...,B,,,BAO_0000190,...,Homo sapiens,Peroxisome proliferator-activated receptor gamma,9606,,,IC50,M,UO_0000065,,5.300000e-06


In [None]:
# Calculate the percentage of each category in the 'assay_type' column
assay_type_percentages = df_assays['assay_type'].value_counts(normalize=True) * 100
print(assay_type_percentages)

assay_type
B    90.591966
F     8.350951
A     1.057082
Name: proportion, dtype: float64


In [None]:
# Filtering the DataFrame to include only rows where 'assay_type' is 'F'

df_assays_f_only = df_assays[df_assays['assay_type'] == 'B']
df_assays_f_only

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
14,,,541035,[],CHEMBL759456,Inhibitory concentration of the PPAR gamma),B,,,BAO_0000190,...,Homo sapiens,Peroxisome proliferator-activated receptor gamma,9606,,,IC50,M,UO_0000065,,5.000000e-08
15,,,907914,[],CHEMBL759455,Inhibitory activity of compound against the bi...,B,,,BAO_0000190,...,Homo sapiens,Peroxisome proliferator-activated receptor gamma,9606,,,IC50,M,UO_0000065,,4.900000e-06
16,,,924270,[],CHEMBL759455,Inhibitory activity of compound against the bi...,B,,,BAO_0000190,...,Homo sapiens,Peroxisome proliferator-activated receptor gamma,9606,,,IC50,M,UO_0000065,,1.700000e-06
17,,,927129,[],CHEMBL759455,Inhibitory activity of compound against the bi...,B,,,BAO_0000190,...,Homo sapiens,Peroxisome proliferator-activated receptor gamma,9606,,,IC50,M,UO_0000065,,5.500000e-07
18,,,927135,[],CHEMBL759455,Inhibitory activity of compound against the bi...,B,,,BAO_0000190,...,Homo sapiens,Peroxisome proliferator-activated receptor gamma,9606,,,IC50,M,UO_0000065,,6.500000e-06
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1905,"{'action_type': 'AGONIST', 'description': 'Bin...",,25031279,[],CHEMBL5241789,Agonist activity at PPARgamma (unknown origin),B,,,BAO_0000190,...,Homo sapiens,Peroxisome proliferator-activated receptor gamma,9606,,,IC50,M,UO_0000065,,2.310000e-05
1906,"{'action_type': 'AGONIST', 'description': 'Bin...",,25031283,[],CHEMBL5241789,Agonist activity at PPARgamma (unknown origin),B,,,BAO_0000190,...,Homo sapiens,Peroxisome proliferator-activated receptor gamma,9606,,,IC50,M,UO_0000065,,5.660000e-07
1907,"{'action_type': 'ANTAGONIST', 'description': '...",,25031302,[],CHEMBL5241798,Antagonist activity at PPARgamma (unknown origin),B,,,BAO_0000190,...,Homo sapiens,Peroxisome proliferator-activated receptor gamma,9606,,,IC50,M,UO_0000065,,3.500000e-06
1908,"{'action_type': 'PARTIAL AGONIST', 'descriptio...",,25092762,[],CHEMBL5258849,Partial agonist activity at PPARgamma (unknown...,B,,,BAO_0000190,...,Homo sapiens,Peroxisome proliferator-activated receptor gamma,9606,,,IC50,M,UO_0000065,,5.300000e-06


In [None]:
## Assigning the class of compounds: active if IC50 < 1000 nM, inactive if IC50 > 10000 nM, and intermediate if IC50 is between 1000 nM and 100000 nM.
## The variable of interest is always "standard_value".

bioactivity_class = []
for i in df_assays_f_only.standard_value:
    if float(i) >= 10000:
        bioactivity_class.append("Inactive")
    elif float(i) < 1000:
        bioactivity_class.append("Active")
    else:
        bioactivity_class.append("Intermediate")

In [None]:
# Viewing the bioactive compounds
df_assays_f_only.molecule_chembl_id

Unnamed: 0,molecule_chembl_id
14,CHEMBL320553
15,CHEMBL149676
16,CHEMBL344282
17,CHEMBL278590
18,CHEMBL424133
...,...
1905,CHEMBL150
1906,CHEMBL13045
1907,CHEMBL379064
1908,CHEMBL5290209


In [None]:
## 7.1. Iterating through the bioactive compounds:
mol_cid = []
for i in df_assays_f_only.molecule_chembl_id:
    mol_cid.append(i)

In [None]:
# Printing the variable mol_cid:
mol_cid

['CHEMBL320553',
 'CHEMBL149676',
 'CHEMBL344282',
 'CHEMBL278590',
 'CHEMBL424133',
 'CHEMBL147086',
 'CHEMBL345721',
 'CHEMBL148774',
 'CHEMBL148639',
 'CHEMBL356170',
 'CHEMBL149438',
 'CHEMBL24458',
 'CHEMBL408',
 'CHEMBL25259',
 'CHEMBL294807',
 'CHEMBL59132',
 'CHEMBL65458',
 'CHEMBL64972',
 'CHEMBL66206',
 'CHEMBL121',
 'CHEMBL23874',
 'CHEMBL23296',
 'CHEMBL23670',
 'CHEMBL279053',
 'CHEMBL278994',
 'CHEMBL121',
 'CHEMBL23881',
 'CHEMBL25710',
 'CHEMBL180504',
 'CHEMBL181644',
 'CHEMBL181656',
 'CHEMBL182230',
 'CHEMBL121',
 'CHEMBL981',
 'CHEMBL182884',
 'CHEMBL181954',
 'CHEMBL363532',
 'CHEMBL360368',
 'CHEMBL192691',
 'CHEMBL373142',
 'CHEMBL370747',
 'CHEMBL365387',
 'CHEMBL192646',
 'CHEMBL370074',
 'CHEMBL363805',
 'CHEMBL192919',
 'CHEMBL365315',
 'CHEMBL371341',
 'CHEMBL179330',
 'CHEMBL364894',
 'CHEMBL191087',
 'CHEMBL372913',
 'CHEMBL195367',
 'CHEMBL121',
 'CHEMBL66206',
 'CHEMBL81592',
 'CHEMBL82034',
 'CHEMBL199225',
 'CHEMBL206794',
 'CHEMBL204057',
 'CHEMBL1783

In [None]:
## 7.2. Iterating over canonical SMILES into a list.
canonical_smiles = []
for i in df_assays.canonical_smiles:
    canonical_smiles.append(i)

In [None]:
## 7.3. Iterating over standard_value into a list.
standard_value = []
for i in df_assays.standard_value:
    standard_value.append(i)

In [None]:
## 7.4. Combining the four variables into the same DataFrame.
dados_tupla = list(zip(mol_cid, canonical_smiles, bioactivity_class, standard_value))
df3 = pd.DataFrame( dados_tupla,  columns=['molecule_chembl_id', 'canonical_smiles', 'bioactivity_class', 'standard_value'])

In [None]:
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,standard_value
0,CHEMBL320553,Cc1oc(-c2ccccc2)nc1CCOc1ccc(C[C@](C)(Oc2ccccc2...,Active,14.0
1,CHEMBL149676,Cc1oc(-c2ccccc2)nc1CCOc1ccc(CC(C)(Oc2ccccc2)C(...,Intermediate,18.0
2,CHEMBL344282,Cc1oc(-c2ccccc2)nc1CCOc1ccc(CC(Oc2ccccc2)C(=O)...,Intermediate,128.0
3,CHEMBL278590,Cc1oc(C2CCCCC2)nc1CCOc1ccc(C[C@](C)(Oc2ccccc2)...,Active,15.0
4,CHEMBL424133,Cc1oc(-c2cccs2)nc1CCOc1ccc(C[C@](C)(Oc2ccccc2)...,Intermediate,10.0
...,...,...,...,...
1709,CHEMBL150,CCN(CCOc1ccc(/C=C2\SC(=O)NC2=O)cc1)c1ccccn1,Inactive,15000.0
1710,CHEMBL13045,CN(CCOc1ccc(CC2SC(=O)NC2=O)cc1)c1nc2ccccc2s1,Active,20000.0
1711,CHEMBL379064,CN(CCOc1ccc(/C=C2\SC(=O)NC2=O)cc1)c1nc2ccccc2s1,Intermediate,13000.0
1712,CHEMBL5290209,CCN(CCOc1ccc(CC2SC(=O)NC2=O)cc1)c1nc2ccccc2s1,Intermediate,25000.0


In [None]:

# Saving the DataFrame to a CSV file.

df3.to_csv('Peroxisome proliferator-activated receptor gamma.csv', index=False)
