<a href="https://colab.research.google.com/github/SamritiSharma123/DIGITALBLOG/blob/main/ChEMBL_data_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Installing the ChEMBL Library
The ChEMBL library is a programatic way to access the ChEMBL database and retrieve disease/drug targets for a specific condition.


In [1]:
! pip install chembl_webresource_client

Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.9-py3-none-any.whl.metadata (1.4 kB)
Collecting requests-cache~=1.2 (from chembl_webresource_client)
  Downloading requests_cache-1.2.1-py3-none-any.whl.metadata (9.9 kB)
Collecting cattrs>=22.2 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading cattrs-24.1.3-py3-none-any.whl.metadata (8.4 kB)
Collecting url-normalize>=1.4 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading url_normalize-2.2.0-py3-none-any.whl.metadata (4.9 kB)
Downloading chembl_webresource_client-0.10.9-py3-none-any.whl (55 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.2/55.2 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading requests_cache-1.2.1-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/61.4 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading cattrs-24.1.3-py3-none-any.whl (66 kB)
[2K   [90m━━━━━━━━━━━━━━

In [2]:
import pandas as pd
from chembl_webresource_client.new_client import new_client

In [3]:
target = new_client.target
target_query = target.search('coronavirus')
targets = pd.DataFrame.from_dict(target_query)
targets.head(5)

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Coronavirus,Coronavirus,17.0,False,CHEMBL613732,[],ORGANISM,11119
1,[],Feline coronavirus,Feline coronavirus,14.0,False,CHEMBL612744,[],ORGANISM,12663
2,[],Murine coronavirus,Murine coronavirus,14.0,False,CHEMBL5209664,[],ORGANISM,694005
3,[],Canine coronavirus,Canine coronavirus,14.0,False,CHEMBL5291668,[],ORGANISM,11153
4,[],Human coronavirus 229E,Human coronavirus 229E,13.0,False,CHEMBL613837,[],ORGANISM,11137


## Select for Bioactivity Data for SARS-CoV Proteinase
This is where we filter out the target receptor protein data for COVID-19


In [4]:
selected_protein_targets = targets.target_chembl_id[4]
selected_protein_targets

'CHEMBL613837'

### IC50 Measurements
Half-maximal inhibitory concentration (IC50) is the most widely used and informative measure of a drug's efficacy. It indicates how much drug is needed to inhibit a biological process by half, thus providing a measure of potency of an antagonist drug in pharmacological research. (https://pubmed.ncbi.nlm.nih.gov/27365221/)

In [5]:
# Retrieve Bioactivity data for the selected targets
bioactivity_data = new_client.activity
# Filter data for those specific targets and set standard measuement unit to IC50 measurements
filtered_data = bioactivity_data.filter(target_chembl_id=selected_protein_targets).filter(standat_type="IC50")
# Create DataFrame from the filtered data stored in a dictionary, remove None/NA and then store it in a csv file for reusability
bioactivity_DF = pd.DataFrame.from_dict(filtered_data)
bioactivity_DF = bioactivity_DF[bioactivity_DF.standard_value.notna()]
bioactivity_DF.to_csv('raw_bioactivity_data.csv', index = False)
bioactivity_DF.head(5)

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,1749289,[],CHEMBL867944,Inhibition of human coronavirus 229E 3CL protease,F,,,BAO_0000192,...,Human coronavirus 229E,Human coronavirus 229E,11137,,,Ki,uM,UO_0000065,,10.0
1,,,1749292,[],CHEMBL867944,Inhibition of human coronavirus 229E 3CL protease,F,,,BAO_0000192,...,Human coronavirus 229E,Human coronavirus 229E,11137,,,Ki,uM,UO_0000065,,0.068
2,,,13500391,[],CHEMBL2445767,Antiviral activity against Coronavirus 229E in...,F,,,BAO_0000188,...,Human coronavirus 229E,Human coronavirus 229E,11137,,,EC50,uM,UO_0000065,,0.5
3,,,13500392,[],CHEMBL2445767,Antiviral activity against Coronavirus 229E in...,F,,,BAO_0000188,...,Human coronavirus 229E,Human coronavirus 229E,11137,,,EC50,uM,UO_0000065,,0.2
4,,,17963318,[],CHEMBL3994808,Antiviral activity against Human coronavirus 2...,F,,,BAO_0000188,...,Human coronavirus 229E,Human coronavirus 229E,11137,,,EC50,uM,UO_0000065,,112.0


### Activity Level Filtering
Now that we have the compounds involved, we can label them as being either active of inactive relative to a certain activity measurement threshold. We would label three classes:    
-  Active: activity<1000 nM.   
-  Inactive: activity>10000 nM.     
-  Moderate: 1000<activity<10000 nM.

In [6]:
activity_classes = []
for i in bioactivity_DF.standard_value:
  if float(i) >= 10000:
    activity_classes.append("inactive")
  elif float(i) <= 1000:
    activity_classes.append("active")
  else:
    activity_classes.append("moderate")

### Selecting Relevant Columns

In [7]:
## Select columns of interest
columns = ['molecule_chembl_id','canonical_smiles', 'standard_value']
bioactivity_DF = bioactivity_DF[columns]
bioactivity_DF['activity_class'] = activity_classes
bioactivity_DF.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bioactivity_DF['activity_class'] = activity_classes


Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,activity_class
0,CHEMBL20636,CCOC(=O)/C=C/[C@H](C[C@@H]1CCNC1=O)NC(=O)[C@H]...,10000.0,inactive
1,CHEMBL213054,CC(OC(C)(C)C)[C@H](NC(=O)OCc1ccccc1)C(=O)N[C@@...,68.0,active
2,CHEMBL2441745,CC(C)C[C@H](NC(=O)[C@H](Cc1cccc2ccccc12)NC(=O)...,500.0,active
3,CHEMBL2441741,CC(C)C[C@H](NC(=O)[C@H](Cc1cccc2ccccc12)NC(=O)...,200.0,active
4,CHEMBL1643,NC(=O)c1ncn([C@@H]2O[C@H](CO)[C@@H](O)[C@H]2O)n1,112000.0,inactive
5,CHEMBL4070109,CCOP(=O)(OCC)[C@@H]1C[C@@H](Cn2c(=O)n(C(=O)c3c...,39500.0,inactive
6,CHEMBL4127582,C/C(=C\CP(=O)(N[C@@H](C)C(=O)OC(C)C)Oc1cccc2cc...,100000.0,inactive
7,CHEMBL4126212,C/C(=C\CP(=O)(N[C@@H](C)C(=O)OCc1ccccc1)Oc1ccc...,100000.0,inactive
8,CHEMBL4128569,C/C(=C\CP(=O)(N[C@@H](C)C(=O)OC(C)C)Oc1ccccc1)...,100000.0,inactive
9,CHEMBL4127182,C/C(=C\CP(=O)(N[C@@H](C)C(=O)OCc1ccccc1)Oc1ccc...,100000.0,inactive


In [8]:
## Saving preprocessed data to csv
bioactivity_DF.to_csv('preprocessed_bioactivity_data.csv', index= False)