 ## Drug Discovery
 
 In this Project we will explore the drug discovery project for HIV using the ChEMBL Database. 
## So what is the ChEMBL data base? 
   <li> This is a data base that combines the chemical, biological and genomic data, allowing for a link between the target drug or illness. </li>
   <li> This is a great resource for many bioinformaticians because it contains 12 million bioactivity data set extracted from 51,000 publications </li>

Here are the basic steps to get the ChEMBL database on python.
1. Install the ChEMBL Library 

In [12]:
pip install chembl_webresource_client

Note: you may need to restart the kernel to use updated packages.


Now we need to import the libraries to visual the data. Here the libraries we will use are pandas and new_client. 
<li> pandas ~ this is an open source library that useful for data analysis and data manipulation.  This is one of the most commonly used tools in data science and Machine Learning. And some key features include: </li>
        <li> Data Structures:  There are usually two main data structures: Dataframe and the series. The DataFrame is more like a spreadsheet or sort of like an SQL table while the series is one dimensional like a column, its an array. </li>
        <li> Data Manipulation: As the name suggest, pandas library has the ability to manipulate data. </li>
        <li> Data Cleaning: This is good for cleaning data like fixing missing data, duplicate data and handling inconsistencies </li>
        <li> Data Analysis: Again like the name suggests the data analysis is a major feature in the pandas library. This can include: calculating descriptive statistics, generate summary reports, and visualizing data </li>
        <li> Time Series Analysis: Pandas is a great tool to use when we are dealing with features like date and time. </li>
        <li> Reading/Writing: It supports writing and reading data to and from different formats of files like SQl, CSV, Excel, JASON, etc. </li>
        <li> Integration: Pandas can integrate with other python libraries like Numpy and scikit-learn (which is a library used for machine learning) </li>

The second library we will use is this one: 
<li> chembl_webresource_client: this is a library that will help us connect with the ChEMBL database, and we will import and using this library we will import new_client. </li>
<li> new_client: This is a module or a class named new_client inside the hembl_webresource_client </li>


In [13]:
import pandas as pd
from chembl_webresource_client.new_client import new_client

# Here we will search for Target Proteins

In [14]:
target = new_client.target
target_query = target.search('HIV')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Human immunodeficiency virus,HIV protease,19.0,False,CHEMBL3638323,"[{'accession': 'Q9YQ30', 'component_descriptio...",SINGLE PROTEIN,12721
1,[],HIV-1 M:B_Lai,HIV-1 M:B_Lai,15.0,False,CHEMBL612775,[],ORGANISM,290579
2,[],Homo sapiens,Transcription factor HIVEP2,12.0,False,CHEMBL4523214,"[{'accession': 'P31629', 'component_descriptio...",SINGLE PROTEIN,9606
3,[],Homo sapiens,Ubiquitin thioesterase OTU1,12.0,False,CHEMBL4630833,"[{'accession': 'Q5VVQ6', 'component_descriptio...",SINGLE PROTEIN,9606
4,"[{'xref_id': 'P51681', 'xref_name': None, 'xre...",Homo sapiens,C-C chemokine receptor type 5,11.0,False,CHEMBL274,"[{'accession': 'P51681', 'component_descriptio...",SINGLE PROTEIN,9606
5,"[{'xref_id': 'P15822', 'xref_name': None, 'xre...",Homo sapiens,Human immunodeficiency virus type I enhancer-b...,9.0,False,CHEMBL2909,"[{'accession': 'P15822', 'component_descriptio...",SINGLE PROTEIN,9606
6,"[{'xref_id': 'Q92993', 'xref_name': None, 'xre...",Homo sapiens,Histone acetyltransferase KAT5,9.0,False,CHEMBL5750,"[{'accession': 'Q92993', 'component_descriptio...",SINGLE PROTEIN,9606
7,[],Homo sapiens,CCR5/mu opioid receptor complex,9.0,False,CHEMBL3301384,"[{'accession': 'P51681', 'component_descriptio...",PROTEIN COMPLEX,9606
8,[],Homo sapiens,Zinc finger and BTB domain-containing protein 7A,7.0,False,CHEMBL5069375,"[{'accession': 'O95365', 'component_descriptio...",SINGLE PROTEIN,9606
9,[],Homo sapiens,80S Ribosome,0.0,False,CHEMBL3987582,"[{'accession': 'P08865', 'component_descriptio...",PROTEIN NUCLEIC-ACID COMPLEX,9606


Now lets say we want to select a perticular entry for us lets pretend that its Human immunodeficiency virus type I enhancer-b. So to go about selecting this entry we would use this code (notice we used the index of 6 instead of 5 because the table started at 0): 

In [15]:
select_target = targets.target_chembl_id[6]
select_target

'CHEMBL5750'

The ID 'CHEMBL5750' is the bioactivity of data for the HIV ype I enhancer-b, this is the unique identification of the target in the ChEMBL data. 

Now if we want to only obtain the bioactivity data for HIV type I enhancer-b, meaning we only care about the the effect of that perticular tissue, we will use this code:

In [16]:
activity = new_client.activity
res = activity.filter(target_chembl_id=select_target).filter(target_type = "SINGLE PROTEIN")
df = pd.DataFrame.from_dict(res)
df.head(3)

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,2490443,[],CHEMBL1029727,Inhibition of TIP60 in human HeLa cell extract...,B,,,BAO_0000190,...,Homo sapiens,Histone acetyltransferase KAT5,9606,,,IC50,uM,UO_0000065,,9.0
1,,,2490445,[],CHEMBL1025546,Inhibition of recombinant Tip60 (1-513) expres...,B,,,BAO_0000190,...,Homo sapiens,Histone acetyltransferase KAT5,9606,,,IC50,uM,UO_0000065,,143.35
2,,,2490449,[],CHEMBL1025546,Inhibition of recombinant Tip60 (1-513) expres...,B,,,BAO_0000190,...,Homo sapiens,Histone acetyltransferase KAT5,9606,,,IC50,uM,UO_0000065,,111.7


Here in this code we did a few things, we created a new variable named activity, which is a new instance and calls the activity attribute of the new client object. This allows us to have an object that we can use to get various operations related to activities recorded in the data base. We also filtered out what we wanted with the first instance only asking for elect_target (which we previous made a variable for) and the second filter is the target_type found on the data base and exactly what target we want. The command df.head(3) gives us the 3 top results in the dataframe. 

# Saving the data in to a CSV file 

In [17]:
df.to_csv('bioactivity_data_HIV.csv', index=False)

# Handling missing data
There maybe some data that is missing in our database so to find lets say missing data for the value we would use this code: 

In [19]:
df2 = df[df.value.notna()]
df2

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,2490443,[],CHEMBL1029727,Inhibition of TIP60 in human HeLa cell extract...,B,,,BAO_0000190,...,Homo sapiens,Histone acetyltransferase KAT5,9606,,,IC50,uM,UO_0000065,,9.0
1,,,2490445,[],CHEMBL1025546,Inhibition of recombinant Tip60 (1-513) expres...,B,,,BAO_0000190,...,Homo sapiens,Histone acetyltransferase KAT5,9606,,,IC50,uM,UO_0000065,,143.35
2,,,2490449,[],CHEMBL1025546,Inhibition of recombinant Tip60 (1-513) expres...,B,,,BAO_0000190,...,Homo sapiens,Histone acetyltransferase KAT5,9606,,,IC50,uM,UO_0000065,,111.7
3,,,2490453,[],CHEMBL1025546,Inhibition of recombinant Tip60 (1-513) expres...,B,,,BAO_0000190,...,Homo sapiens,Histone acetyltransferase KAT5,9606,,,IC50,uM,UO_0000065,,25.87
4,,,2490457,[],CHEMBL1025546,Inhibition of recombinant Tip60 (1-513) expres...,B,,,BAO_0000190,...,Homo sapiens,Histone acetyltransferase KAT5,9606,,,IC50,uM,UO_0000065,,17.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79,,,24342485,[],CHEMBL5029729,Inhibition of TIP60 (unknown origin),B,,,BAO_0000190,...,Homo sapiens,Histone acetyltransferase KAT5,9606,,,IC50,uM,UO_0000065,,50.0
80,,,24342486,[],CHEMBL5029729,Inhibition of TIP60 (unknown origin),B,,,BAO_0000190,...,Homo sapiens,Histone acetyltransferase KAT5,9606,,,IC50,uM,UO_0000065,,50.0
81,,,24849004,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5147377,Inhibition of GST tagged human TIP60 expressed...,B,,,BAO_0000190,...,Homo sapiens,Histone acetyltransferase KAT5,9606,,,IC50,uM,UO_0000065,,50.0
82,"{'action_type': 'INHIBITOR', 'description': 'N...",,24849005,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5147377,Inhibition of GST tagged human TIP60 expressed...,B,,,BAO_0000190,...,Homo sapiens,Histone acetyltransferase KAT5,9606,,,IC50,uM,UO_0000065,,37.2


So Here the code df[df.value.notna()] is bassically saying we need to create a dataframe using the old dataframe (df) which we initially created and drop it. Here there seems to be no missing data. 

# Preprocessing the Data 
Here we will lable whether these compounds are active or inactive. First what we will do is create another target, using type and then we will lable them as active and inactive depending on the value. 

In [27]:
# Here we will create another target: 
select_target = targets.target_chembl_id[6]
act = new_client.activity
res = act.filter(target_chembl_id=select_target).filter(standard_type="IC50")
df = pd.DataFrame.from_dict(res)
df


Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,2490443,[],CHEMBL1029727,Inhibition of TIP60 in human HeLa cell extract...,B,,,BAO_0000190,...,Homo sapiens,Histone acetyltransferase KAT5,9606,,,IC50,uM,UO_0000065,,9.0
1,,,2490445,[],CHEMBL1025546,Inhibition of recombinant Tip60 (1-513) expres...,B,,,BAO_0000190,...,Homo sapiens,Histone acetyltransferase KAT5,9606,,,IC50,uM,UO_0000065,,143.35
2,,,2490449,[],CHEMBL1025546,Inhibition of recombinant Tip60 (1-513) expres...,B,,,BAO_0000190,...,Homo sapiens,Histone acetyltransferase KAT5,9606,,,IC50,uM,UO_0000065,,111.7
3,,,2490453,[],CHEMBL1025546,Inhibition of recombinant Tip60 (1-513) expres...,B,,,BAO_0000190,...,Homo sapiens,Histone acetyltransferase KAT5,9606,,,IC50,uM,UO_0000065,,25.87
4,,,2490457,[],CHEMBL1025546,Inhibition of recombinant Tip60 (1-513) expres...,B,,,BAO_0000190,...,Homo sapiens,Histone acetyltransferase KAT5,9606,,,IC50,uM,UO_0000065,,17.3
5,,,2490461,[],CHEMBL1025546,Inhibition of recombinant Tip60 (1-513) expres...,B,,,BAO_0000190,...,Homo sapiens,Histone acetyltransferase KAT5,9606,,,IC50,uM,UO_0000065,,20.91
6,,,2490465,[],CHEMBL1025546,Inhibition of recombinant Tip60 (1-513) expres...,B,,,BAO_0000190,...,Homo sapiens,Histone acetyltransferase KAT5,9606,,,IC50,uM,UO_0000065,,79.62
7,,,2490469,[],CHEMBL1025546,Inhibition of recombinant Tip60 (1-513) expres...,B,,,BAO_0000190,...,Homo sapiens,Histone acetyltransferase KAT5,9606,,,IC50,uM,UO_0000065,,29.75
8,,,2490473,[],CHEMBL1025546,Inhibition of recombinant Tip60 (1-513) expres...,B,,,BAO_0000190,...,Homo sapiens,Histone acetyltransferase KAT5,9606,,,IC50,uM,UO_0000065,,374.59
9,,,2490477,[],CHEMBL1025546,Inhibition of recombinant Tip60 (1-513) expres...,B,,,BAO_0000190,...,Homo sapiens,Histone acetyltransferase KAT5,9606,,,IC50,uM,UO_0000065,,200.0


Now we create the lables that will define whether the compounds are active or inactive. In our process the compounds that have a value of less than 150 will be active and the compounds with value of greater than 200 will be inactive. Values between 150 to 200 will be intermediate. 

In [28]:
bioactivity_class = []
for i in df.value:
  if float(i) >= 200:
    bioactivity_class.append("inactive")
  elif float(i) <= 150:
    bioactivity_class.append("active")
  else:
    bioactivity_class.append("intermediate")

Now we want to combine the 4 columns (assay_chembl_id, target_organism, target_pref_name, and Value) and bioactivity_class into a DataFrame. Remember the bioactivity_class is a df we created. The command pd.concat combines the dfs. 

In [29]:
selection = ['assay_chembl_id','target_organism', 'target_pref_name', 'value']
df3 = df2[selection]
df3

Unnamed: 0,assay_chembl_id,target_organism,target_pref_name,value
0,CHEMBL1029727,Homo sapiens,Histone acetyltransferase KAT5,9.0
1,CHEMBL1025546,Homo sapiens,Histone acetyltransferase KAT5,143.35
2,CHEMBL1025546,Homo sapiens,Histone acetyltransferase KAT5,111.7
3,CHEMBL1025546,Homo sapiens,Histone acetyltransferase KAT5,25.87
4,CHEMBL1025546,Homo sapiens,Histone acetyltransferase KAT5,17.3
...,...,...,...,...
79,CHEMBL5029729,Homo sapiens,Histone acetyltransferase KAT5,50.0
80,CHEMBL5029729,Homo sapiens,Histone acetyltransferase KAT5,50.0
81,CHEMBL5147377,Homo sapiens,Histone acetyltransferase KAT5,50.0
82,CHEMBL5147377,Homo sapiens,Histone acetyltransferase KAT5,37.2


In [32]:
bioactivity_class = pd.Series(bioactivity_class, name='bioactivity_class')
df4 = pd.concat([df3, bioactivity_class], axis=1)
df4

Unnamed: 0,assay_chembl_id,target_organism,target_pref_name,value,bioactivity_class
0,CHEMBL1029727,Homo sapiens,Histone acetyltransferase KAT5,9.0,active
1,CHEMBL1025546,Homo sapiens,Histone acetyltransferase KAT5,143.35,active
2,CHEMBL1025546,Homo sapiens,Histone acetyltransferase KAT5,111.7,active
3,CHEMBL1025546,Homo sapiens,Histone acetyltransferase KAT5,25.87,active
4,CHEMBL1025546,Homo sapiens,Histone acetyltransferase KAT5,17.3,active
...,...,...,...,...,...
81,CHEMBL5147377,Homo sapiens,Histone acetyltransferase KAT5,50.0,
82,CHEMBL5147377,Homo sapiens,Histone acetyltransferase KAT5,37.2,
83,CHEMBL5214072,Homo sapiens,Histone acetyltransferase KAT5,8.7,
12,,,,,active


Notice there are some missing values in the bioactivity column as well as other coloumns so lets see if we can fix that: 

In [38]:
bioactivity_class = pd.Series(bioactivity_class, name='bioactivity_class')
df4 = pd.concat([df3, bioactivity_class], axis=1)
df4 = df4.dropna(subset=['bioactivity_class', "target_pref_name", "target_organism", "assay_chembl_id"])
df4

Unnamed: 0,assay_chembl_id,target_organism,target_pref_name,value,bioactivity_class
0,CHEMBL1029727,Homo sapiens,Histone acetyltransferase KAT5,9.0,active
1,CHEMBL1025546,Homo sapiens,Histone acetyltransferase KAT5,143.35,active
2,CHEMBL1025546,Homo sapiens,Histone acetyltransferase KAT5,111.7,active
3,CHEMBL1025546,Homo sapiens,Histone acetyltransferase KAT5,25.87,active
4,CHEMBL1025546,Homo sapiens,Histone acetyltransferase KAT5,17.3,active
5,CHEMBL1025546,Homo sapiens,Histone acetyltransferase KAT5,20.91,active
6,CHEMBL1025546,Homo sapiens,Histone acetyltransferase KAT5,79.62,active
7,CHEMBL1025546,Homo sapiens,Histone acetyltransferase KAT5,29.75,active
8,CHEMBL1025546,Homo sapiens,Histone acetyltransferase KAT5,374.59,inactive
9,CHEMBL1025546,Homo sapiens,Histone acetyltransferase KAT5,200.0,inactive


Now we save this to a CSV file format and we are good to go! 

In [39]:
df4.to_csv('bioactivity_data_preprocessed.csv', index=False)