# Bioinformatics Project - Computational Drug Discovery

## [Part 1] Download Bioactivity Data

### ChEMBL Database

The ChEMBL Database is a database that contains curated bioactivity data of more than 2 million compunds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications.

### Installing libraries

In [1]:
! pip install chembl_webresource_client

Collecting attrs<22.0,>=21.2 (from requests-cache~=0.7.0->chembl_webresource_client)
  Using cached attrs-21.4.0-py2.py3-none-any.whl (60 kB)
Installing collected packages: attrs
  Attempting uninstall: attrs
    Found existing installation: attrs 23.1.0
    Uninstalling attrs-23.1.0:
      Successfully uninstalled attrs-23.1.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
referencing 0.32.0 requires attrs>=22.2.0, but you have attrs 21.4.0 which is incompatible.
jsonschema 4.20.0 requires attrs>=22.2.0, but you have attrs 21.4.0 which is incompatible.[0m[31m
[0mSuccessfully installed attrs-21.4.0


### Importing libraries

In [10]:
import pandas as pd
from chembl_webresource_client.new_client import new_client

## Search for Target protein

### Target search for Acetylcholinesterase

In [11]:
# Target search for coronavirus

target = new_client.target
target_query = target.search("acetylcholinesterase")
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,"[{'xref_id': 'P22303', 'xref_name': None, 'xre...",Homo sapiens,Acetylcholinesterase,27.0,False,CHEMBL220,"[{'accession': 'P22303', 'component_descriptio...",SINGLE PROTEIN,9606
1,[],Homo sapiens,Cholinesterases; ACHE & BCHE,27.0,False,CHEMBL2095233,"[{'accession': 'P06276', 'component_descriptio...",SELECTIVITY GROUP,9606
2,[],Drosophila melanogaster,Acetylcholinesterase,18.0,False,CHEMBL2242744,"[{'accession': 'P07140', 'component_descriptio...",SINGLE PROTEIN,7227
3,[],Bemisia tabaci,AChE2,16.0,False,CHEMBL2366409,"[{'accession': 'B3SST5', 'component_descriptio...",SINGLE PROTEIN,7038
4,[],Leptinotarsa decemlineata,Acetylcholinesterase,16.0,False,CHEMBL2366490,"[{'accession': 'Q27677', 'component_descriptio...",SINGLE PROTEIN,7539
5,"[{'xref_id': 'P04058', 'xref_name': None, 'xre...",Torpedo californica,Acetylcholinesterase,15.0,False,CHEMBL4780,"[{'accession': 'P04058', 'component_descriptio...",SINGLE PROTEIN,7787
6,"[{'xref_id': 'P21836', 'xref_name': None, 'xre...",Mus musculus,Acetylcholinesterase,15.0,False,CHEMBL3198,"[{'accession': 'P21836', 'component_descriptio...",SINGLE PROTEIN,10090
7,"[{'xref_id': 'P37136', 'xref_name': None, 'xre...",Rattus norvegicus,Acetylcholinesterase,15.0,False,CHEMBL3199,"[{'accession': 'P37136', 'component_descriptio...",SINGLE PROTEIN,10116
8,"[{'xref_id': 'O42275', 'xref_name': None, 'xre...",Electrophorus electricus,Acetylcholinesterase,15.0,False,CHEMBL4078,"[{'accession': 'O42275', 'component_descriptio...",SINGLE PROTEIN,8005
9,"[{'xref_id': 'P23795', 'xref_name': None, 'xre...",Bos taurus,Acetylcholinesterase,15.0,False,CHEMBL4768,"[{'accession': 'P23795', 'component_descriptio...",SINGLE PROTEIN,9913


## Select and retrieve bioactivity data for Human Azetylcholinesterase (first entey)

In [12]:
selected_target = targets.target_chembl_id[0]
selected_target

'CHEMBL220'

Here, we will retrieve only bioactivity data for Human Acetylcholinesterase (CHEMBL220) that are reported as pChEMBL values.

In [13]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [14]:
df = pd.DataFrame.from_dict(res)
df

ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

Finally we will save the resulting bioactivity data to a csv file bioactivity_data.csv

In [None]:
df.to_csv("acetycholinesterase_01_bioactivityt_data_raw.csv", index=False)

## Handling missing data

if any compounds has missing value for the standard_value and canonical_smiles column then drop it.

In [None]:
df2 = df[df.standard_value.notna()] # df.column.notna() : 값이 NA가 아니라면 True, 맞으면 False <--> isna() : 값이 NA이면 True, 아니면 False
df2 = df2[df.canonical_smiles.notna()]
df2

In [None]:
len(df2.canonical_smiles.unique()) # df[columns].unique() : 특정 column에 중복값을 제외한 값 array

In [None]:
df2_nr = df2.drop_dupliactes(["canonical_smiles"])
df2_nr

## Data pre-processing of the bioactivitry data

### Combine the 3 columns (molecule_chembl_id, canonical_smiles, standard_value) and bioactivity_class into a DataFrame

selection = ["molecule_chembl_id", "canonical_smiles", "standard_value"]
df3 = df2_nr[selection]
df3

Saves dataframe to csv file

df3.to_csv("actylcholinesterase_02_bioactivity_data_preprocessed.csv", index=False)

## Labeling compunds as either being active, inactive or intermediate

The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be active while those greater than 10,000 nM will be considered to be inactive. As for those values in between 1,000 and 10,000 nM will be referred to as intermediate.

In [None]:
df4 = pd.read_csv("actylcholinesterase_02_bioactivity_data_preprocessed.csv")

In [None]:
bioactivity_threshold = []
for i in df4.standard_value:
    if float(i) >= 10000:
        bioactivity_threshold.append("inactive")
    elif float(i) <= 1000:
        bioactivity_threshold.append("active")
    else:
        bioactivity_threshold.append("intermediate")

In [None]:
bioactivity_class = pd.Series(bioactivity_threshold, name="class")
df5 = pd.concat([df4, bioactivity_class], axis=1)
df5

Saves dataframe to csv file

In [None]:
df5.to_csv("actylcholinesterase_02_bioactivity_data_curated.csv", index = False)

In [None]:
! zip acetylcholinesterase.zip *_csv

In [None]:
! ls -|