**Bioinformatics Project - Computational Drug Discovery** : Monkeypoxvirus

 STEP 1 :
Data Collection and Pre-Processing from the ChEMBL Database.

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.


In [None]:
! pip install chembl_webresource_client

## **Importing libraries**

In [None]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target protein**

In [None]:
# Target search for Monkeypox virus
target = new_client.target
target_query = target.search('Monkeypox virus')
targets = pd.DataFrame.from_dict(target_query)
targets

### **Select and retrieve bioactivity data we are taking organism of *Monkeypox Virus* so its first entry**





In [None]:
selected_target = targets.target_chembl_id[0]
selected_target

Here, we will retrieve only bioactivity data for *Monkeypox Virus* (CHEMBL613120) that are reported as pChEMBL values.

In [None]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [None]:
df = pd.DataFrame.from_dict(res)

In [None]:
df.head()

In [None]:
df.standard_type.unique()

Finally we will save the resulting bioactivity data to a CSV file **mpox_bioactivity_data.csv**.

In [None]:
df.to_csv('mpox_bioactivity_data.csv', index=False)

## **Copying files to Google Drive**


Firstly, we need to mount the Google Drive into Colab so that we can have access to our Google adrive from within Colab.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)


Next, we create a **data** folder in our **Colab Notebooks** folder on Google Drive.

In [None]:
! mkdir "/content/gdrive/My Drive/Colab Notebooks/data0"

In [None]:
! cp mpox_bioactivity_data.csv "/content/gdrive/My Drive/Colab Notebooks/data0"

In [None]:
! ls -l "/content/gdrive/My Drive/Colab Notebooks/data0"

Let's see the CSV files that we have so far.


In [None]:
! ls

Taking a glimpse of the **bioactivity_data.csv** file that we've just created.

In [None]:
! head mpox_bioactivity_data.csv

## **Handling missing data**
If any compounds has missing value for the **standard_value** and **canonical_smiles** column then drop it.

In [None]:
df2 = df[df.standard_value.notna()]
df2 = df2[df.canonical_smiles.notna()]
df2

In [None]:
len(df2.canonical_smiles.unique())

In [None]:
df2_nr = df2.drop_duplicates(['canonical_smiles'])
df2_nr

## **Data pre-processing of the bioactivity data**

### **Combine the 3 columns (molecule_chembl_id,canonical_smiles,standard_value) and bioactivity_class into a DataFrame**

In [None]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2_nr[selection]
df3

Saves dataframe to CSV file

In [None]:
df3.to_csv('mpox_bioactivity_data_preprocessed.csv', index=False)

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**.

In [None]:
df4 = pd.read_csv('mpox_bioactivity_data_preprocessed.csv')

In [None]:
bioactivity_threshold = []
for i in df4.standard_value:
  if float(i) >= 10000:
    bioactivity_threshold.append("inactive")
  elif float(i) <= 1000:
    bioactivity_threshold.append("active")
  else:
    bioactivity_threshold.append("intermediate")

In [None]:
bioactivity_class = pd.Series(bioactivity_threshold, name='class')
df5 = pd.concat([df4, bioactivity_class], axis=1)
df5

Saves dataframe to CSV file

In [None]:
df5.to_csv('mpox_bioactivity_data_preprocessed.csv', index=False)

In [None]:
! zip mpox.zip *.csv

In [None]:
! ls -l