<a href="https://colab.research.google.com/github/Abaysew/Abaysew/blob/main/python/CDD_ML_Part_1_bioactivity_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Computational Drug Discovery [Part 1] Download Bioactivity Data**

Chanin Nantasenamat

[*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

---

## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications.
[Data as of March 25, 2020; ChEMBL version 26].

## **Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [1]:
! pip install chembl_webresource_client

Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.7-py3-none-any.whl (55 kB)
[?25l[K     |██████                          | 10 kB 17.6 MB/s eta 0:00:01[K     |███████████▉                    | 20 kB 22.9 MB/s eta 0:00:01[K     |█████████████████▊              | 30 kB 22.6 MB/s eta 0:00:01[K     |███████████████████████▋        | 40 kB 13.1 MB/s eta 0:00:01[K     |█████████████████████████████▌  | 51 kB 4.8 MB/s eta 0:00:01[K     |████████████████████████████████| 55 kB 2.0 MB/s 
Collecting requests-cache~=0.7.0
  Downloading requests_cache-0.7.5-py3-none-any.whl (39 kB)
Collecting pyyaml>=5.4
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 14.5 MB/s 
[?25hCollecting url-normalize<2.0,>=1.4
  Downloading url_normalize-1.4.3-py2.py3-none-any.whl (6.8 kB)
Collecting itsdangerous>=2.0.1
  Downloading itsdangerous-

## **Importing libraries**

In [2]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target protein**

### **Target search for coronavirus**

### **Select and retrieve bioactivity data for *SARS coronavirus 3C-like proteinase* (fifth entry)**

In [3]:
# Target search for Mycobacterium tuberculosis
target = new_client.target
target_query = target.search('Mycobacterium tuberculosis')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Mycobacterium tuberculosis,Mycobacterium tuberculosis,27.0,False,CHEMBL360,[],ORGANISM,1773
1,[],Mycobacterium tuberculosis H37Rv,Mycobacterium tuberculosis H37Rv,24.0,False,CHEMBL2111188,[],ORGANISM,83332
2,[],Mycobacterium tuberculosis H37Ra,Mycobacterium tuberculosis H37Ra,24.0,False,CHEMBL2366634,[],ORGANISM,419947
3,[],Mycobacterium,Mycobacterium,13.0,True,CHEMBL614981,[],ORGANISM,1763
4,[],Mycobacterium tuberculosis,PYRAZINAMIDASE/NICOTINAMIDAS PNCA (PZase),13.0,False,CHEMBL1697663,"[{'accession': 'Q50575', 'component_descriptio...",SINGLE PROTEIN,1773
...,...,...,...,...,...,...,...,...,...
115,[],Mycobacterium tuberculosis,Ribonucleoside-diphosphate reductase subunit a...,7.0,False,CHEMBL2346487,"[{'accession': 'P9WH75', 'component_descriptio...",SINGLE PROTEIN,1773
116,[],Mycobacterium tuberculosis,70S ribosome,7.0,False,CHEMBL2363965,"[{'accession': 'P9WHE1', 'component_descriptio...",PROTEIN NUCLEIC-ACID COMPLEX,1773
117,[],Mycobacterium tuberculosis,ATP synthase,7.0,False,CHEMBL2364166,"[{'accession': 'P9WPU9', 'component_descriptio...",PROTEIN COMPLEX,1773
118,[],Mycobacterium tuberculosis,Thioredoxin reductase,7.0,False,CHEMBL2390811,"[{'accession': 'P9WHH1', 'component_descriptio...",SINGLE PROTEIN,1773


We will assign the one handed eighteenth entry (which corresponds to the target protein, *ATP synthase*) to the ***selected_target*** variable 

In [4]:
selected_target = targets.target_chembl_id[117]
selected_target

'CHEMBL2364166'

Here, we will retrieve only bioactivity data for *ATP synthase* (CHEMBL2364166) that are reported as IC$_{50}$ values in nM (nanomolar) unit.

In [5]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [6]:
df = pd.DataFrame.from_dict(res)

In [9]:
df.head(3)

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,16438287,[],CHEMBL3757365,Inhibition of Mycobacterium tuberculosis ATP s...,B,,,BAO_0000190,BAO_0000223,protein complex format,COc1nc2ccccc2cc1C(c1cc2ccccc2nc1OC)C(O)(CCN(C)...,,,CHEMBL3751774,MedChemComm,2015,"{'bei': '14.09', 'le': '0.26', 'lle': '1.38', ...",CHEMBL3752792,,CHEMBL3752792,7.16,False,http://www.openphacts.org/units/Nanomolar,2746818,=,1,True,=,,IC50,nM,,70.0,CHEMBL2364166,Mycobacterium tuberculosis,ATP synthase,1773,,,IC50,uM,UO_0000065,,0.07
1,,16438288,[],CHEMBL3757365,Inhibition of Mycobacterium tuberculosis ATP s...,B,,,BAO_0000190,BAO_0000223,protein complex format,COc1nc2ccccc2cc1C(c1cc2ccccc2nc1OC)C(O)(CCN(C)...,,,CHEMBL3751774,MedChemComm,2015,"{'bei': '12.45', 'le': '0.26', 'lle': '0.77', ...",CHEMBL3751877,,CHEMBL3751877,7.3,False,http://www.openphacts.org/units/Nanomolar,2746819,=,1,True,=,,IC50,nM,,50.0,CHEMBL2364166,Mycobacterium tuberculosis,ATP synthase,1773,,,IC50,uM,UO_0000065,,0.05
2,,16438289,[],CHEMBL3757365,Inhibition of Mycobacterium tuberculosis ATP s...,B,,,BAO_0000190,BAO_0000223,protein complex format,C=CCOc1nc2ccccc2cc1C(c1cc2ccccc2nc1OCC=C)C(O)(...,,,CHEMBL3751774,MedChemComm,2015,"{'bei': '13.22', 'le': '0.24', 'lle': '0.52', ...",CHEMBL3754648,,CHEMBL3754648,7.4,False,http://www.openphacts.org/units/Nanomolar,2746820,=,1,True,=,,IC50,nM,,40.0,CHEMBL2364166,Mycobacterium tuberculosis,ATP synthase,1773,,,IC50,uM,UO_0000065,,0.04


In [10]:
df.standard_type.unique()

array(['IC50'], dtype=object)

Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**.

In [11]:
df.to_csv('bioactivity_data.csv', index=False)

## **Copying files to Google Drive**

Firstly, we need to mount the Google Drive into Colab so that we can have access to our Google adrive from within Colab.

In [12]:
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)


Mounted at /content/gdrive/


Next, we create a **data** folder in our **Colab Notebooks** folder on Google Drive.

In [35]:
ls "/content/gdrive/My Drive/Colab Notebooks/"

'Copy of CDD-ML-Part-1-bioactivity-data (1).ipynb'
'Copy of CDD-ML-Part-1-bioactivity-data (2).ipynb'
'Copy of CDD-ML-Part-1-Bioactivity-Data-Concised.ipynb'
'Copy of CDD-ML-Part-1-bioactivity-data.ipynb'
 data
 [0m[01;34mdata2[0m/
 [01;34mdata3[0m/
 [01;34mdata_my[0m/
 My-first-notebook.ipynb
 Untitled
 Untitled0.ipynb


In [28]:
! mkdir "/content/gdrive/My Drive/Colab Notebooks/data_my"

mkdir: cannot create directory ‘/content/gdrive/My Drive/Colab Notebooks/data_my’: File exists


In [33]:
! cp bioactivity_data.csv "/content/gdrive/My Drive/Colab Notebooks/data_my"

In [34]:
! ls -l "/content/gdrive/My Drive/Colab Notebooks/data_my"

total 9
-rw------- 1 root root 8722 Dec 13 07:59 bioactivity_data.csv


Let's see the CSV files that we have so far.

In [36]:
! ls

bioactivity_data.csv  gdrive  sample_data


Taking a glimpse of the **bioactivity_data.csv** file that we've just created.

In [37]:
! head bioactivity_data.csv

activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
,16438287,[],CHEMBL3757365,Inhibition of Mycobacterium tuberculosis ATP synthase,B,,,BAO_0000190,BAO_0000223,protein complex format,COc1nc2ccccc2cc1C(c1cc2ccccc2nc1OC)C(O)(CCN(C)C)c1ccccc1,,,CHEMBL3751774,MedChemComm,2015,"{'bei': '14.09', 'le': '0.26', 'lle': '1.38', 'sei': '10.57'}",CHEMBL3752792,,CHEMBL3752792,7.16,Fa

## **Handling missing data**
If any compounds has missing value for the **standard_value** column then drop it

In [38]:
df2 = df[df.standard_value.notna()]
df2

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,16438287,[],CHEMBL3757365,Inhibition of Mycobacterium tuberculosis ATP s...,B,,,BAO_0000190,BAO_0000223,protein complex format,COc1nc2ccccc2cc1C(c1cc2ccccc2nc1OC)C(O)(CCN(C)...,,,CHEMBL3751774,MedChemComm,2015,"{'bei': '14.09', 'le': '0.26', 'lle': '1.38', ...",CHEMBL3752792,,CHEMBL3752792,7.16,False,http://www.openphacts.org/units/Nanomolar,2746818,=,1,True,=,,IC50,nM,,70.0,CHEMBL2364166,Mycobacterium tuberculosis,ATP synthase,1773,,,IC50,uM,UO_0000065,,0.07
1,,16438288,[],CHEMBL3757365,Inhibition of Mycobacterium tuberculosis ATP s...,B,,,BAO_0000190,BAO_0000223,protein complex format,COc1nc2ccccc2cc1C(c1cc2ccccc2nc1OC)C(O)(CCN(C)...,,,CHEMBL3751774,MedChemComm,2015,"{'bei': '12.45', 'le': '0.26', 'lle': '0.77', ...",CHEMBL3751877,,CHEMBL3751877,7.3,False,http://www.openphacts.org/units/Nanomolar,2746819,=,1,True,=,,IC50,nM,,50.0,CHEMBL2364166,Mycobacterium tuberculosis,ATP synthase,1773,,,IC50,uM,UO_0000065,,0.05
2,,16438289,[],CHEMBL3757365,Inhibition of Mycobacterium tuberculosis ATP s...,B,,,BAO_0000190,BAO_0000223,protein complex format,C=CCOc1nc2ccccc2cc1C(c1cc2ccccc2nc1OCC=C)C(O)(...,,,CHEMBL3751774,MedChemComm,2015,"{'bei': '13.22', 'le': '0.24', 'lle': '0.52', ...",CHEMBL3754648,,CHEMBL3754648,7.4,False,http://www.openphacts.org/units/Nanomolar,2746820,=,1,True,=,,IC50,nM,,40.0,CHEMBL2364166,Mycobacterium tuberculosis,ATP synthase,1773,,,IC50,uM,UO_0000065,,0.04
3,,16438290,[],CHEMBL3757365,Inhibition of Mycobacterium tuberculosis ATP s...,B,,,BAO_0000190,BAO_0000223,protein complex format,C=CCOc1nc2ccccc2cc1C(c1cc2ccccc2nc1OCC=C)C(O)(...,,,CHEMBL3751774,MedChemComm,2015,"{'bei': '11.78', 'le': '0.24', 'lle': '-0.13',...",CHEMBL3752547,,CHEMBL3752547,7.52,False,http://www.openphacts.org/units/Nanomolar,2746821,=,1,True,=,,IC50,nM,,30.0,CHEMBL2364166,Mycobacterium tuberculosis,ATP synthase,1773,,,IC50,uM,UO_0000065,,0.03
4,,16438291,[],CHEMBL3757365,Inhibition of Mycobacterium tuberculosis ATP s...,B,,,BAO_0000190,BAO_0000223,protein complex format,C=CCOc1nc2ccccc2cc1C(c1cc2ccccc2nc1OCC=C)C(O)(...,,,CHEMBL3751774,MedChemComm,2015,"{'bei': '10.26', 'le': '0.21', 'lle': '-1.10',...",CHEMBL3753673,,CHEMBL3753673,6.55,False,http://www.openphacts.org/units/Nanomolar,2746822,=,1,True,=,,IC50,nM,,280.0,CHEMBL2364166,Mycobacterium tuberculosis,ATP synthase,1773,,,IC50,uM,UO_0000065,,0.28
5,,16438292,[],CHEMBL3757365,Inhibition of Mycobacterium tuberculosis ATP s...,B,,,BAO_0000190,BAO_0000223,protein complex format,C=CCCOc1nc2ccccc2cc1C(c1cc2ccccc2nc1OCCC=C)C(O...,,,CHEMBL3751774,MedChemComm,2015,"{'bei': '10.65', 'le': '0.22', 'lle': '-1.33',...",CHEMBL3751865,,CHEMBL3751865,7.1,False,http://www.openphacts.org/units/Nanomolar,2746823,=,1,True,=,,IC50,nM,,80.0,CHEMBL2364166,Mycobacterium tuberculosis,ATP synthase,1773,,,IC50,uM,UO_0000065,,0.08
6,,16438293,[],CHEMBL3757365,Inhibition of Mycobacterium tuberculosis ATP s...,B,,,BAO_0000190,BAO_0000223,protein complex format,COc1nc2ccccc2cc1C(c1cc2ccccc2nc1OC)C(O)(CCN(C)...,,,CHEMBL3751774,MedChemComm,2015,"{'bei': '13.49', 'le': '0.24', 'lle': '0.60', ...",CHEMBL3753028,,CHEMBL3753028,7.52,False,http://www.openphacts.org/units/Nanomolar,2746824,=,1,True,=,,IC50,nM,,30.0,CHEMBL2364166,Mycobacterium tuberculosis,ATP synthase,1773,,,IC50,uM,UO_0000065,,0.03
7,,16438294,[],CHEMBL3757365,Inhibition of Mycobacterium tuberculosis ATP s...,B,,,BAO_0000190,BAO_0000223,protein complex format,COc1nc2ccccc2cc1C(c1cc2ccccc2nc1OC)C(O)(CCN(C)...,,,CHEMBL3751774,MedChemComm,2015,"{'bei': '13.64', 'le': '0.28', 'lle': '1.47', ...",CHEMBL3754231,,CHEMBL3754231,8.0,False,http://www.openphacts.org/units/Nanomolar,2746825,=,1,True,=,,IC50,nM,,10.0,CHEMBL2364166,Mycobacterium tuberculosis,ATP synthase,1773,,,IC50,uM,UO_0000065,,0.01
8,,16438295,[],CHEMBL3757365,Inhibition of Mycobacterium tuberculosis ATP s...,B,,,BAO_0000190,BAO_0000223,protein complex format,COc1nc2ccccc2cc1C(c1cc2ccccc2nc1OC)C(O)(CCN(C)...,,,CHEMBL3751774,MedChemComm,2015,"{'bei': '16.14', 'le': '0.29', 'lle': '2.08', ...",CHEMBL3754673,,CHEMBL3754673,9.0,False,http://www.openphacts.org/units/Nanomolar,2746826,=,1,True,=,,IC50,nM,,1.0,CHEMBL2364166,Mycobacterium tuberculosis,ATP synthase,1773,,,IC50,uM,UO_0000065,,0.001
9,,16438296,[],CHEMBL3757365,Inhibition of Mycobacterium tuberculosis ATP s...,B,,,BAO_0000190,BAO_0000223,protein complex format,CN(C)CCC(O)(c1cccc(Br)c1)C1c2cc3ccccc3nc2OCC/C...,,,CHEMBL3751774,MedChemComm,2015,"{'bei': '12.53', 'le': '0.25', 'lle': '0.38', ...",CHEMBL3753969,,CHEMBL3753969,8.0,False,http://www.openphacts.org/units/Nanomolar,2746827,=,1,True,=,,IC50,nM,,10.0,CHEMBL2364166,Mycobacterium tuberculosis,ATP synthase,1773,,,IC50,uM,UO_0000065,,0.01


Apparently, for this dataset there is no missing data. But we can use the above code cell for bioactivity data of other target protein.

## **Data pre-processing of the bioactivity data**

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**. 

In [39]:
bioactivity_class = []
for i in df2.standard_value:
  if float(i) >= 10000:
    bioactivity_class.append("inactive")
  elif float(i) <= 1000:
    bioactivity_class.append("active")
  else:
    bioactivity_class.append("intermediate")

### **Iterate the *molecule_chembl_id* to a list**

In [42]:
mol_cid = []
for i in df2.molecule_chembl_id:
  mol_cid.append(i)

In [43]:
mol_cid

['CHEMBL3752792',
 'CHEMBL3751877',
 'CHEMBL3754648',
 'CHEMBL3752547',
 'CHEMBL3753673',
 'CHEMBL3751865',
 'CHEMBL3753028',
 'CHEMBL3754231',
 'CHEMBL3754673',
 'CHEMBL3753969',
 'CHEMBL3402635',
 'CHEMBL3402627',
 'CHEMBL3402631',
 'CHEMBL3400982',
 'CHEMBL3402630',
 'CHEMBL3402629']

### **Iterate *canonical_smiles* to a list**

In [44]:
canonical_smiles = []
for i in df2.canonical_smiles:
  canonical_smiles.append(i)

### **Iterate *standard_value* to a list**

In [45]:
standard_value = []
for i in df2.standard_value:
  standard_value.append(i)

### **Combine the 4 lists into a dataframe**

In [46]:
data_tuples = list(zip(mol_cid, canonical_smiles, bioactivity_class, standard_value))
df3 = pd.DataFrame( data_tuples,  columns=['molecule_chembl_id', 'canonical_smiles', 'bioactivity_class', 'standard_value'])

In [47]:
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,standard_value
0,CHEMBL3752792,COc1nc2ccccc2cc1C(c1cc2ccccc2nc1OC)C(O)(CCN(C)...,active,70.0
1,CHEMBL3751877,COc1nc2ccccc2cc1C(c1cc2ccccc2nc1OC)C(O)(CCN(C)...,active,50.0
2,CHEMBL3754648,C=CCOc1nc2ccccc2cc1C(c1cc2ccccc2nc1OCC=C)C(O)(...,active,40.0
3,CHEMBL3752547,C=CCOc1nc2ccccc2cc1C(c1cc2ccccc2nc1OCC=C)C(O)(...,active,30.0
4,CHEMBL3753673,C=CCOc1nc2ccccc2cc1C(c1cc2ccccc2nc1OCC=C)C(O)(...,active,280.0
5,CHEMBL3751865,C=CCCOc1nc2ccccc2cc1C(c1cc2ccccc2nc1OCCC=C)C(O...,active,80.0
6,CHEMBL3753028,COc1nc2ccccc2cc1C(c1cc2ccccc2nc1OC)C(O)(CCN(C)...,active,30.0
7,CHEMBL3754231,COc1nc2ccccc2cc1C(c1cc2ccccc2nc1OC)C(O)(CCN(C)...,active,10.0
8,CHEMBL3754673,COc1nc2ccccc2cc1C(c1cc2ccccc2nc1OC)C(O)(CCN(C)...,active,1.0
9,CHEMBL3753969,CN(C)CCC(O)(c1cccc(Br)c1)C1c2cc3ccccc3nc2OCC/C...,active,10.0


### **Alternative method**

In [48]:
selection = ['molecule_chembl_id', 'canonical_smiles', 'standard_value']
df3 = df2[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL3752792,COc1nc2ccccc2cc1C(c1cc2ccccc2nc1OC)C(O)(CCN(C)...,70.0
1,CHEMBL3751877,COc1nc2ccccc2cc1C(c1cc2ccccc2nc1OC)C(O)(CCN(C)...,50.0
2,CHEMBL3754648,C=CCOc1nc2ccccc2cc1C(c1cc2ccccc2nc1OCC=C)C(O)(...,40.0
3,CHEMBL3752547,C=CCOc1nc2ccccc2cc1C(c1cc2ccccc2nc1OCC=C)C(O)(...,30.0
4,CHEMBL3753673,C=CCOc1nc2ccccc2cc1C(c1cc2ccccc2nc1OCC=C)C(O)(...,280.0
5,CHEMBL3751865,C=CCCOc1nc2ccccc2cc1C(c1cc2ccccc2nc1OCCC=C)C(O...,80.0
6,CHEMBL3753028,COc1nc2ccccc2cc1C(c1cc2ccccc2nc1OC)C(O)(CCN(C)...,30.0
7,CHEMBL3754231,COc1nc2ccccc2cc1C(c1cc2ccccc2nc1OC)C(O)(CCN(C)...,10.0
8,CHEMBL3754673,COc1nc2ccccc2cc1C(c1cc2ccccc2nc1OC)C(O)(CCN(C)...,1.0
9,CHEMBL3753969,CN(C)CCC(O)(c1cccc(Br)c1)C1c2cc3ccccc3nc2OCC/C...,10.0


In [50]:
pd.concat([df3,pd.Series(bioactivity_class)], axis=1)

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,0
0,CHEMBL3752792,COc1nc2ccccc2cc1C(c1cc2ccccc2nc1OC)C(O)(CCN(C)...,70.0,active
1,CHEMBL3751877,COc1nc2ccccc2cc1C(c1cc2ccccc2nc1OC)C(O)(CCN(C)...,50.0,active
2,CHEMBL3754648,C=CCOc1nc2ccccc2cc1C(c1cc2ccccc2nc1OCC=C)C(O)(...,40.0,active
3,CHEMBL3752547,C=CCOc1nc2ccccc2cc1C(c1cc2ccccc2nc1OCC=C)C(O)(...,30.0,active
4,CHEMBL3753673,C=CCOc1nc2ccccc2cc1C(c1cc2ccccc2nc1OCC=C)C(O)(...,280.0,active
5,CHEMBL3751865,C=CCCOc1nc2ccccc2cc1C(c1cc2ccccc2nc1OCCC=C)C(O...,80.0,active
6,CHEMBL3753028,COc1nc2ccccc2cc1C(c1cc2ccccc2nc1OC)C(O)(CCN(C)...,30.0,active
7,CHEMBL3754231,COc1nc2ccccc2cc1C(c1cc2ccccc2nc1OC)C(O)(CCN(C)...,10.0,active
8,CHEMBL3754673,COc1nc2ccccc2cc1C(c1cc2ccccc2nc1OC)C(O)(CCN(C)...,1.0,active
9,CHEMBL3753969,CN(C)CCC(O)(c1cccc(Br)c1)C1c2cc3ccccc3nc2OCC/C...,10.0,active


Saves dataframe to CSV file

In [51]:
df3.to_csv('bioactivity_preprocessed_data.csv', index=False)

In [52]:
! ls -l

total 24
-rw-r--r-- 1 root root 8722 Dec 13 07:30 bioactivity_data.csv
-rw-r--r-- 1 root root 1405 Dec 13 08:13 bioactivity_preprocessed_data.csv
drwx------ 5 root root 4096 Dec 13 07:32 gdrive
drwxr-xr-x 1 root root 4096 Dec  3 14:33 sample_data


Let's copy to the Google Drive

In [53]:
! cp bioactivity_preprocessed_data.csv "/content/gdrive/My Drive/Colab Notebooks/data_my"

In [57]:
! ls "/content/gdrive/My Drive/Colab Notebooks/data_my"

bioactivity_data.csv  bioactivity_preprocessed_data.csv


---