## Smiles retrieval

In this notebook we will retrieve the smiles information from chembl database. I will be using their pythoin web resource client to help retrieve this information, but it can be retrieved much easily by installing a SQL clients and making request through their database. Smiles inforamtion will be the structure feature we need to work with.

First we load the data, we will remove all those compounds for which smiles inforamtion is not available. It may be available in other databases but for this project I would prefer to strictly work with chembl data.

In [1]:
import pandas as pd

comp = pd.read_csv('../cleaned_data/imp_comp.txt', sep='\n')

In [2]:
comp.head()

Unnamed: 0,chem_id
0,CHEMBL1241824
1,CHEMBL583947
2,CHEMBL404160
3,CHEMBL2112734
4,CHEMBL1994241


## Load chembl client and download data

In [3]:
import logging
from operator import itemgetter
from IPython.display import Image, display

from chembl_webresource_client.new_client import new_client


In [4]:
'''
We will look at all the available resources in the chembl database and how many molecules are present in the database
'''


available_resources = [resource for resource in dir(new_client) if not resource.startswith('_')]
print(available_resources)
print(len(available_resources))

molecule = new_client.molecule
molecule.set_format('json')
print("%s molecules available in ChEMBL" % len(molecule.all()))

['activity', 'activity_supplementary_data_by_activity', 'assay', 'assay_class', 'atc_class', 'binding_site', 'biotherapeutic', 'cell_line', 'chembl_id_lookup', 'compound_record', 'compound_structural_alert', 'description', 'document', 'document_similarity', 'document_term', 'drug', 'drug_indication', 'go_slim', 'image', 'mechanism', 'metabolism', 'molecule', 'molecule_form', 'official', 'organism', 'protein_class', 'similarity', 'source', 'substructure', 'target', 'target_component', 'target_prediction', 'target_relation', 'tissue', 'xref_source']
35
1879206 molecules available in ChEMBL


In [5]:
# we can get information about a drug by using the compound id

record = molecule.get('CHEMBL3904876') # testing using chem id of 1st compound

In [6]:
# the record is a dictionary containing chemical information on the molecules

type(record), record.keys()

(dict,

In [7]:
# the smiles information we need is in the following keys

record['molecule_structures']['canonical_smiles']

'CCc1c(N)ncnc1N2CCC(CC2)c3nc(cn3CCN4CCC4)c5ccnc(OC)c5'

In [8]:
# we will create a new column to hold smiles values
comp['Smiles'] = ''
comp.head()

Unnamed: 0,chem_id,Smiles
0,CHEMBL1241824,
1,CHEMBL583947,
2,CHEMBL404160,
3,CHEMBL2112734,
4,CHEMBL1994241,


In [None]:
'''
We run the following code to get the smiles for our compound
'''

# import time

# i=60000

# for ind in range(60000, len(df_tr.compound_id.values)):
#     comp = df_tr.loc[ind,'compound_id']
#     try:  
#         record = molecule.get(comp)
#         smiles = record['molecule_structures']['canonical_smiles']    
#         df_tr.loc[i, 'Smiles'] = smiles
#     except:
#         df_tr.loc[i, 'Smiles'] = 'None'
#     i+=1
#     if i%1000==0:
#         %time
#         time.sleep(5)
#         %time
#         print(f"Done with {i} number of compounds")
        

Once we collected all the smiles information we will create a new dataframe that has molecular fingerprint for the smiles. The fingerprint will help our model as teh fingerprints contain binary information regarding the molecules substructure. We will then merge them to create a single file which we will save later

In [None]:
# create a list holding ECFP4 values for each compounds

ECFP4 = [list(AllChem.GetMorganFingerprintAsBitVect(Chem.MolFromSmiles(i),2,nBits=1024).ToBitString()) 
                                                  for i in df_test_k_sep["Compound_SMILES"]]

# turn the list into a dataframe
df_ECFP4 = pd.DataFrame(ECFP4, columns=["ECFP4."+str(i) for i in range(1,1025)])
# concat them with our previous file
comp = pd.concat([comp,df_ECFP4],axis=1)

In [9]:
comp.to_csv('../cleaned_data/ECFP4.tsv')