## Data Mining and cleaning for Kinase Inhibitor Data

After deciding to obtain our kinase inhibitor information from PKIDB we went about extracting this information from the website and putting it into a CSV file. From here we need to clean this data and extract further information in order for our table to be suitable enough to import into our database.

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import re

In [2]:
#We read our CSV file using pandas to make it a dataframe object.
df = pd.read_csv('PKIDB.csv', header=0, encoding = 'unicode_escape')

df

Unnamed: 0,INN_Name,BrandName,Phase,Applicants,Links,LigID,pdbID,Type,RoF,MW,...,Indications,Targets,Kinase families,Canonical_Smiles_InChiKey,First_Approval,SC_Patent,Chirality,Synonyms,FDA approved,Melting points (°C)
0,Abemaciclib,Verzenio,4.0,Eli Lilly,ChemSpider\nChEMBL\nPubChem\nDrugBank\nRCSB\nP...,'6ZV',5l2s,1.0,1,506.3,...,"On September 28, 2017, the Food and Drug Admin...",CDK4\nCDK6,CMGC,Smiles=CCN1CCN(CC1)Cc2ccc(nc2)Nc3ncc(c(n3)c4cc...,2017.0,US-7855211-B2,Achiral Molecule,Verzenio\nAbemaciclib\nLY-2835219,Y,
1,Abivertinib,,3.0,ACEA Biosciences,ChemSpider\nPubChem\nDrugBank\nGuide to Pharma...,,,,0,487.2,...,"AC0010 is an orally active, irreversible EGFR ...",,,Smiles=CN1CCN(CC1)c2ccc(cc2F)Nc3nc4c(cc[nH]4)c...,,,,AC0010\nAvitinib,,
2,Acalabrutinib,Calquence,4.0,Astrazeneca,ChemSpider\nChEMBL\nPubChem\nDrugBank\nGuide t...,,,,0,465.2,...,Acalabrutinib is currently indicated for the t...,BTK,Tyr,Smiles=CC#CC(=O)N1CCC[C@H]1c2nc(c3n2ccnc3N)c4c...,2017.0,US-9290504-B2,Single Stereoisomer,Calquence\nACP-196\nAcalabrutinib,Y,
3,Acalisib,,2.0,Gilead Sciences,ChemSpider\nPubChem\nDrugBank\nGuide to Pharma...,,,,0,401.1,...,,PIK3CA,Atypical,Smiles=C[C@@H](c1nc2ccc(cc2c(=O)n1c3ccccc3)F)N...,,,Single Stereoisomer,CAL-120,,
4,Acumapimod,,2.0,Mereo BioPharma,ChemSpider\nPubChem\nDrugBank\nGuide to Pharma...,,,,0,385.2,...,,MAPK14\nMAPK11\nMAPK13\nMAPK12,CMGC,Smiles=Cc1ccc(cc1n2c(c(cn2)C(=O)c3cccc(c3)C#N)...,,,Single Stereoisomer,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
213,Vistusertib,,2.0,Astrazeneca,ChemSpider\nChEMBL\nPubChem\nDrugBank\nGuide t...,,,,0,462.2,...,,MTOR,Atypical,Smiles=C[C@H]1COCCN1c2c3ccc(nc3nc(n2)N4CCOC[C@...,,,Single Stereoisomer,AZD-2014\nAZD2014\nVistusertib,,
214,Volasertib,,3.0,Boehringer Ingelheim,ChemSpider\nChEMBL\nDrugBank\nRCSB\nPDBe\nGuid...,'IBI',3fc2 5v67 5vbr,1.0,1,618.4,...,,PLK1,Other,Smiles=CC[C@H]1N(c2nc(ncc2N(C1=O)C)Nc3c(cc(cc3...,,,Single Stereoisomer,BI-6727\nVolasertib,,
215,Voruciclib,,2.0,Piramal Enterprises,ChemSpider\nChEMBL\nPubChem\nDrugBank\nGuide t...,,,,0,469.1,...,,CDK4,CMGC,Smiles=CN1CC[C@H]([C@@H]1CO)c2c(cc(c3c2oc(cc3=...,,,Single Stereoisomer,,,
216,Voxtalisib,,2.0,Sanofi,ChemSpider\nChEMBL\nPubChem\nDrugBank\nZINC\nF...,,,,0,270.1,...,,PIK3CA\nMTOR,Atypical,Smiles=CCn1c2c(cc(c1=O)c3ccn[nH]3)c(nc(n2)N)C\...,,,Achiral Molecule,SAR-245409\nVoxtalisib\nXL-765,,


The format that our table presently is in, is not in a format we need for our table. Presently, the kinases in the targets column are in a list format per inhibitor. However, for our database we need the targets to be list in their own indvidual rows. So we need to create a new dataframe to reformat our data in this way.  

In [3]:
#So we take the values under the INN_Name and Targets column and place them into a new dataframe.
INN = df.INN_Name
TARGETS = df.Targets
df1 = pd.DataFrame({'INN_Name': INN, 'Targets': TARGETS})
df1

Unnamed: 0,INN_Name,Targets
0,Abemaciclib,CDK4\nCDK6
1,Abivertinib,
2,Acalabrutinib,BTK
3,Acalisib,PIK3CA
4,Acumapimod,MAPK14\nMAPK11\nMAPK13\nMAPK12
...,...,...
213,Vistusertib,MTOR
214,Volasertib,PLK1
215,Voruciclib,CDK4
216,Voxtalisib,PIK3CA\nMTOR


In [4]:
#We then re-format our new dataframe so that all of our Kinase targets are split and placed into their own separate rows.
df1 = \
(df1.set_index(df1.columns.drop('Targets',2).tolist()) #set the dataframe index to our Targets column as a list and drop it
   .Targets.str.split('\n', expand=True) #we then separate the contents of the Targets column by each new line (i.e. each kinase)
   .stack() #we then stack the dataframe from columns to index
   .reset_index() #reset our index
   .rename(columns={0:'Targets'}) #then we rename our index column back to 'Targets'
   .loc[:, df1.columns] #we then rematch the INN_Names to out new Targets column to get the final basis for our newly formatted 
                        # dataframe
)

df1

Unnamed: 0,INN_Name,Targets
0,Abemaciclib,CDK4
1,Abemaciclib,CDK6
2,Acalabrutinib,BTK
3,Acalisib,PIK3CA
4,Acumapimod,MAPK14
...,...,...
488,Vistusertib,MTOR
489,Volasertib,PLK1
490,Voruciclib,CDK4
491,Voxtalisib,PIK3CA


Now that we have our initial data in the correct format we now need to put the rest of our data of interest into the same format of this new dataframe.

In [5]:
#We remove all of the column from our first dataframe so that only the remaining data from the table matches what we want
#in our database.
df = df.drop(columns=['BrandName', 'Phase', 'Applicants', 'Links', 'LigID', 'pdbID', 'Type', 'Indications', 'Targets',
                      'Kinase families', 'First_Approval', 'SC_Patent', 'Chirality', 'Synonyms', 'FDA approved', 
                      'Melting points (°C)'])

df

Unnamed: 0,INN_Name,RoF,MW,LogP,TPSA,HBA,HBD,NRB,Canonical_Smiles_InChiKey
0,Abemaciclib,1,506.3,4.9,75.0,8,1,7,Smiles=CCN1CCN(CC1)Cc2ccc(nc2)Nc3ncc(c(n3)c4cc...
1,Abivertinib,0,487.2,4.5,98.4,7,3,7,Smiles=CN1CCN(CC1)c2ccc(cc2F)Nc3nc4c(cc[nH]4)c...
2,Acalabrutinib,0,465.2,3.3,118.5,7,2,4,Smiles=CC#CC(=O)N1CCC[C@H]1c2nc(c3n2ccnc3N)c4c...
3,Acalisib,0,401.1,3.4,101.4,7,2,4,Smiles=C[C@@H](c1nc2ccc(cc2c(=O)n1c3ccccc3)F)N...
4,Acumapimod,0,385.2,2.8,113.8,6,2,5,Smiles=Cc1ccc(cc1n2c(c(cn2)C(=O)c3cccc(c3)C#N)...
...,...,...,...,...,...,...,...,...,...
213,Vistusertib,0,462.2,2.5,92.7,8,1,4,Smiles=C[C@H]1COCCN1c2c3ccc(nc3nc(n2)N4CCOC[C@...
214,Volasertib,1,618.4,4.3,106.2,9,2,10,Smiles=CC[C@H]1N(c2nc(ncc2N(C1=O)C)Nc3c(cc(cc3...
215,Voruciclib,0,469.1,4.3,94.1,6,3,3,Smiles=CN1CC[C@H]([C@@H]1CO)c2c(cc(c3c2oc(cc3=...
216,Voxtalisib,0,270.1,1.1,102.5,6,2,2,Smiles=CCn1c2c(cc(c1=O)c3ccn[nH]3)c(nc(n2)N)C\...


In [6]:
#Smiles and InChi key information for each inhibitor is something we would like in two separate columns so we split the contents
#of the Canonical_Smiles_InChiKey by a new line (seaprator between Smile and InChi info in the column) and we put it into two
#new columns.
df[['Smiles','InChi_Key']] = df.Canonical_Smiles_InChiKey.str.split("\n",expand=True,)
df

Unnamed: 0,INN_Name,RoF,MW,LogP,TPSA,HBA,HBD,NRB,Canonical_Smiles_InChiKey,Smiles,InChi_Key
0,Abemaciclib,1,506.3,4.9,75.0,8,1,7,Smiles=CCN1CCN(CC1)Cc2ccc(nc2)Nc3ncc(c(n3)c4cc...,Smiles=CCN1CCN(CC1)Cc2ccc(nc2)Nc3ncc(c(n3)c4cc...,InChiKey=UZWDCWONPYILKI-UHFFFAOYSA-N
1,Abivertinib,0,487.2,4.5,98.4,7,3,7,Smiles=CN1CCN(CC1)c2ccc(cc2F)Nc3nc4c(cc[nH]4)c...,Smiles=CN1CCN(CC1)c2ccc(cc2F)Nc3nc4c(cc[nH]4)c...,InChiKey=UOFYSRZSLXWIQB-UHFFFAOYSA-N
2,Acalabrutinib,0,465.2,3.3,118.5,7,2,4,Smiles=CC#CC(=O)N1CCC[C@H]1c2nc(c3n2ccnc3N)c4c...,Smiles=CC#CC(=O)N1CCC[C@H]1c2nc(c3n2ccnc3N)c4c...,InChiKey=WDENQIQQYWYTPO-IBGZPJMESA-N
3,Acalisib,0,401.1,3.4,101.4,7,2,4,Smiles=C[C@@H](c1nc2ccc(cc2c(=O)n1c3ccccc3)F)N...,Smiles=C[C@@H](c1nc2ccc(cc2c(=O)n1c3ccccc3)F)N...,InChiKey=DOCINCLJNAXZQF-LBPRGKRZSA-N
4,Acumapimod,0,385.2,2.8,113.8,6,2,5,Smiles=Cc1ccc(cc1n2c(c(cn2)C(=O)c3cccc(c3)C#N)...,Smiles=Cc1ccc(cc1n2c(c(cn2)C(=O)c3cccc(c3)C#N)...,InChiKey=VGUSQKZDZHAAEE-UHFFFAOYSA-N
...,...,...,...,...,...,...,...,...,...,...,...
213,Vistusertib,0,462.2,2.5,92.7,8,1,4,Smiles=C[C@H]1COCCN1c2c3ccc(nc3nc(n2)N4CCOC[C@...,Smiles=C[C@H]1COCCN1c2c3ccc(nc3nc(n2)N4CCOC[C@...,InChiKey=JUSFANSTBFGBAF-IRXDYDNUSA-N
214,Volasertib,1,618.4,4.3,106.2,9,2,10,Smiles=CC[C@H]1N(c2nc(ncc2N(C1=O)C)Nc3c(cc(cc3...,Smiles=CC[C@H]1N(c2nc(ncc2N(C1=O)C)Nc3c(cc(cc3...,InChiKey=SXNJFOWDRLKDSF-STROYTFGSA-N
215,Voruciclib,0,469.1,4.3,94.1,6,3,3,Smiles=CN1CC[C@H]([C@@H]1CO)c2c(cc(c3c2oc(cc3=...,Smiles=CN1CC[C@H]([C@@H]1CO)c2c(cc(c3c2oc(cc3=...,InChiKey=MRPGRAKIAJJGMM-OCCSQVGLSA-N
216,Voxtalisib,0,270.1,1.1,102.5,6,2,2,Smiles=CCn1c2c(cc(c1=O)c3ccn[nH]3)c(nc(n2)N)C\...,Smiles=CCn1c2c(cc(c1=O)c3ccn[nH]3)c(nc(n2)N)C,InChiKey=RGHYDLZMTYDBDT-UHFFFAOYSA-N


In [7]:
#This leaves a now redundant Canonical_Smiles_InChiKey column so we remove that from that dataframe.
df = df.drop(columns=['Canonical_Smiles_InChiKey'])
df

Unnamed: 0,INN_Name,RoF,MW,LogP,TPSA,HBA,HBD,NRB,Smiles,InChi_Key
0,Abemaciclib,1,506.3,4.9,75.0,8,1,7,Smiles=CCN1CCN(CC1)Cc2ccc(nc2)Nc3ncc(c(n3)c4cc...,InChiKey=UZWDCWONPYILKI-UHFFFAOYSA-N
1,Abivertinib,0,487.2,4.5,98.4,7,3,7,Smiles=CN1CCN(CC1)c2ccc(cc2F)Nc3nc4c(cc[nH]4)c...,InChiKey=UOFYSRZSLXWIQB-UHFFFAOYSA-N
2,Acalabrutinib,0,465.2,3.3,118.5,7,2,4,Smiles=CC#CC(=O)N1CCC[C@H]1c2nc(c3n2ccnc3N)c4c...,InChiKey=WDENQIQQYWYTPO-IBGZPJMESA-N
3,Acalisib,0,401.1,3.4,101.4,7,2,4,Smiles=C[C@@H](c1nc2ccc(cc2c(=O)n1c3ccccc3)F)N...,InChiKey=DOCINCLJNAXZQF-LBPRGKRZSA-N
4,Acumapimod,0,385.2,2.8,113.8,6,2,5,Smiles=Cc1ccc(cc1n2c(c(cn2)C(=O)c3cccc(c3)C#N)...,InChiKey=VGUSQKZDZHAAEE-UHFFFAOYSA-N
...,...,...,...,...,...,...,...,...,...,...
213,Vistusertib,0,462.2,2.5,92.7,8,1,4,Smiles=C[C@H]1COCCN1c2c3ccc(nc3nc(n2)N4CCOC[C@...,InChiKey=JUSFANSTBFGBAF-IRXDYDNUSA-N
214,Volasertib,1,618.4,4.3,106.2,9,2,10,Smiles=CC[C@H]1N(c2nc(ncc2N(C1=O)C)Nc3c(cc(cc3...,InChiKey=SXNJFOWDRLKDSF-STROYTFGSA-N
215,Voruciclib,0,469.1,4.3,94.1,6,3,3,Smiles=CN1CC[C@H]([C@@H]1CO)c2c(cc(c3c2oc(cc3=...,InChiKey=MRPGRAKIAJJGMM-OCCSQVGLSA-N
216,Voxtalisib,0,270.1,1.1,102.5,6,2,2,Smiles=CCn1c2c(cc(c1=O)c3ccn[nH]3)c(nc(n2)N)C,InChiKey=RGHYDLZMTYDBDT-UHFFFAOYSA-N


The data in both our Smiles and InChi_Key could be cleaned up a bit as the data has 'Smiles=' and 'InChiKey=' in them that could be removed. So we look to clean them here.

In [8]:
#We put the values in the Smiles and InChi_Key columns in variables.
SmilesList = df['Smiles'].values
InCHIList = df['InChi_Key'].values

#Apply a substitution regex to select the 'Smiles=' and 'InChiKey=' in each of those columns and replace them with blanks. 
Smiles_regex = [re.sub('(Smiles=)', '', Smile) for Smile in SmilesList]
InCHI_regex = [re.sub('(InChiKey=)', '', InCHI) for InCHI in InCHIList]

#We remove the old Smiles and InChi_Key columns
df = df.drop(columns=['Smiles', 'InChi_Key'])

#We then add the new Smiles and InChi_Key columns filled with our regex cleaned data lists.
df['Smiles'] = Smiles_regex
df['InChi_Key'] = InCHI_regex

In [9]:
#Now we merge our dataframe into the dataframe template and format we actually want (df1)
df1 = pd.merge(df1,df[['INN_Name', 'RoF', 'MW', 'LogP', 'TPSA', 'HBA', 'HBD', 'NRB', 'Smiles', 'InChi_Key' ]],
               on='INN_Name', how='inner')
df1

Unnamed: 0,INN_Name,Targets,RoF,MW,LogP,TPSA,HBA,HBD,NRB,Smiles,InChi_Key
0,Abemaciclib,CDK4,1,506.3,4.9,75.0,8,1,7,CCN1CCN(CC1)Cc2ccc(nc2)Nc3ncc(c(n3)c4cc5c(c(c4...,UZWDCWONPYILKI-UHFFFAOYSA-N
1,Abemaciclib,CDK6,1,506.3,4.9,75.0,8,1,7,CCN1CCN(CC1)Cc2ccc(nc2)Nc3ncc(c(n3)c4cc5c(c(c4...,UZWDCWONPYILKI-UHFFFAOYSA-N
2,Acalabrutinib,BTK,0,465.2,3.3,118.5,7,2,4,CC#CC(=O)N1CCC[C@H]1c2nc(c3n2ccnc3N)c4ccc(cc4)...,WDENQIQQYWYTPO-IBGZPJMESA-N
3,Acalisib,PIK3CA,0,401.1,3.4,101.4,7,2,4,C[C@@H](c1nc2ccc(cc2c(=O)n1c3ccccc3)F)Nc4c5c(n...,DOCINCLJNAXZQF-LBPRGKRZSA-N
4,Acumapimod,MAPK14,0,385.2,2.8,113.8,6,2,5,Cc1ccc(cc1n2c(c(cn2)C(=O)c3cccc(c3)C#N)N)C(=O)...,VGUSQKZDZHAAEE-UHFFFAOYSA-N
...,...,...,...,...,...,...,...,...,...,...,...
488,Vistusertib,MTOR,0,462.2,2.5,92.7,8,1,4,C[C@H]1COCCN1c2c3ccc(nc3nc(n2)N4CCOC[C@@H]4C)c...,JUSFANSTBFGBAF-IRXDYDNUSA-N
489,Volasertib,PLK1,1,618.4,4.3,106.2,9,2,10,CC[C@H]1N(c2nc(ncc2N(C1=O)C)Nc3c(cc(cc3)C(=O)N...,SXNJFOWDRLKDSF-STROYTFGSA-N
490,Voruciclib,CDK4,0,469.1,4.3,94.1,6,3,3,CN1CC[C@H]([C@@H]1CO)c2c(cc(c3c2oc(cc3=O)c4ccc...,MRPGRAKIAJJGMM-OCCSQVGLSA-N
491,Voxtalisib,PIK3CA,0,270.1,1.1,102.5,6,2,2,CCn1c2c(cc(c1=O)c3ccn[nH]3)c(nc(n2)N)C,RGHYDLZMTYDBDT-UHFFFAOYSA-N


Now that we have a large portion of our data in the correct format in our new dataframe we are still missing two details that we want to include:

1) ChEMBL IDs for each of the inhibitors

2) Images of the Chemical Structures for each of the inhibitors

In [10]:
#To obtain the ChEMBL IDs we use BeautifulSoup to parse the html of the PKIDB website 
r = requests.get("http://www.icoa.fr/pkidb/")
soup = BeautifulSoup(r.text, 'html.parser')

#Searching for all 'tr' tags to obtain all of the table rows of the website and place them into an object
table_rows = soup.find_all('tr')

#we then specify the specific rows of the table that actually contain the data we want i.e. all but the very first one (i.e.
# the header in this case [0])
data_rows = table_rows[1:]

In [11]:
#Set up an empty list to collect all of the Inhibitor Names and their corresponding ChEMBL IDs.
Name_and_ChEMBLIDs = []

for row in data_rows: #for each row in our table rows containing data
    Name = row.find('td').text #extract the text of the table table data tag (Inhibitor name) and put it into a variable.
    ChEMBL = row.find_all('a')[1] #extract the 2nd 'a' tag in the data (ChEMBL link) and put it into a variable
    ChEMBL = str(ChEMBL) #convert it into a string
    ChEMBL_regex = re.search(r"(CHEMBL\d+)",ChEMBL) #use a regex to search for just the ChEMBL ID from the whole url.
    if ChEMBL_regex is not None: #if the regex finds a match
        ChEMBL_regex = ChEMBL_regex.group(0) #extract the match value
    else:
        continue 
        
    Name_and_ChEMBLIDs.append((Name, ChEMBL_regex)) #then append both the name and the ID into our empty list.

In [12]:
#We then create a new dataframe with the Inhibitor Names and their corresponding ChEMBL IDs 
ChEMBL_ID = pd.DataFrame(Name_and_ChEMBLIDs, columns=['INN_Name', 'ChEMBL_ID'])
ChEMBL_ID

Unnamed: 0,INN_Name,ChEMBL_ID
0,Leniolisib,CHEMBL3643413
1,Nemiralisib,CHEMBL2216859
2,Oclacitinib,CHEMBL2103874
3,Toceranib,CHEMBL13608
4,Dezapelisib,CHEMBL2216863
...,...,...
192,Imatinib,CHEMBL941
193,Ponatinib,CHEMBL1171837
194,Lapatinib,CHEMBL554
195,Ribociclib,CHEMBL3545110


In [13]:
#We then merge this dataframe with the dataframe containing the majority of our data.
df1 = pd.merge(df1,chembl_id[['INN_Name', 'ChEMBL_ID']],
               on='INN_Name', how='inner')
df1

Unnamed: 0,INN_Name,Targets,RoF,MW,LogP,TPSA,HBA,HBD,NRB,Smiles,InChi_Key,ChEMBL_ID
0,Abemaciclib,CDK4,1,506.3,4.9,75.0,8,1,7,CCN1CCN(CC1)Cc2ccc(nc2)Nc3ncc(c(n3)c4cc5c(c(c4...,UZWDCWONPYILKI-UHFFFAOYSA-N,CHEMBL3301610
1,Abemaciclib,CDK6,1,506.3,4.9,75.0,8,1,7,CCN1CCN(CC1)Cc2ccc(nc2)Nc3ncc(c(n3)c4cc5c(c(c4...,UZWDCWONPYILKI-UHFFFAOYSA-N,CHEMBL3301610
2,Acalabrutinib,BTK,0,465.2,3.3,118.5,7,2,4,CC#CC(=O)N1CCC[C@H]1c2nc(c3n2ccnc3N)c4ccc(cc4)...,WDENQIQQYWYTPO-IBGZPJMESA-N,CHEMBL3707348
3,Adavosertib,WEE1,1,500.3,2.9,104.3,10,2,7,CC(C)(c1cccc(n1)n2c3c(cnc(n3)Nc4ccc(cc4)N5CCN(...,BKWJAKQVGHWELA-UHFFFAOYSA-N,CHEMBL1976040
4,Afatinib,EGFR,0,485.2,4.4,88.6,7,2,8,CN(C)C/C=C/C(=O)Nc1cc2c(cc1O[C@H]3CCOC3)ncnc2N...,ULXXDDBFHOBEHA-CWDCEQMOSA-N,CHEMBL1173655
...,...,...,...,...,...,...,...,...,...,...,...,...
464,Vistusertib,MTOR,0,462.2,2.5,92.7,8,1,4,C[C@H]1COCCN1c2c3ccc(nc3nc(n2)N4CCOC[C@@H]4C)c...,JUSFANSTBFGBAF-IRXDYDNUSA-N,CHEMBL2336325
465,Volasertib,PLK1,1,618.4,4.3,106.2,9,2,10,CC[C@H]1N(c2nc(ncc2N(C1=O)C)Nc3c(cc(cc3)C(=O)N...,SXNJFOWDRLKDSF-STROYTFGSA-N,CHEMBL1233528
466,Voruciclib,CDK4,0,469.1,4.3,94.1,6,3,3,CN1CC[C@H]([C@@H]1CO)c2c(cc(c3c2oc(cc3=O)c4ccc...,MRPGRAKIAJJGMM-OCCSQVGLSA-N,CHEMBL3905910
467,Voxtalisib,PIK3CA,0,270.1,1.1,102.5,6,2,2,CCn1c2c(cc(c1=O)c3ccn[nH]3)c(nc(n2)N)C,RGHYDLZMTYDBDT-UHFFFAOYSA-N,CHEMBL3545366


In [14]:
#Create an empty list to contain all of our Inhibitor structure image links
image_list = []

for Name in df1['INN_Name']: #for each Inhibitor under the INN_Name column of our main dataframe
    image_link = "http://www.icoa.fr/pkidb/static/img/mol/"+Name+".svg" #create the image link by inserting the inhibitor name
                                                                        #into a url template
    image_list.append(image_link) #then apend that image link into our list

#Then we simply create an image link new column in our dataframe with the contents of the column being our list of image links     
df1["image_link"] = image_list 
df1

Unnamed: 0,INN_Name,Targets,RoF,MW,LogP,TPSA,HBA,HBD,NRB,Smiles,InChi_Key,ChEMBL_ID,image_link
0,Abemaciclib,CDK4,1,506.3,4.9,75.0,8,1,7,CCN1CCN(CC1)Cc2ccc(nc2)Nc3ncc(c(n3)c4cc5c(c(c4...,UZWDCWONPYILKI-UHFFFAOYSA-N,CHEMBL3301610,http://www.icoa.fr/pkidb/static/img/mol/Abemac...
1,Abemaciclib,CDK6,1,506.3,4.9,75.0,8,1,7,CCN1CCN(CC1)Cc2ccc(nc2)Nc3ncc(c(n3)c4cc5c(c(c4...,UZWDCWONPYILKI-UHFFFAOYSA-N,CHEMBL3301610,http://www.icoa.fr/pkidb/static/img/mol/Abemac...
2,Acalabrutinib,BTK,0,465.2,3.3,118.5,7,2,4,CC#CC(=O)N1CCC[C@H]1c2nc(c3n2ccnc3N)c4ccc(cc4)...,WDENQIQQYWYTPO-IBGZPJMESA-N,CHEMBL3707348,http://www.icoa.fr/pkidb/static/img/mol/Acalab...
3,Adavosertib,WEE1,1,500.3,2.9,104.3,10,2,7,CC(C)(c1cccc(n1)n2c3c(cnc(n3)Nc4ccc(cc4)N5CCN(...,BKWJAKQVGHWELA-UHFFFAOYSA-N,CHEMBL1976040,http://www.icoa.fr/pkidb/static/img/mol/Adavos...
4,Afatinib,EGFR,0,485.2,4.4,88.6,7,2,8,CN(C)C/C=C/C(=O)Nc1cc2c(cc1O[C@H]3CCOC3)ncnc2N...,ULXXDDBFHOBEHA-CWDCEQMOSA-N,CHEMBL1173655,http://www.icoa.fr/pkidb/static/img/mol/Afatin...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
464,Vistusertib,MTOR,0,462.2,2.5,92.7,8,1,4,C[C@H]1COCCN1c2c3ccc(nc3nc(n2)N4CCOC[C@@H]4C)c...,JUSFANSTBFGBAF-IRXDYDNUSA-N,CHEMBL2336325,http://www.icoa.fr/pkidb/static/img/mol/Vistus...
465,Volasertib,PLK1,1,618.4,4.3,106.2,9,2,10,CC[C@H]1N(c2nc(ncc2N(C1=O)C)Nc3c(cc(cc3)C(=O)N...,SXNJFOWDRLKDSF-STROYTFGSA-N,CHEMBL1233528,http://www.icoa.fr/pkidb/static/img/mol/Volase...
466,Voruciclib,CDK4,0,469.1,4.3,94.1,6,3,3,CN1CC[C@H]([C@@H]1CO)c2c(cc(c3c2oc(cc3=O)c4ccc...,MRPGRAKIAJJGMM-OCCSQVGLSA-N,CHEMBL3905910,http://www.icoa.fr/pkidb/static/img/mol/Voruci...
467,Voxtalisib,PIK3CA,0,270.1,1.1,102.5,6,2,2,CCn1c2c(cc(c1=O)c3ccn[nH]3)c(nc(n2)N)C,RGHYDLZMTYDBDT-UHFFFAOYSA-N,CHEMBL3545366,http://www.icoa.fr/pkidb/static/img/mol/Voxtal...


In [16]:
#Finally, we insert another column to act as our ID column so that we have a primary key for our inhibtor table for our database
df1.insert(0, 'INHIBITOR_ID', range(1, 1 + len(df1)))
df1

Unnamed: 0,INHIBITOR_ID,INN_Name,Targets,RoF,MW,LogP,TPSA,HBA,HBD,NRB,Smiles,InChi_Key,ChEMBL_ID,image_link
0,1,Abemaciclib,CDK4,1,506.3,4.9,75.0,8,1,7,CCN1CCN(CC1)Cc2ccc(nc2)Nc3ncc(c(n3)c4cc5c(c(c4...,UZWDCWONPYILKI-UHFFFAOYSA-N,CHEMBL3301610,http://www.icoa.fr/pkidb/static/img/mol/Abemac...
1,2,Abemaciclib,CDK6,1,506.3,4.9,75.0,8,1,7,CCN1CCN(CC1)Cc2ccc(nc2)Nc3ncc(c(n3)c4cc5c(c(c4...,UZWDCWONPYILKI-UHFFFAOYSA-N,CHEMBL3301610,http://www.icoa.fr/pkidb/static/img/mol/Abemac...
2,3,Acalabrutinib,BTK,0,465.2,3.3,118.5,7,2,4,CC#CC(=O)N1CCC[C@H]1c2nc(c3n2ccnc3N)c4ccc(cc4)...,WDENQIQQYWYTPO-IBGZPJMESA-N,CHEMBL3707348,http://www.icoa.fr/pkidb/static/img/mol/Acalab...
3,4,Adavosertib,WEE1,1,500.3,2.9,104.3,10,2,7,CC(C)(c1cccc(n1)n2c3c(cnc(n3)Nc4ccc(cc4)N5CCN(...,BKWJAKQVGHWELA-UHFFFAOYSA-N,CHEMBL1976040,http://www.icoa.fr/pkidb/static/img/mol/Adavos...
4,5,Afatinib,EGFR,0,485.2,4.4,88.6,7,2,8,CN(C)C/C=C/C(=O)Nc1cc2c(cc1O[C@H]3CCOC3)ncnc2N...,ULXXDDBFHOBEHA-CWDCEQMOSA-N,CHEMBL1173655,http://www.icoa.fr/pkidb/static/img/mol/Afatin...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
464,465,Vistusertib,MTOR,0,462.2,2.5,92.7,8,1,4,C[C@H]1COCCN1c2c3ccc(nc3nc(n2)N4CCOC[C@@H]4C)c...,JUSFANSTBFGBAF-IRXDYDNUSA-N,CHEMBL2336325,http://www.icoa.fr/pkidb/static/img/mol/Vistus...
465,466,Volasertib,PLK1,1,618.4,4.3,106.2,9,2,10,CC[C@H]1N(c2nc(ncc2N(C1=O)C)Nc3c(cc(cc3)C(=O)N...,SXNJFOWDRLKDSF-STROYTFGSA-N,CHEMBL1233528,http://www.icoa.fr/pkidb/static/img/mol/Volase...
466,467,Voruciclib,CDK4,0,469.1,4.3,94.1,6,3,3,CN1CC[C@H]([C@@H]1CO)c2c(cc(c3c2oc(cc3=O)c4ccc...,MRPGRAKIAJJGMM-OCCSQVGLSA-N,CHEMBL3905910,http://www.icoa.fr/pkidb/static/img/mol/Voruci...
467,468,Voxtalisib,PIK3CA,0,270.1,1.1,102.5,6,2,2,CCn1c2c(cc(c1=O)c3ccn[nH]3)c(nc(n2)N)C,RGHYDLZMTYDBDT-UHFFFAOYSA-N,CHEMBL3545366,http://www.icoa.fr/pkidb/static/img/mol/Voxtal...


In [18]:
#Writing our final dataframe into a CSV file
df1.to_csv('Inhibitor_Table.csv')