# Merging LTKB data with TG Gates

Author: Tim Dudgeon (tdudgeon@informaticsmatters.com)

This notebook illustrates how to join data between multiple datasets.
We have a set of structures from TG Gates and want to add information from the LTKB dataset so that we can use the DILI outcomes as a machine learning label and use the TG Gates data for model generation.

Both datasets have PubChem CIDs so it will be easy to join by these values. But we want to be able to check that the process is accurate (there may be incorrect CIDs in the data).

We also want to try to join the data by finding identical structures in the 2 sets. This is a bit more complex, but RDKit makes it possible. 

In [131]:
from IPython.display import Image
import rdkit
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from IPython.display import Image
from rdkit.Chem import PandasTools
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
PandasTools.molRepresentation='svg'

IPythonConsole.molSize = (500,300)

In [132]:
# read the LTKB data from the CSV file
ltkb = pd.read_csv('ltkb.csv', sep=',')
ltkb.dtypes

LTKBID                             object
PubChem_CID                       float64
LabelCompoundName                  object
ApprovalYear                      float64
DILIConcern                        object
SeverityClass                       int64
LabelSection                       object
AdjudicatedDILI                    object
vDILIConcern                       object
Greene_Annotation                  object
Sakatis_Annotation                 object
Xu_Annotation                      object
Zhu_Annotation                     object
SMILES                             object
URL (accessed on 1/21/2016)        object
VER_DISP (1=LTKB-BD, 2=others)      int64
dtype: object

In [133]:
# let's look at the shape of the LTKB datset
ltkb.shape

(1036, 16)

In [134]:
# READ the TG Gates data for the TAB separated file.
tggates = pd.read_csv('tg_gates.tab', sep='\t')
tggates.dtypes

CHEMBLID            object
PubChem CID          int64
Compound            object
StdInChIKey         object
CANONICAL_SMILES    object
dtype: object

In [135]:
# let's look at the shape of the TG Gates datset
tggates.shape

(151, 5)

First let's try joining using the PubChem Compound identifier as that's present in both sets.
But we need to remove rows in LTKB that don't have a value, and to convert to the same integer datatype as is found in the TG Gates data.

In [136]:
ltkb.dropna(subset=['PubChem_CID'], how='any', inplace=True)
ltkb['PubChem_CID'] = ltkb['PubChem_CID'].apply(lambda x: int(x))
ltkb

Unnamed: 0,LTKBID,PubChem_CID,LabelCompoundName,ApprovalYear,DILIConcern,SeverityClass,LabelSection,AdjudicatedDILI,vDILIConcern,Greene_Annotation,Sakatis_Annotation,Xu_Annotation,Zhu_Annotation,SMILES,URL (accessed on 1/21/2016),"VER_DISP (1=LTKB-BD, 2=others)"
0,LT01185,5361919,ceftriaxone,1984.0,Less-DILI-Concern,4,Adverse reactions,Yes,vLess-DILI-Concern,HH,non-hepatotixic,Positive,Postive,CN1C(=NC(=O)C(=O)N1)SCC2=C(N3[C@@H]([C@@H](C3=O)NC(=O)/C(=N/OC)/C4=CSC(=N4)N)SC2)C(=O)O,http://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?setid=7c6b995b-264f-494c-8c10-6c12b02c214e,2
1,LT01842,5578,trimethoprim,1980.0,Less-DILI-Concern,4,Adverse reactions,Yes,vLess-DILI-Concern,HH,non-hepatotixic,Positive,Postive,COC1=CC(=CC(=C1OC)OC)CC2=CN=C(N=C2N)N,http://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?setid=24daed1d-d20b-4b39-bfa3-efaf05c00380,2
2,LT00036,5353990,tetracycline,1953.0,Less-DILI-Concern,2,Warnings and precautions,Yes,vLess-DILI-Concern,HH,Hepatotoxic,Positive,Postive,CC1(C2CC3C(C(=O)/C(=C(\N)/O)/C(=O)C3(C(=O)C2=C(C4=C1C=CC=C4O)O)O)N(C)C)O,http://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?setid=eaa5b1bd-0927-46e4-b941-32fcc492d258,1
3,LT00289,2955,dapsone,1979.0,Less-DILI-Concern,3,Warnings and precautions,Yes,vLess-DILI-Concern,HH,Hepatotoxic,Positive,Postive,C1=CC(=CC=C1N)S(=O)(=O)C2=CC=C(C=C2)N,http://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?setid=0792169d-c6f9-4af0-93ae-b75d710c47a9,2
4,LT00166,1046,pyrazinamide,1971.0,Less-DILI-Concern,3,Warnings and precautions,Yes,vLess-DILI-Concern,HH,Hepatotoxic,Positive,Postive,C1=CN=C(C=N1)C(=O)N,http://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?setid=ecee3128-a47f-4ced-a1db-4f58c5f3e9c6,1
5,LT00098,3339,fenofibrate,1993.0,Less-DILI-Concern,3,Warnings and precautions,Yes,vLess-DILI-Concern,HH,Hepatotoxic,Positive,Postive,CC(C)OC(=O)C(C)(C)OC1=CC=C(C=C1)C(=O)C2=CC=C(C=C2)Cl,http://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?setid=2fe0bbd6-47a4-4836-beb1-0b635e3ab647,1
6,LT00013,2907,cyclophosphamide,1959.0,Less-DILI-Concern,5,Adverse reactions,Yes,vLess-DILI-Concern,HH,Hepatotoxic,Positive,Postive,C1CNP(=O)(OC1)N(CCCl)CCCl,http://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?setid=369e6911-66bd-4544-9901-cf887fdf1dc2,1
7,LT00068,2726,chlorpromazine,1957.0,Less-DILI-Concern,2,Adverse reactions,Yes,vLess-DILI-Concern,HH,Hepatotoxic,Positive,Postive,CN(C)CCCN1C2=CC=CC=C2SC3=C1C=C(C=C3)Cl,http://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?setid=b9d57be1-17cb-42ac-8334-ef3494be7f17,1
8,LT00335,31703,doxorubicin,1974.0,Less-DILI-Concern,3,Adverse reactions,Yes,vLess-DILI-Concern,,Hepatotoxic,Positive,Postive,C[C@H]1[C@H]([C@H](C[C@@H](O1)O[C@H]2C[C@@](CC3=C(C4=C(C(=C23)O)C(=O)C5=C(C4=O)C=CC=C5OC)O)(C(=O)CO)O)N)O,http://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?setid=21d9c619-7e94-49e2-ac41-31e9ea96554a,2
9,LT01225,29029,clindamycin,1970.0,Less-DILI-Concern,3,Warnings and precautions,Yes,vLess-DILI-Concern,HH,,Positive,Postive,CCC[C@@H]1C[C@H](N(C1)C)C(=O)NC([C@@H]2[C@@H]([C@@H]([C@H]([C@H](O2)SC)O)O)O)C(C)Cl,http://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?setid=ab7233b1-b68e-458b-9e1d-4fc3fa15e8b6,2


In [137]:
# rename the PubChem CID field so that it's the same as in the LTKB data
tggates.rename({'PubChem CID':'PubChem_CID'}, axis='columns', inplace=True)
tggates

Unnamed: 0,CHEMBLID,PubChem_CID,Compound,StdInChIKey,CANONICAL_SMILES
0,CHEMBL273386,1493,"2,4-dinitrophenol",UFBJCMHMOXMLKC-UHFFFAOYSA-N,Oc1ccc(cc1[N+](=O)[O-])[N+](=O)[O-]
1,CHEMBL351487,11831,2-nitrofluorene,XFOHWECQTFIEIX-UHFFFAOYSA-N,[O-][N+](=O)c1ccc2c(Cc3ccccc23)c1
2,CHEMBL1566,444254,acarbose,XUFXOAAUWZOOIT-SXARVLRPSA-N,C[C@H]1O[C@H](O[C@H]2[C@H](O)[C@@H](O)[C@@H](O[C@H]3[C@H](O)[C@@H](O)[C@H](O)O[C@@H]3CO)O[C@@H]2CO)[C@H](O)[C@@H](O)[C@@H]1N[C@H]4C=C(CO)[C@@H](O)[C@H](O)[C@H]4O
3,CHEMBL16081,178,acetamide,DLFVBJFMPXGRIB-UHFFFAOYSA-N,CC(=O)N
4,CHEMBL311469,5897,acetamidofluorene,CZIHNRWJTSTCEX-UHFFFAOYSA-N,CC(=O)Nc1ccc2c(Cc3ccccc23)c1
5,CHEMBL20,1986,acetazolamide,BZKPWHYZMXOIDC-UHFFFAOYSA-N,CC(=O)Nc1nnc(s1)S(=O)(=O)N
6,CHEMBL1697694,186907,aflatoxin b1,OQIQSTLJSLGHID-WNWIJWBNSA-N,COc1cc2O[C@H]3OC=C[C@H]3c2c4OC(=O)C5=C(CCC5=O)c14
7,CHEMBL2105617,70691408,ajmaline,CJDRUOGAGYHKKD-FUIWMBJSSA-N,CCC1C2CC3C4N(C)c5ccccc5C46CC(C2[C@H]6O)N3C1O
8,CHEMBL54349,54897,alpidem,JRTIDHTUMYMPRU-UHFFFAOYSA-N,CCCN(CCC)C(=O)Cc1c(nc2ccc(Cl)cn12)c3ccc(Cl)cc3
9,CHEMBL267345,5280965,amphotericin b,APKFDSVGJQXUKY-INPOYWNPSA-N,C[C@@H]1OC(=O)C[C@H](O)C[C@H](O)CC[C@@H](O)[C@H](O)C[C@H](O)C[C@]2(O)C[C@H](O)[C@H]([C@H](C[C@@H](O[C@@H]3O[C@H](C)[C@@H](O)[C@H](N)[C@@H]3O)\C=C\C=C\C=C\C=C\C=C\C=C\C=C\[C@H](C)[C@@H](O)[C@H]1C)O2)C(=O)O


Merge the data using the common PubChem_CID values.
We end up with 89 rows of data.

In [138]:
joinedByPubchemCID = pd.merge(tggates, ltkb, on='PubChem_CID')
joinedByPubchemCID.shape

(89, 20)

In [139]:
joinedByPubchemCID

Unnamed: 0,CHEMBLID,PubChem_CID,Compound,StdInChIKey,CANONICAL_SMILES,LTKBID,LabelCompoundName,ApprovalYear,DILIConcern,SeverityClass,LabelSection,AdjudicatedDILI,vDILIConcern,Greene_Annotation,Sakatis_Annotation,Xu_Annotation,Zhu_Annotation,SMILES,URL (accessed on 1/21/2016),"VER_DISP (1=LTKB-BD, 2=others)"
0,CHEMBL20,1986,acetazolamide,BZKPWHYZMXOIDC-UHFFFAOYSA-N,CC(=O)Nc1nnc(s1)S(=O)(=O)N,LT00498,acetazolamide,1953.0,Most-DILI-Concern,8,Warnings and precautions,Yes,vMost-DILI-Concern,HH,,Positive,,CC(=O)NC1=NN=C(S1)S(=O)(=O)N,http://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?setid=a0dc2da1-f42f-483a-83de-56ad81b728df,2
1,CHEMBL54349,54897,alpidem,JRTIDHTUMYMPRU-UHFFFAOYSA-N,CCCN(CCC)C(=O)Cc1c(nc2ccc(Cl)cn12)c3ccc(Cl)cc3,LT01060,alpidem,,Most-DILI-Concern,8,Withdrawn,,vMost-DILI-Concern,HH,Hepatotoxic,Positive,,CCCN(CCC)C(=O)CC1=C(N=C2N1C=C(C=C2)Cl)C3=CC=C(C=C3)Cl,,1
2,CHEMBL1089221,2313,bendazac,BYFMCKSPFYVMOU-UHFFFAOYSA-N,OC(=O)COc1nn(Cc2ccccc2)c3ccccc13,LT01117,bendazac,,Most-DILI-Concern,8,Withdrawn,,vMost-DILI-Concern,,,,,C1=CC=C(C=C1)CN2C3=CC=CC=C3C(=N2)OCC(=O)O,,1
3,CHEMBL232201,6237,benziodarone,CZCHIEJNWPNBDE-UHFFFAOYSA-N,CCc1oc2ccccc2c1C(=O)c3cc(I)c(O)c(I)c3,LT01121,benziodarone,,Most-DILI-Concern,8,Withdrawn,,vMost-DILI-Concern,,,,,CCC1=C(C2=CC=CC=C2O1)C(=O)C3=CC(=C(C(=C3)I)O)I,,1
4,CHEMBL49,2477,buspirone,QWCRAEMEVRGPNT-UHFFFAOYSA-N,O=C1CC2(CCCC2)CC(=O)N1CCCCN3CCN(CC3)c4ncccn4,LT01148,buspirone,1986.0,Less-DILI-Concern,3,Adverse reactions,,Ambiguous DILI-concern,NE,non-hepatotixic,Negative,,C1CCC2(C1)CC(=O)N(C(=O)C2)CCCCN3CCN(CC3)C4=NC=CC=N4,http://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?setid=3451ced4-f2a3-476e-a279-b96c8c3f0acf,1
5,CHEMBL113,2519,caffeine,RYYVLZVUVIJVGH-UHFFFAOYSA-N,CN1C(=O)N(C)c2ncn(C)c2C1=O,LT01888,caffeine,1999.0,No-DILI-Concern,0,No match,,vNo-DILI-Concern,NE,non-hepatotixic,Negative,,CN1C=NC2=C1C(=O)N(C(=O)N2C)C,http://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?setid=fbcd234e-57e1-4da2-a96c-e3f7b7cf6ec0,2
6,CHEMBL1560,44093,captopril,FAKRSMQSSFJEIM-RQJHMYQMSA-N,C[C@H](CS)C(=O)N1CCC[C@H]1C(=O)O,LT00059,captopril,1981.0,Less-DILI-Concern,7,Adverse reactions,Yes,vLess-DILI-Concern,HH,,Positive,Postive,C[C@H](CS)C(=O)N1CCC[C@H]1C(=O)O,http://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?setid=9f7dae34-0112-4c11-b032-4f951e139705,1
7,CHEMBL1467,2094,allopurinol,OFCNXPDARWKPPY-UHFFFAOYSA-N,O=C1N=CN=C2NNC=C12,LT00043,allopurinol,1966.0,Most-DILI-Concern,8,Warnings and precautions,Yes,vMost-DILI-Concern,HH,Hepatotoxic,Positive,Postive,C1=C2C(=NC=NC2=O)NN1,http://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?setid=73cd79c1-6bab-4d7b-ae8b-0176efbaf5b9,1
8,CHEMBL633,2157,amiodarone,IYIKLHRQXLHMJQ-UHFFFAOYSA-N,CCCCc1oc2ccccc2c1C(=O)c3cc(I)c(OCCN(CC)CC)c(I)c3,LT00046,amiodarone,1985.0,Most-DILI-Concern,8,Box warning,Yes,vMost-DILI-Concern,HH,Hepatotoxic,Positive,Postive,CCCCC1=C(C2=CC=CC=C2O1)C(=O)C3=CC(=C(C(=C3)I)OCCN(CC)CC)I,http://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?setid=51be39a2-9134-402c-95ad-defe47406ff8,1
9,CHEMBL629,2160,amitriptyline,KRMDCWKBEZIMAB-UHFFFAOYSA-N,CN(C)CCC=C1c2ccccc2CCc3ccccc13,LT00047,amitriptyline,1961.0,Less-DILI-Concern,5,Adverse reactions,Yes,vLess-DILI-Concern,WE,Hepatotoxic,Negative,Postive,CN(C)CCC=C1C2=CC=CC=C2CCC3=CC=CC=C31,http://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?setid=091b3ffc-ccef-47f9-b13a-ef2812e02630,2


Now let's look at joining by chemical structure. We will do this by generating canaonical smiles for each structure and comparing those. But the molecules are not ready for this. In bith datasets (esp. LTKB) we have multi fragment molecules, and we can't be sure they have been represented in the same way in both sets e.g. ghey couldd be in different changed forms, or as Kekule forms in one set and aromatic in the other. To fix this we use the RDKit standardizer which does its best to end up with a 'standard' representaton that gives the best chance of matching identical molecules. Note that this is not a perfect process and we should look carefully at the results.

In [140]:
from rdkit.Chem.MolStandardize import rdMolStandardize
from rdkit import RDLogger

# the standarizer currently spits out tons of output on STDOUT so we'll turn logging off for the moment
RDLogger.logger().setLevel(RDLogger.ERROR)

uncharger = rdMolStandardize.Uncharger()

# define the standardize function
def standardize(smiles):
    mol = Chem.MolFromSmiles(smiles)
    if mol:
        mol = rdMolStandardize.FragmentParent(mol)
        mol = uncharger.uncharge(mol)
    return mol

# Apply standardization and add the molecules to the data frame.
# Even though the TG Gates data claims to be canonical smiles we don't know what chemistry engine
# was used for this so we must canonicalize both sets with RDKit to be sure.
tggates['MOL'] = tggates['CANONICAL_SMILES'].apply(standardize)
ltkb['MOL'] = ltkb['SMILES'].apply(standardize)

# and revert logging back to normal
RDLogger.logger().setLevel(RDLogger.INFO)

In [141]:
# Now generate canaonical smiles for all molecules 
tggates['CANSMI'] = tggates['MOL'].apply(lambda mol: Chem.MolToSmiles(mol))
ltkb['CANSMI'] = ltkb['MOL'].apply(lambda mol: Chem.MolToSmiles(mol))

In [142]:
# and join using the canonical smiles.
joinedBySmiles = pd.merge(tggates, ltkb, on='CANSMI')
joinedBySmiles.shape
# We end up with 86 rows, 3 less than when we used the PubChem CID's. 
# That's pretty good, but we should really look at this more carefully. 

(86, 24)