## Data needed for this notebook to work
- [EPA toxcast data](https://www.epa.gov/chemical-research/exploring-toxcast-data#Download) (this uses version 3.3)
- [PhenomeXcan data](https://github.com/hakyimlab/phenomexcan) 
- [SIDER side effects or indications](http://sideeffects.embl.de/download/)

Put these into the "input_data" directory

## FIrst, work on getting CUIs for each UKBiobank phenotype

In [11]:
import pandas as pd 

phenomeXcan = 'input_data/'
spredi = pd.read_csv(phenomeXcan + "s-predixcan-phenomexcan.csv",sep=",")
spredi = spredi.loc[~spredi['full_code'].str.startswith(tuple("20114|20112|20113|20107|20110|2976|20111".split("|"))) &
                   ~spredi['full_code'].str.lower().str.contains("medication")]
spredi = spredi.loc[spredi['n_cases'] > 200,:]

toget = []
icds = []
ncis = []
cancercode = []
for i in spredi['description']:
    if i.startswith("Non-cancer"):
        nci = i.split(":")[1].strip()
        toget.append(nci)
        ncis.append(nci)
    elif i.startswith("Diagnoses - main ICD10"):
        code = i.split(":")[1].split()[0].strip()
        toget.append(code)
        icds.append(code)
    elif i.startswith("Cancer code"):
        code = i.split(":")[1].strip()
        toget.append(code)
        cancercode.append(code)
    else:
        toget.append(i)
        

In [11]:
## use UMLS table to look up CUI for coded phenotypes
icd10 = pd.read_csv("icd10",sep="|",usecols=[0,13],header=None)
icdcui = {i:list(set(icd10.loc[icd10[13]==i,0].values)) for i in icds}

## for non-coded phenotypes, use API to get CUIs
left = set(toget)- set(icds) #- set(ncis) - set(cancercode)
left = [l for l in left if not l.startswith("Job ")]
with open("missing_phe.txt",'w') as f:
    f.write("\n".join(left) + "\n")

Use the script `umls_search_cui.sh` to get the matching possible CUIs. `bash umls_search_cui.sh missing_phe.txt` (requires UMLS API scripts to be available locally).

Once run, read in the results:

In [12]:
def read_res(fn):
    illcui = {}
    for i in open(fn):
        li = i.strip().split("%%%")
        cuis = li[1].strip(",").split(",") #[j.strip(",") for j in li if j[0]=="C" and j[1:].strip(",").isdigit()]
        illcui[li[0]] = cuis
        #cuis.append(name.split()[-1])
        #li = i.strip().replace(", C",",C").replace(", N",",N").split()
        #illcui[" ".join(name.split()[:-1])] = cuis #li[-1].split(",")
    return illcui
nonicd = read_res("resmissing_phe.txt")

Now combine the ICD matches with the searched matches to get our cuis for each phenotype. This is not perfect but we will look at how it lines up with the SIDER phenotypes.

In [13]:
cuis = {k:icdcui[k] if k in icdcui else
       nonicd[k] if k in nonicd else
        [] if k.startswith("Job ") else 'WHAT' for k in toget}

gwascuis = {dd:cuis[toget[i]] for i, dd in enumerate(spredi["description"])}


## Now, load SIDER to match phenotypes using CUIs

Load in SIDER and get Meddra terms and UMLS CUIs

In [31]:
## this file is obtained directly from SIDER
sider = pd.read_csv("input_data/meddra_all_se.tsv",sep="\t",header=None)
sider['cid'] = sider[0].str.slice(4).map(int)

sider.columns = ['stitch_flat','stitch_stereo','UMLS_label', 'type','meddra','se_name','pubchem_cid']
pt = sider.loc[sider['type']=='PT',].drop_duplicates(['stitch_flat','meddra']).copy()

from collections import Counter
pt_ae = pd.DataFrame(Counter(pt['meddra']),index=['ct']).transpose()
ptfreq = pt.loc[pt['meddra'].isin(pt_ae.loc[(pt_ae['ct']>4),:].index),:]

from collections import Counter
pt_ae = pd.DataFrame(Counter(pt['meddra']),index=['ct']).transpose()
ptfreq = pt.loc[pt['meddra'].isin(pt_ae.loc[(pt_ae['ct']>4),:].index),:]

Now we make a file that uses the cuis to link MedDra terms to Phenotypes. 

In [14]:
all_se = ptfreq.loc[:,['meddra','se_name']].drop_duplicates('meddra')
ukmatch = {}
matchnames = []
manual = {}
for i in all_se['meddra']:
    mat = [k for k,v in gwascuis.items() if i in v]
    ukmatch[i] = mat
    matchnames.append('"' + '",'.join(mat) + '"')
    if len(mat):
        manual[all_se.loc[all_se['meddra']==i,'se_name'].values[0]] = mat

In [15]:
manual

{'Anaemia': ['Non-cancer illness code, self-reported: anaemia'],
 'Arrhythmia': ['Non-cancer illness code, self-reported: heart arrhythmia',
  'Non-cancer illness code, self-reported: irregular heart beat'],
 'Atrial fibrillation': ['Non-cancer illness code, self-reported: atrial fibrillation'],
 'Back pain': ['Non-cancer illness code, self-reported: back pain',
  'Dorsalgia',
  'Diagnoses - main ICD10: M54 Dorsalgia'],
 'Bronchitis': ['Non-cancer illness code, self-reported: bronchitis',
  'Bronchitis'],
 'Chest pain': ['Chest pain or discomfort'],
 'Constipation': ['Non-cancer illness code, self-reported: constipation'],
 'Cough': ['Diagnoses - main ICD10: R05 Cough'],
 'Depression': ['Non-cancer illness code, self-reported: depression',
  'Depression'],
 'Dysgeusia': ['Non-cancer illness code, self-reported: gout'],
 'Dyspepsia': ['Non-cancer illness code, self-reported: dyspepsia / indigestion',
  'Diagnoses - main ICD10: K30 Dyspepsia'],
 'Rash': ['Diagnoses - main ICD10: R21 Rash

Then, a student removed some spurious or too-general linkages (ie, "Neoplasms"), saved in the file `manual_matching_curation.py`

This is a dictionary where the keys are MedDra terms in SIDER and the values are Phenotypes in UKBiobank/PhenomeXcan

In [10]:
sys.path.append("./code")
import manual_matching_curation
manual = manual_matching_curation.get_manual()

In [12]:
sider2ukb = {}
for k in manual:
    x = spredi.loc[spredi['description'].isin(manual[k]),:].sort_values('n_cases',ascending=False).iloc[0,:]
    sider2ukb[k] = [x['description'], x['n_cases']]

from collections import defaultdict
grouped = defaultdict(list)
for k,v in sider2ukb.items():
    grouped[v[0]].append(k)
    
sidername2ukbname = {}
for k, v in grouped.items():
    for sname in v:
        sidername2ukbname[sname] = k

In [None]:
pt_match = ptfreq.loc[ptfreq['se_name'].isin(sidername2ukbname),:].copy()
pt_match['ukb'] = [sidername2ukbname[n] for n in pt_match['se_name']]

pt_match['ae'] = 1
ptstack = pt_match.loc[:,['pubchem_cid','ukb','ae']].drop_duplicates(['pubchem_cid','ukb']).set_index(['pubchem_cid','ukb']).sort_index() #.unstack().shape
ptstack = ptstack.transpose().stack('ukb')
ptstack = ptstack.mask(pd.isnull(ptstack), other=0)
ptstack = ptstack.droplevel(0,0)

Completed side effect data matched to PhenomeXcan.

## Processing ToxCast data
First, create list of all compounts SMILES and use [PubChem](https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgi) to convert these to Pubchem CID

In [20]:

epa = pd.read_csv("input_data/INVITRODBv3_20181017.csv",sep=",")

y = epa.loc[epa['SMILES']!="-",'SMILES']
z = [i.split()[0] if len(i.split()) > 1 else i for i in y ]

### make file for pubchem upload
with open("intermediate_files/smiles_clean",'w') as f:
    f.write("\n".join(z) + "\n")


Load in the result saved in `smiles_invitrodb`

In [25]:
  
smiles2cid = pd.read_csv("intermediate_files/smiles_invitrodb",sep="\t")
smiles2cid.columns = ['SMILES','cid']
smiles2cid = smiles2cid.drop_duplicates()

epa2 = epa.merge(smiles2cid,on='SMILES',how='left')
epa2['casrn-clean'] = epa2['CASRN'].str.replace("-","").str.replace("_","")

Now get the sider pubchem IDs, we will just keep these (there are many others in toxcast)

In [32]:
subsel = epa2.loc[epa2['cid'].isin(sider['pubchem_cid']),'casrn-clean'].values

In [33]:
len(subsel)

430

Go through the various databases and extract the level 5 data, particularly "endpoint" (`aenm`) and the drug tested (`code`), and the results `hit_pct`.

In [None]:
dbs = {}
import glob
for i in glob.glob("INVITRODB_V3_3_SUMMARY/EXPOR*csv"):
    name = os.path.basename(i).replace("EXPORT_LVL5&6_","").replace("_200730.csv","")
    dbs[name] = []
    for ep in pd.read_csv(i,sep=",", chunksize=50000):
        code = ep['code'].mask(pd.isnull(ep['code']),other='').str.slice(1)
        ep['code-cln']= code    
        print(' chems=',
             len(set(code)),  "/", 
             len(set(code) - set(epa2['casrn-clean'])), 
              ' endpoints=', len(set(ep['aenm'].mask(pd.isnull(ep['aenm']),other=''))))
        dbs[name].append(ep.loc[ep['code-cln'].isin(subsel),])
    dbs[name]= pd.concat(dbs[name],axis=0)
    
x = pd.concat([v.loc[:,['code-cln','aenm',u'cnst_prob',  u'hill_prob', u'gnls_prob',
                    u'hit_pct', u'total_hitc', 
       u'cnst_pct', u'hill_pct', u'gnls_pct']].groupby(['code-cln','aenm']).agg('mean').reset_index()
           for v in dbs.values()],axis=0)
x['cid'] = epa2.set_index("casrn-clean").loc[list(x['code-cln'].values),'cid'].values

xstk = x.loc[:,['cid','aenm','hit_pct']].set_index(['cid','aenm']).sort_index().transpose().stack('aenm')
xstk = xstk.droplevel(0,0)
xstk.to_csv("epa.txt")