# Description
### (April 26 2020)
* In this notebook, we use the [BioPython-enabled API](http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc271) to download all compounds from the KEGG database.
* We will try to download as many compounds from /tables/nature_supplementary_cas_no.tsv if we find a matching CAS number

In [1]:
# Imports
from Bio.KEGG import REST
import pandas as pd
import time

In [2]:
# Read all KEGG compounds
compounds = REST.kegg_list('compound').read()

In [4]:
# Build dictionary of compound-id -> compound-name
kegg_compounds = {}
for line in compounds.rstrip().split('\n'):
    entry,description = line.split('\t')
    kegg_compounds[entry] = description

In [5]:
# Get KEGG compound entry ids
kegg_ids = list(kegg_compounds.keys())

In [8]:
print('{} KEGG compound ids found'.format(len(kegg_ids)))

18700 KEGG compound ids found


In [17]:
# Mine all external DB identifiers for each compound

#NOTE this will take a while!
start_time = time.time()

pubchem_numbers = []
chembl_numbers = []
cas_numbers = []

for i, compound in enumerate(kegg_ids):

    # Log progress
    if i%1000 == 0:  
        print('Processed {} compounds'.format(i))
    
    c = compound.replace('cpd:','')
    test = REST.kegg_get(c).read()
    
    try:
        pubchem = test[test.index('PubChem'):].split('\n')[0].replace('PubChem: ','')
        pubchem_numbers.append(pubchem)

    except:
        pubchem_numbers.append('')
    
    try:
        chembl = test[test.index('ChEMBL'):].split('\n')[0].replace('ChEMBL: ','')
        chembl_numbers.append(chembl)
    except:
        chembl_numbers.append('')
    
    try:
        cas = test[test.index('CAS:'):].split('\n')[0].replace('CAS: ','')
        cas_numbers.append(cas)
    except:
        cas_numbers.append('')

print('Total time elapsed: {} seconds'.format(time.time() - start_time))

Processed 6400 compounds
Processed 6500 compounds
Processed 6600 compounds
Processed 6700 compounds
Processed 6800 compounds
Processed 6900 compounds
Processed 7000 compounds
Processed 7100 compounds
Processed 7200 compounds
Processed 7300 compounds
Processed 7400 compounds
Processed 7500 compounds
Processed 7600 compounds
Processed 7700 compounds
Processed 7800 compounds
Processed 7900 compounds
Processed 8000 compounds
Processed 8100 compounds
Processed 8200 compounds
Processed 8300 compounds
Processed 8400 compounds
Processed 8500 compounds
Processed 8600 compounds
Processed 8700 compounds
Processed 8800 compounds
Processed 8900 compounds
Processed 9000 compounds
Processed 9100 compounds
Processed 9200 compounds
Processed 9300 compounds
Processed 9400 compounds
Processed 9500 compounds
Processed 9600 compounds
Processed 9700 compounds
Processed 9800 compounds
Processed 9900 compounds
Processed 10000 compounds
Processed 10100 compounds
Processed 10200 compounds
Processed 10300 compou

In [None]:
# Save results to a dataframe
kegg_table = pd.DataFrame()
kegg_table['Kegg-id'] = kegg_ids
kegg_table['Pubchem-id'] = pubchem_numbers
kegg_table['Chembl-id'] = chembl_numbers
kegg_table['CAS-number'] = cas_numbers
kegg_table.to_csv('../tables/kegg_compounds.tsv', sep='\t', index=False)

In [34]:
print('Finished with success')

Finished with success
