# Description
### (April 26 2020)
* In this notebook, we use the [Pubchem-Py](https://pubchempy.readthedocs.io/en/latest/guide/gettingstarted.html) to obtain all SMILES and INCHI strings for all compounds found in in `/tables/nature_supplementary_compounds.tsv` 

* Unlike `nb1`, we will search for these compounds using the CAS number as a synonym name in PubchemPy

In [None]:
# Imports
import pubchempy as pcp
import pandas as pd
import time

## Get SMILES and InChI strings from Pubchem IDs in `../tables/nature_supplementary_compounds.tsv`

In [None]:
nature_compounds = pd.read_csv('../tables/nature_supplementary_compounds.tsv', sep='\t')
nature_compounds.head(3)

In [None]:
def getPubchemFromName(compound_name):
    results = pcp.get_compounds(compound_name, 'name')
    if len(results) == 1:
        return str(results[0].cid)
    else:
        return 'multiple' #NOTE make sure you manually search for these CAS-no to have a complete table

In [None]:
#NOTE We will use CAS-number instead of name as our query in PubChem
pubchem_numbers = [getPubchemFromName(n) for n in list(nature_compounds['CAS-number'])]

In [None]:
# Get strings
pubchem_numbers = list(nature_compounds['Pubchem-id'])
isomeric_smiles = []
canonical_smiles = []
inchi_strings = []
for pubchem_id in pubchem_numbers:
    iso_smiles, cano_smiles, inchi = getStringRepresentations(pubchem_id)
    isomeric_smiles.append(iso_smiles)
    canonical_smiles.append(cano_smiles)
    inchi_strings.append(inchi)

In [None]:
# Save results to dataframe
reference_df = pd.DataFrame()
reference_df['Pubchem-id'] = pubchem_numbers
reference_df['Isomeric-SMILES'] = isomeric_smiles
reference_df['Canonical-SMILES'] = canonical_smiles
reference_df['InChI'] = inchi_strings
reference_df.to_csv('../data/reference_compound_strings.tsv', sep='\t', index=False)