# Description
### (April 26 2020)
* In this notebook, we use the [Pubchem-Py](https://pubchempy.readthedocs.io/en/latest/guide/gettingstarted.html) to obtain all SMILES and INCHI strings for all compounds found in KEGG (see `nb0_parseKEGGcompunds.ipynb`).

* We will repeat the SMILES extraction for the compounds in `/tables/nature_supplementary_compounds.tsv` which we found in Pubchem using the CAS number as a synonym name in PubchemPy as follows:

```python
def getPubchemFromName(compound_name):
    results = pcp.get_compounds(compound_name, 'name')
    if len(results) == 1:
        return str(results[0].cid)
    else:
        return 'multiple'

pubchem_numbers = [getPubchemFromName(n) for n in list(nature_compounds['CAS-number'])]
```

In [2]:
# Imports
import pubchempy as pcp
import pandas as pd

## Get SMILES and InChI strings from Pubchem IDs in `../tables/nature_supplementary_compounds.tsv`

In [25]:
nature_compounds = pd.read_csv('../tables/nature_supplementary_compounds.tsv', sep='\t')
nature_compounds.head(3)

Unnamed: 0,Name,CAS-number,MW,Pubchem-id
0,1-Amino-2-naphthol-4-sulfonic acid,116-63-2,239.25,8316
1,"1,2-Naphthoquinone-4-sulfonic acid sodium salt",521-24-4,260.2,516996
2,"1,2,4-Benzenetricarboxylic acid",528-44-9,210.14,10708


In [26]:
# Function to extract strings from PubChem
def getStringRepresentations(pubchem_id):
    cid = int(pubchem_id)
    c = pcp.Compound.from_cid(cid)
    return c.isomeric_smiles, c.canonical_smiles, c.inchi

In [27]:
# Get strings
pubchem_numbers = list(nature_compounds['Pubchem-id'])
isomeric_smiles = []
canonical_smiles = []
inchi_strings = []
for pubchem_id in pubchem_numbers:
    iso_smiles, cano_smiles, inchi = getStringRepresentations(pubchem_id)
    isomeric_smiles.append(iso_smiles)
    canonical_smiles.append(cano_smiles)
    inchi_strings.append(inchi)

In [31]:
# Save results to dataframe
reference_df = pd.DataFrame()
reference_df['Pubchem-id'] = pubchem_numbers
reference_df['Isomeric-SMILES'] = isomeric_smiles
reference_df['Canonical-SMILES'] = canonical_smiles
reference_df['InChI'] = inchi_strings
reference_df.to_csv('../data/reference_compound_strings.tsv', sep='\t', index=False)