# Metabolites
In the first part of this notebook we create a dataframe containing all the available information for the metabolites accounted in our reconstruction. The dataframe generated will constitute the **"Metabolites Sheet"** in our reconstruction. In the second part of this notebook we curate and identify duplicated metabolites in our dataset. <br><br>
[1. Generation of Metabolites dataset](#generation) <br>
&nbsp;&nbsp;&nbsp;&nbsp;**1.1 Retrieve a list of all the metabolites from our reconstruction** <br>
&nbsp;&nbsp;&nbsp;&nbsp;**1.2 Retrieve information from all the metabolites on Recon3D, iCHO2291 and iCHO1766**<br>
&nbsp;&nbsp;&nbsp;&nbsp;**1.3 Add all the metabolites information into our metabolites dataset** <br>
&nbsp;&nbsp;&nbsp;&nbsp;**1.4 Unique metabolite identification** <br><br>
[2. Retrieve Missing Information from Databases](#curation) <br>
&nbsp;&nbsp;&nbsp;&nbsp;**2.1 Update missing information in metabolites dataset from BiGG** <br>
&nbsp;&nbsp;&nbsp;&nbsp;**2.2 Update missing information in metabolites dataset from PubChem**<br>
&nbsp;&nbsp;&nbsp;&nbsp;**2.3 Homogenize information and Update Google Sheet file**
<br><br>
[3. Identification of Duplicated Metabolites](#duplicated) <br>
&nbsp;&nbsp;&nbsp;&nbsp;**3.1 Identification of duplicated metabolites by their Names and Formulas**<br>
&nbsp;&nbsp;&nbsp;&nbsp;**3.2 Identification of duplicated metabolites by their PubChem IDs**<br>
&nbsp;&nbsp;&nbsp;&nbsp;**3.3 Identification of duplicated metabolites by their Inchi**<br>

[4. Statistical Analysis of the Information in the Metabolites Dataseet](#information) <br>
&nbsp;&nbsp;&nbsp;&nbsp;**3.1 Calculate the missing Information for Relevant Metabolites** <br>
&nbsp;&nbsp;&nbsp;&nbsp;**2.2 Update missing information in metabolites dataset from other databases** <br>
&nbsp;&nbsp;&nbsp;&nbsp;**2.3 Identification of duplicated metabolites** <br>

<a id='generation'></a>
## 1. Generation of Metabolites dataset
We start by creating a list of all the metabolites included in the reactions of our reconstruction (1). Then we create a dataset containing all the metabolites info from Recon3D, iCHO2291 and iCHO1766 models, including supplementary information from Recon 3D (2). Now we can map back this information into the metabolites from our reconstruction and generate an excell file for uploading into Google Sheets (3). Finally, we estimate how many duplicated metabolites we have in our dataset by calculating occurences in different identifiers (5).

In [None]:
# Import libraries
import gspread
import pandas as pd
import numpy as np
import requests
import time

import cobra
from cobra import Model
from cobra.io import read_sbml_model

from tqdm.notebook import tqdm

from google_sheet import GoogleSheet
from utils import df_to_dict

### 1.1 Retrieve a list of all the metabolites from our reconstruction
The list of all the reactions and the metabolites involved are in the Rxns Sheet in the Google Sheet.

In [None]:
KEY_FILE_PATH = 'credentials.json'
SPREADSHEET_ID = '1MlBXeHIKw8k8fZyXm-sN__AHTRSunJxar_-bqvukZws'

# Initialize the GoogleSheet object
sheet = GoogleSheet(SPREADSHEET_ID, KEY_FILE_PATH)

# Read data from the Google Sheet and crete "rxns" df
sheet_rxns = 'Rxns'
rxns = sheet.read_google_sheet(sheet_rxns)

In [None]:
# Create a cobra model to identify the metabolites involved in our reconstruction
model = cobra.Model("iCHOxxxx")
lr = []

for _, row in rxns.iterrows():
    r = cobra.Reaction(row['Reaction'])
    lr.append(r)
    
model.add_reactions(lr)
model

In [None]:
# With the built in function "build_reaction_from_string" we can identify the metabolites
for i,r in enumerate(tqdm(model.reactions)):
    r.build_reaction_from_string(df['Reaction Formula'][i])

In [None]:
# We first create a list of the metabolites and then a pandas df with it
metabolites_list = []
for met in model.metabolites:
    metabolites_list.append(met.id)
    
metabolites = pd.DataFrame(metabolites_list, columns =['BiGG ID'])
metabolites

### 1.2 Retrieve information from all the metabolites on Recon3D, iCHO2291 and iCHO1766
We use two datasets for this, first we take information from the Recon3D.xml, iCHO2291.xml and iCHO1766 files from which we get the metabolite ID, Name, Formula and Compartment. We then add the metadata for the available metabolites from Recon3D supplementary files.

In [None]:
# read the Recon3D model
recon3d_model = read_sbml_model('../Data/GPR_Curation/Recon3D.xml')

In [None]:
# Generate a dataset containing all the metabolites, chemical formula of each metabolite and compartment
num_rows = len(recon3d_model.metabolites)
recon3d_model_metabolites = pd.DataFrame(index=range(num_rows), columns=['BiGG ID', 'Name', 'Formula', 'Compartment'])
for i,met in enumerate(recon3d_model.metabolites):
    id_ = met.id
    name = met.name
    formula = met.formula
    comp = met.compartment
    recon3d_model_metabolites.iloc[i] = [id_, name, formula, comp]

In [None]:
recon3d_model_metabolites

In [None]:
# read the Yeo's model
iCHO2291_model = read_sbml_model('../Data/Reconciliation/models/iCHO2291.xml')

In [None]:
# Generate a dataset containing all the metabolites, chemical formula of each metabolite and compartment from Yeo's model
num_rows = len(iCHO2291_model.metabolites)
iCHO2291_model_metabolites = pd.DataFrame(index=range(num_rows), columns=['BiGG ID', 'Name', 'Formula', 'Compartment'])
for i,met in enumerate(iCHO2291_model.metabolites):
    id_ = met.id
    name = met.name
    formula = met.formula
    comp = met.compartment
    iCHO2291_model_metabolites.iloc[i] = [id_, name, formula, comp]
    
iCHO2291_model_metabolites['BiGG ID'] = iCHO2291_model_metabolites['BiGG ID'].str.replace("[", "_", regex=False)
iCHO2291_model_metabolites['BiGG ID'] = iCHO2291_model_metabolites['BiGG ID'].str.replace("]", "", regex=False)
iCHO2291_model_metabolites

In [None]:
# read Hefzi's model
iCHO1766_model = read_sbml_model('../Data/Reconciliation/models/iCHOv1_final.xml')

In [None]:
# Generate a dataset containing all the metabolites, chemical formula of each metabolite and compartment from Hefzi's model
num_rows = len(iCHO1766_model.metabolites)
iCHO1766_model_metabolites = pd.DataFrame(index=range(num_rows), columns=['BiGG ID', 'Name', 'Formula', 'Compartment'])
for i,met in enumerate(iCHO1766_model.metabolites):
    id_ = met.id
    name = met.name
    formula = met.formula
    comp = met.compartment
    iCHO1766_model_metabolites.iloc[i] = [id_, name, formula, comp]

iCHO1766_model_metabolites

In [None]:
models_metabolites = pd.concat([recon3d_model_metabolites, iCHO2291_model_metabolites, iCHO1766_model_metabolites])
models_metabolites = models_metabolites.groupby('BiGG ID').first()
models_metabolites = models_metabolites.reset_index(drop = False)
models_metabolites

In [None]:
#Generation of a dataset containing all the information from Recon3D metabolites Supplementary Data.
recon3d_metabolites_meta = pd.read_excel('../Data/Metabolites/metabolites.recon3d.xlsx', header = 0)
recon3d_metabolites_meta['BiGG ID'] = recon3d_metabolites_meta['BiGG ID'].str.replace("[", "_", regex=False)
recon3d_metabolites_meta['BiGG ID'] = recon3d_metabolites_meta['BiGG ID'].str.replace("]", "", regex=False)
recon3d_metabolites_meta

In [None]:
# Transformation of the "recon3d_metabolites_meta" into a dict to map it into the "recon3d_model_metabolites"
recon3dmet_dict = df_to_dict(recon3d_metabolites_meta, 'BiGG ID')

In [None]:
# Mapping into the "recon3d_model_metabolites" dataset
models_metabolites[['KEGG','CHEBI', 'PubChem','Inchi', 'Hepatonet', 'EHMNID', 'SMILES', 'INCHI2',
                          'CC_ID','Stereoisomer Information of Metabolite Identified', 'Charge of the Metabolite Identified',
    'CID_ID','PDB (ligand-expo) Experimental Coordinates  File Url', 'Pub Chem Url',
    'ChEBI Url']] = models_metabolites['BiGG ID'].apply(lambda x: pd.Series(recon3dmet_dict.get(x, None), dtype=object))

In [None]:
models_metabolites

In [None]:
# Transform the final Recon3D Metabolites dataset into a dictionary to map it into our dataset
final_met_dict = df_to_dict(models_metabolites, 'BiGG ID')

### 1.3 Add all the metabolites information into our metabolites dataset
With the dictionary created in **Step 2** we can use the information to map it in the metabolites dataset created in **Step 1** which contains all the metabolites of our reconstruction.

In [None]:
metabolites[['Name', 'Formula', 'Compartment', 'KEGG','CHEBI', 'PubChem','Inchi', 'Hepatonet', 'EHMNID', 'SMILES',
             'INCHI2','CC_ID','Stereoisomer Information of Metabolite Identified', 'Charge of the Metabolite Identified',
    'CID_ID','PDB (ligand-expo) Experimental Coordinates  File Url', 'Pub Chem Url',
    'ChEBI Url']] = metabolites['BiGG ID'].apply(lambda x: pd.Series(final_met_dict.get(x, None), dtype=object))

In [None]:
# Update the Compartment column in the final dataset
for i,row in metabolites.iterrows():
    if row['Compartment'] == 'c':
        metabolites.loc[i, 'Compartment'] = 'c - cytosol'
    if row['Compartment'] == 'l':
        metabolites.loc[i, 'Compartment'] = 'l - lysosome'
    if row['Compartment'] == 'm':
        metabolites.loc[i, 'Compartment'] = 'm - mitochondria'
    if row['Compartment'] == 'r':
        metabolites.loc[i, 'Compartment'] = 'r - endoplasmic reticulum'
    if row['Compartment'] == 'e':
        metabolites.loc[i, 'Compartment'] = 'e - extracellular space'
    if row['Compartment'] == 'x':
        metabolites.loc[i, 'Compartment'] = 'x - peroxisome/glyoxysome'
    if row['Compartment'] == 'n':
        metabolites.loc[i, 'Compartment'] = 'n - nucleus'
    if row['Compartment'] == 'g':
        metabolites.loc[i, 'Compartment'] = 'g - golgi apparatus'
    if row['Compartment'] == 'im':
        metabolites.loc[i, 'Compartment'] = 'im - intermembrane space of mitochondria'

In [None]:
# The dataset generated is stored as an Excel file in the "Data" folder
metabolites.to_excel('../Data/Metabolites/metabolites.xlsx')

### 1.4 Unique metabolite identification
This next block of code gives us an idea of how many duplicated metabolites we have in our generated dataset based on the IDs, Name, Formula and KEGG IDs.

In [None]:
##### ----- Generate datasets from Google Sheet ----- #####

#Credential file
KEY_FILE_PATH = 'credentials.json'

# #CHO Network Reconstruction + Recon3D_v2 Google Sheet ID
# SPREADSHEET_ID = '1MlBXeHIKw8k8fZyXm-sN__AHTRSunJxar_-bqvukZws'

#CHO Network Reconstruction + Recon3D_v3 Google Sheet ID
SPREADSHEET_ID = '1MlBXeHIKw8k8fZyXm-sN__AHTRSunJxar_-bqvukZws'

# Initialize the GoogleSheet object
sheet = GoogleSheet(SPREADSHEET_ID, KEY_FILE_PATH)

# Read data from the Google Sheet
sheet_met = 'Metabolites'
metabolites = sheet.read_google_sheet(sheet_met)

In [None]:
print("Duplicated rxns by BiGG ID = ", len(metabolites['BiGG ID']) - len(metabolites['BiGG ID'].unique()))
print("Duplicated rxns by Name = ", len(metabolites['Name']) - len(metabolites['Name'].unique()))
print("Duplicated rxns by Formula = ", len(metabolites['Formula']) - len(metabolites['Formula'].unique()))
print("Duplicated rxns by KEGG = ", len(metabolites['KEGG']) - len(metabolites['KEGG'].unique()))

<a id='curation'></a>
## 2. Retrieve Missing Information from Databases
In this second part of the notebook we curate missing information in the metabolites dataset generated above. Since many metabolites have been manually curated in the "Metabolites" google sheet file, we generate a new dataframe using the GoogleSheet class to obtain the metabolites dataset with all the changes

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import time
import requests
from bs4 import BeautifulSoup

import cobra
from cobra import Model, Reaction

from tqdm.notebook import tqdm

from google_sheet import GoogleSheet
from metabolite_identifiers import getPubchemCID, getChEMBLID, getCIDSmilesInChI, getCIDFormula, homogenize_info

In [None]:
#Generate the "metabolites" dataset from our Google Sheet file

#Credential file
KEY_FILE_PATH = 'credentials.json'

#CHO Network Reconstruction + Recon3D_v3 Google Sheet ID
SPREADSHEET_ID = '1MlBXeHIKw8k8fZyXm-sN__AHTRSunJxar_-bqvukZws'

# Initialize the GoogleSheet object
sheet = GoogleSheet(SPREADSHEET_ID, KEY_FILE_PATH)

# Read data from the Google Sheet
sheet_met = 'Metabolites'
metabolites = sheet.read_google_sheet(sheet_met)

### 2.1 Update missing information in metabolites dataset from BiGG

In [None]:
# Get BiGG descriptive names from the BiGG database

# Unknown Mets: metabolites without names
unkown_mets = metabolites[metabolites['Name'] == '']

Descriptive_Names = [''] * len(unkown_mets)
Formulae = [''] * len(Descriptive_Names)
Changed = [True] * len(Descriptive_Names)

for Met_Counter, metID in enumerate(tqdm(unkown_mets['BiGG ID'].iloc[:])):
    print(Met_Counter)
    input_str = metID[:-2]
    response = requests.get(f"http://bigg.ucsd.edu/universal/metabolites/{input_str}")
    time.sleep(1)
    # Check if the request was successful
    if response.status_code != 200:
        D_Name = "BiGG ID not found in BiGG"
        Formulae_B = "BiGG ID not found in BiGG"
        Changed[Met_Counter] = False       
    else:    
        soup = BeautifulSoup(response.content, 'html.parser')
        N_Header = soup.find('h4', string='Descriptive name:')
        D_Name = N_Header.find_next_sibling('p').text
        N_Formulae = soup.find('h4', string='Formulae in BiGG models: ')
        Formulae_B = N_Formulae.find_next_sibling('p').text    
        if D_Name is None:
            D_Name = "Name not found in BiGG"            
        elif Formulae_B is None:
            Formulae_B = "Formula not found in BiGG"                
    Descriptive_Names[Met_Counter] = D_Name
    Formulae[Met_Counter] = Formulae_B

In [None]:
for Met_Counter, metID in enumerate(unkown_mets['BiGG ID']):
    print('before',unkown_mets['BiGG ID'].iloc[Met_Counter])
    print('before',unkown_mets['Formula'].iloc[Met_Counter])
    print('before',unkown_mets['Name'].iloc[Met_Counter])
    if unkown_mets['Formula'].iloc[Met_Counter] == '':
        unkown_mets['Formula'].iloc[Met_Counter] = Formulae[Met_Counter]  
    unkown_mets['Name'].iloc[Met_Counter] = Descriptive_Names[Met_Counter]
    print('..............................................')
    print('after',unkown_mets['BiGG ID'].iloc[Met_Counter])
    print('after',unkown_mets['Formula'].iloc[Met_Counter])
    print('after',unkown_mets['Name'].iloc[Met_Counter])
    print('..............................................')
    print('..............................................')
    print('..............................................')

In [None]:
metabolites.update(unkown_mets)

# Manual Curation
for bigg_id in metabolites['BiGG ID']:
    # xtra = Xanthurenic acid; C10H6NO4
    # http://bigg.ucsd.edu/models/iCHOv1/reactions/r0647
    if 'xtra' in bigg_id:
        metabolites.loc[metabolites['BiGG ID'] == bigg_id, 'Name'] = 'Xanthurenic acid'
        metabolites.loc[metabolites['BiGG ID'] == bigg_id, 'Formula'] = 'C10H6NO4'
    # chedxch = Bilirubin-monoglucuronoside; C39H42N4O122-
    # Reactions name = 'ATP-binding Cassette (ABC) TCDB:3.A.1.208.2' --> https://metabolicatlas.org/identifier/TCDB/3.A.1.208.2
    elif 'chedxch' in bigg_id:
        metabolites.loc[metabolites['BiGG ID'] == bigg_id, 'Name'] = 'Bilirubin-monoglucuronoside'
        metabolites.loc[metabolites['BiGG ID'] == bigg_id, 'Formula'] = 'C39H42N4O122-'
    # chatGTP
    elif '3hoc246_6Z_9Z_12Z_15Z_18Z_21Zcoa' in bigg_id:
        metabolites.loc[metabolites['BiGG ID'] == bigg_id, 'Name'] = 'CoA molecule that has a 24-carbon fatty acid with six double bonds, with the location of the double bonds specified by the numbers and Zs'
    # chatGTP
    elif 'c247_2Z_6Z_9Z_12Z_15Z_18Z_21Zcoa' in bigg_id:
        metabolites.loc[metabolites['BiGG ID'] == bigg_id, 'Name'] = 'CoA molecule that has a modified version of the same 24-carbon fatty acid, with a hydroxyl group added at the third carbon position'
    # chatGTP
    elif '3hoc143_5Z_8Z_11Zcoa' in bigg_id:
        metabolites.loc[metabolites['BiGG ID'] == bigg_id, 'Name'] = 'CoA molecule that has a 14-carbon fatty acid with three double bonds, with the location of the double bonds specified by the numbers and Zs.'
    # chatGTP
    elif '3oc143_5Z_8Z_11Zcoa' in bigg_id:
        metabolites.loc[metabolites['BiGG ID'] == bigg_id, 'Name'] = 'CoA molecule that has a modified version of the same 14-carbon fatty acid, with the hydroxyl group removed and one of the double bonds converted to a keto group'
    # chatGTP
    elif 'acgalgalacglcgalgluside' in bigg_id:
        metabolites.loc[metabolites['BiGG ID'] == bigg_id, 'Name'] = 'Complex glycosphingolipid that contains multiple sugar residues'

### 2.2 Update missing information in metabolites dataset from PubChem
Here we use different functions from the "metabolites" module to try to fetch Inchi, SMILES and database identifiers for all the metabolites in our reconstruction

In [None]:
# Get PubChem IDs using the getPubchemCID() function

counter = 0
no_match = [] #create an empty list with PubChem IDs that don't match with the formulas in the dataset
for i,met in tqdm(metabolites.iterrows()):
    cmp = met['Name']
    if met['PubChem']=='NaN':
        pubchem_id = getPubchemCID(cmp,'')
         
        if pubchem_id:
            if (len(pubchem_id)>1): #If there is more than 1 Pubchem ID, check which one correspond to our metabolite
                match_found = False
                for _id in pubchem_id:
                    form = getCIDFormula(_id)
                    
                    # Compare the formula obtained from the PubChem ID to the one in our dataset
                    if (form == met['Formula']):
                        match_found = True
                        metabolites.loc[i, 'PubChem'] = _id
                        print('Match found:'+met['BiGG ID'], _id)
                        break # break the loop as we found the match
                        
                if not match_found:  # if no match was found
                    _id = pubchem_id[0]  # take the first ID in the pubchem_id list
                    metabolites.loc[i, 'PubChem'] = _id  
                    print('Not match found:'+met['BiGG ID'], pubchem_id)
                    no_match.append([met['BiGG ID'], pubchem_id])
                    
            # If there is only one ID associated to that metabolite        
            else:
                metabolites.loc[i, 'PubChem'] = pubchem_id[0]
                print(met['BiGG ID'], pubchem_id[0])
            counter +=1
            print(counter)


In [None]:
# Get the Inchi and SMILES for the metabolites with PubChem IDs retrieved previously

counter = 0
for i,met in metabolites.iterrows():
    if (met['PubChem'] != 'NaN' and (met['Inchi']=='NaN' or met['SMILES']=='NaN')):
        try:
            Inchi_SMILES = getCIDSmilesInChI(met['PubChem'])
            SMILES = Inchi_SMILES[0]
            Inchi = Inchi_SMILES[1]
            
            if met['Inchi']=='NaN':
                metabolites.loc[i, 'Inchi'] = Inchi
            if met['SMILES']=='NaN':
                metabolites.loc[i, 'SMILES'] = SMILES
                
            print(met['BiGG ID'])
            print(SMILES)
            print(Inchi)
            print('............')
        except KeyError:
            print(met['BiGG ID']+' Inchi and SMILES cannot be retrieved')
        
        counter +=1
        print(counter)

### 2.3 Homogenize information and Update Google Sheet file
The information retrieved in **2.1** and **2.2** is first homogenized in order for each metabolite to have the same information in all the compartments. And finally the Google Sheet file is updated.

In [None]:
print('Before homogenization')
print(len(metabolites[metabolites['PubChem']=='NaN']))
print(len(metabolites[metabolites['Inchi']=='NaN']))
print(len(metabolites[metabolites['SMILES']=='NaN']))

In [None]:
# Homogenize the columns in your DataFrame
metabolites = homogenize_info(metabolites)
metabolites

In [None]:
print('After homogenization')
print(len(metabolites[metabolites['PubChem']=='NaN']))
print(len(metabolites[metabolites['Inchi']=='NaN']))
print(len(metabolites[metabolites['SMILES']=='NaN']))

In [None]:
################################################
#### -------------------------------------- ####
#### ---- Update the Google Sheet file ---- ####
#### -------------------------------------- ####
################################################

sheet.update_google_sheet(sheet_met, metabolites)
print("Google Sheet updated.")

<a id='duplicated'></a>
## 3. Identification of Duplicated Metabolites 
Here we use the customized functions **getCanonical** and **similarity_calc** from the **metabolite_identifiers** module to identify duplicated metabolites through their SMILES. 

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import time
import requests
from bs4 import BeautifulSoup
from scipy.stats import mode

import cobra
from cobra import Model, Reaction

from tqdm.notebook import tqdm

from google_sheet import GoogleSheet
from metabolite_identifiers import getCanonical, similarity_calc

In [2]:
KEY_FILE_PATH = 'credentials.json'
SPREADSHEET_ID = '1MlBXeHIKw8k8fZyXm-sN__AHTRSunJxar_-bqvukZws'

# Initialize the GoogleSheet object
sheet = GoogleSheet(SPREADSHEET_ID, KEY_FILE_PATH)

# Read data from the Google Sheet
sheet_met = 'Metabolites'
sheet_rxns = 'Rxns'
shee_attributes = 'Attributes'

met = sheet.read_google_sheet(sheet_met)
rxns = sheet.read_google_sheet(sheet_rxns)
attributes = sheet.read_google_sheet(shee_attributes)

### 3.1 Identification of duplicated metabolites by their Names and Formulas
The first step in the indetification of duplicated metabolites is to ideentify those that share exactly the same **name** and **formula**, in which case the matabolites involved are automatically labeled as duplicated and fixed.

In [6]:
# Convert metabolites names to lower case and remove the compartment
met['Name'] = met['Name'].str.lower()
met_copy = met.copy()
met_copy['BiGG ID'] = met_copy['BiGG ID'].str[:-2]
met_copy

Unnamed: 0,Curated,BiGG ID,Name,Formula,Compartment,KEGG,CHEBI,ChEMBLID,PubChem,Inchi,...,EHMNID,SMILES,INCHI2,CC_ID,Stereoisomer Information of Metabolite Identified,Charge of the Metabolite Identified,CID_ID,PDB (ligand-expo) Experimental Coordinates File Url,Pub Chem Url,ChEBI Url
0,,10fthf5glu,10-formyltetrahydrofolate-[glu](5),C40H45N11O19,c - cytosol,,,,,InChI=1S/C40H51N11O19/c41-40-49-32-31(37(66)50...,...,,N=c1nc(O)c2c([nH]1)NCC(CN(C=O)c1ccc(C(=O)NC(CC...,,,,,,,,
1,,10fthf5glu,10-formyltetrahydrofolate-[glu](5),C40H45N11O19,e - extracellular space,,,,,InChI=1S/C40H51N11O19/c41-40-49-32-31(37(66)50...,...,,N=c1nc(O)c2c([nH]1)NCC(CN(C=O)c1ccc(C(=O)NC(CC...,,,,,,,,
2,,10fthf5glu,10-formyltetrahydrofolate-[glu](5),C40H45N11O19,l - lysosome,,,,,InChI=1S/C40H51N11O19/c41-40-49-32-31(37(66)50...,...,,N=c1nc(O)c2c([nH]1)NCC(CN(C=O)c1ccc(C(=O)NC(CC...,,,,,,,,
3,,10fthf5glu,10-formyltetrahydrofolate-[glu](5),C40H45N11O19,m - mitochondria,,,,,InChI=1S/C40H51N11O19/c41-40-49-32-31(37(66)50...,...,,N=c1nc(O)c2c([nH]1)NCC(CN(C=O)c1ccc(C(=O)NC(CC...,,,,,,,,
4,,10fthf6glu,10-formyltetrahydrofolate-[glu](6),C45H51N12O22,c - cytosol,,,,,InChI=1/C45H58N12O22/c46-45-55-36-35(38(67)56-...,...,,N=c1nc([O-])c2c([nH]1)NCC(CN(C=O)c1ccc(C(=O)N[...,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7757,PD,zym_int2,"5alpha-cholesta-8,24-dien-3-one",C27H42O,r - endoplasmic reticulum,,,,22298942,InChI=1S/C27H42O/c1-18(2)7-6-8-19(3)23-11-12-2...,...,,CC(CCC=C(C)C)C1CCC2C1(CCC3=C2CCC4C3(CCC(=O)C4)C)C,InChI=1S/C27H42O/c1-18(2)7-6-8-19(3)23-11-12-2...,,,Neutral,22298942,,https://pubchem.ncbi.nlm.nih.gov/compound/2229...,https://www.ebi.ac.uk/chebi/searchId.do?chebiI...
7758,,zymst,zymosterol c27h44o,C27H44O,c - cytosol,,,,92746,InChI=1S/C27H44O/c1-18(2)7-6-8-19(3)23-11-12-2...,...,,[H][C@@]12CCC3=C(CC[C@]4(C)[C@]([H])(CC[C@@]34...,InChI=1S/C27H44O/c1-18(2)7-6-8-19(3)23-11-12-2...,,,Neutral,92746,,https://pubchem.ncbi.nlm.nih.gov/compound/92746,https://www.ebi.ac.uk/chebi/searchId.do?chebiI...
7759,,zymst,zymosterol c27h44o,C27H44O,r - endoplasmic reticulum,,,,92746,InChI=1S/C27H44O/c1-18(2)7-6-8-19(3)23-11-12-2...,...,,[H][C@@]12CCC3=C(CC[C@]4(C)[C@]([H])(CC[C@@]34...,InChI=1S/C27H44O/c1-18(2)7-6-8-19(3)23-11-12-2...,,,Neutral,92746,,https://pubchem.ncbi.nlm.nih.gov/compound/92746,https://www.ebi.ac.uk/chebi/searchId.do?chebiI...
7760,,zymstnl,5alpha-cholest-8-en-3beta-ol,C27H46O,c - cytosol,,16608,,101770,InChI=1S/C27H46O/c1-18(2)7-6-8-19(3)23-11-12-2...,...,,[H][C@@]12CCC3=C(CC[C@]4(C)[C@]([H])(CC[C@@]34...,InChI=1S/C27H46O/c1-18(2)7-6-8-19(3)23-11-12-2...,,,Neutral,101770,,http://pubchem.ncbi.nlm.nih.gov/compound/101770,https://www.ebi.ac.uk/chebi/searchId.do?chebiI...


In [7]:
# Generate a list with duplicated metabolites

grouped = met_copy.groupby(['Name', 'Formula'])

# Initialize an empty dictionary to store the results
duplicated_metabolites = []

# Iterate over the grouped DataFrame
for (Name, Formula), group in grouped:
    # Check if the group has more than one element (i.e., duplicate) and filter out those metabolites whose names are unknown
    if group['BiGG ID'].nunique() > 1 and Name != 'bigg id not found in bigg':
        unique_ids = group['BiGG ID'].unique()
        duplicated_metabolites.append((Name, Formula, unique_ids))

In [8]:
# Generate empty dict to store the existence of each duplicated metabolite in BiGG
duplicated_dict = {}


for metabolite in tqdm(duplicated_metabolites):
    duplicated_dict[metabolite[0]] = {}
    for big_id in metabolite[2]:
        time.sleep(1)
        # Generate a tag for each metabolite
        response = requests.get(f"http://bigg.ucsd.edu/universal/metabolites/{big_id}")
        if response.status_code == 200:
            #if the metabolite is in BiGG "OK"
            duplicated_dict[metabolite[0]][big_id] = 'OK'
        else:
            #if is not "NO"
            duplicated_dict[metabolite[0]][big_id] ='NO'
        


  0%|          | 0/1 [00:00<?, ?it/s]

In [9]:
# Eliminate proton from the duplicated_dict
duplicated_dict.pop('proton')
duplicated_dict

{}

In [10]:
#Examine the generated dictionary of duplicated values
duplicated_dict

{}

In [None]:
# Create a dictionary to store the 'OK' subkey for each key in duplicated_dict
ok_dict = {}

# Iterate over keys in duplicated_dict
for key in duplicated_dict:
    # Create an empty list to store 'NO' subkeys for this key
    no_list = []
    # Iterate over subkeys and values in sub-dictionary
    for subkey, value in duplicated_dict[key].items():
        # If the value is 'OK', save the subkey to a variable
        if value == 'OK':
            ok_dict[key] = subkey
        # If the value is 'NO', add the subkey to the list
        elif value == 'NO':
            no_list.append(subkey)
    # Replace all 'NO' subkeys with the 'OK' subkey for this key in all the datasets
    if key in ok_dict:
        ok_subkey = ok_dict[key]
        for no_subkey in no_list:
            met['BiGG ID'] = met['BiGG ID'].str.replace(no_subkey, ok_subkey)
            rxns['Reaction Formula'] = rxns['Reaction Formula'].str.replace(no_subkey, ok_subkey)
            attributes['Reaction Formula'] = attributes['Reaction Formula'].str.replace(no_subkey, ok_subkey)
    # Reset the 'ok_subkey' and 'no_subkey' variables at the end of each iteration over keys
    ok_dict[key] = None

In [None]:
##############################################################
#### ---------------------------------------------------- ####
#### ---- Update Rxns and  Attributes Google Sheets ----- ####
#### ---------------------------------------------------- ####
##############################################################

sheet.update_google_sheet(sheet_rxns, rxns)
sheet.update_google_sheet(shee_attributes, attributes)
print("Rxns and Attributes Google Sheet updated.")

### 3.2 Identification of duplicated metabolites by their PubChem IDs
Once the duplicated metabolites have been identified by their names and formulas we then move to the identification of duplicated metabolites by their **PubChem IDs** retrieved in **2.2**.

In [11]:
# Geneate a dict with metabolite IDs, without the compartment, as keys, and SMILES strings as values
met_copy = met_copy.groupby('BiGG ID').first()
met_copy = met_copy.reset_index()
met_dict = met_copy.set_index('BiGG ID')[['PubChem','Inchi','SMILES']].to_dict(orient='index')

met_dict

{'10fthf': {'PubChem': '122347',
  'Inchi': 'InChI=1S/C20H23N7O7/c21-20-25-16-15(18(32)26-20)23-11(7-22-16)8-27(9-28)12-3-1-10(2-4-12)17(31)24-13(19(33)34)5-6-14(29)30/h1-4,9,11,13,23H,5-8H2,(H,24,31)(H,29,30)(H,33,34)(H4,21,22,25,26,32)/p-2/t11-,13+/m1/s1',
  'SMILES': '[H]C(=O)N(C[C@H]1CNc2nc(N)[nH]c(=O)c2N1)c1ccc(cc1)C(=O)N[C@@H](CCC([O-])=O)C([O-])=O'},
 '10fthf5glu': {'PubChem': 'NaN',
  'Inchi': 'InChI=1S/C40H51N11O19/c41-40-49-32-31(37(66)50-40)43-19(15-42-32)16-51(17-52)20-3-1-18(2-4-20)33(62)47-24(38(67)68)5-10-26(53)44-21(6-11-27(54)55)34(63)45-22(7-12-28(56)57)35(64)46-23(8-13-29(58)59)36(65)48-25(39(69)70)9-14-30(60)61/h1-4,17,19,21-25,43H,5-16H2,(H,44,53)(H,45,63)(H,46,64)(H,47,62)(H,48,65)(H,54,55)(H,56,57)(H,58,59)(H,60,61)(H,67,68)(H,69,70)(H4,41,42,49,50,66)/p-6',
  'SMILES': 'N=c1nc(O)c2c([nH]1)NCC(CN(C=O)c1ccc(C(=O)NC(CCC(=O)NC(CCC(=O)[O-])C(=O)NC(CCC(=O)[O-])C(=O)NC(CCC(=O)[O-])C(=O)NC(CCC(=O)[O-])C(=O)[O-])C(=O)[O-])cc1)N2'},
 '10fthf6glu': {'PubChem': 'NaN',
  'In

In [12]:
# Create an inverted dictionary with PubChem IDs as keys and Metabolite IDs as values
inverted_data = {}

for key, value in met_dict.items():
    pubchem_value = value['PubChem']
    if pubchem_value == 'NaN':
        continue  # Skip if the PubChem value is 'NaN'
    if pubchem_value not in inverted_data:
        inverted_data[pubchem_value] = [key]
    else:
        inverted_data[pubchem_value].append(key)

# Check for PubChem values associated with more than one key
for pubchem_value, keys in inverted_data.items():
    if len(keys) > 1:
        print(f"PubChem value '{pubchem_value}' is associated with keys: {keys}")


PubChem value 'None' is associated with keys: ['3oc101_7Zcoa', 'HC02111', 'sphmyln_cho']
PubChem value '536537' is associated with keys: ['CE2953', 'CE2963']
PubChem value '439153' is associated with keys: ['HC02112', 'nadh']
PubChem value '446013' is associated with keys: ['HC02114', 'fadh2']
PubChem value '439415' is associated with keys: ['HC02119', 'ametam']
PubChem value '73427352' is associated with keys: ['acglcgal14acglcgalgluside_cho', 'acngalacglcgalgluside_cho']
PubChem value '73427349' is associated with keys: ['galacglcgal14acglcgalgluside_cho', 'galacglcgalacglcgal14acglcgalgluside_cho']
PubChem value '1038' is associated with keys: ['h', 'h_']
PubChem value '53477847' is associated with keys: ['octd11ecoa', 'vacccoa']
PubChem value 'C02530' is associated with keys: ['xolest2_cho', 'xolest_cho']


The list of metabolites generated here is then manually curated since many of the duplicated PubChem IDs could be wrongly assigned.

### 3.3 Identification of duplicated metabolites by their Inchi
Next, we identify those metabolites that are duplicated by comparting their **Inchi** string. Here we repeate the procedure used in **3.2** and create an inverted a dictionary with the **Inchi** as keys and **Metabolite IDs** as values.

In [18]:
# Create an inverted dictionary with Inchis as keys and Metabolite IDs as values
inverted_data = {}

for key, value in met_dict.items():
    inchi_value = value['Inchi']
    if inchi_value == 'NaN':
        continue  # Skip if the Inchi value is 'NaN'
    if inchi_value == '':
        continue  # Skip if the Inchi value is an empty string
    if inchi_value == None:
        continue  # Skip if the Inchi value is None
    if inchi_value not in inverted_data:
        inverted_data[inchi_value] = [key]
    else:
        inverted_data[inchi_value].append(key)

# Check for PubChem values associated with more than one key
for inchi_value, keys in inverted_data.items():
    if len(keys) > 1:
        print(f"Inchi value '{inchi_value}' is associated with keys: {keys}")


Inchi value 'InChI=1S/C30H50N7O17P3S/c1-4-5-6-7-8-9-10-21(39)58-14-13-32-20(38)11-12-33-28(42)25(41)30(2,3)16-51-57(48,49)54-56(46,47)50-15-19-24(53-55(43,44)45)23(40)29(52-19)37-18-36-22-26(31)34-17-35-27(22)37/h9-10,17-19,23-25,29,40-41H,4-8,11-16H2,1-3H3,(H,32,38)(H,33,42)(H,46,47)(H,48,49)(H2,31,34,35)(H2,43,44,45)/p-4/b10-9+/t19-,23-,24-,25+,29-/m1/s1' is associated with keys: ['CE4806', 'M00056']
Inchi value 'InChI=1S/C20H30O5/c1-2-3-6-9-15(21)12-13-17-16(18(22)14-19(17)23)10-7-4-5-8-11-20(24)25/h3-4,6-7,12-13,15-17,19,21,23H,2,5,8-11,14H2,1H3,(H,24,25)/b6-3-,7-4-,13-12+/t15-,16+,17+,19+/m0/s1' is associated with keys: ['CE7112', 'CE7115', 'HC02213']
Inchi value 'InChI=1S/C52H92N2O28/c1-4-5-6-7-8-9-10-11-12-13-14-15-16-17-27(62)26(53-24(2)60)23-73-49-41(70)39(68)45(32(22-59)78-49)80-52-42(71)46(35(64)29(19-56)76-52)81-48-33(54-25(3)61)37(66)44(31(21-58)77-48)79-51-43(72)47(36(65)30(20-57)75-51)82-50-40(69)38(67)34(63)28(18-55)74-50/h16-17,26-52,55-59,62-72H,4-15,18-23H2,1-3H3,(H,

The list of metabolites generated here is then manually curated since many of the duplicated Inchis could be wrongly annotated.

In [3]:
met

Unnamed: 0,Curated,BiGG ID,Name,Formula,Compartment,KEGG,CHEBI,ChEMBLID,PubChem,Inchi,...,EHMNID,SMILES,INCHI2,CC_ID,Stereoisomer Information of Metabolite Identified,Charge of the Metabolite Identified,CID_ID,PDB (ligand-expo) Experimental Coordinates File Url,Pub Chem Url,ChEBI Url
0,,10fthf5glu_c,10-formyltetrahydrofolate-[glu](5),C40H45N11O19,c - cytosol,,,,,InChI=1S/C40H51N11O19/c41-40-49-32-31(37(66)50...,...,,N=c1nc(O)c2c([nH]1)NCC(CN(C=O)c1ccc(C(=O)NC(CC...,,,,,,,,
1,,10fthf5glu_e,10-formyltetrahydrofolate-[glu](5),C40H45N11O19,e - extracellular space,,,,,InChI=1S/C40H51N11O19/c41-40-49-32-31(37(66)50...,...,,N=c1nc(O)c2c([nH]1)NCC(CN(C=O)c1ccc(C(=O)NC(CC...,,,,,,,,
2,,10fthf5glu_l,10-formyltetrahydrofolate-[glu](5),C40H45N11O19,l - lysosome,,,,,InChI=1S/C40H51N11O19/c41-40-49-32-31(37(66)50...,...,,N=c1nc(O)c2c([nH]1)NCC(CN(C=O)c1ccc(C(=O)NC(CC...,,,,,,,,
3,,10fthf5glu_m,10-formyltetrahydrofolate-[glu](5),C40H45N11O19,m - mitochondria,,,,,InChI=1S/C40H51N11O19/c41-40-49-32-31(37(66)50...,...,,N=c1nc(O)c2c([nH]1)NCC(CN(C=O)c1ccc(C(=O)NC(CC...,,,,,,,,
4,,10fthf6glu_c,10-formyltetrahydrofolate-[glu](6),C45H51N12O22,c - cytosol,,,,,InChI=1/C45H58N12O22/c46-45-55-36-35(38(67)56-...,...,,N=c1nc([O-])c2c([nH]1)NCC(CN(C=O)c1ccc(C(=O)N[...,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7752,PD,zym_int2_r,"5alpha-Cholesta-8,24-dien-3-one",C27H42O,r - endoplasmic reticulum,,,,22298942,InChI=1S/C27H42O/c1-18(2)7-6-8-19(3)23-11-12-2...,...,,CC(CCC=C(C)C)C1CCC2C1(CCC3=C2CCC4C3(CCC(=O)C4)C)C,InChI=1S/C27H42O/c1-18(2)7-6-8-19(3)23-11-12-2...,,,Neutral,22298942,,https://pubchem.ncbi.nlm.nih.gov/compound/2229...,https://www.ebi.ac.uk/chebi/searchId.do?chebiI...
7753,,zymst_c,zymosterol c27h44o,C27H44O,c - cytosol,,,,92746,InChI=1S/C27H44O/c1-18(2)7-6-8-19(3)23-11-12-2...,...,,[H][C@@]12CCC3=C(CC[C@]4(C)[C@]([H])(CC[C@@]34...,InChI=1S/C27H44O/c1-18(2)7-6-8-19(3)23-11-12-2...,,,Neutral,92746,,https://pubchem.ncbi.nlm.nih.gov/compound/92746,https://www.ebi.ac.uk/chebi/searchId.do?chebiI...
7754,,zymst_r,zymosterol c27h44o,C27H44O,r - endoplasmic reticulum,,,,92746,InChI=1S/C27H44O/c1-18(2)7-6-8-19(3)23-11-12-2...,...,,[H][C@@]12CCC3=C(CC[C@]4(C)[C@]([H])(CC[C@@]34...,InChI=1S/C27H44O/c1-18(2)7-6-8-19(3)23-11-12-2...,,,Neutral,92746,,https://pubchem.ncbi.nlm.nih.gov/compound/92746,https://www.ebi.ac.uk/chebi/searchId.do?chebiI...
7755,,zymstnl_c,5alpha-cholest-8-en-3beta-ol,C27H46O,c - cytosol,,16608,,101770,InChI=1S/C27H46O/c1-18(2)7-6-8-19(3)23-11-12-2...,...,,[H][C@@]12CCC3=C(CC[C@]4(C)[C@]([H])(CC[C@@]34...,InChI=1S/C27H46O/c1-18(2)7-6-8-19(3)23-11-12-2...,,,Neutral,101770,,http://pubchem.ncbi.nlm.nih.gov/compound/101770,https://www.ebi.ac.uk/chebi/searchId.do?chebiI...


In [4]:
# Store the original column order
column_order = met.columns.tolist()

# Group by 'BiGG ID' and keep the first non-null value in each group, then reset the index
met = met.groupby('BiGG ID').first().reset_index()

# Rearrange the columns to the original order
met = met[column_order]

met

Unnamed: 0,Curated,BiGG ID,Name,Formula,Compartment,KEGG,CHEBI,ChEMBLID,PubChem,Inchi,...,EHMNID,SMILES,INCHI2,CC_ID,Stereoisomer Information of Metabolite Identified,Charge of the Metabolite Identified,CID_ID,PDB (ligand-expo) Experimental Coordinates File Url,Pub Chem Url,ChEBI Url
0,,10fthf5glu_c,10-formyltetrahydrofolate-[glu](5),C40H45N11O19,c - cytosol,,,,,InChI=1S/C40H51N11O19/c41-40-49-32-31(37(66)50...,...,,N=c1nc(O)c2c([nH]1)NCC(CN(C=O)c1ccc(C(=O)NC(CC...,,,,,,,,
1,,10fthf5glu_e,10-formyltetrahydrofolate-[glu](5),C40H45N11O19,e - extracellular space,,,,,InChI=1S/C40H51N11O19/c41-40-49-32-31(37(66)50...,...,,N=c1nc(O)c2c([nH]1)NCC(CN(C=O)c1ccc(C(=O)NC(CC...,,,,,,,,
2,,10fthf5glu_l,10-formyltetrahydrofolate-[glu](5),C40H45N11O19,l - lysosome,,,,,InChI=1S/C40H51N11O19/c41-40-49-32-31(37(66)50...,...,,N=c1nc(O)c2c([nH]1)NCC(CN(C=O)c1ccc(C(=O)NC(CC...,,,,,,,,
3,,10fthf5glu_m,10-formyltetrahydrofolate-[glu](5),C40H45N11O19,m - mitochondria,,,,,InChI=1S/C40H51N11O19/c41-40-49-32-31(37(66)50...,...,,N=c1nc(O)c2c([nH]1)NCC(CN(C=O)c1ccc(C(=O)NC(CC...,,,,,,,,
4,,10fthf6glu_c,10-formyltetrahydrofolate-[glu](6),C45H51N12O22,c - cytosol,,,,,InChI=1/C45H58N12O22/c46-45-55-36-35(38(67)56-...,...,,N=c1nc([O-])c2c([nH]1)NCC(CN(C=O)c1ccc(C(=O)N[...,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7750,PD,zym_int2_r,"5alpha-Cholesta-8,24-dien-3-one",C27H42O,r - endoplasmic reticulum,,,,22298942,InChI=1S/C27H42O/c1-18(2)7-6-8-19(3)23-11-12-2...,...,,CC(CCC=C(C)C)C1CCC2C1(CCC3=C2CCC4C3(CCC(=O)C4)C)C,InChI=1S/C27H42O/c1-18(2)7-6-8-19(3)23-11-12-2...,,,Neutral,22298942,,https://pubchem.ncbi.nlm.nih.gov/compound/2229...,https://www.ebi.ac.uk/chebi/searchId.do?chebiI...
7751,,zymst_c,zymosterol c27h44o,C27H44O,c - cytosol,,,,92746,InChI=1S/C27H44O/c1-18(2)7-6-8-19(3)23-11-12-2...,...,,[H][C@@]12CCC3=C(CC[C@]4(C)[C@]([H])(CC[C@@]34...,InChI=1S/C27H44O/c1-18(2)7-6-8-19(3)23-11-12-2...,,,Neutral,92746,,https://pubchem.ncbi.nlm.nih.gov/compound/92746,https://www.ebi.ac.uk/chebi/searchId.do?chebiI...
7752,,zymst_r,zymosterol c27h44o,C27H44O,r - endoplasmic reticulum,,,,92746,InChI=1S/C27H44O/c1-18(2)7-6-8-19(3)23-11-12-2...,...,,[H][C@@]12CCC3=C(CC[C@]4(C)[C@]([H])(CC[C@@]34...,InChI=1S/C27H44O/c1-18(2)7-6-8-19(3)23-11-12-2...,,,Neutral,92746,,https://pubchem.ncbi.nlm.nih.gov/compound/92746,https://www.ebi.ac.uk/chebi/searchId.do?chebiI...
7753,,zymstnl_c,5alpha-cholest-8-en-3beta-ol,C27H46O,c - cytosol,,16608,,101770,InChI=1S/C27H46O/c1-18(2)7-6-8-19(3)23-11-12-2...,...,,[H][C@@]12CCC3=C(CC[C@]4(C)[C@]([H])(CC[C@@]34...,InChI=1S/C27H46O/c1-18(2)7-6-8-19(3)23-11-12-2...,,,Neutral,101770,,http://pubchem.ncbi.nlm.nih.gov/compound/101770,https://www.ebi.ac.uk/chebi/searchId.do?chebiI...


In [19]:
# Normalize all the SMILES to Canonical
dict_metabolites_canonical = {k: getCanonical(v) for k, v in met_dict.items() if str(v) != 'NaN'}

{'PubChem': '122347', 'Inchi': 'InChI=1S/C20H23N7O7/c21-20-25-16-15(18(32)26-20)23-11(7-22-16)8-27(9-28)12-3-1-10(2-4-12)17(31)24-13(19(33)34)5-6-14(29)30/h1-4,9,11,13,23H,5-8H2,(H,24,31)(H,29,30)(H,33,34)(H4,21,22,25,26,32)/p-2/t11-,13+/m1/s1', 'SMILES': '[H]C(=O)N(C[C@H]1CNc2nc(N)[nH]c(=O)c2N1)c1ccc(cc1)C(=O)N[C@@H](CCC([O-])=O)C([O-])=O'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'InChI=1S/C40H51N11O19/c41-40-49-32-31(37(66)50-40)43-19(15-42-32)16-51(17-52)20-3-1-18(2-4-20)33(62)47-24(38(67)68)5-10-26(53)44-21(6-11-27(54)55)34(63)45-22(7-12-28(56)57)35(64)46-23(8-13-29(58)59)36(65)48-25(39(69)70)9-14-30(60)61/h1-4,17,19,21-25,43H,5-16H2,(H,44,53)(H,45,63)(H,46,64)(H,47,62)(H,48,65)(H,54,55)(H,56,57)(H,58,59)(H,60,61)(H,67,68)(H,69,70)(H4,41,42,49,50,66)/p-6', 'SMILES': 'N=c1nc(O)c2c([nH]1)NCC(CN(C=O)c1ccc(C(=O)NC(CCC(=O)NC(CCC(=O)[O-])C(=O)NC(CCC(=O)[O-])C(=O)NC(CCC(=O)[O-])C(=O)NC(CCC(=O)[O-])C(=O)[O-])C(=O)[O-])cc1)N2'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inc

{'PubChem': '45266568', 'Inchi': 'InChI=1S/C26H42N7O18P3S/c1-13(14(2)34)25(39)55-8-7-28-16(35)5-6-29-23(38)20(37)26(3,4)10-48-54(45,46)51-53(43,44)47-9-15-19(50-52(40,41)42)18(36)24(49-15)33-12-32-17-21(27)30-11-31-22(17)33/h11-13,15,18-20,24,36-37H,5-10H2,1-4H3,(H,28,35)(H,29,38)(H,43,44)(H,45,46)(H2,27,30,31)(H2,40,41,42)/p-4/t13?,15-,18-,19-,20+,24-/m1/s1', 'SMILES': 'CC(C(C)=O)C(=O)SCCNC(=O)CCNC(=O)[C@H](O)C(C)(C)COP([O-])(=O)OP([O-])(=O)OC[C@H]1O[C@H]([C@H](O)[C@@H]1OP([O-])([O-])=O)n1cnc2c(N)ncnc12'} Not a valid SMILES string
{'PubChem': '5280564', 'Inchi': 'InChI=1S/C26H42N7O17P3S/c1-5-14(2)25(38)54-9-8-28-16(34)6-7-29-23(37)20(36)26(3,4)11-47-53(44,45)50-52(42,43)46-10-15-19(49-51(39,40)41)18(35)24(48-15)33-13-32-17-21(27)30-12-31-22(17)33/h5,12-13,15,18-20,24,35-36H,6-11H2,1-4H3,(H,28,34)(H,29,37)(H,42,43)(H,44,45)(H2,27,30,31)(H2,39,40,41)/p-4/b14-5+/t15-,18-,19-,20+,24-/m1/s1', 'SMILES': 'C\\C=C(/C)C(=O)SCCNC(=O)CCNC(=O)[C@H](O)C(C)(C)COP([O-])(=O)OP([O-])(=O)OC[C@H]1O[C@H](

{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': '11966216', 'Inchi': 'InChI=1S/C29H50N7O18P3S/c1-4-5-6-7-17(37)12-20(39)58-11-10-31-19(38)8-9-32-27(42)24(41)29(2,3)14-51-57(48,49)54-56(46,47)50-13-18-23(53-55(43,44)45)22(40)28(52-18)36-16-35-21-25(30)33-15-34-26(21)36/h15-18,22-24,28,37,40-41H,4-14H2,1-3H3,(H,31,38)(H,32,42)(H,46,47)(H,48,49)(H2,30,33,34)(H2,43,44,45)/p-4/t17-,18+,22+,23+,24-,28+/m0/s1', 'SMILES': 'CCCCC[C@H](O)CC(=O)SCCNC(=O)CCNC(=O)[C@H](O)C(C)(C)COP(O)(=O)OP(O)(=O)OC[C@H]1O[C@H]([C@H](O)[C@@H]1OP(O)(O)=O)n1cnc2c(N)ncnc12'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES 

{'PubChem': '22833663', 'Inchi': 'InChI=1S/C39H68N7O18P3S/c1-4-5-6-7-8-9-10-11-12-13-14-15-16-17-27(47)22-30(49)68-21-20-41-29(48)18-19-42-37(52)34(51)39(2,3)24-61-67(58,59)64-66(56,57)60-23-28-33(63-65(53,54)55)32(50)38(62-28)46-26-45-31-35(40)43-25-44-36(31)46/h25-26,28,32-34,38,50-51H,4-24H2,1-3H3,(H,41,48)(H,42,52)(H,56,57)(H,58,59)(H2,40,43,44)(H2,53,54,55)/p-4/t28-,32-,33-,34+,38-/m1/s1', 'SMILES': 'CCCCCCCCCCCCCCCC(=O)CC(=O)SCCNC(=O)CCNC(=O)[C@H](O)C(C)(C)COP(O)(=O)OP(O)(=O)OC[C@H]1O[C@H]([C@H](O)[C@@H]1OP(O)(O)=O)n1cnc2c(N)ncnc12'} Not a valid SMILES string
{'PubChem': '11966312', 'Inchi': 'InChI=1S/C31H50N7O18P3S/c1-17(2)19(7-6-18(3)39)12-22(41)60-11-10-33-21(40)8-9-34-29(44)26(43)31(4,5)14-53-59(50,51)56-58(48,49)52-13-20-25(55-57(45,46)47)24(42)30(54-20)38-16-37-23-27(32)35-15-36-28(23)38/h15-16,19-20,24-26,30,42-43H,1,6-14H2,2-5H3,(H,33,40)(H,34,44)(H,48,49)(H,50,51)(H2,32,35,36)(H2,45,46,47)/t19-,20+,24+,25+,26-,30+/m0/s1', 'SMILES': 'CC(=C)C(CCC(=O)C)CC(=O)SCCNC(=O)CCNC(=

{'PubChem': '439400', 'Inchi': 'InChI=1S/C6H13O7P/c1-6(9,4-5(7)8)2-3-13-14(10,11)12/h9H,2-4H2,1H3,(H,7,8)(H2,10,11,12)/p-3/t6-/m1/s1', 'SMILES': 'C[C@@](O)(CCOP([O-])([O-])=O)CC([O-])=O'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': '439954', 'Inchi': 'InChI=1S/C6H11NO3/c7-4-2-1-3-5(8)6(9)10/h1-4,7H2,(H,9,10)', 'SMILES': 'C(CCN)CC(=O)C(=O)O'} Not a valid SMILES string
{'PubChem': '169691', 'Inchi': 'InChI=1/C44H54N12O21/c45-21(36(66)67)5-10-26(57)43(15-30(61)62,27(58)11-6-22(46)37(68)69)44(28(59)12-7-23(47)38(70)71,41(76)77-31(63)14-9-25(49)40(74)75)56(29(60)13-8-24(48)39(72)73)35(65)18-1-3-19(4-2-18)51-16-20-17-52-33-32(53-20)34(64)55-42(50)54-33/h1-4,17,21-25,51H,5-16,45-49H2,(H,61,62)(H,66,67)(H,68,69)(H,70,71)(H,72,73)(H,74,75)(H3,50,52,54,55,64)/t21-,22-,23-,24-,25-,44+/m0/s1', 'SMILES': 'C1=CC(=CC=C1C(=O)N(C(=O)CCC(C(=O)O)N)C(C(=O)CC

{'PubChem': '5356421', 'Inchi': 'InChI=1/C18H32O3/c1-2-3-10-13-16-17(21-16)14-11-8-6-4-5-7-9-12-15-18(19)20/h8,11,16-17H,2-7,9-10,12-15H2,1H3,(H,19,20)/b11-8-', 'SMILES': 'CCCCCC1C(O1)CC=CCCCCCCCC(=O)O'} Not a valid SMILES string
{'PubChem': '37456', 'Inchi': 'InChI=1S/C20H12O/c1-2-11-4-5-13-10-14-7-9-16-20(21-16)19(14)15-8-6-12(3-1)17(11)18(13)15/h1-10,16,20H', 'SMILES': 'C1=CC2=C3C(=C1)C=CC4=C3C(=CC5=C4C6C(O6)C=C5)C=C2'} Not a valid SMILES string
{'PubChem': '37786', 'Inchi': 'InChI=1S/C20H12O/c1-2-6-13-12(4-1)10-16-18-14(13)9-8-11-5-3-7-15(17(11)18)19-20(16)21-19/h1-10,19-20H', 'SMILES': 'C1=CC=C2C3=C4C(=CC2=C1)C5C(O5)C6=CC=CC(=C64)C=C3'} Not a valid SMILES string
{'PubChem': '5781', 'Inchi': 'InChI=1/C4H2N2O4/c7-1-2(8)5-4(10)6-3(1)9/h(H2,5,6,8,9,10)', 'SMILES': 'O=C1NC(=O)C(=O)C(=O)N1'} Not a valid SMILES string
{'PubChem': '131769843', 'Inchi': 'InChI=1/C27H48O3/c1-17(7-6-12-25(2,3)30)20-8-9-21-24-22(11-14-27(20,21)5)26(4)13-10-19(28)15-18(26)16-23(24)29/h17-24,28-30H,6-16H2,1-5H3

{'PubChem': '53477797', 'Inchi': 'InChI=1/C18H22O3/c1-18-9-8-11-10-4-6-15(19)17(21)13(10)3-2-12(11)14(18)5-7-16(18)20/h4,6,11-12,14,19,21H,2-3,5,7-9H2,1H3/t11?,12?,14?,18-/m1/s1', 'SMILES': 'CC12CCC3C(C1CCC2=O)CCC4=C3C=CC(=C4O)O'} Not a valid SMILES string
{'PubChem': '13267935', 'Inchi': 'InChI=1S/C19H26O3/c1-19-8-7-12-13(15(19)5-6-18(19)21)4-3-11-9-17(22-2)16(20)10-14(11)12/h9-10,12-13,15,18,20-21H,3-8H2,1-2H3/t12-,13+,15-,18-,19-/m0/s1', 'SMILES': 'CC12CCC3C(C1CCC2O)CCC4=CC(=C(C=C34)O)OC'} Not a valid SMILES string
{'PubChem': '53480676', 'Inchi': 'InChI=1/C19H24O3/c1-19-8-7-12-13(15(19)5-6-18(19)21)4-3-11-9-17(22-2)16(20)10-14(11)12/h9-10,12-13,15,20H,3-8H2,1-2H3/t12?,13?,15?,19-/m1/s1', 'SMILES': 'CC12CCC3C(C1CCC2=O)CCC4=CC(=C(C=C34)O)OC'} Not a valid SMILES string
{'PubChem': '29983092', 'Inchi': 'InChI=1/C19H26O3/c1-19-10-9-12-11-5-7-16(20)18(22-2)14(11)4-3-13(12)15(19)6-8-17(19)21/h5,7,12-13,15,17,20-21H,3-4,6,8-10H2,1-2H3/t12-,13-,15+,17-,19-/m0/s1', 'SMILES': 'CC12CCC3C(C1CCC

{'PubChem': '53481584', 'Inchi': 'InChI=1S/C57H90N18O16/c1-30(2)27-38(71-47(82)34-18-20-44(78)66-34)49(84)72-39(28-31-14-16-32(76)17-15-31)50(85)67-35(19-21-45(79)80)48(83)73-40(29-43(59)77)51(86)70-36(9-3-4-22-58)53(88)74-25-7-12-41(74)52(87)68-33(10-5-23-64-56(60)61)46(81)69-37(11-6-24-65-57(62)63)54(89)75-26-8-13-42(75)55(90)91/h14-17,30,33-42,76H,3-13,18-29,58H2,1-2H3,(H2,59,77)(H,66,78)(H,67,85)(H,68,87)(H,69,81)(H,70,86)(H,71,82)(H,72,84)(H,73,83)(H,79,80)(H,90,91)(H4,60,61,64)(H4,62,63,65)/t33-,34+,35+,36+,37+,38+,39-,40+,41-,42-/m0/s1', 'SMILES': 'CC(C)CC(C(=O)NC(CC1=CC=C(C=C1)O)C(=O)NC(CCC(=O)O)C(=O)NC(CC(=O)N)C(=O)NC(CCCCN)C(=O)N2CCCC2C(=O)NC(CCCN=C(N)N)C(=O)NC(CCCN=C(N)N)C(=O)N3CCCC3C(=O)O)NC(=O)C4CCC(=O)N4'} Not a valid SMILES string
{'PubChem': '53481585', 'Inchi': 'InChI=1S/C21H33N3O5/c1-5-13(4)18(20(27)23-17(21(28)29)10-12(2)3)24-19(26)16(22)11-14-6-8-15(25)9-7-14/h6-9,12-13,16-18,25H,5,10-11,22H2,1-4H3,(H,23,27)(H,24,26)(H,28,29)/t13?,16-,17-,18+/m1/s1', 'SMILES': 'CCC(

{'PubChem': '131769873', 'Inchi': 'InChI=1S/C43H66N7O17P3S/c1-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-21-22-23-34(52)71-27-26-45-33(51)24-25-46-41(55)38(54)43(2,3)29-64-70(61,62)67-69(59,60)63-28-32-37(66-68(56,57)58)36(53)42(65-32)50-31-49-35-39(44)47-30-48-40(35)50/h5-6,8-9,11-12,14-15,17-18,22-23,30-32,36-38,42,53-54H,4,7,10,13,16,19-21,24-29H2,1-3H3,(H,45,51)(H,46,55)(H,59,60)(H,61,62)(H2,44,47,48)(H2,56,57,58)/b6-5-,9-8-,12-11-,15-14-,18-17-,23-22+/t32-,36+,37+,38-,42-/m0/s1', 'SMILES': 'CCC=CCC=CCC=CCC=CCC=CCCCC=CC(=O)SCCNC(=O)CCNC(=O)C(C(C)(C)COP(=O)(O)OP(=O)(O)OCC1C(C(C(O1)N2C=NC3=C(N=CN=C32)N)O)OP(=O)(O)O)O'} Not a valid SMILES string
{'PubChem': '71581213', 'Inchi': '', 'SMILES': ''} Not a valid SMILES string
{'PubChem': '53481432', 'Inchi': 'InChI=1/C45H70N7O18P3S/c1-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-21-22-23-33(53)28-36(55)74-27-26-47-35(54)24-25-48-43(58)40(57)45(2,3)30-67-73(64,65)70-72(62,63)66-29-34-39(69-71(59,60)61)38(56)44(68-34)52-32-51-37-41(46)49-3

{'PubChem': 'NaN', 'Inchi': 'InChI=1S/C45H78N7O18P3S/c1-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-21-22-23-33(53)28-36(55)74-27-26-47-35(54)24-25-48-43(58)40(57)45(2,3)30-67-73(64,65)70-72(62,63)66-29-34-39(69-71(59,60)61)38(56)44(68-34)52-32-51-37-41(46)49-31-50-42(37)52/h11-12,31-32,34,38-40,44,56-57H,4-10,13-30H2,1-3H3,(H,47,54)(H,48,58)(H,62,63)(H,64,65)(H2,46,49,50)(H2,59,60,61)/p-4/b12-11-/t34?,38?,39?,40?,44-/m0/s1', 'SMILES': 'CCCCCCCC/C=C\\CCCCCCCCCCCC(=O)CC(=O)SCCNC(=O)CCNC(=O)C(O)C(C)(C)COP(=O)([O-])OP(=O)([O-])OCC1O[C@H](n2cnc3c(N)ncnc32)C(O)C1OP(=O)([O-])[O-]'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': '72193825', 'Inchi': 'InChI=1S/C45H78N7O17P3S/c1-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-21-22-23-24-25-36(54)73-29-28-47-35(53)26-27-48-43(57)40(56)45(2,3)31-66-72(63,64)69-71(61,62)65-30-34-39(68-70(58,59)60)38(55)44(67-34)52-33-51-37-41(46)49-32-50-42(37)52/h11-12,24-25,32-34,38-40,44,55-56H

{'PubChem': '100185', 'Inchi': 'InChI=1/C11H13NO4/c1-11(10(15)16)7-5-9(14)8(13)4-6(7)2-3-12-11/h4-5,12-14H,2-3H2,1H3,(H,15,16)/p-1', 'SMILES': 'CC1(C2=CC(=C(C=C2CCN1)O)O)C(=O)O'} Not a valid SMILES string
{'PubChem': '320322', 'Inchi': 'InChI=1/C12H15NO4/c1-12(11(15)16)8-6-10(17-2)9(14)5-7(8)3-4-13-12/h5-6,13-14H,3-4H2,1-2H3,(H,15,16)/p-1', 'SMILES': 'CC1(C2=CC(=C(C=C2CCN1)O)OC)C(=O)O'} Not a valid SMILES string
{'PubChem': '20844', 'Inchi': 'InChI=1/C10H11NO2/c1-6-8-5-10(13)9(12)4-7(8)2-3-11-6/h4-5,12-13H,2-3H2,1H3', 'SMILES': 'CC1=NCCC2=CC(=C(C=C12)O)O'} Not a valid SMILES string
{'PubChem': '123349', 'Inchi': 'InChI=1S/HNO3/c2-1-4-3/h3H/p-1', 'SMILES': '[O-]ON=O'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': '54518673', 'Inchi': 'InChI=1S/C20H26O3/c1-14(7-6-8-15(2)13-19(22)23)9-11-17-16(3)10-12-18(21)20(17,4)5/h6-9,11,13H,10,12H2,1-5H3,(H,22,23)', 'SMILES': 'CC1=C(C(C(=O)CC1)(C)C)C=CC(=CC=CC(=CC(=O)O)C)C'} Not a 

{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': '53481456', 'Inchi': 'InChI=1S/C20H30O6/c21-17(11-6-2-1-3-9-15-19(23)24)12-7-4-5-8-13-18(22)14-10-16-20(25)26/h2,4-6,8,13,18,22H,1,3,7,9-12,14-16H2,(H,23,24)(H,25,26)/b5-4+,6-2-,13-8-/t18-/m0/s1', 'SMILES': 'C(CCC(=O)O)CC=CCC(=O)CCC=CC=CC(CCCC(=O)O)O'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': '53481512', 'Inchi': 'InChI=1/C23H35NO6S/c24-19(23(29)30)18-31-21(20(26)14-13-16-22(27)28)15-11-9-7-5-3-1-2-4-6-8-10-12-17-25/h2-5,7,9,11,15,17,19-21,26H,1,6,8,10,12-14,16,18,24H2,(H,27,28)(H,29,30)/p-1/b4-2-,5-3-,9-7+,15-11+/t19-,20+,21-/m1/s1', 'SMILES': 'C(CCC=O)CC=CCC=CC=CC=CC(C(CCCC(=O)O)O)SCC(C(=O)O)N'} Not a valid SMILES string
{'PubChem': '53481508', 'Inchi': 'InChI=1S/C23H3

{'PubChem': '71328509', 'Inchi': 'InChI=1/C9H16O2/c1-2-3-4-5-8-9(11-8)6-7-10/h7-9H,2-6H2,1H3', 'SMILES': 'CCCCCC1C(O1)CC=O'} Not a valid SMILES string
{'PubChem': '53394277', 'Inchi': 'InChI=1S/C20H30O4/c1-2-3-4-5-6-7-8-9-10-11-12-13-14-16-19(24-23)17-15-18-20(21)22/h3-4,6-7,9-10,12-14,16,19,23H,2,5,8,11,15,17-18H2,1H3,(H,21,22)', 'SMILES': 'CCC=CCC=CCC=CCC=CC=CC(CCCC(=O)O)OO'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': '53481464', 'Inchi': 'InChI=1S/C28H40O4/c1-19(12-8-14-21(3)27(30)31)10-7-11-20(2)13-9-16-28(6)17-15-24-18-25(29)22(4)23(5)26(24)32-28/h10,13-14,18,29H,7-9,11-12,15-17H2,1-6H3,(H,30,31)/b19-10+,20-13+,21-14+/t28-/m1/s1', 'SMILES': 'CC1=C(C=C2CCC(OC2=C1C)(C)CCC=C(C)CCC=C(C)CCC=C(C)C(=O)O)O'} Not a valid SMILES string
{'PubChem': '53481468', 'Inchi': 'InChI=1S/C28H42O3/c1-20(12-8-13-22(3)19-29)10-7-11-21(2)14-9-16-28(6)17-15-25-18-26(30)23(4)24(5)27(25)31-28/h10,13-14,18,29-30H,7-9,11-12,15-17,19H2,1-

{'PubChem': '135398702', 'Inchi': 'InChI=1S/C9H13N5O3/c1-3(15)6(16)4-2-11-7-5(12-4)8(17)14-9(10)13-7/h3-4,12,15H,2H2,1H3,(H4,10,11,13,14,17)', 'SMILES': 'CC(C(=O)C1CNC2=C(N1)C(=O)NC(=N2)N)O'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': '135402024', 'Inchi': 'InChI=1S/C9H13N5O4/c10-9-13-7-5(8(18)14-9)12-3(1-11-7)6(17)4(16)2-15/h4,6,15-17H,1-2H2,(H4,10,11,13,14,18)', 'SMILES': 'Nc1nc2NCC(=Nc2c(=O)[nH]1)[C@H](O)[C@H](O)CO'} Not a valid SMILES string
{'PubChem': '440565', 'Inchi': 'InChI=1S/C13H25NO2S2/c1-3-10(2)13(16)18-9-8-11(17)6-4-5-7-12(14)15/h10-11,17H,3-9H2,1-2H3,(H2,14,15)', 'SMILES': 'CCC(C)C(=O)SCCC(S)CCCCC(N)=O'} Not a valid SMILES string
{'PubChem': '440566', 'Inchi': 'InChI=1S/C13H25NO2S2/c1-10(2)9-13(16)18-8-7-11(17)5-3-4-6-12(14)15/h10-11,17H,3-9H2,1-2H3,(H2,14,15)', 'SMILES': 'CC(C)CC(=O)SCCC(S)CCCCC(N)=O'} Not a valid SMILES string
{'PubChem': '25244267', 'Inchi': 'InChI=1S/C6H6O7/c7-3(8)1-2(5(10)11)4(

{'PubChem': '23951', 'Inchi': 'InChI=1S/Sm', 'SMILES': '[Sm]'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} N

{'PubChem': '5312529', 'Inchi': 'InChI=1S/C20H34O2/c1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20(21)22/h3-4,6-7,9-10H,2,5,8,11-19H2,1H3,(H,21,22)/b4-3-,7-6-,10-9-', 'SMILES': 'CCC=CCC=CCC=CCCCCCCCCCC(=O)O'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': '5312518', 'Inchi': 'InChI=1S/C20H38O2/c1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20(21)22/h7-8H,2-6,9-19H2,1H3,(H,21,22)/b8-7-', 'SMILES': 'CCCCCCC=CCCCCCCCCCCCC(=O)O'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': '5312441', 'Inchi': 'InChI=1S/C18H34O2/c1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18(19)20/h5-6H,2-4,7-17H2,1H3,(H,19,20)/b6-5-', 'SMILES': 'CCCCC=CCCCCCCCCCCCC(=O)O'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': '5312554', 'Inchi

{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': '119521', 'Inchi': 'InChI=1S/C2H2Cl2O/c3-2(4)1-5-2/h1H2', 'SMILES': 'C1C(O1)(Cl)Cl'} Not a valid SMILES string
{'PubChem': '6366', 'Inchi': 'InChI=1S/C2H2Cl2/c1-2(3)4/h1H2', 'SMILES': 'C=C(Cl)Cl'} Not a valid SMILES string
{'PubChem': '53297446', 'Inchi': 'InChI=1S/C17H16O8/c1-24-12-4-10(21)13(8(5-18)11(22)6-19)16-15(12)7-2-3-9(20)14(7)17(23)25-16/h4,6,8,11,18,21-22H,2-3,5H2,1H3', 'SMILES': 'COC1=C2C3=C(C(=O)CC3)C(=O)OC2=C(C(=C1)O)C(CO)C(C=O)O'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': '7839', 'Inchi': 'InChI=1S/C2H4B

{'PubChem': '7069', 'Inchi': 'InChI=1S/C16H12N2O4/c1-21-15-7-11(3-5-13(15)17-9-19)12-4-6-14(18-10-20)16(8-12)22-2/h3-8H,1-2H3', 'SMILES': 'O(P(=O)(OCCN)O)C[C@H](O)COC(*)=O'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': '11825433', 'Inchi': 'InChI=1S/C19H41O6P/c1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-24-17-19(20)18-25-26(21,22)23/h19-20H,2-18H2,1H3,(H2,21,22,23)/t19-/m1/s1', 'SMILES': 'CCCCCCCCCCCCCCCCOCC(COP(=O)(O)O)O'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': '11954052', 'Inchi': 'InChI=1S/C10H9NO4/c12-9-5-4-6-7(10(9)13)2-1-3-8(6)11(14)15/h1-5,9-10,12-13H', 'SMILES': 'C1=CC2=C(C=CC(C2O)O)C(=C1)[N+](=O)[O-]'} Not a valid SMILES string
{'PubChem': '11954057', 'Inchi': 'InChI=1S/C20H24N4O9S/c21-12(20(30)31)5-7-16(26)23-13(19(29)22-8-17(27)28)9-34-18-11-2-1-3-14(24(32)33)10(11)4-6-15(18)25/h1-4,6,12-13,15,18,25H,5,7-9,21H2,(H,22,29)(H,23,26)(H,27

{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': '71581247', 'Inchi': None, 'SMILES': None} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': '183009', 'Inchi': 'InChI=1S/C16H23N3O8/c1-19(18-

{'PubChem': '28598', 'Inchi': 'InChI=1S/C20H12O/c21-16-8-6-14-10-15-5-4-12-2-1-3-13-7-9-17(18(14)11-16)20(15)19(12)13/h1-11,21H', 'SMILES': 'C1=CC2=C3C(=C1)C=CC4=C5C=C(C=CC5=CC(=C43)C=C2)O'} Not a valid SMILES string
{'PubChem': '115064', 'Inchi': 'InChI=1S/C20H12O2/c21-12-6-4-11-8-16-18-13(15(11)9-12)7-5-10-2-1-3-14(17(10)18)19-20(16)22-19/h1-9,19-21H', 'SMILES': 'C1=CC2=C3C(=C1)C4C(O4)C5=C3C(=C6C=C(C=CC6=C5)O)C=C2'} Not a valid SMILES string
{'PubChem': '904', 'Inchi': 'InChI=1S/C8H9NO/c1-7(10)9-8-5-3-2-4-6-8/h2-6H,1H3,(H,9,10)', 'SMILES': 'CC(=O)NC1=CC=CC=C1'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': '44224030', 'Inchi': 'InChI=1S/C16H12O9/c1-23-10-2-8(19)11(6(3-17)9(20)4-18)14-12(10)7-5-24-15(21)13(7)16(22)25-14/h2-4,6,9,19-20H,5H2,1H3', 'SMILES': 'COC1=C2C3=C(C(=O)OC3)C(=O)OC2=C(C(=C1)O)C(C=O)C(C=O)O'} Not a valid SMILES string
{'

{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': '2756', 'Inchi': 'InChI=1S/C10H16N6S/c1-8-9(16-7-15-8)5-17-4-3-13-10(12-2)14-6-11/h7H,3-5H2,1-2H3,(H,15,16)(H2,12,13,14)', 'SMILES': 'CC1=C(N=CN1)CSCCNC(=NC)NC#N'} Not a valid SMILES string
{'PubChem': '114918', 'Inchi': 'InChI=1S/C14H8N2O6/c15-10-6(17)4-8-12(9(10)14(20)21)16-11-5(13(18)19)2-1-3-7(11)22-8/h1-4H,15H2,(H,18,19)(H,20,21)', 'SMILES': 'C1=CC(=C2C(=C1)OC3=CC(=O)C(=C(C3=N2)C(=O)O)N)C(=O)O'} Not a valid SMILES string
{'PubChem': '91828294', 'Inchi': 'InChI=1S/C35H60N7O18P3S/c1-4-5-6-7-8-9-10-11-12-13-23(43)18-26(45)64-17-16-37-25(44)14-15-38-33(48)30(47)35(2,3)20-57-63(54,55)60-62(52,53)56-19-24-29(59-61(49,50)51)28(46)34(58-24)42-22-41-27-31(36)39-21-40-32(2

{'PubChem': '28486', 'Inchi': 'InChI=1S/Li/q+1', 'SMILES': '[Li+]'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': '157010245', 'Inchi': 'InChI=1S/C22H31N10O17P3/c1-30-7-32(18-11(30)19(36)29-22(24)28-18)20-14(35)12(33)8(46-20)3-44-50(37,38)48-52(41,42)49-51(39,40)45-4-9-13(34)15(43-2)21(47-9)31-6-27-10-16(23)25-5-26-17(10)31/h5-9,12-15,20-21,33-35H,3-4H2,1-2H3,(H7-,23,24,25,26,28,29,36,37,38,39,40,41,42)/p-2', 'SMILES': 'CN1C=[N+](C2=C1C(=O)NC(=N2)N)C3C(C(C(O3)COP(=O)([O-])OP(=O)([O-])OP(=O)([O-])OCC4C(C(C(O4)N5C=NC6=C(N=CN=C65)N)OC)O)O)O'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': '

{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': '156317', 'Inchi': 'InChI=1S/C20H18O2/c1-11-13-5-3-4-6-14(13)12(2)19-15(11)7-8-17-16(19)9-10-18(21)20(17)22/h3-10,18,20-22H,1-2H3/t18-,20-/m0/s1', 'SMILES': 'CC1=C2C=CC3=C(C2=C(C4=CC=CC=C14)C)C=CC(C3O)O'} Not

{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'OC[C@H](NC([*])=O)C(=O)N[*]'} Not a valid SMILES string
{'PubChem': '189122', 'Inchi': 'InChI=1S/C11H17N3O7S/c12-6(11(20)21)1-2-8(16)14-7(4-22-5-15)10(19)13-3-9(17)18/h5-7H,1-4,12H2,(H,13,19)(H,14,16)(H,17,18)(H,20,21)/p-1/t6-,7-/m0/s1', 'SMILES': '[NH3+][C@@H](CCC(=O)N[C@@H](CSC=O)C(=O)NCC([O-])=O)C([O-])=O'} Not a valid SMILES string
{'PubChem': '44176418', 'Inchi': 'InChI=1S/C10H17N3O6S2/c11-5(10(18)19)1-2-7(14)13-6(4-21-20)9(17)12-3-8(15)16/h5-6,20H,1-4,11H2,(H,12,17)(H,13,14)(H,15,16)(H,18,19)/t5-,6-/m0/s1', 'SMILES': 'C(CC(=O)NC(CSS)C(=O)NCC(=O)O)C(C(=O)O)N'} Not a valid SMILES string
{'PubChem': '53477723', 'Inchi': 'InChI=1S/C30H50O/c1-24(2)14-11-17-27(5)20-12-18-25(3)15-9-10-16-26(4)19-13-21-28(6)22-23-29-30(7,8)31-29/h14-16,20-21,29H,9-13,17-19,22-23H2,1-8H3/b25-15+,26-16+,27-2

{'PubChem': '148559', 'Inchi': 'InChI=1S/C11H22N6O4/c1-6(12)9(20)17-7(3-2-4-15-11(13)14)10(21)16-5-8(18)19/h6-7H,2-5,12H2,1H3,(H,16,21)(H,17,20)(H,18,19)(H4,13,14,15)/t6-,7-/m0/s1', 'SMILES': 'CC(C(=O)NC(CCCN=C(N)N)C(=O)NCC(=O)O)N'} Not a valid SMILES string
{'PubChem': '145453525', 'Inchi': 'InChI=1S/C13H24N4O5/c1-6(2)4-9(13(21)22)17-12(20)8(5-10(15)18)16-11(19)7(3)14/h6-9H,4-5,14H2,1-3H3,(H2,15,18)(H,16,19)(H,17,20)(H,21,22)/t7-,8-,9-/m0/s1', 'SMILES': 'CC(C)CC(C(=O)O)NC(=O)C(CC(=O)N)NC(=O)C(C)N'} Not a valid SMILES string
{'PubChem': '14299171', 'Inchi': 'InChI=1S/C11H22N4O4/c1-7(13)10(17)14-6-9(16)15-8(11(18)19)4-2-3-5-12/h7-8H,2-6,12-13H2,1H3,(H,14,17)(H,15,16)(H,18,19)/t7-,8-/m0/s1', 'SMILES': 'CC(C(=O)NCC(=O)NC(CCCCN)C(=O)O)N'} Not a valid SMILES string
{'PubChem': '7019972', 'Inchi': 'InChI=1S/C12H19N5O4/c1-6(13)10(18)17-9(3-8-4-14-5-15-8)11(19)16-7(2)12(20)21/h4-7,9H,3,13H2,1-2H3,(H,14,15)(H,16,19)(H,17,18)(H,20,21)/t6-,7-,9-/m0/s1', 'SMILES': 'CC(C(=O)NC(CC1=CN=CN1)C(=O)NC(C)

{'PubChem': '83887', 'Inchi': 'InChI=1S/C4H7NO4/c5-2(4(8)9)1-3(6)7/h2H,1,5H2,(H,6,7)(H,8,9)/p-1/t2-/m1/s1', 'SMILES': '[NH3+][C@H](CC([O-])=O)C([O-])=O'} Not a valid SMILES string
{'PubChem': '5960', 'Inchi': 'InChI=1S/C4H7NO4/c5-2(4(8)9)1-3(6)7/h2H,1,5H2,(H,6,7)(H,8,9)/p-1/t2-/m0/s1', 'SMILES': 'C(C(C(=O)O)N)C(=O)O'} Not a valid SMILES string
{'PubChem': '145453999', 'Inchi': 'InChI=1S/C13H25N7O5/c1-6(19-11(23)7(14)5-9(15)21)10(22)20-8(12(24)25)3-2-4-18-13(16)17/h6-8H,2-5,14H2,1H3,(H2,15,21)(H,19,23)(H,20,22)(H,24,25)(H4,16,17,18)/t6-,7-,8-/m0/s1', 'SMILES': 'CC(C(=O)NC(CCCN=C(N)N)C(=O)O)NC(=O)C(CC(=O)N)N'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': '4130574', 'Inchi': 'InChI=1S/C9H14N2O7/c10-4(3-7(14)15)8(16)11-5(9(17)18)1-2-6(12)13/h4-5H,1-3,10H2,(H,11,16)(H,12,13)(H,14,15)(H,17,18)', 'SMILES': 'C(CC(=O)O)C(C(=O)O)NC(=O)C(CC(=O)O)N'} Not a valid SMILES string
{'PubChem': '145454428', 'Inchi': 'InChI=1S/C14H21N3

{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': '53481667', 'Inchi': 'InChI=1S/C15H27NO4/c1-5-6-7-8-9-10-15(19)20-13(16(2,3)4)11-12-14(17)18/h9-10,13H,5-8,11-12H2,1-4H3/b10-9+/t13-/m0/s1', 'SMILES': 'CCCCCC=CC(=O)OC(CCC(=O)[O-])[N+](C)(C)C'} Not a valid SMILES string
{'PubChem': '119058157', 'Inchi': 'InChI=1S/C29H46N7O17P3S/c1-4-5-6-7-8-9-20(38)57-13-12-31-19(37)10-11-32-27(41)24(40)29(2,3)15-50-56(47,48)53-55(45,46)49-14-18-23(52-54(42,43)44)22(39)28(51-18)36-17-35-21-25(30)33-16-34-26(21)36/h5-6,8-9,16-18,22-24,28,39-40H,4,7,10-15H2,1-3H3,(H,31,37)(H,32,41)(H,45,46)(H,47,48)(H2,30,33,34)(H2,42,43,44)/b6-5-,9-8+/t18-,22-,23-,24+,28-/m1/s1', 'SMILES': 'CCC=CCC=CC(=O)SCCNC(=O)CCNC(=O)C(C(C)(C)COP(=O)(O)OP(=O)(O)OCC1C(C(C(O1)N2C=NC3=C(N=CN=C32)N)O)OP(=O)(O)O)O'} Not a valid SMILES string
{'Pu

{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'O[C@H]([*])[C@H](COP([O-])([O-])=O)NC([*])=O'} Not a valid SMILES string
{'PubChem': '10917', 'Inchi': 'InChI=1S/C7H15NO3/c1-8(2,3)5-6(9)4-7(10)11/h6,9H,4-5H2,1-3H3/t6-/m1/s1', 'SMILES': 'C[N+](C)(C)C[C@H](O)CC([O-])=O'} Not a valid SMILES string
{'PubChem': '137319715', 'Inchi': 'InChI=1S/C4H7N3O/c1-7-2-3(8)6-4(7)5/h2H2,1H3,(H2,5,6,8)', 'SMILES': 'CN1CC(=O)N=C1N'} Not a valid SMILES string
{'PubChem': '5754', 'Inchi': 'InChI=1S/C21H30O5/c1-19-7-5-13(23)9-12(19)3-4-14-15-6-8-21(26,17(25)11-22)20(15,2)10-16(24)18(14)19/h9,14-16,18,22,24,26H,3-8,10-11H2,1-2H3/t14-,15-,16-,18+,19-,20-,21-/m0/s1', 'SMILES': '[H][C@@]12CCC3=CC(=O)CC[C@]3(C)[C@@]1([H])[C@@H](O)C[C@@]1(C)[C@@]2([H])CC[C@]1(O)C(=O)CO'} Not a valid SMILES string
{'PubChem': '5753', 'Inchi': 'InChI=1S/C21H30O4/c1-20-8-7-13(23)9-12(20)3-4-14-15-5-6-16(18(25)11-22)21(15,2)10-17(24)19(14)20/h9,14-17,19,22,24H,3-8,10-11H2,1-2H3/t14-,15-,16+,17-,19+,20-,21-/m0/s1', 'SMILES': '[H][C@@]1(CC

{'PubChem': '6175', 'Inchi': 'InChI=1S/C9H13N3O5/c10-5-1-2-12(9(16)11-5)8-7(15)6(14)4(3-13)17-8/h1-2,4,6-8,13-15H,3H2,(H2,10,11,16)/t4-,6-,7-,8-/m1/s1', 'SMILES': 'Nc1ccn([C@@H]2O[C@H](CO)[C@@H](O)[C@H]2O)c(=O)n1'} Not a valid SMILES string
{'PubChem': '13730', 'Inchi': 'InChI=1S/C10H13N5O3/c11-9-8-10(13-3-12-9)15(4-14-8)7-1-5(17)6(2-16)18-7/h3-7,16-17H,1-2H2,(H2,11,12,13)/t5-,6+,7+/m0/s1', 'SMILES': 'Nc1ncnc2n(cnc12)[C@H]1C[C@H](O)[C@@H](CO)O1'} Not a valid SMILES string
{'PubChem': '439182', 'Inchi': 'InChI=1S/C10H13N5O3/c1-4-6(16)7(17)10(18-4)15-3-14-5-8(11)12-2-13-9(5)15/h2-4,6-7,10,16-17H,1H3,(H2,11,12,13)/t4-,6-,7-,10-/m1/s1', 'SMILES': 'CC1C(C(C(O1)N2C=NC3=C(N=CN=C32)N)O)O'} Not a valid SMILES string
{'PubChem': '188966', 'Inchi': 'InChI=1S/C10H15N5O9P2/c11-9-8-10(13-3-12-9)15(4-14-8)7-1-5(16)6(23-7)2-22-26(20,21)24-25(17,18)19/h3-7,16H,1-2H2,(H,20,21)(H2,11,12,13)(H2,17,18,19)/p-3/t5-,6+,7+/m0/s1', 'SMILES': 'Nc1ncnc2n(cnc12)[C@H]1C[C@H](O)[C@@H](COP(O)(=O)OP(O)(O)=O)O1'} Not a

{'PubChem': '673', 'Inchi': 'InChI=1S/C4H9NO2/c1-5(2)3-4(6)7/h3H2,1-2H3,(H,6,7)', 'SMILES': 'CN(C)CC(O)=O'} Not a valid SMILES string
{'PubChem': '90008792', 'Inchi': 'InChI=1/C30H52N7O17P3S/c1-17(2)7-6-8-18(3)29(42)58-12-11-32-20(38)9-10-33-27(41)24(40)30(4,5)14-51-57(48,49)54-56(46,47)50-13-19-23(53-55(43,44)45)22(39)28(52-19)37-16-36-21-25(31)34-15-35-26(21)37/h15-19,22-24,28,39-40H,6-14H2,1-5H3,(H,32,38)(H,33,41)(H,46,47)(H,48,49)(H2,31,34,35)(H2,43,44,45)/p-4/t18-,19+,22-,23-,24?,28+/m0/s1', 'SMILES': 'CC(C)CCCC(C)C(=O)SCCNC(=O)CCNC(=O)[C@H](O)C(C)(C)COP(O)(=O)OP(O)(=O)OC[C@H]1O[C@H]([C@H](O)[C@@H]1OP(O)(O)=O)n1cnc2c(N)ncnc12'} Not a valid SMILES string
{'PubChem': '53477823', 'Inchi': 'InChI=1/C16H31NO4/c1-12(2)8-7-9-13(3)16(20)21-14(10-15(18)19)11-17(4,5)6/h12-14H,7-11H2,1-6H3', 'SMILES': 'CC(C)CCCC(C)C(=O)OC(CC([O-])=O)C[N+](C)(C)C'} Not a valid SMILES string
{'PubChem': '123831', 'Inchi': 'InChI=1S/C8H18N4O2/c1-12(2)8(10)11-5-3-4-6(9)7(13)14/h6H,3-5,9H2,1-2H3,(H2,10,11)(H,13,1

{'PubChem': '445639', 'Inchi': 'InChI=1S/C18H34O2/c1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18(19)20/h9-10H,2-8,11-17H2,1H3,(H,19,20)/p-1/b10-9+', 'SMILES': 'CCCCCCCCC=CCCCCCCCC(=O)O'} Not a valid SMILES string
{'PubChem': '6441392', 'Inchi': 'InChI=1/C25H47NO4/c1-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-25(29)30-23(21-24(27)28)22-26(2,3)4/h12-13,23H,5-11,14-22H2,1-4H3/b13-12+', 'SMILES': 'CCCCCCCCC=CCCCCCCCC(=O)OC(CC(=O)[O-])C[N+](C)(C)C'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': '5460032', 'Inchi': 'InChI=1S/C4H8O4/c5-1-3(7)4(8)2-6/h3,5-7H,1-2H2/t3-/m0/s1', 'SMILES': 'OC[C@H](O)C(=O)CO'} Not a valid S

{'PubChem': '70678546', 'Inchi': 'InChI=1S/C51H90N2O27/c1-4-5-6-7-8-9-10-11-12-13-14-15-16-17-27(60)26(52-23-58)22-71-48-41(69)38(66)44(30(20-56)75-48)78-50-42(70)39(67)43(31(21-57)76-50)77-47-32(53-25(3)59)45(35(63)29(19-55)73-47)79-51-46(37(65)34(62)28(18-54)74-51)80-49-40(68)36(64)33(61)24(2)72-49/h16-17,23-24,26-51,54-57,60-70H,4-15,18-22H2,1-3H3,(H,52,58)(H,53,59)/b17-16+/t24-,26-,27+,28+,29+,30+,31+,32+,33+,34-,35+,36+,37-,38+,39+,40-,41+,42+,43-,44+,45+,46+,47-,48+,49-,50-,51-/m0/s1', 'SMILES': 'CCCCCCCCCCCCC\\C=C\\[C@@H](O)[C@H](CO[C@@H]1O[C@H](CO)[C@@H](O[C@@H]2O[C@H](CO)[C@H](O[C@@H]3O[C@H](CO)[C@@H](O)[C@H](O[C@@H]4O[C@H](CO)[C@H](O)[C@H](O)[C@H]4O[C@@H]4O[C@@H](C)[C@@H](O)[C@@H](O)[C@@H]4O)[C@H]3NC(C)=O)[C@H](O)[C@H]2O)[C@H](O)[C@H]1O)NC([*])=O'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'CCCCCCCCCCCCC\\C=C\\[C@@H](O)[C@H](CO[C@@H]1O[C@H](CO)[C@@H](O[C@@H]2O[C@H](CO)[C@H](O)[C@H](O[C@@H]3O[C@H](CO)[C@@H](O[C@@H]4O[C@@H](C)[C@@H](O)[C@@H](O)[C@@H

{'PubChem': '145454983', 'Inchi': 'InChI=1S/C14H24N6O7/c15-6(1-3-9(16)21)12(24)20-8(5-11(18)23)13(25)19-7(14(26)27)2-4-10(17)22/h6-8H,1-5,15H2,(H2,16,21)(H2,17,22)(H2,18,23)(H,19,25)(H,20,24)(H,26,27)/t6-,7-,8-/m0/s1', 'SMILES': 'C(CC(=O)N)C(C(=O)NC(CC(=O)N)C(=O)NC(CCC(=O)N)C(=O)O)N'} Not a valid SMILES string
{'PubChem': '145455084', 'Inchi': 'InChI=1S/C17H24N8O5/c18-11(1-2-14(19)26)15(27)24-12(3-9-5-20-7-22-9)16(28)25-13(17(29)30)4-10-6-21-8-23-10/h5-8,11-13H,1-4,18H2,(H2,19,26)(H,20,22)(H,21,23)(H,24,27)(H,25,28)(H,29,30)/t11-,12-,13-/m0/s1', 'SMILES': 'C1=C(NC=N1)CC(C(=O)NC(CC2=CN=CN2)C(=O)O)NC(=O)C(CCC(=O)N)N'} Not a valid SMILES string
{'PubChem': '145455087', 'Inchi': 'InChI=1S/C17H29N7O5/c18-6-2-1-3-12(17(28)29)23-16(27)13(7-10-8-21-9-22-10)24-15(26)11(19)4-5-14(20)25/h8-9,11-13H,1-7,18-19H2,(H2,20,25)(H,21,22)(H,23,27)(H,24,26)(H,28,29)/t11-,12-,13-/m0/s1', 'SMILES': 'C1=C(NC=N1)CC(C(=O)NC(CCCCN)C(=O)O)NC(=O)C(CCC(=O)N)N'} Not a valid SMILES string
{'PubChem': '145455141', 'In

{'PubChem': '75104755', 'Inchi': 'InChI=1S/C78H131N5O47/c1-6-7-8-9-10-11-12-13-14-15-16-17-18-19-37(96)36(79-31-91)30-117-70-58(108)57(107)61(46(28-89)120-70)122-72-60(110)68(130-78(75(115)116)22-39(98)49(81-33(3)93)65(127-78)53(103)42(101)24-85)62(47(29-90)121-72)123-69-51(83-35(5)95)63(54(104)43(25-86)118-69)124-71-59(109)67(55(105)44(26-87)119-71)129-77(74(113)114)21-40(99)50(82-34(4)94)66(128-77)56(106)45(27-88)125-76(73(111)112)20-38(97)48(80-32(2)92)64(126-76)52(102)41(100)23-84/h18-19,31,36-72,84-90,96-110H,6-17,20-30H2,1-5H3,(H,79,91)(H,80,92)(H,81,93)(H,82,94)(H,83,95)(H,111,112)(H,113,114)(H,115,116)', 'SMILES': 'CCCCCCCCCCCCCC=CC(C(COC1C(C(C(C(O1)CO)OC2C(C(C(C(O2)CO)OC3C(C(C(C(O3)CO)O)OC4C(C(C(C(O4)CO)O)OC5(CC(C(C(O5)C(C(CO)OC6(CC(C(C(O6)C(C(CO)O)O)NC(=O)C)O)C(=O)O)O)NC(=O)C)O)C(=O)O)O)NC(=O)C)OC7(CC(C(C(O7)C(C(CO)O)O)NC(=O)C)O)C(=O)O)O)O)O)NC=O)O'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inc

{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': '964', 'Inchi': 'InChI=1S/C3H4O4/c4-1-2(5)3(6)7/h4H,1H2,(H,6,7)/p-1', 'SMILES': 'C(C(=O)C(=O)O)O'} Not a valid SMILES string
{'PubChem': '6438629', 'Inchi': 'InChI=1S/C20H28O3/c1-14(7-6-8-15(2)13-19(22)23)9-10-17-16(3)18(21)11-12-20(17,4)5/h6-10,13,18,21H,11-12H2,1-5H3,(H,22,23)/b8-6+,10-9+,14-7+,15-13+', 'SMILES': 'CC1=C(C(CCC1O)(C)C)C=CC(=CC=CC(=CC(=O)O)C)C'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} 

{'PubChem': '807', 'Inchi': 'InChI=1S/I2/c1-2', 'SMILES': 'II'} Not a valid SMILES string
{'PubChem': '1195', 'Inchi': 'InChI=1S/C5H12O7P2/c1-5(2)3-4-11-14(9,10)12-13(6,7)8/h1,3-4H2,2H3,(H,9,10)(H2,6,7,8)/p-3', 'SMILES': 'CC(=C)CCOP(O)(=O)OP(O)(O)=O'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': '45266606', 'Inchi': 'InChI=1S/C26H40N7O19P3S/c1-13(25(39)40)8-16(35)56-7-6-28-15(34)4-5-29-23(38)20(37)26(2,3)10-49-55(46,47)52-54(44,45)48-9-14-19(51-53(41,42)43)18(36)24(50-14)33-12-32-17-21(27)30-11-31-22(17)33/h11-12,14,18-20,24,36-37H,1,4-10H2,2-3H3,(H,28,34)(H,29,38)(H,39,40)(H,44,45)(H,46,47)(H2,27,30,31)(H2,41,42,43)/p-5/t14-,18-,19-,20+,24-/m1/s1', 'SMILES': 'CC(C)(COP([O-])(=O)OP([O-])(=O)OC[C@H]1O[C@H]([C@H](O)[C@@H]1OP([O-])([O-])=O)n1cnc2c(N)ncnc12)[C@@H](O)C(=O)NCCC(=O)NCCSC(=O)CC(=C)C([O-])=O'} Not a valid SMILES string
{'PubChem': 

{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid 

{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': '5282457', 'Inchi': 'InChI=1/C18H32O2/c1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18(19)20/h6-7,9-10H,2-5,8,11-17H2,1H3,(H,19,20)/b7-6+,10-9+', 'SMILES': 'CCCCC\\C=C\\C\\C=C\\CCCCCCCC(O)=O'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': '6441626', 'Inchi': 'InChI=1S/C39H66N7O17P3S/c1-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-30(48)67-23-22-41-29(47)20-21-42-37(51)34(50)39(2,3)25-60-66(57,58)63-65(55,56)59-24-28-33(62-64(52,53)54)32(49)38(61-28)46-27-45-31-35(40)43-26-44-36(31)46/h8-9,11-12,26-28,32-34,38,49-50H,4-7,10,13-25H2,1-3H3,(H,41,47)(H,42,51)(H,55,56)(H,57,58)(H2,40,43,44)(H2,52,53,54)/b9-8+,

{'PubChem': '638129', 'Inchi': 'InChI=1S/C5H6O4/c1-3(5(8)9)2-4(6)7/h2H,1H3,(H,6,7)(H,8,9)/p-2/b3-2+', 'SMILES': 'C\\C(=C/C([O-])=O)C([O-])=O'} Not a valid SMILES string
{'PubChem': '6137', 'Inchi': 'InChI=1S/C5H11NO2S/c1-9-3-2-4(6)5(7)8/h4H,2-3,6H2,1H3,(H,7,8)/t4-/m0/s1', 'SMILES': 'CSCC[C@H](N)C(O)=O'} Not a valid SMILES string
{'PubChem': '145456846', 'Inchi': 'InChI=1S/C17H34N6O4S/c1-10(2)9-13(16(26)27)23-15(25)12(5-4-7-21-17(19)20)22-14(24)11(18)6-8-28-3/h10-13H,4-9,18H2,1-3H3,(H,22,24)(H,23,25)(H,26,27)(H4,19,20,21)/t11-,12-,13-/m0/s1', 'SMILES': 'CC(C)CC(C(=O)O)NC(=O)C(CCCN=C(N)N)NC(=O)C(CCSC)N'} Not a valid SMILES string
{'PubChem': '145456868', 'Inchi': 'InChI=1S/C18H26N4O6S/c1-29-7-6-12(19)16(25)21-13(9-15(20)24)17(26)22-14(18(27)28)8-10-2-4-11(23)5-3-10/h2-5,12-14,23H,6-9,19H2,1H3,(H2,20,24)(H,21,25)(H,22,26)(H,27,28)/t12-,13-,14-/m0/s1', 'SMILES': 'CSCCC(C(=O)NC(CC(=O)N)C(=O)NC(CC1=CC=C(C=C1)O)C(=O)O)N'} Not a valid SMILES string
{'PubChem': '145456920', 'Inchi': 'InChI=1S/C

{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': '14180', 'Inchi': 'InChI=1S/C11H15N2O8P/c12-10(16)6-2-1-3-13(4-6)11-9(15)8(14)7(21-11)5-20-22(17,18)19/h1-4,7-9,11,14-15H,5H2,(H3-,12,16,17,18,19)/p-1/t7-,8-,9-,11-/m1/s1', 'SMILES': 'NC(=O)c1ccc[n+](c1)[C@@H]1O[C@H](COP(O)([O-])=O)[C@@H](O)[C@H]1O'} Not a valid SMILES string
{'PubChem': '439791', 'Inchi': 'InChI=1S/C5H14N2/c1-7-5-3-2-4-6/h7H,2-6H2,1H3/p+2', 'SMILES': 'C[NH2+]CCCC[NH3+]'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': '150885', 'Inchi': 'InChI=1S/C11H14N2O/c1-12-5-4-8-7-13-11-3-2-9(14)6-10(8)11/h2-3,6-7,12-14H,4-5H2,1H3', 'SMILES': 'CNCCc1c[nH]c2ccc(O)cc12'} Not a valid SMILES string
{'PubChem': '145068', 'Inchi': 'InChI=1S/NO/c1-2', 'SMILES': '[N]=O'} Not a valid SMILES string
{'PubChem': '24529', 'Inchi': 'InChI=1S/HNO2/c2-1-3/h(H,2,3)/p-1', 'SMILES': '[H]ON=O'} Not a valid SMILES string
{'PubChem': '439607', '

{'PubChem': '9546675', 'Inchi': 'InChI=1S/C26H52NO7P/c1-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-26(29)34-25(23-28)24-33-35(30,31)32-22-21-27(2,3)4/h12-13,25,28H,5-11,14-24H2,1-4H3/p+1/b13-12-', 'SMILES': 'CCCCCCCCC=CCCCCCCCC(=O)OC(CO)COP(=O)(O)OCC[N+](C)(C)C'} Not a valid SMILES string
{'PubChem': '9546667', 'Inchi': 'InChI=1S/C24H50NO7P/c1-5-6-7-8-9-10-11-12-13-14-15-16-17-18-24(27)32-23(21-26)22-31-33(28,29)30-20-19-25(2,3)4/h23,26H,5-22H2,1-4H3/p+1', 'SMILES': 'CCCCCCCCCCCCCCCC(=O)OC(CO)COP(=O)(O)OCC[N+](C)(C)C'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': '24798685', 'Inchi': 'InChI=1S/C10H20NO8P/c1-11(2,3)4-5-18-20(14,15)19-7-10(17-9-13)6-16-8-12/h8-10H,4-7H2,1-3H3/p+1/t10-/m1/s1', 'SMILES': 'C[N+](C)(C)CCOP(O)(=O)OC[C@@H](COC([*])=O)OC([*])=O'} Not a valid SMILES string
{'PubChem': '24779476', 'Inchi': 'InChI=1S/C28H50NO7P/c1-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-21-22-28(31)34-25-27(30)26-36-37(32,33

{'PubChem': '76808', 'Inchi': 'InChI=1S/C15H22N2O3/c1-10(2)8-13(15(19)20)17-14(18)12(16)9-11-6-4-3-5-7-11/h3-7,10,12-13H,8-9,16H2,1-2H3,(H,17,18)(H,19,20)/t12-,13-/m0/s1', 'SMILES': 'CC(C)CC(C(=O)O)NC(=O)C(CC1=CC=CC=C1)N'} Not a valid SMILES string
{'PubChem': '71464636', 'Inchi': 'InChI=1S/C19H27N3O6/c1-11(2)8-13(20)17(25)21-14(9-12-6-4-3-5-7-12)18(26)22-15(19(27)28)10-16(23)24/h3-7,11,13-15H,8-10,20H2,1-2H3,(H,21,25)(H,22,26)(H,23,24)(H,27,28)/t13-,14-,15-/m0/s1', 'SMILES': 'CC(C)CC(C(=O)NC(CC1=CC=CC=C1)C(=O)NC(CC(=O)O)C(=O)O)N'} Not a valid SMILES string
{'PubChem': '145457254', 'Inchi': 'InChI=1S/C21H29N5O4/c1-13(2)8-17(25-19(27)16(22)9-14-6-4-3-5-7-14)20(28)26-18(21(29)30)10-15-11-23-12-24-15/h3-7,11-13,16-18H,8-10,22H2,1-2H3,(H,23,24)(H,25,27)(H,26,28)(H,29,30)/t16-,17-,18-/m0/s1', 'SMILES': 'CC(C)CC(C(=O)NC(CC1=CN=CN1)C(=O)O)NC(=O)C(CC2=CC=CC=C2)N'} Not a valid SMILES string
{'PubChem': '53491907', 'Inchi': 'InChI=1S/C18H28N4O4/c1-12(18(25)26)21-17(24)15(9-5-6-10-19)22-16(23)14(

{'PubChem': '439322', 'Inchi': 'InChI=1S/C11H22N2O4S/c1-11(2,7-14)9(16)10(17)13-4-3-8(15)12-5-6-18/h9,14,16,18H,3-7H2,1-2H3,(H,12,15)(H,13,17)/t9-/m0/s1', 'SMILES': 'CC(C)(CO)[C@@H](O)C(=O)NCCC(=O)NCCS'} Not a valid SMILES string
{'PubChem': '1053', 'Inchi': 'InChI=1S/C8H13N2O5P/c1-5-8(11)7(2-9)6(3-10-5)4-15-16(12,13)14/h3,11H,2,4,9H2,1H3,(H2,12,13,14)/p-1', 'SMILES': 'Cc1ncc(COP(O)(O)=O)c(CN)c1O'} Not a valid SMILES string
{'PubChem': '1052', 'Inchi': 'InChI=1S/C8H12N2O2/c1-5-8(12)7(2-9)6(4-11)3-10-5/h3,11-12H,2,4,9H2,1H3/p+1', 'SMILES': 'Cc1ncc(CO)c(CN)c1O'} Not a valid SMILES string
{'PubChem': '1050', 'Inchi': 'InChI=1S/C8H9NO3/c1-5-8(12)7(4-11)6(3-10)2-9-5/h2,4,10,12H,3H2,1H3', 'SMILES': 'CC1=NC=C(C(=C1O)C=O)CO'} Not a valid SMILES string
{'PubChem': '1051', 'Inchi': 'InChI=1S/C8H10NO6P/c1-5-8(11)7(3-10)6(2-9-5)4-15-16(12,13)14/h2-3,11H,4H2,1H3,(H2,12,13,14)/p-2', 'SMILES': '[H]C(=O)c1c(COP([O-])([O-])=O)cnc(C)c1O'} Not a valid SMILES string
{'PubChem': '1054', 'Inchi': 'InChI=1S/

{'PubChem': '1092', 'Inchi': 'InChI=1S/H3O3PSe/c1-4(2,3)5/h(H3,1,2,3,5)/p-3', 'SMILES': 'OP(O)(O)=[Se]'} Not a valid SMILES string
{'PubChem': '71077', 'Inchi': 'InChI=1S/C3H7NO3/c4-2(1-5)3(6)7/h2,5H,1,4H2,(H,6,7)/t2-/m1/s1', 'SMILES': 'N[C@H](CO)C(O)=O'} Not a valid SMILES string
{'PubChem': '5951', 'Inchi': 'InChI=1S/C3H7NO3/c4-2(1-5)3(6)7/h2,5H,1,4H2,(H,6,7)/t2-/m0/s1', 'SMILES': 'N[C@@H](CO)C(O)=O'} Not a valid SMILES string
{'PubChem': '145457660', 'Inchi': 'InChI=1S/C12H24N6O5/c1-6(11(22)23)17-10(21)8(3-2-4-16-12(14)15)18-9(20)7(13)5-19/h6-8,19H,2-5,13H2,1H3,(H,17,21)(H,18,20)(H,22,23)(H4,14,15,16)/t6-,7-,8-/m0/s1', 'SMILES': 'CC(C(=O)O)NC(=O)C(CCCN=C(N)N)NC(=O)C(CO)N'} Not a valid SMILES string
{'PubChem': '145457673', 'Inchi': 'InChI=1S/C20H29N7O5/c21-13(10-28)17(29)26-15(6-3-7-24-20(22)23)18(30)27-16(19(31)32)8-11-9-25-14-5-2-1-4-12(11)14/h1-2,4-5,9,13,15-16,25,28H,3,6-8,10,21H2,(H,26,29)(H,27,30)(H,31,32)(H4,22,23,24)/t13-,15-,16-/m0/s1', 'SMILES': 'C1=CC=C2C(=C1)C(=CN2)CC(C(

{'PubChem': '2733768', 'Inchi': 'InChI=1S/C26H45NO6S/c1-16(4-9-24(30)27-12-13-34(31,32)33)20-7-8-21-19-6-5-17-14-18(28)10-11-25(17,2)22(19)15-23(29)26(20,21)3/h16-23,28-29H,4-15H2,1-3H3,(H,27,30)(H,31,32,33)/t16-,17-,18-,19+,20-,21+,22+,23+,25+,26-/m1/s1', 'SMILES': 'CC(CCC(=O)NCCS(=O)(=O)O)C1CCC2C1(C(CC3C2CCC4C3(CCC(C4)O)C)O)C'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'C[N+](C)(C)CC(CC([O-])=O)OC([*])=O'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': '21252281', 'Inchi': 'InChI=1S/C35H58N7O17P3S/c1-4-5-6-7-8-9-10-11-12-13-14-15-26(44)63-19-18-37-25(43)16-17-38-33(47)30(46)35(2,3)21-56-62(53,54)59-61(51,52)55-20-24-29(58-60(48,49)50)28(45)34(57-24)42-23-41-27-31(36)39-22-40-32(27)42/h8-9,11-12,22-24,28-30,34,45-46H,4-7,10,13-21H2,1-3

{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid 

{'PubChem': '145458809', 'Inchi': 'InChI=1S/C19H29N3O5S/c1-11(2)16(18(25)21-15(19(26)27)8-9-28-3)22-17(24)14(20)10-12-4-6-13(23)7-5-12/h4-7,11,14-16,23H,8-10,20H2,1-3H3,(H,21,25)(H,22,24)(H,26,27)/t14-,15-,16-/m0/s1', 'SMILES': 'CC(C)C(C(=O)NC(CCSC)C(=O)O)NC(=O)C(CC1=CC=C(C=C1)O)N'} Not a valid SMILES string
{'PubChem': '46173749', 'Inchi': 'InChI=1S/C94H156N8O26P2/c1-59(2)31-21-32-60(3)33-22-34-61(4)35-23-36-62(5)37-24-38-63(6)39-25-40-64(7)41-26-42-65(8)43-27-44-66(9)45-28-46-67(10)47-29-48-68(11)49-30-50-69(12)54-56-122-129(118,119)128-130(120,121)127-94-82(100-75(18)106)86(85(79(58-104)125-94)126-93-81(99-74(17)105)84(109)83(108)78(57-103)124-93)123-73(16)89(112)96-71(14)88(111)102-77(92(116)117)52-53-80(107)101-76(51-19-20-55-95)90(113)97-70(13)87(110)98-72(15)91(114)115/h31,33,35,37,39,41,43,45,47,49,54,70-73,76-79,81-86,93-94,103-104,108-109H,19-30,32,34,36,38,40,42,44,46,48,50-53,55-58,95H2,1-18H3,(H,96,112)(H,97,113)(H,98,110)(H,99,105)(H,100,106)(H,101,107)(H,102,111)(H,114,1

{'PubChem': '439692', 'Inchi': 'InChI=1S/C5H8O5/c6-2-1-10-5(9)4(8)3(2)7/h2-4,6-8H,1H2/t2-,3+,4-/m1/s1', 'SMILES': 'C1C(C(C(C(=O)O1)O)O)O'} Not a valid SMILES string
{'PubChem': '6971043', 'Inchi': 'InChI=1S/C5H10O6/c6-1-2(7)3(8)4(9)5(10)11/h2-4,6-9H,1H2,(H,10,11)/t2-,3+,4-/m0/s1', 'SMILES': 'C(C(C(C(C(=O)O)O)O)O)O'} Not a valid SMILES string
{'PubChem': 'NaN', 'Inchi': 'NaN', 'SMILES': 'NaN'} Not a valid SMILES string
{'PubChem': '6912', 'Inchi': 'InChI=1S/C5H12O5/c6-1-3(8)5(10)4(9)2-7/h3-10H,1-2H2/t3-,4+,5+', 'SMILES': 'OC[C@H](O)[C@@H](O)[C@H](O)CO'} Not a valid SMILES string
{'PubChem': '5289590', 'Inchi': 'InChI=1S/C5H10O5/c6-1-3(8)5(10)4(9)2-7/h3,5-8,10H,1-2H2/t3-,5+/m1/s1', 'SMILES': '[H][C@@](O)(CO)[C@]([H])(O)C(=O)CO'} Not a valid SMILES string
{'PubChem': '22253', 'Inchi': 'InChI=1S/C5H10O5/c6-1-3(8)5(10)4(9)2-7/h3,5-8,10H,1-2H2/t3-,5+/m0/s1', 'SMILES': 'C(C(C(C(=O)CO)O)O)O'} Not a valid SMILES string
{'PubChem': '92729', 'Inchi': 'InChI=1S/C28H48O2/c1-20(2)11-8-12-21(3)13-9-1

In [None]:
# Initialize an empty dictionary to hold the results
similarity_scores = {}

# Iterate over each pair of keys in the dictionary
for key1, smi1 in tqdm(dict_metabolites_canonical.items()):
    for key2, smi2 in dict_metabolites_canonical.items():
        # Avoid comparing a molecule to itself
        if key1 != key2:
            # Construct a unique identifier for this pair of molecules
            pair_id = f"{key1}_{key2}"

            # Calculate the similarity between the two molecules
            similarity = similarity_calc(smi1, smi2)

            # Store the similarity score in the results dictionary
            similarity_scores[pair_id] = similarity

print(similarity_scores)

In [4]:
# Update the Google Sheet with the modified DataFrame
sheet.update_google_sheet(sheet_rxns, rxns)
sheet.update_google_sheet(shee_attributes, attributes)
sheet.update_google_sheet(sheet_met, met)
print("Google Sheet updated.")

Google Sheet updated.


In [20]:
# Check for diferences between the metabolites in the "Rxns" and "Metabolites" Sheets

model = Model("iCHO")
lr = []
for _, row in rxns.iterrows():
    r = Reaction(row['Reaction'])
    lr.append(r)    
model.add_reactions(lr)

for i,r in enumerate(tqdm(model.reactions)):
    print(r.id)
    r.build_reaction_from_string(rxns['Reaction Formula'][i]) 
    
model_met_list = []
for m in model.metabolites:
    model_met_list.append(m.id)
    
sheet_met_list = list(met['BiGG ID'])

model = set(model_met_list)
sheet = set(sheet_met_list)

Set parameter Username
Academic license - for non-commercial use only - expires 2024-03-24


  0%|          | 0/10547 [00:00<?, ?it/s]

10FTHF5GLUtl
unknown metabolite '10fthf5glu_c' created
unknown metabolite '10fthf5glu_l' created
10FTHF5GLUtm
unknown metabolite '10fthf5glu_m' created
10FTHF6GLUtl
unknown metabolite '10fthf6glu_c' created
unknown metabolite '10fthf6glu_l' created
10FTHF6GLUtm
unknown metabolite '10fthf6glu_m' created
10FTHF7GLUtl
unknown metabolite '10fthf7glu_c' created
unknown metabolite '10fthf7glu_l' created
10FTHF7GLUtm
unknown metabolite '10fthf7glu_m' created
10FTHFtl
unknown metabolite '10fthf_c' created
unknown metabolite '10fthf_l' created
10FTHFtm
unknown metabolite '10fthf_m' created
11DOCRTSLte
unknown metabolite '11docrtsl_c' created
unknown metabolite 'atp_c' created
unknown metabolite 'h2o_c' created
unknown metabolite '11docrtsl_e' created
unknown metabolite 'adp_c' created
unknown metabolite 'h_c' created
unknown metabolite 'pi_c' created
11DOCRTSLtm
unknown metabolite '11docrtsl_m' created
11DOCRTSLtr
unknown metabolite '11docrtsl_r' created
11DOCRTSTRNte
unknown metabolite '11docr

unknown metabolite 'acgalfucgalacglcgal14acglcgalgluside_cho_g' created
ABO7g
unknown metabolite 'fucgalacgalfucgalacglcgal14acglcgalgluside_cho_g' created
unknown metabolite 'acgalfucgalacgalfucgalacglcgal14acglcgalgluside_cho_g' created
ABO8g
unknown metabolite 'galfucgalacglcgal14acglcgalgluside_cho_g' created
ABO9g
unknown metabolite 'fucfucgalacglcgalacglcgal14acglcgalgluside_cho_g' created
unknown metabolite 'galgalfucfucgalacglcgalacglcgal14acglcgalgluside_cho_g' created
ABTArm
unknown metabolite 'sucsal_m' created
ABTD
unknown metabolite 'abt_c' created
unknown metabolite 'xylu_L_c' created
ABTti
unknown metabolite 'abt_e' created
ABUTD
unknown metabolite '4abutn_c' created
ABUTDm
unknown metabolite '4abutn_m' created
ABUTt2r
unknown metabolite '4abut_e' created
ABUTt2rL
unknown metabolite '4abut_l' created
ABUTt4_2_r
ACACT100m
unknown metabolite '3odcoa_m' created
unknown metabolite 'occoa_m' created
ACACT102n3m
unknown metabolite 'octe5coa_m' created
ACACT102n3p
unknown metab

unknown metabolite 'pro_L_c' created
unknown metabolite 'pro_L_e' created
ASPPROLYSt
unknown metabolite 'aspprolys_e' created
unknown metabolite 'aspprolys_c' created
ASPTA
ASPTAm
unknown metabolite 'oaa_m' created
ASPTRAH
unknown metabolite 'asptrna_c' created
unknown metabolite 'trnaasp_c' created
ASPTRAHm
unknown metabolite 'asptrna_m' created
unknown metabolite 'trnaasp_m' created
ASPTRS
ASPTRSm
ASPVALASNt
unknown metabolite 'aspvalasn_e' created
unknown metabolite 'aspvalasn_c' created
ASPt6
ASPt7l
unknown metabolite 'asp_L_l' created
ASPte
ATAH
unknown metabolite 'urdglyc_m' created
ATAn
unknown metabolite 'h2o_n' created
unknown metabolite 'pi_n' created
ATAx
unknown metabolite 'atp_x' created
unknown metabolite 'pi_x' created
ATP1ter
unknown metabolite 'adp_r' created
unknown metabolite 'atp_r' created
ATP2ter
ATPH1e
unknown metabolite 'atp_e' created
unknown metabolite 'amp_e' created
unknown metabolite 'pi_e' created
ATPH2e
unknown metabolite 'adp_e' created
ATPM
ATPS4m
unkno

CYTDtn
CYTK1
CYTK10
CYTK10n
unknown metabolite 'cmp_n' created
unknown metabolite 'dgtp_n' created
unknown metabolite 'cdp_n' created
unknown metabolite 'dgdp_n' created
CYTK11
unknown metabolite 'dcmp_c' created
unknown metabolite 'dcdp_c' created
CYTK11n
unknown metabolite 'dcmp_n' created
unknown metabolite 'dcdp_n' created
CYTK12
unknown metabolite 'dctp_c' created
CYTK12n
unknown metabolite 'dctp_n' created
CYTK13
CYTK13n
unknown metabolite 'datp_n' created
unknown metabolite 'dadp_n' created
CYTK14
CYTK14n
unknown metabolite 'utp_n' created
unknown metabolite 'udp_n' created
CYTK1m
unknown metabolite 'cdp_m' created
CYTK1n
CYTK2
CYTK2m
unknown metabolite 'dcmp_m' created
unknown metabolite 'dcdp_m' created
CYTK2n
CYTK3
CYTK3n
unknown metabolite 'gdp_n' created
CYTK4
CYTK4n
CYTK5
CYTK5n
CYTK6
CYTK6n
CYTK7
CYTK7n
CYTK8
CYTK8n
CYTK9
CYTK9n
Coqe
unknown metabolite 'q10_e' created
unknown metabolite 'q10_c' created
D3AIBTm
DACGST
unknown metabolite '23dh1i56dio_c' created
unknown meta

FAOXC102C81x
FAOXC102_4Z_7Zm
FAOXC102_4Z_7Zx
FAOXC102m
FAOXC102x
FAOXC103C102m
FAOXC103C102x
FAOXC10C10OHm
FAOXC10DCC8DCx
FAOXC11
FAOXC11090m
unknown metabolite 'c110coa_m' created
unknown metabolite 'c90coa_m' created
FAOXC11BRC9BRx
unknown metabolite 'tmuncoa_x' created
FAOXC11C9m
FAOXC120100m
FAOXC120100x
FAOXC121C101m
FAOXC121C10x
FAOXC121_3Em
unknown metabolite 'c121_3Ecoa_m' created
FAOXC121_3Zm
FAOXC121_5Em
unknown metabolite 'c121_5Ecoa_m' created
FAOXC121x
FAOXC122C101m
unknown metabolite '2ddecdicoa_m' created
FAOXC122_3E_6Em
FAOXC122_3Z_6Zm
FAOXC122_3Z_6Zx
FAOXC122m
FAOXC122x
unknown metabolite '2ddecdicoa_x' created
FAOXC123C102m
FAOXC123C102x
FAOXC123_3Z_6Z_9Zm
FAOXC123_3Z_6Z_9Zx
FAOXC123m
FAOXC123x
FAOXC12C12OHm
FAOXC12DCC10DCx
FAOXC12DCTc
unknown metabolite 'c12dc_x' created
unknown metabolite 'c12dc_c' created
FAOXC12DCc
unknown metabolite 'c12dccoa_c' created
FAOXC12DCx
FAOXC130110m
unknown metabolite 'c130coa_m' created
FAOXC13BRC11BRx
FAOXC13C11m
unknown metabolite '

FUCFUCFUCGALACGLCGAL14ACGLCGALGLUSIDEtg
unknown metabolite 'fucfucfucgalacglcgal14acglcgalgluside_cho_g' created
FUCFUCGALACGLCGALGLUSIDEte
unknown metabolite 'fucfucgalacglcgalgluside_cho_e' created
unknown metabolite 'fucfucgalacglcgalgluside_cho_c' created
FUCFUCGALACGLCGALGLUSIDEtg
unknown metabolite 'fucfucgalacglcgalgluside_cho_g' created
FUCGAL14ACGLCGALGLUSIDEte
unknown metabolite 'fucgal14acglcgalgluside_cho_e' created
unknown metabolite 'fucgal14acglcgalgluside_cho_c' created
FUCGAL14ACGLCGALGLUSIDEtg
unknown metabolite 'fucgal14acglcgalgluside_cho_g' created
FUCGALFUCGALACGLCGALGLUSIDEte
unknown metabolite 'fucgalfucgalacglcgalgluside_cho_e' created
unknown metabolite 'fucgalfucgalacglcgalgluside_cho_c' created
FUCGALFUCGALACGLCGALGLUSIDEtg
unknown metabolite 'fucgalfucgalacglcgalgluside_cho_g' created
FUCGALGBSIDEte
unknown metabolite 'fucgalgbside_cho_c' created
unknown metabolite 'fucgalgbside_cho_e' created
FUCGALGBSIDEtg
unknown metabolite 'fucgalgbside_cho_g' created
F

GLNS
GLNSERNaEx
GLNSP2
unknown metabolite 'uaaGgla_c' created
unknown metabolite 'uaaGgtla_c' created
GLNTHRNaEx
GLNTRAH
unknown metabolite 'glntrna_c' created
unknown metabolite 'trnagln_c' created
GLNTRAHm
unknown metabolite 'glntrna_m' created
unknown metabolite 'gln_L_m' created
unknown metabolite 'trnagln_m' created
GLNTRPGLUt
unknown metabolite 'glntrpglu_e' created
unknown metabolite 'glntrpglu_c' created
GLNTRS
GLNTRSm
GLNTYRLEUt
unknown metabolite 'glntyrleu_e' created
unknown metabolite 'glntyrleu_c' created
GLNt4
GLNt7l
unknown metabolite 'gln_L_l' created
GLNtN1
GLNtm
GLNyLATthc
GLPASE1
GLPASE2
GLPASE2a
GLRASE
unknown metabolite 'gullac_c' created
GLU5Km
GLU5SAtmc
GLUARGLEUt
unknown metabolite 'gluargleu_e' created
unknown metabolite 'gluargleu_c' created
GLUASNLEUt
unknown metabolite 'gluasnleu_e' created
unknown metabolite 'gluasnleu_c' created
GLUB0AT3tc
GLUCYS
unknown metabolite 'glucys_c' created
GLUDC
GLUDxm
GLUDym
GLUGLUt
unknown metabolite 'gluglu_e' created
unknown

HMR_0197
HMR_0200
unknown metabolite 'M00129_c' created
HMR_0201
HMR_0203
unknown metabolite 'M00117_e' created
unknown metabolite 'M00117_c' created
HMR_0204
unknown metabolite 'CE0784_c' created
HMR_0206
HMR_0207
HMR_0208
unknown metabolite 'M02745_e' created
unknown metabolite 'M02745_c' created
HMR_0209
unknown metabolite 'M01141_c' created
HMR_0210
HMR_0211
HMR_0214
HMR_0215
unknown metabolite 'ptdca_e' created
HMR_0230
HMR_0232
unknown metabolite 'M01197_e' created
unknown metabolite 'M01197_c' created
HMR_0233
unknown metabolite 'M01191_c' created
HMR_0234
HMR_0235
HMR_0238
HMR_0239
unknown metabolite 'hpdca_e' created
HMR_0240
unknown metabolite 'M00003_e' created
unknown metabolite 'M00003_c' created
HMR_0241
unknown metabolite 'M00004_c' created
HMR_0242
HMR_0243
HMR_0244
unknown metabolite 'M01238_e' created
unknown metabolite 'M01238_c' created
HMR_0245
unknown metabolite 'M01237_c' created
HMR_0246
HMR_0247
HMR_0253
HMR_0254
unknown metabolite 'M00019_e' created
unknown me

HMR_0674
HMR_0675
HMR_0676
HMR_0677
HMR_0678
HMR_0679
HMR_0680
HMR_0681
HMR_0682
HMR_0683
HMR_0684
HMR_0703
unknown metabolite 'M02114_x' created
HMR_0705
unknown metabolite 'M00550_x' created
HMR_0706
unknown metabolite 'M02017_x' created
HMR_0708
unknown metabolite 'M02017_c' created
unknown metabolite 'M00532_c' created
HMR_0715
unknown metabolite 'M02748_c' created
HMR_0716
unknown metabolite 'phsphings_c' created
HMR_0718
HMR_0719
HMR_0733
unknown metabolite 'HC02048_g' created
HMR_0750
HMR_0753
HMR_0758
unknown metabolite 'sphings_c' created
HMR_0765
unknown metabolite 'udpgal_r' created
unknown metabolite 'galgluside_cho_r' created
HMR_0767
HMR_0770
HMR_0775
unknown metabolite 'sphs1p_c' created
HMR_0783
unknown metabolite 'sphs1p_e' created
unknown metabolite 'sphings_e' created
HMR_0792
unknown metabolite 'paps_l' created
unknown metabolite 'pap_l' created
HMR_0793
HMR_0803
HMR_0805
unknown metabolite 'udpacgal_c' created
HMR_0806
unknown metabolite 'udpacgal_l' created
HMR_08

HMR_2193
HMR_2210
unknown metabolite 'CE2242_c' created
HMR_2211
unknown metabolite 'CE2253_c' created
HMR_2215
unknown metabolite '3ohxccoa_c' created
HMR_2217
unknown metabolite 'M00783_c' created
HMR_2218
unknown metabolite 'M00049_c' created
HMR_2219
HMR_2227
unknown metabolite 'M02773_c' created
HMR_2228
unknown metabolite 'M00898_c' created
HMR_2229
unknown metabolite 'M00796_c' created
HMR_2230
unknown metabolite 'M00062_c' created
HMR_2231
unknown metabolite 'M02693_c' created
HMR_2232
unknown metabolite 'M00876_c' created
HMR_2233
unknown metabolite 'M00781_c' created
HMR_2234
unknown metabolite 'M00047_c' created
HMR_2235
unknown metabolite 'M02106_c' created
HMR_2236
unknown metabolite 'M00888_c' created
HMR_2237
unknown metabolite 'M00791_c' created
HMR_2238
unknown metabolite 'M00055_c' created
HMR_2239
unknown metabolite 'M02615_c' created
HMR_2240
unknown metabolite 'M00910_c' created
HMR_2241
unknown metabolite 'M00805_c' created
HMR_2242
unknown metabolite 'M00070_c' c

HMR_2837
unknown metabolite 'arachcrn_r' created
HMR_2838
HMR_2839
unknown metabolite 'M01777_r' created
HMR_2840
unknown metabolite 'CE5151_r' created
HMR_2841
unknown metabolite 'M01775_r' created
HMR_2842
unknown metabolite 'M01236_r' created
HMR_2843
unknown metabolite 'M01776_r' created
HMR_2844
unknown metabolite 'M00018_r' created
HMR_2845
unknown metabolite 'M00122_r' created
HMR_2846
unknown metabolite 'M00123_r' created
HMR_2847
unknown metabolite 'M00100_r' created
HMR_2848
unknown metabolite 'M00101_r' created
HMR_2849
unknown metabolite 'M02051_r' created
HMR_2850
unknown metabolite 'M02052_r' created
HMR_2851
unknown metabolite 'M01724_r' created
HMR_2852
HMR_2853
unknown metabolite 'M01727_r' created
HMR_2854
HMR_2855
unknown metabolite 'M01726_r' created
HMR_2856
unknown metabolite 'M00006_r' created
HMR_2857
unknown metabolite 'M02637_r' created
HMR_2859
HMR_2861
unknown metabolite 'strdnccrn_r' created
HMR_2862
HMR_2863
unknown metabolite 'eicostetcrn_r' created
HMR_2

HMR_3232
unknown metabolite 'M00849_m' created
HMR_3233
HMR_3234
unknown metabolite 'M03022_m' created
HMR_3235
unknown metabolite 'M01573_m' created
HMR_3236
unknown metabolite 'M00885_m' created
HMR_3237
HMR_3240
unknown metabolite 'M03014_m' created
HMR_3241
unknown metabolite 'M00702_m' created
HMR_3242
unknown metabolite 'M00843_m' created
HMR_3243
HMR_3244
unknown metabolite 'M03024_m' created
HMR_3245
HMR_3246
unknown metabolite 'M00841_m' created
HMR_3247
HMR_3256
unknown metabolite 'HC10853_m' created
HMR_3258
unknown metabolite 'HC12594_m' created
HMR_3272
HMR_3288
HMR_3296
HMR_3316
HMR_3321
HMR_3322
HMR_3326
unknown metabolite 'M03016_x' created
HMR_3327
unknown metabolite 'M00715_x' created
HMR_3328
unknown metabolite 'M00879_x' created
HMR_3329
HMR_3330
HMR_3331
HMR_3332
HMR_3333
HMR_3334
unknown metabolite 'CE5154_x' created
HMR_3335
unknown metabolite 'CE5153_x' created
HMR_3336
HMR_3337
unknown metabolite 'CE5151_x' created
HMR_3338
HMR_3339
unknown metabolite 'CE5148_x

HMR_4964
HMR_5130
unknown metabolite 'trnatyr_c' created
unknown metabolite 'tyrtrna_c' created
HMR_5144
unknown metabolite 'mettrna_c' created
unknown metabolite 'fmettrna_c' created
HMR_5146
unknown metabolite 'trnapro_c' created
unknown metabolite 'protrna_c' created
HMR_5149
unknown metabolite 'trnatrp_c' created
unknown metabolite 'trptrna_c' created
HMR_5166
unknown metabolite 'leutrna_c' created
unknown metabolite 'lystrna_c' created
unknown metabolite 'phetrna_c' created
unknown metabolite 'sertrna_c' created
unknown metabolite 'thrtrna_c' created
unknown metabolite 'valtrna_c' created
unknown metabolite 'HC00004_c' created
unknown metabolite 'trnaleu_c' created
unknown metabolite 'trnalys_c' created
unknown metabolite 'trnamet_c' created
unknown metabolite 'trnaphe_c' created
unknown metabolite 'trnaser_c' created
unknown metabolite 'trnathr_c' created
unknown metabolite 'trnaval_c' created
HMR_5241
unknown metabolite 'M03146_l' created
HMR_5246
unknown metabolite 'M01569_l' c

HMR_9530
unknown metabolite 'M02513_c' created
unknown metabolite 'M02521_c' created
HMR_9531
unknown metabolite 'M01018_c' created
unknown metabolite 'M02702_c' created
HMR_9532
unknown metabolite 'M02701_c' created
HMR_9534
unknown metabolite 'M00155_c' created
HMR_9535
unknown metabolite 'M00822_c' created
unknown metabolite 'M00823_c' created
HMR_9538
unknown metabolite 'M02829_c' created
HMR_9539
unknown metabolite 'M03144_c' created
HMR_9541
unknown metabolite 'M02339_c' created
unknown metabolite 'M00196_c' created
HMR_9542
unknown metabolite 'M02706_c' created
unknown metabolite 'M02707_c' created
HMR_9543
unknown metabolite 'M02375_c' created
unknown metabolite 'M01128_c' created
HMR_9544
unknown metabolite 'M02704_c' created
unknown metabolite 'trdox_c' created
unknown metabolite 'M02703_c' created
unknown metabolite 'trdrd_c' created
HMR_9545
unknown metabolite 'M02708_c' created
unknown metabolite 'M02705_c' created
HMR_9546
unknown metabolite 'M02848_c' created
unknown met

ITCOALm
ITPtm
unknown metabolite 'itp_m' created
ITPtn
unknown metabolite 'itp_n' created
IVCOAACBP
IVCRNe
IZPN
It
KAS8
KCC2t
KCCt
KDNH
unknown metabolite 'kdn_c' created
KHK
KHK2
unknown metabolite 'xylu_D_c' created
KHK3
KHte
KSII_CORE2t
unknown metabolite 'ksii_core2_g' created
unknown metabolite 'ksii_core2_e' created
KSII_CORE2tly
unknown metabolite 'ksii_core2_l' created
KSII_CORE4t
unknown metabolite 'ksii_core4_g' created
unknown metabolite 'ksii_core4_e' created
KSII_CORE4tly
unknown metabolite 'ksii_core4_l' created
KSIt
unknown metabolite 'ksi_g' created
KSItly
KYN
KYN3OX
KYNAKGAT
unknown metabolite '4aphdob_c' created
KYNAKGATm
unknown metabolite 'Lkynr_m' created
unknown metabolite '4aphdob_m' created
KYNATESYN
unknown metabolite 'kynate_c' created
KYNATESYNm
unknown metabolite 'kynate_m' created
KYNATEtr
unknown metabolite 'kynate_e' created
Kt3g
unknown metabolite 'k_g' created
LACLt
unknown metabolite 'lac_L_x' created
LACZe
unknown metabolite 'lcts_e' created
LACZly
un

MCLOR
MCOATA
MCOATAm
MCPST
unknown metabolite 'tcynt_c' created
MDH
MDHm
MDHx
MDRPD
unknown metabolite '5mdru1p_c' created
ME1m
ME2
ME2m
MECOALm
unknown metabolite 'mescon_m' created
unknown metabolite 'mescoa_m' created
MECOAS1m
MELATN23DOX
unknown metabolite 'fna5moxam_c' created
MELATNOX
MEOHt2
unknown metabolite 'meoh_e' created
MEOHtly
unknown metabolite 'meoh_l' created
MEOHtr
MEPIVESSte
unknown metabolite 'mepi_e' created
MERCPLACCYSt
unknown metabolite 'mercplaccys_e' created
MESCOALm
METARGLEUt
unknown metabolite 'metargleu_e' created
unknown metabolite 'metargleu_c' created
METASNTYRt
unknown metabolite 'metasntyr_e' created
unknown metabolite 'metasntyr_c' created
METAT
METATB0tc
unknown metabolite 'met_L_e' created
METB0AT3tc
METGLNTYRt
unknown metabolite 'metglntyr_e' created
unknown metabolite 'metglntyr_c' created
METGLYARGt
unknown metabolite 'metglyarg_e' created
unknown metabolite 'metglyarg_c' created
METHISLYSt
unknown metabolite 'methislys_e' created
unknown metabo

P4504F123r
P4504F81r
unknown metabolite '18harachd_r' created
P4507A1r
P4507B11r
unknown metabolite 'xoltri25_r' created
P4507B12r
unknown metabolite 'xoltri27_r' created
P4508B11r
unknown metabolite 'xoldiolone_r' created
P4508B13r
P450LTB4r
unknown metabolite 'leuktrB4wcooh_r' created
P450SCC1m
unknown metabolite '20ahchsterol_m' created
P5CDm
P5CR
P5CRm
P5CRx
P5CRxm
PACCOAL
PAFABCt
unknown metabolite 'paf_cho_e' created
PAFH
PAFHe
PAFS
PAFt
PAIL45P_HStn
unknown metabolite 'pail45p_cho_c' created
unknown metabolite 'pail45p_cho_n' created
PAIL4P_HStn
unknown metabolite 'pail4p_cho_c' created
unknown metabolite 'pail4p_cho_n' created
PAILAR_HSPLA2
unknown metabolite 'pailar_hs_c' created
PAILPALM_HSPLA2
unknown metabolite 'pailpalm_hs_c' created
PAIL_HStn
unknown metabolite 'pail_cho_n' created
PALFATPtc
PAN4PP
unknown metabolite 'ptth_c' created
PAPStg
PAPTT
unknown metabolite 'ApoACP_c' created
PAPtg
PA_HSter
PA_HStg
unknown metabolite 'pa_cho_g' created
PA_HStn
PCACTDMHPm
PCACTPRIS

PRISTtx
PRO1x
PRO1xm
PROAKGOX1r
unknown metabolite 'akg_r' created
unknown metabolite '4hpro_LT_r' created
unknown metabolite 'succ_r' created
PROARGASPt
unknown metabolite 'proargasp_e' created
unknown metabolite 'proargasp_c' created
PROARGCYSt
unknown metabolite 'proargcys_e' created
unknown metabolite 'proargcys_c' created
PROASNCYSt
unknown metabolite 'proasncys_e' created
unknown metabolite 'proasncys_c' created
PROCYSt
unknown metabolite 'procys_e' created
unknown metabolite 'procys_c' created
PROD2
PROD2m
PRODt2r
unknown metabolite 'pro_D_e' created
PRODt2rL
unknown metabolite 'pro_D_l' created
PROGLNPROt
unknown metabolite 'proglnpro_e' created
unknown metabolite 'proglnpro_c' created
PROGLULYSt
unknown metabolite 'proglulys_e' created
unknown metabolite 'proglulys_c' created
PROGLYPEPT1tc
unknown metabolite 'progly_e' created
unknown metabolite 'progly_c' created
PROGLYPRO1c
PROHISTYRt
unknown metabolite 'prohistyr_e' created
unknown metabolite 'prohistyr_c' created
PROHISt
u

RE1517X
RE1518M
unknown metabolite 'CE2432_m' created
RE1518X
RE1519X
RE1520M
RE1520X
unknown metabolite 'CE2420_x' created
RE1521M
unknown metabolite 'CE2417_m' created
RE1521X
unknown metabolite 'CE2417_x' created
RE1522M
unknown metabolite 'CE2418_m' created
RE1522X
unknown metabolite 'CE2418_x' created
RE1523M
RE1523X
RE1525C
unknown metabolite 'CE2418_c' created
unknown metabolite 'CE2422_c' created
RE1525M
unknown metabolite 'CE2422_m' created
RE1525X
unknown metabolite 'CE2422_x' created
RE1526C
unknown metabolite 'CE2417_c' created
unknown metabolite 'CE2424_c' created
RE1526M
unknown metabolite 'CE2424_m' created
RE1526X
unknown metabolite 'CE2424_x' created
RE1527C
unknown metabolite 'CE2420_c' created
unknown metabolite 'CE0693_c' created
RE1527M
RE1527X
unknown metabolite 'CE0693_x' created
RE1530C
RE1530M
RE1531M
RE1531X
RE1532M
RE1532X
RE1533M
RE1533X
RE1534M
RE1534X
RE1537C
unknown metabolite '3aap_c' created
RE1537X
unknown metabolite '3aap_x' created
unknown metabolite

unknown metabolite 'CE2962_c' created
RE2149R
unknown metabolite 'CE2961_r' created
unknown metabolite 'CE2962_r' created
RE2150C
unknown metabolite 'retnglc_c' created
RE2150R
unknown metabolite 'retnglc_r' created
RE2151C
unknown metabolite 'CE5757_c' created
RE2151R
unknown metabolite 'CE5757_r' created
RE2152C
unknown metabolite 'CE1162_c' created
unknown metabolite 'CE2955_c' created
RE2154C
RE2155C
RE2155R
unknown metabolite 'CE5072_r' created
RE2156M
unknown metabolite 'cyst_L_m' created
unknown metabolite 'CE5082_m' created
RE2202C
unknown metabolite 'CE5775_c' created
RE2203C
unknown metabolite 'CE5776_c' created
RE2220C
unknown metabolite 'CE1293_c' created
RE2221C
unknown metabolite 'CE1297_c' created
unknown metabolite 'CE1294_c' created
RE2221M
unknown metabolite 'CE1297_m' created
unknown metabolite 'CE1294_m' created
RE2223M
unknown metabolite 'CE1310_m' created
RE2235C
RE2235R
unknown metabolite 'C05300_r' created
RE2240C
RE2248C
unknown metabolite 'CE2963_c' created
RE

RE3020C
unknown metabolite 'CE5140_c' created
RE3020R
unknown metabolite 'CE5140_r' created
RE3021C
unknown metabolite 'CE5141_c' created
RE3022C
unknown metabolite 'CE5525_c' created
RE3033C
unknown metabolite 'CE2567_c' created
RE3033N
RE3033R
unknown metabolite 'CE2567_r' created
RE3036C
unknown metabolite 'CE7172_c' created
RE3036N
unknown metabolite 'CE7172_n' created
RE3038C
unknown metabolite 'C06315_c' created
RE3038N
unknown metabolite 'C06315_n' created
RE3038R
unknown metabolite 'C06315_r' created
RE3038X
unknown metabolite 'C06315_x' created
RE3040C
RE3040R
unknown metabolite 'C06314_r' created
RE3040X
unknown metabolite 'C06314_x' created
RE3041C
RE3041N
unknown metabolite 'leuktrA4_n' created
RE3044C
RE3044N
RE3050R
RE3051C
RE3066X
unknown metabolite 'CE5122_x' created
unknown metabolite 'CE5123_x' created
RE3072X
RE3074X
RE3075C
unknown metabolite 'CE2414_c' created
unknown metabolite 'CE5122_c' created
RE3075X
unknown metabolite 'CE2414_x' created
RE3076X
RE3079C
unknow

RE3336M
unknown metabolite 'CE5345_m' created
unknown metabolite 'CE5346_m' created
RE3336X
unknown metabolite 'CE5345_x' created
unknown metabolite 'CE5346_x' created
RE3337M
unknown metabolite 'CE5344_m' created
RE3337X
unknown metabolite 'CE5344_x' created
RE3338C
unknown metabolite 'CE5344_c' created
unknown metabolite 'CE5345_c' created
RE3338M
RE3338X
RE3339C
unknown metabolite 'CE5346_c' created
unknown metabolite 'CE5307_c' created
RE3339M
unknown metabolite 'CE5307_m' created
RE3339X
unknown metabolite 'CE5307_x' created
RE3340C
unknown metabolite 'CE5329_c' created
unknown metabolite 'CE5331_c' created
RE3340M
unknown metabolite 'CE5329_m' created
unknown metabolite 'CE5331_m' created
RE3340X
unknown metabolite 'CE5329_x' created
unknown metabolite 'CE5331_x' created
RE3341M
unknown metabolite 'CE5342_m' created
unknown metabolite 'CE5341_m' created
RE3341X
unknown metabolite 'CE5342_x' created
unknown metabolite 'CE5341_x' created
RE3342M
unknown metabolite 'CE5337_m' create

S3T3g
S3TASE1ly
unknown metabolite 'hs_deg11_l' created
S3TASE2ly
S3TASE3ly
S4T1g
S4T2g
S4T3g
unknown metabolite 'cs_e_pre5a_g' created
S4T4g
S4T5g
unknown metabolite 'cs_e_pre5b_g' created
S4T6g
S4TASE1ly
S4TASE2ly
S4TASE3ly
S4TASE4ly
unknown metabolite 'cs_e_deg1_l' created
S4TASE5ly
unknown metabolite 'cs_e_deg5_l' created
S6T10g
S6T11g
S6T12g
S6T13g
S6T14g
S6T15g
S6T16g
S6T17g
S6T18g
S6T19g
S6T1g
S6T20g
S6T21g
S6T22g
S6T23g
S6T24g
S6T25g
S6T2g
S6T3g
S6T4g
S6T5g
S6T6g
S6T7g
S6T8g
S6T9g
S6TASE10ly
unknown metabolite 'ksi_deg4_l' created
S6TASE11ly
S6TASE12ly
S6TASE13ly
S6TASE14ly
S6TASE15ly
S6TASE16ly
S6TASE17ly
S6TASE18ly
S6TASE19ly
S6TASE1ly
S6TASE20ly
S6TASE21ly
S6TASE22ly
unknown metabolite 'ksii_core2_deg1_l' created
S6TASE23ly
S6TASE24ly
S6TASE25ly
unknown metabolite 'ksii_core4_deg1_l' created
S6TASE26ly
S6TASE2ly
S6TASE3ly
S6TASE4ly
S6TASE5ly
S6TASE6ly
S6TASE7ly
S6TASE8ly
S6TASE9ly
SACCD3m
unknown metabolite 'saccrp_L_m' created
SACCD4m
SADT
SADTn
unknown metabolite 'so4_n' c

unknown metabolite 'thrilearg_c' created
THRMETARGt
unknown metabolite 'thrmetarg_e' created
unknown metabolite 'thrmetarg_c' created
THRPHEARGt
unknown metabolite 'thrphearg_e' created
unknown metabolite 'thrphearg_c' created
THRPHELAT2tc
THRS
THRSERARGt
unknown metabolite 'thrserarg_e' created
unknown metabolite 'thrserarg_c' created
THRSERNaEx
THRTHRARGt
unknown metabolite 'thrthrarg_e' created
unknown metabolite 'thrthrarg_c' created
THRTRAH
THRTRAHm
unknown metabolite 'thrtrna_m' created
unknown metabolite 'trnathr_m' created
THRTRS
THRTRSm
THRTYRMETt
unknown metabolite 'thrtyrmet_e' created
unknown metabolite 'thrtyrmet_c' created
THRt4
THRt7l
THYMDt1
THYMDtl
THYMDtm
THYMDtr2
THYMt
unknown metabolite 'thym_e' created
THYOCHOLabc
unknown metabolite 'thyochol_e' created
THYOCHOLt
THYOCHOLt2
THYOXt
unknown metabolite 'thyox_L_e' created
THYOXt2
THYPX
THYST
unknown metabolite 'thyoxs_c' created
THYSTte
unknown metabolite 'thyoxs_e' created
TIDSSULF
TIGCRNe
TIGGLYc
unknown metabolite 

XYLOR
unknown metabolite 'xylnact__D_c' created
XYLR
XYLTD_Dr
XYLTer
XYLTt
unknown metabolite 'xylt_e' created
XYLUR
XYLt
unknown metabolite 'xyl_D_e' created
XYLtly
YVITEt
Zn2t
unknown metabolite 'zn2_e' created
Znabc
biomass
biomass_prod
biomass_producing
gthox_export
peplys_synthesis
q10h2tc
q10tm
r0001
unknown metabolite 'HC02119_c' created
r0002
r0009
r0013
unknown metabolite 'HC00822_l' created
r0016
r0021
r0022
r0023
unknown metabolite 'HC00617_c' created
unknown metabolite 'HC00619_c' created
r0027
r0028
r0033
unknown metabolite 'dpcoa_m' created
r0034
r0047
r0051
r0060
r0062
r0068
unknown metabolite 'HC01672_c' created
r0074
r0081
r0082
r0083
unknown metabolite 'HC01434_m' created
r0084
unknown metabolite 'HC01434_x' created
r0085
unknown metabolite 'HC00591_c' created
r0086
unknown metabolite 'HC00591_m' created
r0093
unknown metabolite 'udpg_e' created
unknown metabolite 'g1p_e' created
r0097
r0113
unknown metabolite 'acmana_r' created
r0119
r0120
r0121
unknown metabolite 'f

r1113
r1116
r1117
r1129
unknown metabolite 'HC00004_r' created
r1135
unknown metabolite 'HC02110_r' created
r1143
r1144
r1147
r1148
r1150
r1154
unknown metabolite '2obut_m' created
r1155
r1156
r1159
unknown metabolite 'cdpchol_r' created
r1162
r1163
r1164
unknown metabolite 'HC02020_r' created
r1165
unknown metabolite 'hdd2coa_r' created
unknown metabolite 'HC02021_r' created
r1166
unknown metabolite 'HC02022_r' created
r1167
r1168
unknown metabolite 'HC02024_r' created
r1169
unknown metabolite 'HC02025_r' created
r1170
unknown metabolite 'HC02026_r' created
r1171
unknown metabolite 'HC02027_r' created
r1172
r1173
unknown metabolite 'HC02020_l' created
r1174
unknown metabolite 'hdcea_r' created
r1175
r1176
unknown metabolite 'HC02022_l' created
unknown metabolite 'ocdca_l' created
r1177
r1178
unknown metabolite 'HC02023_l' created
unknown metabolite 'ocdcea_l' created
r1179
r1180
r1181
r1182
unknown metabolite 'lnlncg_r' created
r1183
r1184
r1185
unknown metabolite 'HC02029_c' created


In [21]:
diff1 = model - sheet
print(f'Metabolites in the Rxns Sheet not present in the Metabolites Sheet:{list(diff1)}\n')


diff2 = sheet - model
print(f'Metabolites in the Metabolites Sheet not present in the Rxns Sheet:{list(diff2)}\n')

equal = (sheet == model)
if equal:
    print('Both sheets contains the same exactly metabolites')

Metabolites in the Rxns Sheet not present in the Metabolites Sheet:[]

Metabolites in the Metabolites Sheet not present in the Rxns Sheet:[]

Both sheets contains the same exactly metabolites


#### Identification of missing Metabolites

In [22]:
from google_sheet import GoogleSheet

KEY_FILE_PATH = 'credentials.json'
SPREADSHEET_ID = '1MlBXeHIKw8k8fZyXm-sN__AHTRSunJxar_-bqvukZws'

sheet = GoogleSheet(SPREADSHEET_ID, KEY_FILE_PATH)
sheet_met = 'Metabolites'
met = sheet.read_google_sheet(sheet_met)

met['Name'] = met['Name'].str.lower()
met_copy = met.copy()
met_copy['BiGG ID'] = met_copy['BiGG ID'].str[:-2]

In [23]:
import numpy as np

empty_cells_KEGG = met_copy['KEGG'] == ''
empty_cells_CHEBI = met_copy['CHEBI'] == ''
empty_cells_ChEMBILD = met_copy['ChEMBLID'] == ''
empty_cells_PubChem = met_copy['PubChem'] == 'NaN'
empty_cells = np.sum(empty_cells_KEGG & empty_cells_CHEBI & empty_cells_ChEMBILD & empty_cells_PubChem)
empty_mets = met_copy.loc[empty_cells_KEGG & empty_cells_CHEBI & empty_cells_ChEMBILD & empty_cells_PubChem]
print(f"Number of empty cells (Mets with no IDs): {empty_cells}")

Number of empty cells (Mets with no IDs): 2347


##### Check if the Mets belong only in one reconstruction

In [24]:
import pandas as pd
import re

Recon = pd.read_excel('../Data/Reconciliation/datasets/rxns_recon3d_toadd.xlsx')
Recon_Mets = Recon['m_metabolites'].copy()

Hefzi = pd.read_excel('../Data/Reconciliation/datasets/hefzi_final.xlsx')
Hefzi_Mets = Hefzi['Reaction Formula'].copy()

iCHO2101 = pd.read_excel('../Data/Reconciliation/datasets/iCHO2101.xlsx', sheet_name='Supplementary Table 10', skiprows=1)
iCHO2101_Mets = iCHO2101['Reaction'].copy()

iCHO2291 = pd.read_excel('../Data/Reconciliation/datasets/iCHO2291_final.xlsx')
iCHO2291_Mets = iCHO2291['Reaction Formula'].copy()


In [25]:
import re
Mets = []
for met in Recon_Mets:
    elements = re.findall(r'[+]+|-->|<=>|\b\w+\b', met)
    elements = [elem for elem in elements if elem not in ['+', '-->', '<=>']]
    Mets.append(elements)
Big_Mets = []
for sublist in Mets:
    Big_Mets.extend(sublist)
ReconMets = [element.split('_')[0] for element in Big_Mets]

Mets = []
for met in Hefzi_Mets:
    elements = re.findall(r'[+]+|-->|<=>|\b\w+\b', met)
    elements = [elem for elem in elements if elem not in ['+', '-->', '<=>']]
    Mets.append(elements)
Big_Mets = []
for sublist in Mets:
    Big_Mets.extend(sublist)
HefziMets = [element.split('_')[0] for element in Big_Mets]

Mets = []
for met in iCHO2101_Mets:
    elements = re.findall(r'[+]+|-->|<=>|\b\w+\b', met)
    elements = [elem for elem in elements if elem not in ['+', '=>', '<=>']]
    Mets.append(elements)
Big_Mets = []
for sublist in Mets:
    Big_Mets.extend(sublist)
iCHO2101Mets = [element.split('[]')[0] for element in Big_Mets]

Mets = []
for met in iCHO2291_Mets:
    elements = re.findall(r'[+]+|-->|<=>|\b\w+\b', met)
    elements = [elem for elem in elements if elem not in ['+', '-->', '<=>']]
    Mets.append(elements)
Big_Mets = []
for sublist in Mets:
    Big_Mets.extend(sublist)
iCHO2291Mets = [element.split('[')[0] for element in Big_Mets]


In [26]:
import itertools
from collections import defaultdict

# Initialize counters
single_dataset_counters = {name: 0 for name in ['Recon', 'Hefzi', 'iCHO2291', 'iCHO2101']}
all_Counter = 0

datasets = {
    'Recon': ReconMets,
    'Hefzi': HefziMets,
    'iCHO2291': iCHO2291Mets,
    'iCHO2101': iCHO2101Mets
}

shared_counters = defaultdict(int)

for noIDMet in empty_mets['BiGG ID']:
    datasets_with_met = [name for name, mets in datasets.items() if noIDMet in mets]

    if len(datasets_with_met) == 1:
        single_dataset_counters[datasets_with_met[0]] += 1
    elif 2 <= len(datasets_with_met) <= len(datasets):
        combination = tuple(sorted(datasets_with_met))
        shared_counters[combination] += 1
    if len(datasets_with_met) == len(datasets):
        all_Counter += 1

for dataset, count in single_dataset_counters.items():
    print(f"Number of mets ONLY in {dataset}: {count}")

for combination, count in shared_counters.items():
    print(f"Number of mets shared ONLY by {', '.join(combination)}: {count}")

print(f"Number of mets shared by all the models: {all_Counter}")


Number of mets ONLY in Recon: 954
Number of mets ONLY in Hefzi: 2
Number of mets ONLY in iCHO2291: 254
Number of mets ONLY in iCHO2101: 14
Number of mets shared ONLY by Hefzi, iCHO2101, iCHO2291: 208
Number of mets shared ONLY by iCHO2101, iCHO2291: 373
Number of mets shared ONLY by Hefzi, Recon, iCHO2101, iCHO2291: 77
Number of mets shared ONLY by Hefzi, Recon, iCHO2101: 141
Number of mets shared ONLY by Hefzi, iCHO2101: 191
Number of mets shared ONLY by Recon, iCHO2291: 8
Number of mets shared ONLY by Recon, iCHO2101, iCHO2291: 2
Number of mets shared by all the models: 77


<a id='information'></a>
## 4. Statistical Analysis of the Information in the Metabolites Dataseet
Here we will use the .txt file generated in **Final CHO Model 3.6** with information about the relevant metabolites for "biomass" and "biomass_producing" optimized models. The list of metabolites provided will be used to estimate the amount of total metabolites that it represents in our reconstruction and how much missed information do we have for those metabolites.

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
from skimage import draw
from wordcloud import WordCloud
from collections import Counter

from cobra import Model, Reaction, Metabolite
from tqdm.notebook import tqdm

from google_sheet import GoogleSheet

### 3.1 Calculate the missing Information for Relevant Metabolites

In [None]:
##### ----- Generate datasets from Google Sheet ----- #####

#Credential file
KEY_FILE_PATH = 'credentials.json'

# #CHO Network Reconstruction + Recon3D_v2 Google Sheet ID
# SPREADSHEET_ID = '1MlBXeHIKw8k8fZyXm-sN__AHTRSunJxar_-bqvukZws'

#CHO Network Reconstruction + Recon3D_v3 Google Sheet ID
SPREADSHEET_ID = '1MlBXeHIKw8k8fZyXm-sN__AHTRSunJxar_-bqvukZws'

# Initialize the GoogleSheet object
sheet = GoogleSheet(SPREADSHEET_ID, KEY_FILE_PATH)

# Read data from the Google Sheet
sheet_met = 'Metabolites'
sheet_rxns = 'Rxns'
sheet_attributes = 'Attributes'
sheet_boundary = 'BoundaryRxns'

metabolites = sheet.read_google_sheet(sheet_met)
rxns = sheet.read_google_sheet(sheet_rxns)
rxns_attributes = sheet.read_google_sheet(sheet_attributes)

In [None]:
metabolites

In [None]:
## ---- Generate a df of Relevant Metabolites ---- ##
rel_mets = pd.read_csv('metabolites.txt', sep=' ', header=None)
rel_mets_list = list(rel_mets[0])
rel_mets_df = metabolites[metabolites['BiGG ID'].isin(rel_mets_list)].copy()
non_rel_mets_df = metabolites[~metabolites['BiGG ID'].isin(rel_mets_list)].copy()

In [None]:
# Calculate the percentage of Relevant Metabolites with and without Info 
info = []
no_info = []
for i,m in rel_mets_df.iterrows():
    if (m['PubChem']!='NaN' or m['Inchi']!='NaN' or m['SMILES']!='NaN'):
        info.append(m['BiGG ID'])
    if (m['PubChem']=='NaN' and m['Inchi']=='NaN' and m['SMILES']=='NaN'):
        no_info.append(m['BiGG ID'])
        
print(f'Percentage of metabolites with info: {len(info)/len(rel_mets_list)*100}%')
print(f'Percentage of metabolites with no info: {len(no_info)/len(rel_mets_list)*100}%')

In [None]:
# Plot the results

# The sizes of the lists
size_A = len(rel_mets_list)
size_B = len(info)
size_C = len(no_info)

# Calculate the percentages
percentage_B = size_B / size_A * 100
percentage_C = size_C / size_A * 100

# Create a bar plot with a tall and thin bar
plt.figure(figsize=(1,8))  # Adjust the size of the plot. Increase the second number to make it taller
plt.bar(1, percentage_B, color='blue', label='Mets. with Info', width=0.1)  # Decrease the width to make the bar thinner
plt.bar(1, percentage_C, bottom=percentage_B, color='green', label='Mets. with No Info', width=0.1)

# Set the labels and title
plt.ylabel('Percentage of Metabolites')
plt.xticks([])  # Hide x ticks
plt.yticks(np.arange(0, 101, 20))  # Set the y ticks
plt.gca().yaxis.set_major_formatter(PercentFormatter())  # Format the y ticks as percentages
plt.ylim([0, 100])  # Set the y limit
plt.box(False)  # Remove the box around the plot
plt.legend(loc='upper right', bbox_to_anchor=(2.3, 1.13))  # Move the legend to the upper right corner

# Save and Show the plot
#plt.savefig('percentage_relevant_mets.png', dpi=300, bbox_inches='tight')
plt.show()

### 3.2 Plot the percentage of the total metabolites comprised by the relevant metabolites

In [None]:
# Plot the results

# The sizes of the lists
size_A = len(list(metabolites['BiGG ID']))
size_B = len(list(rel_mets_df['BiGG ID']))
size_C = len(list(non_rel_mets_df['BiGG ID']))

# Calculate the percentages
percentage_B = size_B / size_A * 100
percentage_C = size_C / size_A * 100

# Create a bar plot with a tall and thin bar
plt.figure(figsize=(1,8))  # Adjust the size of the plot. Increase the second number to make it taller
plt.bar(1, percentage_B, color='blue', label='Relevant Mets.', width=0.1)  # Decrease the width to make the bar thinner
plt.bar(1, percentage_C, bottom=percentage_B, color='gold', label='Rest of the dataset', width=0.1)

# Set the labels and title
plt.ylabel('Percentage of Metabolites')
plt.xticks([])  # Hide x ticks
plt.yticks(np.arange(0, 101, 20))  # Set the y ticks
plt.gca().yaxis.set_major_formatter(PercentFormatter())  # Format the y ticks as percentages
plt.ylim([0, 100])  # Set the y limit
plt.box(False)  # Remove the box around the plot
plt.legend(loc='upper right', bbox_to_anchor=(2.3, 1.13))  # Move the legend to the upper right corner

# Save and Show the plot
print(percentage_B)
print(percentage_C)
plt.savefig('percentage_relevant_mets.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Calculate the percentage of Relevant Metabolites with and without Info 
info2 = []
no_info2 = []
for i,m in non_rel_mets_df.iterrows():
    if (m['PubChem']!='NaN' or m['Inchi']!='NaN' or m['SMILES']!='NaN'):
        info2.append(m['BiGG ID'])
    if (m['PubChem']=='NaN' and m['Inchi']=='NaN' and m['SMILES']=='NaN'):
        no_info2.append(m['BiGG ID'])
        
print(f'Percentage of metabolites with info: {len(info2)/len(non_rel_mets_df)*100}%')
print(f'Percentage of metabolites with no info: {len(no_info2)/len(non_rel_mets_df)*100}%')

In [None]:
len(info2)

### 3.3 Subsystems

In [None]:
##### ----- Create a model and add reactions ----- #####
model = Model("iCHO")
lr = []
for _, row in rxns.iterrows():
    r = Reaction(row['Reaction'])
    lr.append(r)    
model.add_reactions(lr)
model

In [None]:
##### ----- Add information to each one of the reactions ----- #####
for i,r in enumerate(tqdm(model.reactions)):
    print(r.id)
    r.build_reaction_from_string(rxns['Reaction Formula'][i])
    r.name = rxns['Reaction Name'][i]
    r.subsystem = rxns['Subsystem'][i]

In [None]:
# List of metabolite IDs


subsystems_rel = []
subsystems_info = []
subsystems_non_info = []

# Loop over the list of metabolites in the relevant metabolites list
for met_id in rel_mets_list:
    # Get the metabolite
    try:
        metabolite = model.metabolites.get_by_id(met_id)
    except KeyError:
        print(f'Metabolite {met_id} not in the model')
    
    # Get the reactions involving this metabolite
    reactions = metabolite.reactions

    # Add the subsystems for these reactions to our set
    for r in reactions:
        subsystems_rel.append(r.subsystem)

subs_rel_freq = Counter(subsystems_rel)
subs_rel_freq = Counter({key: subs_rel_freq[key] for key in subs_rel_freq if 'TRANSPORT' not in key})
subs_rel_freq = Counter({key: subs_rel_freq[key] for key in subs_rel_freq if 'EXCHANGE' not in key})


# Loop over the list of metabolites in the metabolites with information
for met_id in info2:
    # Get the metabolite
    try:
        metabolite = model.metabolites.get_by_id(met_id)
    except KeyError:
        print(f'Metabolite {met_id} not in the model')
    
    # Get the reactions involving this metabolite
    reactions = metabolite.reactions

    # Add the subsystems for these reactions to our set
    for r in reactions:
        subsystems_info.append(r.subsystem)

subs_info_freq = Counter(subsystems_info)
subs_info_freq = Counter({key: subs_info_freq[key] for key in subs_info_freq if 'TRANSPORT' not in key})
subs_info_freq = Counter({key: subs_info_freq[key] for key in subs_info_freq if 'EXCHANGE' not in key})


# Loop over the list of metabolites in the metabolites with no information
for met_id in no_info2:
    # Get the metabolite
    try:
        metabolite = model.metabolites.get_by_id(met_id)
    except KeyError:
        print(f'Metabolite {met_id} not in the model')
    
    # Get the reactions involving this metabolite
    reactions = metabolite.reactions

    # Add the subsystems for these reactions to our set
    for r in reactions:
        subsystems_non_info.append(r.subsystem)

subs_non_info_freq = Counter(subsystems_non_info)
subs_non_info_freq = Counter({key: subs_non_info_freq[key] for key in subs_non_info_freq if 'TRANSPORT' not in key})
subs_non_info_freq = Counter({key: subs_non_info_freq[key] for key in subs_non_info_freq if 'EXCHANGE' not in key})

In [None]:
#subs_rel_freq
#subs_info_freq
#subs_non_info_freq

In [None]:
mets_with_info = subs_rel_freq + subs_info_freq
C3 = Counter({key: mets_with_info[key] for key in mets_with_info if key not in subs_non_info_freq})
C3

In [None]:
#Plot

radius = 500  # you can change to the size you need
circle_img = np.zeros((2*radius, 2*radius), np.uint8)
rr, cc = draw.disk((radius, radius), radius)
circle_img[rr, cc] = 1

# Create the word cloud
wordcloud = WordCloud(width = 1000, height = 500, mask=circle_img, background_color="rgba(255, 255, 255, 0)", mode="RGBA").generate_from_frequencies(C3)

plt.figure(figsize=(8,8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")

plt.savefig('wordcloud.png', bbox_inches='tight', transparent=True, pad_inches=0)
plt.show()

##### Pandas AI

In [None]:
import pandas as pd
from pandasai import PandasAI

# Sample DataFrame

# Instantiate a LLM
from pandasai.llm.openai import OpenAI
llm = OpenAI(api_token='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')

pandas_ai = PandasAI(llm, conversational=True)
pandas_ai.run(met, prompt='Plot a pie chart of all the compartments and the amount of metabolites in each compartment, using different colors for each bar')

In [None]:
pandas_ai = PandasAI(llm, conversational=True)
pandas_ai.run(met, prompt='How many metabolites are in the nuleus compartment?')

In [None]:
# Convert metabolites names to lower case and remove the compartment
met['Name'] = met['Name'].str.lower()
met_copy = met.copy()
met_copy['BiGG ID'] = met_copy['BiGG ID'].str[:-2]
met = met_copy.groupby('BiGG ID').first().reset_index()
met

In [None]:
pandas_ai = PandasAI(llm, conversational=False)
pandas_ai.run(met, prompt='Which metabolites better correlate?')

In [None]:
met

In [None]:
import pandas as pd

data = '''
Curated         BiGG ID   \n176                 M00056_m  \\\n193                 M00071_m   \n1014                CE2038_x   \n1352                CE4799_m   \n1360                CE4806_m   \n1361                CE4807_m   \n1876                CE5938_x   \n1982              leuktrB4_c   \n2531                M00056_m   \n2540                M00071_m   \n2916                M01191_m   \n2918                M01191_x   \n3019          xolest226_hs_l   \n3023          xolest205_hs_l   \n5636                M01191_x   \n5794                M01191_m   \n5795                M01191_x   \n6078              leuktrB4_c   \n7439                CE4799_m   \n7440                CE4807_m   \n7441                CE2038_x   \n7442                CE4806_m   \n7443                CE5938_x   \n8036    Than  xolest205_hs_l   \n8039    Than  xolest226_hs_l   \n\n                                                   Name         Formula   \n176                                   (2e)-nonenoyl-coa  C30H46N7O17P3S  \\\n193                                 (2e)-undecenoyl-coa  C32H50N7O17P3S   \n1014             trans-2,3-dehydropristanoyl coenzyme a  C40H66N7O17P3S   \n1352          2,6-dimethyl-trans-2-heptenoyl coenzyme a  C30H46N7O17P3S   \n1360        4(r),8-dimethyl-trans-2-nonenoyl coenzyme a  C32H50N7O17P3S   \n1361              4-methyl-trans-2-pentenoyl coenzyme a  C27H40N7O17P3S   \n1876    (4r,8r,12r)-trimethyl-2e-tridecenoyl coenzyme a  C37H60N7O17P3S   \n1982     5,12-dihydroxy-6,8,10,14-eicosatetraenoic acid        C20H31O4   \n2531                           (2e)-nonenoyl coenzyme a  C30H46N7O17P3S   \n2540                         (2e)-undecenoyl coenzyme a  C32H50N7O17P3S   \n2916                         7z-hexadecenoyl coenzyme a  C37H60N7O17P3S   \n2918                         7z-hexadecenoyl coenzyme a  C37H60N7O17P3S   \n3019  cholesteryl docosahexanoate, cholesterol-ester...        C49H76O2   \n3023  1-timnodnoyl-cholesterol, cholesterol-ester (2...        C47H74O2   \n5636                         7z-hexadecenoyl coenzyme a  C37H60N7O17P3S   \n5794                         7z-hexadecenoyl coenzyme a  C37H60N7O17P3S   \n5795                         7z-hexadecenoyl coenzyme a  C37H60N7O17P3S   \n6078                                 leukotriene b4(1-)        C20H31O4   \n7439                 2,6-dimethyl-trans-2-heptenoyl-coa  C30H46N7O17P3S   \n7440                     4-methyl-trans-2-pentenoyl-coa  C27H40N7O17P3S   \n7441                    trans-2,3-dehydropristanoyl-coa  C40H66N7O17P3S   \n7442               4(r),8-dimethyl-trans-2-nonenoyl-coa  C32H50N7O17P3S   \n7443         (4r,8r,12r)-trimethyl-(2e)-tridecenoyl-coa  C37H60N7O17P3S   \n8036  1-timnodnoyl-cholesterol, cholesterol-ester (2...        C47H74O2   \n8039  cholesteryl docosahexanoate, cholesterol-ester...        C49H76O2   \n\n                    Compartment  KEGG  CHEBI   PubChem   \n176            m - mitochondria  None   None      None  \\\n193            m - mitochondria                          \n1014  x - peroxisome/glyoxysome        63803  56927963   \n1352           m - mitochondria                          \n1360           m - mitochondria                          \n1361           m - mitochondria                          \n1876  x - peroxisome/glyoxysome               53481434   \n1982                c - cytosol  None   None      None   \n2531           m - mitochondria  None   None      None   \n2540           m - mitochondria                          \n2916           m - mitochondria  None   None      None   \n2918  x - peroxisome/glyoxysome  None   None      None   \n3019               l - lysosome  None   None      None   \n3023               l - lysosome  None   None      None   \n5636  x - peroxisome/glyoxysome  None   None      None   \n5794           m - mitochondria  None   None      None   \n5795  x - peroxisome/glyoxysome  None   None      None   \n6078                c - cytosol        15647   5280492   \n7439           m - mitochondria                          \n7440           m - mitochondria                          \n7441  x - peroxisome/glyoxysome  None   None      None   \n7442           m - mitochondria                          \n7443  x - peroxisome/glyoxysome  None   None      None   \n8036               l - lysosome               53477889   \n8039               l - lysosome               14274978   \n\n                                                  
...'''

# Split the data into lines
lines = data.split('\n')[1:]  # The first line is empty

# Split each line into fields
lines = [line.split() for line in lines]

# Create a DataFrame
df = pd.DataFrame(lines, columns=['Curated', 'BiGG ID', 'Name', 'Formula', 'Compartment', 'KEGG', 'CHEBI', 'PubChem'])


In [None]:
df