# Metabolites

## Description
In this notebook we create a dataframe containing all the available information for the metabolites accounted in our reconstruction.<br>
[1. Generation of Metabolites dataset](#generation) <br>
&nbsp;&nbsp;&nbsp;&nbsp;**1.1 Retrieve a list of all the metabolites from our reconstruction** <br>
&nbsp;&nbsp;&nbsp;&nbsp;**1.2 Retrieve information from all the metabolites on Recon3D, iCHO2291 and iCHO1766**<br>
&nbsp;&nbsp;&nbsp;&nbsp;**1.3 Add all the metabolites information into our metabolites dataset** <br>
&nbsp;&nbsp;&nbsp;&nbsp;**1.4 Unique metabolite identification** <br><br>
[2. Metabolites Curation](#curation) <br>
&nbsp;&nbsp;&nbsp;&nbsp;**2.1 Update missing information in metabolites dataset from BiGG** <br>
&nbsp;&nbsp;&nbsp;&nbsp;**2.2 Update missing information in metabolites dataset from other databases** <br>
&nbsp;&nbsp;&nbsp;&nbsp;**2.3 Identification of duplicated metabolites** <br>

In [None]:
# Import libraries
import gspread
import pandas as pd
import numpy as np
import requests
import time

import cobra
from cobra import Model
from cobra.io import read_sbml_model

from tqdm.notebook import tqdm

from google_sheet import GoogleSheet

In [None]:
# Define functions

def df_to_dict(df, key_col):
    """
    This function takes a pandas dataframe and a key column, and returns a dictionary
    with the key column as the dictionary keys and the rest of the columns as the values.
    """
    # Create an empty dictionary to hold the key-value pairs
    my_dict = {}
    
    # Loop through each row in the dataframe
    for index, row in df.iterrows():
        # Use the value in the key column as the dictionary key
        key_value = row[key_col]
        
        # Use the rest of the columns as the dictionary values
        value_dict = row.drop(key_col).to_dict()
        
        # Add the key-value pair to the dictionary
        my_dict[key_value] = value_dict
    
    return my_dict

<a id='generation'></a>
## 1. Generation of Metabolites dataset
We start by creating a list of all the metabolites included in the reactions of our reconstruction (1). Then we create a dataset containing all the metabolites info from Recon3D, iCHO2291 and iCHO1766 models, including supplementary information from Recon 3D (2). Now we can map back this information into the metabolites from our reconstruction and generate an excell file for uploading into Google Sheets (3).

### 1. Retrieve a list of all the metabolites from our reconstruction
The list of all the reactions and the metabolites involved are in the Rxns Sheet in the Google Sheet.

In [None]:
from google_sheet import GoogleSheet

KEY_FILE_PATH = 'credentials.json'
SPREADSHEET_ID = '1MlBXeHIKw8k8fZyXm-sN__AHTRSunJxar_-bqvukZws'

# Initialize the GoogleSheet object
sheet = GoogleSheet(SPREADSHEET_ID, KEY_FILE_PATH)

# Read data from the Google Sheet and crete "rxns" df
sheet_rxns = 'Rxns'
rxns = sheet.read_google_sheet(sheet_rxns)

In [None]:
# Create a cobra model to identify the metabolites involved in our reconstruction
model = cobra.Model("iCHOxxxx")
lr = []

for _, row in rxns.iterrows():
    r = cobra.Reaction(row['Reaction'])
    lr.append(r)
    
model.add_reactions(lr)
model

In [None]:
# With the built in function "build_reaction_from_string" we can identify the metabolites
for i,r in enumerate(tqdm(model.reactions)):
    r.build_reaction_from_string(df['Reaction Formula'][i])

In [None]:
# We first create a list of the metabolites and then a pandas df with it
metabolites_list = []
for met in model.metabolites:
    metabolites_list.append(met.id)
    
metabolites = pd.DataFrame(metabolites_list, columns =['BiGG ID'])
metabolites

### 1.2 Retrieve information from all the metabolites on Recon3D, iCHO2291 and iCHO1766
We use two datasets for this, first we take information from the Recon3D.xml, iCHO2291.xml and iCHO1766 files from which we get the metabolite ID, Name, Formula and Compartment. We then add the metadata for the available metabolites from Recon3D supplementary files.

In [None]:
# read the Recon3D model
recon3d_model = read_sbml_model('../Data/GPR_Curation/Recon3D.xml')

In [None]:
# Generate a dataset containing all the metabolites, chemical formula of each metabolite and compartment
num_rows = len(recon3d_model.metabolites)
recon3d_model_metabolites = pd.DataFrame(index=range(num_rows), columns=['BiGG ID', 'Name', 'Formula', 'Compartment'])
for i,met in enumerate(recon3d_model.metabolites):
    id_ = met.id
    name = met.name
    formula = met.formula
    comp = met.compartment
    recon3d_model_metabolites.iloc[i] = [id_, name, formula, comp]

In [None]:
recon3d_model_metabolites

In [None]:
# read the Yeo's model
iCHO2291_model = read_sbml_model('../Data/Reconciliation/models/iCHO2291.xml')

In [None]:
# Generate a dataset containing all the metabolites, chemical formula of each metabolite and compartment from Yeo's model
num_rows = len(iCHO2291_model.metabolites)
iCHO2291_model_metabolites = pd.DataFrame(index=range(num_rows), columns=['BiGG ID', 'Name', 'Formula', 'Compartment'])
for i,met in enumerate(iCHO2291_model.metabolites):
    id_ = met.id
    name = met.name
    formula = met.formula
    comp = met.compartment
    iCHO2291_model_metabolites.iloc[i] = [id_, name, formula, comp]
    
iCHO2291_model_metabolites['BiGG ID'] = iCHO2291_model_metabolites['BiGG ID'].str.replace("[", "_", regex=False)
iCHO2291_model_metabolites['BiGG ID'] = iCHO2291_model_metabolites['BiGG ID'].str.replace("]", "", regex=False)
iCHO2291_model_metabolites

In [None]:
# read Hefzi's model
iCHO1766_model = read_sbml_model('../Data/Reconciliation/models/iCHOv1_final.xml')

In [None]:
# Generate a dataset containing all the metabolites, chemical formula of each metabolite and compartment from Hefzi's model
num_rows = len(iCHO1766_model.metabolites)
iCHO1766_model_metabolites = pd.DataFrame(index=range(num_rows), columns=['BiGG ID', 'Name', 'Formula', 'Compartment'])
for i,met in enumerate(iCHO1766_model.metabolites):
    id_ = met.id
    name = met.name
    formula = met.formula
    comp = met.compartment
    iCHO1766_model_metabolites.iloc[i] = [id_, name, formula, comp]

iCHO1766_model_metabolites

In [None]:
models_metabolites = pd.concat([recon3d_model_metabolites, iCHO2291_model_metabolites, iCHO1766_model_metabolites])
models_metabolites = models_metabolites.groupby('BiGG ID').first()
models_metabolites = models_metabolites.reset_index(drop = False)
models_metabolites

In [None]:
#Generation of a dataset containing all the information from Recon3D metabolites Supplementary Data.
recon3d_metabolites_meta = pd.read_excel('../Data/Metabolites/metabolites.recon3d.xlsx', header = 0)
recon3d_metabolites_meta['BiGG ID'] = recon3d_metabolites_meta['BiGG ID'].str.replace("[", "_", regex=False)
recon3d_metabolites_meta['BiGG ID'] = recon3d_metabolites_meta['BiGG ID'].str.replace("]", "", regex=False)
recon3d_metabolites_meta

In [None]:
# Transformation of the "recon3d_metabolites_meta" into a dict to map it into the "recon3d_model_metabolites"
recon3dmet_dict = df_to_dict(recon3d_metabolites_meta, 'BiGG ID')

In [None]:
# Mapping into the "recon3d_model_metabolites" dataset
models_metabolites[['KEGG','CHEBI', 'PubChem','Inchi', 'Hepatonet', 'EHMNID', 'SMILES', 'INCHI2',
                          'CC_ID','Stereoisomer Information of Metabolite Identified', 'Charge of the Metabolite Identified',
    'CID_ID','PDB (ligand-expo) Experimental Coordinates  File Url', 'Pub Chem Url',
    'ChEBI Url']] = models_metabolites['BiGG ID'].apply(lambda x: pd.Series(recon3dmet_dict.get(x, None), dtype=object))

In [None]:
models_metabolites

In [None]:
# Transform the final Recon3D Metabolites dataset into a dictionary to map it into our dataset
final_met_dict = df_to_dict(models_metabolites, 'BiGG ID')

### 1.3 Add all the metabolites information into our metabolites dataset
With the dictionary created in **Step 2** we can use the information to map it in the metabolites dataset created in **Step 1** which contains all the metabolites of our reconstruction.

In [None]:
metabolites[['Name', 'Formula', 'Compartment', 'KEGG','CHEBI', 'PubChem','Inchi', 'Hepatonet', 'EHMNID', 'SMILES',
             'INCHI2','CC_ID','Stereoisomer Information of Metabolite Identified', 'Charge of the Metabolite Identified',
    'CID_ID','PDB (ligand-expo) Experimental Coordinates  File Url', 'Pub Chem Url',
    'ChEBI Url']] = metabolites['BiGG ID'].apply(lambda x: pd.Series(final_met_dict.get(x, None), dtype=object))

In [None]:
# Update the Compartment column in the final dataset
for i,row in metabolites.iterrows():
    if row['Compartment'] == 'c':
        metabolites.loc[i, 'Compartment'] = 'c - cytosol'
    if row['Compartment'] == 'l':
        metabolites.loc[i, 'Compartment'] = 'l - lysosome'
    if row['Compartment'] == 'm':
        metabolites.loc[i, 'Compartment'] = 'm - mitochondria'
    if row['Compartment'] == 'r':
        metabolites.loc[i, 'Compartment'] = 'r - endoplasmic reticulum'
    if row['Compartment'] == 'e':
        metabolites.loc[i, 'Compartment'] = 'e - extracellular space'
    if row['Compartment'] == 'x':
        metabolites.loc[i, 'Compartment'] = 'x - peroxisome/glyoxysome'
    if row['Compartment'] == 'n':
        metabolites.loc[i, 'Compartment'] = 'n - nucleus'
    if row['Compartment'] == 'g':
        metabolites.loc[i, 'Compartment'] = 'g - golgi apparatus'
    if row['Compartment'] == 'im':
        metabolites.loc[i, 'Compartment'] = 'im - intermembrane space of mitochondria'

In [None]:
# The dataset generated is stored as an Excel file in the "Data" folder
metabolites.to_excel('../Data/Metabolites/metabolites.xlsx')

### 1.4 Unique metabolite identification
This next block of code gives us an idea of how many duplicated metabolites we have in our generated dataset based on the IDs, Name, Formula and KEGG IDs.

In [None]:
##### ----- Generate datasets from Google Sheet ----- #####

#Credential file
KEY_FILE_PATH = 'credentials.json'

# #CHO Network Reconstruction + Recon3D_v2 Google Sheet ID
# SPREADSHEET_ID = '1MlBXeHIKw8k8fZyXm-sN__AHTRSunJxar_-bqvukZws'

#CHO Network Reconstruction + Recon3D_v3 Google Sheet ID
SPREADSHEET_ID = '1MlBXeHIKw8k8fZyXm-sN__AHTRSunJxar_-bqvukZws'

# Initialize the GoogleSheet object
sheet = GoogleSheet(SPREADSHEET_ID, KEY_FILE_PATH)

# Read data from the Google Sheet
sheet_met = 'Metabolites'
metabolites = sheet.read_google_sheet(sheet_met)

In [None]:
print("Duplicated rxns by BiGG ID = ", len(metabolites['BiGG ID']) - len(metabolites['BiGG ID'].unique()))
print("Duplicated rxns by Name = ", len(metabolites['Name']) - len(metabolites['Name'].unique()))
print("Duplicated rxns by Formula = ", len(metabolites['Formula']) - len(metabolites['Formula'].unique()))
print("Duplicated rxns by KEGG = ", len(metabolites['KEGG']) - len(metabolites['KEGG'].unique()))

<a id='curation'></a>
## 2. Metabolites Curation
In this second part of the notebook we curate missing information in the metabolites dataset generated above. Since many metabolites have been manually curated in the "Metabolites" google sheet file, we generate a new dataframe using the gspread library to obtain the metabolites dataset with all the changes

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import time

from tqdm.notebook import tqdm

from google_sheet import GoogleSheet
from metabolite_identifiers import getPubchemCID, getChEMBLID, getCIDSmilesInChI, getCIDFormula

### 2.1 Update missing information in metabolites dataset from BiGG

In [None]:
#Credential file
KEY_FILE_PATH = 'credentials.json'

#CHO Network Reconstruction + Recon3D_v3 Google Sheet ID
SPREADSHEET_ID = '1MlBXeHIKw8k8fZyXm-sN__AHTRSunJxar_-bqvukZws'

# Initialize the GoogleSheet object
sheet = GoogleSheet(SPREADSHEET_ID, KEY_FILE_PATH)

# Read data from the Google Sheet
sheet_met = 'Metabolites'
metabolites = sheet.read_google_sheet(sheet_met)

In [None]:
# Get BiGG descriptive names from the BiGG database
import requests
from bs4 import BeautifulSoup
import time

# Unknown Mets: metabolites without names
unkown_mets = metabolites[metabolites['Name'] == '']

Descriptive_Names = [''] * len(unkown_mets)
Formulae = [''] * len(Descriptive_Names)
Changed = [True] * len(Descriptive_Names)

for Met_Counter, metID in enumerate(tqdm(unkown_mets['BiGG ID'].iloc[:])):
    print(Met_Counter)
    input_str = metID[:-2]
    response = requests.get(f"http://bigg.ucsd.edu/universal/metabolites/{input_str}")
    time.sleep(1)
    # Check if the request was successful
    if response.status_code != 200:
        D_Name = "BiGG ID not found in BiGG"
        Formulae_B = "BiGG ID not found in BiGG"
        Changed[Met_Counter] = False       
    else:    
        soup = BeautifulSoup(response.content, 'html.parser')
        N_Header = soup.find('h4', string='Descriptive name:')
        D_Name = N_Header.find_next_sibling('p').text
        N_Formulae = soup.find('h4', string='Formulae in BiGG models: ')
        Formulae_B = N_Formulae.find_next_sibling('p').text    
        if D_Name is None:
            D_Name = "Name not found in BiGG"            
        elif Formulae_B is None:
            Formulae_B = "Formula not found in BiGG"                
    Descriptive_Names[Met_Counter] = D_Name
    Formulae[Met_Counter] = Formulae_B

In [None]:
for Met_Counter, metID in enumerate(unkown_mets['BiGG ID']):
    print('before',unkown_mets['BiGG ID'].iloc[Met_Counter])
    print('before',unkown_mets['Formula'].iloc[Met_Counter])
    print('before',unkown_mets['Name'].iloc[Met_Counter])
    if unkown_mets['Formula'].iloc[Met_Counter] == '':
        unkown_mets['Formula'].iloc[Met_Counter] = Formulae[Met_Counter]  
    unkown_mets['Name'].iloc[Met_Counter] = Descriptive_Names[Met_Counter]
    print('..............................................')
    print('after',unkown_mets['BiGG ID'].iloc[Met_Counter])
    print('after',unkown_mets['Formula'].iloc[Met_Counter])
    print('after',unkown_mets['Name'].iloc[Met_Counter])
    print('..............................................')
    print('..............................................')
    print('..............................................')

In [None]:
metabolites.update(unkown_mets)

# Manual Curation
for bigg_id in metabolites['BiGG ID']:
    # xtra = Xanthurenic acid; C10H6NO4
    # http://bigg.ucsd.edu/models/iCHOv1/reactions/r0647
    if 'xtra' in bigg_id:
        metabolites.loc[metabolites['BiGG ID'] == bigg_id, 'Name'] = 'Xanthurenic acid'
        metabolites.loc[metabolites['BiGG ID'] == bigg_id, 'Formula'] = 'C10H6NO4'
    # chedxch = Bilirubin-monoglucuronoside; C39H42N4O122-
    # Reactions name = 'ATP-binding Cassette (ABC) TCDB:3.A.1.208.2' --> https://metabolicatlas.org/identifier/TCDB/3.A.1.208.2
    elif 'chedxch' in bigg_id:
        metabolites.loc[metabolites['BiGG ID'] == bigg_id, 'Name'] = 'Bilirubin-monoglucuronoside'
        metabolites.loc[metabolites['BiGG ID'] == bigg_id, 'Formula'] = 'C39H42N4O122-'
    # chatGTP
    elif '3hoc246_6Z_9Z_12Z_15Z_18Z_21Zcoa' in bigg_id:
        metabolites.loc[metabolites['BiGG ID'] == bigg_id, 'Name'] = 'CoA molecule that has a 24-carbon fatty acid with six double bonds, with the location of the double bonds specified by the numbers and Zs'
    # chatGTP
    elif 'c247_2Z_6Z_9Z_12Z_15Z_18Z_21Zcoa' in bigg_id:
        metabolites.loc[metabolites['BiGG ID'] == bigg_id, 'Name'] = 'CoA molecule that has a modified version of the same 24-carbon fatty acid, with a hydroxyl group added at the third carbon position'
    # chatGTP
    elif '3hoc143_5Z_8Z_11Zcoa' in bigg_id:
        metabolites.loc[metabolites['BiGG ID'] == bigg_id, 'Name'] = 'CoA molecule that has a 14-carbon fatty acid with three double bonds, with the location of the double bonds specified by the numbers and Zs.'
    # chatGTP
    elif '3oc143_5Z_8Z_11Zcoa' in bigg_id:
        metabolites.loc[metabolites['BiGG ID'] == bigg_id, 'Name'] = 'CoA molecule that has a modified version of the same 14-carbon fatty acid, with the hydroxyl group removed and one of the double bonds converted to a keto group'
    # chatGTP
    elif 'acgalgalacglcgalgluside' in bigg_id:
        metabolites.loc[metabolites['BiGG ID'] == bigg_id, 'Name'] = 'Complex glycosphingolipid that contains multiple sugar residues'

    # 12e8hdx WTF?
    # hdxur Dead End

metabolites.to_excel('../Data/Metabolites/metabolites_final.xlsx')

### 2.2 Update missing information in metabolites dataset from other databases
Here we use different functions from the "metabolites" module to try to fetch Inchi, SMILES and database identifiers for all the metabolites in our reconstruction

In [2]:
#Credential file
KEY_FILE_PATH = 'credentials.json'

#CHO Network Reconstruction + Recon3D_v3 Google Sheet ID
SPREADSHEET_ID = '1MlBXeHIKw8k8fZyXm-sN__AHTRSunJxar_-bqvukZws'

# Initialize the GoogleSheet object
sheet = GoogleSheet(SPREADSHEET_ID, KEY_FILE_PATH)

# Read data from the Google Sheet
sheet_met = 'Metabolites'
metabolites = sheet.read_google_sheet(sheet_met)

In [3]:
metabolites

Unnamed: 0,Curated,BiGG ID,Name,Formula,Compartment,KEGG,CHEBI,ChEMBLID,PubChem,Inchi,...,EHMNID,SMILES,INCHI2,CC_ID,Stereoisomer Information of Metabolite Identified,Charge of the Metabolite Identified,CID_ID,PDB (ligand-expo) Experimental Coordinates File Url,Pub Chem Url,ChEBI Url
0,,10fthf5glu_c,10-formyltetrahydrofolate-[glu](5),C40H45N11O19,c - cytosol,,,,,InChI=1S/C40H51N11O19/c41-40-49-32-31(37(66)50...,...,,N=c1nc(O)c2c([nH]1)NCC(CN(C=O)c1ccc(C(=O)NC(CC...,,,,,,,,
1,,10fthf5glu_e,10-formyltetrahydrofolate-[glu](5),C40H45N11O19,e - extracellular space,,,,,InChI=1S/C40H51N11O19/c41-40-49-32-31(37(66)50...,...,,N=c1nc(O)c2c([nH]1)NCC(CN(C=O)c1ccc(C(=O)NC(CC...,,,,,,,,
2,,10fthf5glu_l,10-formyltetrahydrofolate-[glu](5),C40H45N11O19,l - lysosome,,,,,InChI=1S/C40H51N11O19/c41-40-49-32-31(37(66)50...,...,,N=c1nc(O)c2c([nH]1)NCC(CN(C=O)c1ccc(C(=O)NC(CC...,,,,,,,,
3,,10fthf5glu_m,10-formyltetrahydrofolate-[glu](5),C40H45N11O19,m - mitochondria,,,,,InChI=1S/C40H51N11O19/c41-40-49-32-31(37(66)50...,...,,N=c1nc(O)c2c([nH]1)NCC(CN(C=O)c1ccc(C(=O)NC(CC...,,,,,,,,
4,,10fthf6glu_c,10-formyltetrahydrofolate-[glu](6),C45H51N12O22,c - cytosol,,,,,InChI=1/C45H58N12O22/c46-45-55-36-35(38(67)56-...,...,,N=c1nc([O-])c2c([nH]1)NCC(CN(C=O)c1ccc(C(=O)N[...,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7984,,zym_int2_r,zymosterol intermediate 2 c27h42o,C27H42O,r - endoplasmic reticulum,,18252,,22298942,InChI=1S/C27H42O/c1-18(2)7-6-8-19(3)23-11-12-2...,...,,[H][C@@]12CCC3=C(CC[C@]4(C)[C@]([H])(CC[C@@]34...,InChI=1S/C27H42O/c1-18(2)7-6-8-19(3)23-11-12-2...,,,Neutral,22298942,,https://pubchem.ncbi.nlm.nih.gov/compound/2229...,https://www.ebi.ac.uk/chebi/searchId.do?chebiI...
7985,,zymst_c,zymosterol c27h44o,C27H44O,c - cytosol,,,,92746,InChI=1S/C27H44O/c1-18(2)7-6-8-19(3)23-11-12-2...,...,,[H][C@@]12CCC3=C(CC[C@]4(C)[C@]([H])(CC[C@@]34...,InChI=1S/C27H44O/c1-18(2)7-6-8-19(3)23-11-12-2...,,,Neutral,92746,,https://pubchem.ncbi.nlm.nih.gov/compound/92746,https://www.ebi.ac.uk/chebi/searchId.do?chebiI...
7986,,zymst_r,zymosterol c27h44o,C27H44O,r - endoplasmic reticulum,,,,92746,InChI=1S/C27H44O/c1-18(2)7-6-8-19(3)23-11-12-2...,...,,[H][C@@]12CCC3=C(CC[C@]4(C)[C@]([H])(CC[C@@]34...,InChI=1S/C27H44O/c1-18(2)7-6-8-19(3)23-11-12-2...,,,Neutral,92746,,https://pubchem.ncbi.nlm.nih.gov/compound/92746,https://www.ebi.ac.uk/chebi/searchId.do?chebiI...
7987,,zymstnl_c,5alpha-cholest-8-en-3beta-ol,C27H46O,c - cytosol,,16608,,101770,InChI=1S/C27H46O/c1-18(2)7-6-8-19(3)23-11-12-2...,...,C03845,[H][C@@]12CCC3=C(CC[C@]4(C)[C@]([H])(CC[C@@]34...,InChI=1S/C27H46O/c1-18(2)7-6-8-19(3)23-11-12-2...,,,Neutral,101770,,http://pubchem.ncbi.nlm.nih.gov/compound/101770,https://www.ebi.ac.uk/chebi/searchId.do?chebiI...


In [None]:
counter = 0
no_match = [] #create an empty list with PubChem IDs that don't match with the formulas in the dataset
for i,met in metabolites.iterrows():
    if met['PubChem'] == '':
        try:
            pubchem_id = getPubchemCID(met['Name'],met['SMILES'])
        except:
            pubchem_id = getPubchemCID(met['Name'],'')
         
        if pubchem_id:
            if (len(pubchem_id)>1): #If there is more than 1 Pubchem ID, check which one correspond to our metabolite
                match_found = False
                for _id in pubchem_id:
                    form = getCIDFormula(_id)
                    
                    # Compare the formula obtained from the PubChem ID to the one in our dataset
                    if (form == met['Formula']):
                        match_found = True
                        metabolites.loc[i, 'PubChem'] = _id
                        inchi_smiles = getCIDSmilesInChI(_id)
                        smiles = inchi_smiles[0]
                        inchi = inchi_smiles[1]
                        if (met['Inchi'] == ''):
                            metabolites.loc[i, 'Inchi'] = inchi
                        if (met['SMILES'] == ''):
                            metabolites.loc[i, 'SMILES'] = smiles   
                        print('Match found:'+met['BiGG ID'], _id, inchi, smiles)
                        break # break the loop as we found the match
                        
                if not match_found:  # if no match was found
                    _id = pubchem_id[0]  # take the first ID
                    metabolites.loc[i, 'PubChem'] = _id
                    inchi_smiles = getCIDSmilesInChI(_id)
                    smiles = inchi_smiles[0]
                    inchi = inchi_smiles[1]
                    if (met['Inchi'] == ''):
                        metabolites.loc[i, 'Inchi'] = inchi
                    if (met['SMILES'] == ''):
                        metabolites.loc[i, 'SMILES'] = smiles   
                    print('Not match found:'+met['BiGG ID'], pubchem_id, inchi, smiles)
                    no_match.append([met['BiGG ID'], pubchem_id, inchi, smiles])
                    
            # If there is only one ID associated to that metabolite        
            else:
                metabolites.loc[i, 'PubChem'] = pubchem_id[0]
                inchi_smiles = getCIDSmilesInChI(pubchem_id[0])
                smiles = inchi_smiles[0]
                inchi = inchi_smiles[1]
                if (met['Inchi'] == ''):
                    metabolites.loc[i, 'Inchi'] = inchi
                if (met['SMILES'] == ''):
                    metabolites.loc[i, 'SMILES'] = smiles
                print(met['BiGG ID'], pubchem_id[0], inchi, smiles)
            counter +=1
            print(counter)


In [10]:
counter = 0
no_match = [] #create an empty list with PubChem IDs that don't match with the formulas in the dataset
for i,met in metabolites.iterrows():
    cmp = met['Name']
    if met['PubChem']=='NaN':
        pubchem_id = getPubchemCID(cmp,'')
         
        if pubchem_id:
            if (len(pubchem_id)>1): #If there is more than 1 Pubchem ID, check which one correspond to our metabolite
                match_found = False
                for _id in pubchem_id:
                    form = getCIDFormula(_id)
                    
                    # Compare the formula obtained from the PubChem ID to the one in our dataset
                    if (form == met['Formula']):
                        match_found = True
                        metabolites.loc[i, 'PubChem'] = _id
                        print('Match found:'+met['BiGG ID'], _id)
                        break # break the loop as we found the match
                        
                if not match_found:  # if no match was found
                    _id = pubchem_id[0]  # take the first ID in the pubchem_id list
                    metabolites.loc[i, 'PubChem'] = _id  
                    print('Not match found:'+met['BiGG ID'], pubchem_id)
                    no_match.append([met['BiGG ID'], pubchem_id])
                    
            # If there is only one ID associated to that metabolite        
            else:
                metabolites.loc[i, 'PubChem'] = pubchem_id[0]
                print(met['BiGG ID'], pubchem_id[0])
            counter +=1
            print(counter)


12dhlkB4_c 5280876
1
13dampp_e 4030255
2
15HPET_x 5280893
3
15kprostgf2_c 5280887
4
15kprostgf2_e 5280887
5
Match found:17ahprgnlone_e 91451
6
17ahprgstrn_e 6238
7
1a2425thvitd2_c 9547253
8
1a2425thvitd2_e 9547253
9
Match found:1a2425thvitd3_c 9910685
10
Not match found:1p2cbxl_c ['440046', '11966218']
11
20ahchsterol_c 121935
12
20ahchsterol_r 121935
13
21hprgnlone_c 247020
14
23dh1i56dio_c 170262
15
2aobut_c 440033
16
2dsis_c 194428
17
2h34hppr_c 636708
18
2h3mv_c 10796774
19
2hdvk_c 11953813
20
Not match found:2hyoxplac_e ['11970', '6933325', '101569100']
21
2obut_e 3593277
22
2ombz_c 5280835
23
2ommb_c 5280836
24
Not match found:2pentcoa_m ['11966167', '86290090']
25
Not match found:34dhoxmand_e ['85782', '19793107']
26
34dhpe_c 82755
27
34dhpe_e 82755
28
34dhpha_e 5460350
29
3a23dsi_c 46173947
30
3aap_c 5460495
31
3aap_x 5460495
32
3h3mglt_m 1662
33
3hbcoa_c 45266553
34
3hdcoa_x 11966158
35
Not match found:3hddcoa_x ['11966173', '440603', '16061157']
36
3hhcoa_m 11966160
37
3hhcoa

M00431_c 252379
335
M00433_c 92854
336
M00434_c 12850019
337
M00435_c 53297434
338
M00440_c 52929760
339
M00441_c 52929762
340
M00442_c 52929759
341
M00445_c 52929763
342
M00450_c 52929753
343
M00452_c 52929765
344
M00457_c 52929758
345
M00460_c 52929754
346
M00471_c 52929768
347
M00476_c 52929761
348
M00477_c 49837859
349
M00478_c 52929752
350
M00479_c 52929757
351
M00480_c 52929764
352
M00481_c 44575057
353
M00484_c 52929750
354
M00492_c 52929766
355
M00493_c 9547180
356
M00494_c 44575059
357
M00498_c 52929767
358
M00532_c 11825433
359
M00540_c 11954052
360
M00541_c 11954057
361
M00542_c 11954056
362
M00543_c 11954055
363
M00544_c 11954054
364
M00545_c 6849
365
M00546_c 11954051
366
M00547_c 11954053
367
M00550_x 167650
368
M00561_c 440568
369
M00565_c 11954073
370
M00566_c 11954072
371
M00568_c 6576
372
M00571_c 11954067
373
M00579_c 6453841
374
M00579_m 6453841
375
M00603_c 92827
376
M00606_m 167685
377
M00636_c 105131
378
M00646_c 1122
379
M00647_c 641637
380
M00658_m 50986248
381

alaglylys_e 14299171
703
alahisala_c 7019972
704
alahisala_e 7019972
705
Not match found:amet_e ['34755', '34756', '24762165']
706
and19one_c 252379
707
andrstandn_c 222865
708
andrstandn_e 222865
709
andrstndn_e 6128
710
anser_c 112072
711
aps_n 49852317
712
arach_l 10467
713
arach_r 10467
714
arachcoa_r 3081439
715
arachcrn_r 53477833
716
arachd_n 444899
717
aracheth_e 5712057
718
Not match found:arg_D_c ['1549104', '59667378']
719
Not match found:arg_D_e ['1549104', '59667378']
720
argalaala_c 102389965
721
argalaala_e 102389965
722
argalathr_c 11244912
723
argalathr_e 11244912
724
Not match found:argarg_c ['151956', '333441']
725
Not match found:argarg_e ['151956', '333441']
726
argarglys_c 52940699
727
argarglys_e 52940699
728
argglupro_c 145453815
729
argglupro_e 145453815
730
argglygly_c 15660735
731
argglygly_e 15660735
732
arghisthr_c 145453841
733
arghisthr_e 145453841
734
argleuphe_c 145453872
735
argleuphe_e 145453872
736
argpromet_c 145453923
737
argpromet_e 145453923
738


KeyboardInterrupt: 

In [11]:
sheet.update_google_sheet(sheet_met, metabolites)
print("Google Sheet updated.")

Google Sheet updated.


### 2.3 Identification of duplicated metabolites
The idea here is to add the metabolites from the reactions added from Recon 3D, at the same time that we do not over write data from our own Metabolites dataset.

In [None]:
from google_sheet import GoogleSheet

KEY_FILE_PATH = 'credentials.json'
SPREADSHEET_ID = '1MlBXeHIKw8k8fZyXm-sN__AHTRSunJxar_-bqvukZws'

# Initialize the GoogleSheet object
sheet = GoogleSheet(SPREADSHEET_ID, KEY_FILE_PATH)

# Read data from the Google Sheet
sheet_met = 'Metabolites'
sheet_rxns = 'Rxns'
shee_attributes = 'Attributes'

met = sheet.read_google_sheet(sheet_met)
rxns = sheet.read_google_sheet(sheet_rxns)
attributes = sheet.read_google_sheet(shee_attributes)

In [None]:
# Convert metabolites names to lower case and remove the compartment
met['Name'] = met['Name'].str.lower()
met_copy = met.copy()
met_copy['BiGG ID'] = met_copy['BiGG ID'].str[:-2]
met_copy

In [None]:
# Generate a list with duplicated metabolites

grouped = met_copy.groupby(['Name', 'Formula'])

# Initialize an empty dictionary to store the results
duplicated_metabolites = []

# Iterate over the grouped DataFrame
for (Name, Formula), group in grouped:
    # Check if the group has more than one element (i.e., duplicate) and filter out those metabolites whose names are unknown
    if group['BiGG ID'].nunique() > 1 and Name != 'bigg id not found in bigg':
        unique_ids = group['BiGG ID'].unique()
        duplicated_metabolites.append((Name, Formula, unique_ids))

        


In [None]:
len(duplicated_metabolites)

In [None]:
# Generate empty dict to store the existence of each duplicated metabolite in BiGG
duplicated_dict = {}


for metabolite in tqdm(duplicated_metabolites):
    duplicated_dict[metabolite[0]] = {}
    for big_id in metabolite[2]:
        time.sleep(1)
        # Check if the metabolite is in BiGG "OK" or not "NO"
        response = requests.get(f"http://bigg.ucsd.edu/universal/metabolites/{big_id}")
        if response.status_code == 200:
            duplicated_dict[metabolite[0]][big_id] = 'OK'
        else:
            duplicated_dict[metabolite[0]][big_id] ='NO'
        


In [None]:
duplicated_dict.pop('proton')
duplicated_dict

In [None]:
duplicated_dict

In [None]:
# Create a dictionary to store the 'OK' subkey for each key in duplicated_dict
ok_dict = {}

# Iterate over keys in duplicated_dict
for key in duplicated_dict:
    # Create an empty list to store 'NO' subkeys for this key
    no_list = []
    # Iterate over subkeys and values in sub-dictionary
    for subkey, value in duplicated_dict[key].items():
        # If the value is 'OK', save the subkey to a variable
        if value == 'OK':
            ok_dict[key] = subkey
        # If the value is 'NO', add the subkey to the list
        elif value == 'NO':
            no_list.append(subkey)
    # Replace all 'NO' subkeys with the 'OK' subkey for this key
    if key in ok_dict:
        ok_subkey = ok_dict[key]
        for no_subkey in no_list:
            met['BiGG ID'] = met['BiGG ID'].str.replace(no_subkey, ok_subkey)
            rxns['Reaction Formula'] = rxns['Reaction Formula'].str.replace(no_subkey, ok_subkey)
            attributes['Reaction Formula'] = attributes['Reaction Formula'].str.replace(no_subkey, ok_subkey)
    # Reset the 'ok_subkey' and 'no_subkey' variables at the end of each iteration over keys
    ok_dict[key] = None

In [None]:
# Store the original column order
column_order = met.columns.tolist()

# Group by 'BiGG ID' and keep the first non-null value in each group, then reset the index
met = met.groupby('BiGG ID').first().reset_index()

# Rearrange the columns to the original order
met = met[column_order]

met

In [None]:
# Update the Google Sheet with the modified DataFrame
sheet.update_google_sheet(sheet_rxns, rxns)
sheet.update_google_sheet(shee_attributes, attributes)
sheet.update_google_sheet(sheet_met, met)
print("Google Sheet updated.")

In [None]:
# Check for diferences between the metabolites in the "Rxns" and "Metabolites" Sheets

model = Model("iCHO")
lr = []
for _, row in rxns.iterrows():
    r = Reaction(row['Reaction'])
    lr.append(r)    
model.add_reactions(lr)

for i,r in enumerate(tqdm(model.reactions)):
    print(r.id)
    r.build_reaction_from_string(rxns['Reaction Formula'][i]) 
    
model_met_list = []
for m in model.metabolites:
    model_met_list.append(m.id)
    
sheet_met_list = list(met['BiGG ID'])

model = set(model_met_list)
sheet = set(sheet_met_list)

In [None]:
diff1 = model - sheet
print(f'Metabolites in the Rxns Sheet not present in the Metabolites Sheet:{list(diff1)}\n')


diff2 = sheet - model
print(f'Metabolites in the Metabolites Sheet not present in the Rxns Sheet:{list(diff2)}\n')

equal = (sheet == model)
if equal:
    print('Both sheets contains the same exactly metabolites')

In [None]:
# Pandas AI

In [None]:
import pandas as pd
from pandasai import PandasAI

# Sample DataFrame

# Instantiate a LLM
from pandasai.llm.openai import OpenAI
llm = OpenAI(api_token='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')

pandas_ai = PandasAI(llm, conversational=True)
pandas_ai.run(met, prompt='Plot a pie chart of all the compartments and the amount of metabolites in each compartment, using different colors for each bar')

In [None]:
pandas_ai = PandasAI(llm, conversational=True)
pandas_ai.run(met, prompt='How many metabolites are in the nuleus compartment?')

In [None]:
# Convert metabolites names to lower case and remove the compartment
met['Name'] = met['Name'].str.lower()
met_copy = met.copy()
met_copy['BiGG ID'] = met_copy['BiGG ID'].str[:-2]
met = met_copy.groupby('BiGG ID').first().reset_index()
met

In [None]:
pandas_ai = PandasAI(llm, conversational=False)
pandas_ai.run(met, prompt='Which metabolites better correlate?')

In [None]:
met

In [None]:
import pandas as pd

data = '''
Curated         BiGG ID   \n176                 M00056_m  \\\n193                 M00071_m   \n1014                CE2038_x   \n1352                CE4799_m   \n1360                CE4806_m   \n1361                CE4807_m   \n1876                CE5938_x   \n1982              leuktrB4_c   \n2531                M00056_m   \n2540                M00071_m   \n2916                M01191_m   \n2918                M01191_x   \n3019          xolest226_hs_l   \n3023          xolest205_hs_l   \n5636                M01191_x   \n5794                M01191_m   \n5795                M01191_x   \n6078              leuktrB4_c   \n7439                CE4799_m   \n7440                CE4807_m   \n7441                CE2038_x   \n7442                CE4806_m   \n7443                CE5938_x   \n8036    Than  xolest205_hs_l   \n8039    Than  xolest226_hs_l   \n\n                                                   Name         Formula   \n176                                   (2e)-nonenoyl-coa  C30H46N7O17P3S  \\\n193                                 (2e)-undecenoyl-coa  C32H50N7O17P3S   \n1014             trans-2,3-dehydropristanoyl coenzyme a  C40H66N7O17P3S   \n1352          2,6-dimethyl-trans-2-heptenoyl coenzyme a  C30H46N7O17P3S   \n1360        4(r),8-dimethyl-trans-2-nonenoyl coenzyme a  C32H50N7O17P3S   \n1361              4-methyl-trans-2-pentenoyl coenzyme a  C27H40N7O17P3S   \n1876    (4r,8r,12r)-trimethyl-2e-tridecenoyl coenzyme a  C37H60N7O17P3S   \n1982     5,12-dihydroxy-6,8,10,14-eicosatetraenoic acid        C20H31O4   \n2531                           (2e)-nonenoyl coenzyme a  C30H46N7O17P3S   \n2540                         (2e)-undecenoyl coenzyme a  C32H50N7O17P3S   \n2916                         7z-hexadecenoyl coenzyme a  C37H60N7O17P3S   \n2918                         7z-hexadecenoyl coenzyme a  C37H60N7O17P3S   \n3019  cholesteryl docosahexanoate, cholesterol-ester...        C49H76O2   \n3023  1-timnodnoyl-cholesterol, cholesterol-ester (2...        C47H74O2   \n5636                         7z-hexadecenoyl coenzyme a  C37H60N7O17P3S   \n5794                         7z-hexadecenoyl coenzyme a  C37H60N7O17P3S   \n5795                         7z-hexadecenoyl coenzyme a  C37H60N7O17P3S   \n6078                                 leukotriene b4(1-)        C20H31O4   \n7439                 2,6-dimethyl-trans-2-heptenoyl-coa  C30H46N7O17P3S   \n7440                     4-methyl-trans-2-pentenoyl-coa  C27H40N7O17P3S   \n7441                    trans-2,3-dehydropristanoyl-coa  C40H66N7O17P3S   \n7442               4(r),8-dimethyl-trans-2-nonenoyl-coa  C32H50N7O17P3S   \n7443         (4r,8r,12r)-trimethyl-(2e)-tridecenoyl-coa  C37H60N7O17P3S   \n8036  1-timnodnoyl-cholesterol, cholesterol-ester (2...        C47H74O2   \n8039  cholesteryl docosahexanoate, cholesterol-ester...        C49H76O2   \n\n                    Compartment  KEGG  CHEBI   PubChem   \n176            m - mitochondria  None   None      None  \\\n193            m - mitochondria                          \n1014  x - peroxisome/glyoxysome        63803  56927963   \n1352           m - mitochondria                          \n1360           m - mitochondria                          \n1361           m - mitochondria                          \n1876  x - peroxisome/glyoxysome               53481434   \n1982                c - cytosol  None   None      None   \n2531           m - mitochondria  None   None      None   \n2540           m - mitochondria                          \n2916           m - mitochondria  None   None      None   \n2918  x - peroxisome/glyoxysome  None   None      None   \n3019               l - lysosome  None   None      None   \n3023               l - lysosome  None   None      None   \n5636  x - peroxisome/glyoxysome  None   None      None   \n5794           m - mitochondria  None   None      None   \n5795  x - peroxisome/glyoxysome  None   None      None   \n6078                c - cytosol        15647   5280492   \n7439           m - mitochondria                          \n7440           m - mitochondria                          \n7441  x - peroxisome/glyoxysome  None   None      None   \n7442           m - mitochondria                          \n7443  x - peroxisome/glyoxysome  None   None      None   \n8036               l - lysosome               53477889   \n8039               l - lysosome               14274978   \n\n                                                  
...'''

# Split the data into lines
lines = data.split('\n')[1:]  # The first line is empty

# Split each line into fields
lines = [line.split() for line in lines]

# Create a DataFrame
df = pd.DataFrame(lines, columns=['Curated', 'BiGG ID', 'Name', 'Formula', 'Compartment', 'KEGG', 'CHEBI', 'PubChem'])


In [None]:
df