<div style="
    border-left: 4px solid #2196f3;
    background: #e3f2fd;
    padding: 16px;
    margin: 20px 0;
    border-radius: 6px;
">
  <h3 style="margin:0; color:#2196f3;">🎯 Aim</h3>
  <p style="margin:12px 0 0 0;">
    The primary goal of this notebook is to perform a comprehensive, multi‐layered exploratory analysis of colorectal cancer (CRC) metabolomics data—integrating cheminformatic annotations, physicochemical and quantum‐mechanical property calculations, KEGG enrichment, SMARTS‐guided enzyme and pathway inference, and transcriptomic/genomic correlations—to identify and prioritize candidate metabolic biomarkers and their biological contexts. The workflow will:
  </p>
  <ol style="margin:12px 0 0 20px; padding:0; list-style-type: decimal;">
    <li>Curate and preprocess raw metabolite measurements.</li>
    <li>Annotate each metabolite with canonical SMILES and compute physicochemical and quantum‐mechanical properties.</li>
    <li>Perform KEGG compound annotation and pathway enrichment analysis.</li>
    <li>Leverage CRC‐enriched SMARTS motifs to infer EC numbers and associated pathways.</li>
    <li>Layer transcriptomic and genomic correlations to contextualize metabolic shifts.</li>
    <li>Consolidate all enriched features—chemical, statistical, enzymatic, pathway, and transcriptomic/genomic annotations—into a single, analysis‐ready DataFrame for downstream modeling or pathway enrichment studies.</li>
  </ol>
</div>


<div style="
    border-left: 4px solid #2196f3;
    background: #feffea;
    padding: 16px;
    margin: 20px 0;
    border-radius: 6px;
">
  <h3 style="margin:0; color:#2196f3;">🎯 Objectives</h3>
  <ul style="margin:12px 0 0 20px; padding:0; list-style-type: disc;">
    <li>
      <strong>Data Curation and Preprocessing</strong>
      <ul style="margin:6px 0 0 20px; padding:0; list-style-type: disc;">
        <li>Import and clean raw CRC metabolomics measurements, ensuring completeness of sample metadata and statistical columns (VIP, p-value, fold change).</li>
      </ul>
    </li>
    <li>
      <strong>Chemical Annotation and Descriptor Calculation</strong>
      <ul style="margin:6px 0 0 20px; padding:0; list-style-type: disc;">
        <li>Retrieve canonical SMILES for each metabolite.</li>
        <li>Compute physicochemical descriptors (molecular weight, logP, TPSA, rotatable bonds, H-bond donors/acceptors).</li>
        <li>Calculate quantum-mechanical properties (HOMO/LUMO energies, dipole moment) via xTB.</li>
      </ul>
    </li>
    <li>
      <strong>Metabolite Annotation, KEGG Enrichment &amp; SMARTS-Guided EC and Pathway Inference</strong>
      <ul style="margin:6px 0 0 20px; padding:0; list-style-type: disc;">
        <li>Map SMILES to KEGG Compound IDs and perform pathway enrichment analysis.</li>
        <li>Load CRC-enriched SMARTS motifs and infer EC numbers &amp; pathways via substructure matching.</li>
        <li>Integrate enzyme and pathway annotations for prioritized metabolites.</li>
      </ul>
    </li>
    <li>
      <strong>Transcriptomic and Genomic Layering</strong>
      <ul style="margin:6px 0 0 20px; padding:0; list-style-type: disc;">
        <li>Load normalized CRC RNA-Seq expression data (e.g., TCGA-CRC) along with relevant clinical metadata.</li>
        <li>Compute PROGENy signaling pathway activity scores via single-sample GSEA and infer transcription factor activity using VIPER with DoRothEA priors.</li>
      </ul>
    </li>
    <li>
      <strong>Data Consolidation for Downstream Analysis</strong>
      <ul style="margin:6px 0 0 20px; padding:0; list-style-type: disc;">
        <li>Compile all annotations—chemical (SMILES), physicochemical, quantum, SMARTS-guided EC/pathway, and transcriptomic correlations—into a single master DataFrame.</li>
        <li>Export the enriched dataset for downstream modeling or pathway enrichment studies.</li>
      </ul>
    </li>
  </ul>
</div>


<div style="
    border-left: 4px solid #2196f3;
    background: #ffe6eb;
    padding: 16px;
    margin: 20px 0;
    border-radius: 6px;
">
  <h3 style="margin:0; color:#2196f3;">🔎 Research Question</h3>
  <p style="margin:12px 0 0 0;">
    To what extent does integrating chemical (SMILES), physicochemical, quantum-mechanical, and SMARTS-guided enzyme/pathway annotations—alongside transcriptomic and genomic correlations—enhance the identification and prioritization of metabolic biomarkers associated with colorectal cancer?
  </p>
</div>


<div style="
    border-left: 4px solid #2e7d32;
    background: #f7faf7;
    padding: 20px;
    margin: 25px 0;
    border-radius: 8px;
    font-family: Arial, sans-serif;
    line-height: 1.5;
">
  <h3 style="margin: 0 0 12px 0; color: #2e7d32;">📋 Methodology</h3>
  <p style="margin: 0 0 16px 0;">
    The analysis pipeline is structured into seven major steps. Each step corresponds to a notebook section with clear subsections.
  </p>
  <ol style="margin: 0; padding: 0 0 0 20px; list-style-type: decimal;">
    <li style="margin-bottom: 12px;">
      <strong>Data Loading &amp; Initial Inspection</strong>
      <ul style="margin: 6px 0 0 20px; padding: 0; list-style-type: disc;">
        <li>Import files.</li>
        <li>Acquire sample identifiers, metabolite IDs, and statistical columns (VIP, p-value, fold change).</li>
      </ul>
    </li>
    <li style="margin-bottom: 12px;">
      <strong>SMILES Representation &amp; Annotation</strong>
      <ul style="margin: 6px 0 0 20px; padding: 0; list-style-type: disc;">
        <li>Retrieve canonical SMILES via public APIs (e.g., PubChem, HMDB).</li>
        <li>Save a “SMILES lookup table” mapping raw names → SMILES.</li>
        <li>Flag unmatched or ambiguous entries for manual curation.</li>
      </ul>
    </li>
    <li style="margin-bottom: 12px;">
      <strong>Physicochemical Descriptor Calculation</strong>
      <ul style="margin: 6px 0 0 20px; padding: 0; list-style-type: disc;">
        <li>Load SMILES into RDKit Mol objects.</li>
        <li>Compute descriptors (RDKit):
          <ul style="margin: 4px 0 0 20px; padding: 0; list-style-type: circle;">
            <li>Molecular weight</li>
            <li>LogP (Crippen)</li>
            <li>Topological polar surface area (TPSA)</li>
            <li>Rotatable bond count</li>
            <li>H-bond donors/acceptors</li>
          </ul>
        </li>
        <li>Assemble descriptors into a DataFrame keyed by metabolite name or SMILES.</li>
      </ul>
    </li>
    <li style="margin-bottom: 12px;">
      <strong>Quantum-Mechanical Property Calculation (xTB)</strong>
      <ul style="margin: 6px 0 0 20px; padding: 0; list-style-type: disc;">
        <li>Run xTB batch jobs to compute:
          <ul style="margin: 4px 0 0 20px; padding: 0; list-style-type: circle;">
            <li>HOMO &amp; LUMO energies</li>
            <li>HOMO–LUMO gap</li>
            <li>Dipole moment</li>
          </ul>
        </li>
      </ul>
    </li>
    <li style="margin-bottom: 12px;">
      <strong>Metabolomic Property Integration and KEGG Pathway Enrichment</strong>
      <h5 style="margin: 12px 0 6px 0; color: #2e7d32;">5.1 Map Metabolites to KEGG IDs for Enzyme, Pathway &amp; Reaction Retrieval</h5>
      <ul style="margin: 6px 0 0 20px; padding: 0; list-style-type: disc;">
        <li>Lookup each metabolite’s SMILES in KEGG (or an in-memory reference) to retrieve the KEGG Compound ID.</li>
        <li>Using that KEGG ID, fetch linked EC numbers, pathway identifiers/names, and reaction details.</li>
      </ul>
      <h5 style="margin: 12px 0 6px 0; color: #2e7d32;">5.2 Handle Metabolites without KEGG Information</h5>
      <ul style="margin: 6px 0 0 20px; padding: 0; list-style-type: disc;">
        <li>For any metabolite lacking a KEGG Compound ID, attempt SMARTS-based inference of EC numbers and associated pathways.</li>
      </ul>
    </li>
    <li style="margin-bottom: 12px;">
      <strong>Genomic &amp; Transcriptomic Layering</strong>
      <h5 style="margin: 12px 0 6px 0; color: #2e7d32;">6.1 Prior-Based Gene Annotation</h5>
      <ul style="margin: 6px 0 0 20px; padding: 0; list-style-type: disc;">
        <li><strong>Gene → KEGG Pathway Mapping:</strong> Map each gene (via its EC number) to one or more KEGG metabolic pathways, placing gene expression within metabolic contexts.</li>
      </ul>
      <h5 style="margin: 12px 0 6px 0; color: #2e7d32;">6.2 Transcriptomic Inference &amp; Mechanistic Integration</h5>
      <ul style="margin: 6px 0 0 20px; padding: 0; list-style-type: disc;">
        <li><strong>Import RNA-Seq Data &amp; Clinical Metadata:</strong> Load normalized CRC tumor and normal expression matrices (e.g., TCGA) alongside sample annotations (tumor type, barcode).</li>
        <li><strong>PROGENy Pathway Activity Scoring:</strong> Compute single-sample GSEA scores for PROGENy’s 14 signaling pathways (e.g., TGFB, WNT, PI3K) using gseapy, generating tumor-normal contrasts.</li>
        <li><strong>TF Activity Inference via VIPER:</strong> Employ VIPER (via decoupler) with DoRothEA and TRRUST prior networks to infer transcription factor activity per sample, yielding TF enrichment scores.</li>
        <li><strong>Final Table Population:</strong> For each metabolite–gene pair, annotate with its regulating TF and TF activity score; link the gene’s KEGG pathway to a PROGENy pathway and its activity; compile into a unified mechanistic table.</li>
      </ul>
    </li>
    <li>
      <strong>Data Cleaning &amp; Consolidation</strong>
      <ul style="margin: 6px 0 0 20px; padding: 0; list-style-type: disc;">
        <li>Perform final data cleaning to ensure consistency and resolve any remaining missing values across all annotations.</li>
        <li>Merge all feature tables—SMILES, physicochemical, quantum, statistical rankings, SMARTS-guided EC/pathway, and transcriptomic correlations—into a unified master DataFrame.</li>
        <li>Prepare the final enriched dataset for downstream analyses and summarize key findings: top metabolites by composite score, their EC enzymes, and associated pathways.</li>
      </ul>
    </li>
  </ol>
</div>


<div style="
    border-left: 4px solid #4caf50;
    background: #f1f8e5;
    padding: 16px;
    margin: 20px 0;
    border-radius: 6px;
">
  <h3 style="margin:0; color:#4caf50;">📊1. Data Loading & Initial Inspection </h3>
  <ul style="margin:12px 0 0 20px; padding:0; list-style-type: disc;">
    <li><strong>ALL_sample_data.xlsx</strong>: full metabolite intensity matrix (features × samples).</li>
    <li><strong>sample_info.xlsx</strong>: sample metadata (CRC vs Normal labels, patient IDs, batch, etc.).</li>
    <li><strong>NC_vs_CRC_filter.xlsx</strong>: filtered differential metabolites (VIP > 1 &amp; |FC| > 1.5) for feature selection.</li>
    <li><strong>NC_vs_CRC_info.xlsx</strong>: complete statistics for all metabolites (VIP, p-value, FDR, FC, log₂FC).</li>
    <li><strong>NC_vs_CRC_pca_eigenval.xlsx</strong> &amp; <strong>NC_vs_CRC_pca_eigenvec.xlsx</strong>: PCA variance explained (eigenvalues) and sample scores (eigenvectors).</li>
    <li><strong>NC_vs_CRC.AUC.xls</strong>: per-metabolite AUC statistics from ROC analysis.</li>
    <li><strong>NC_vs_CRC_filter_anno.xlsx</strong>: filtered-feature annotations (metabolite names, KEGG IDs, pathway mappings).</li>
    <li><strong>hmdb_anno.xlsx</strong>: comprehensive HMDB‐based compound annotations for the full dataset.</li>
  </ul>
</div>


In [2]:
import pandas as pd
# Jupyter notebook / JupyterLab:
%matplotlib inline

# Or if you want pop-up windows (and have a GUI backend):
%matplotlib qt


In [92]:
X_df = pd.read_excel('Original Study Data/ALL_sample_data.xlsx')
y_df = pd.read_excel('Original Study Data/sample_info.xlsx')  
filter_df = pd.read_excel('Original Study Data/NC_vs_CRC_filter.xlsx')
info_df = pd.read_excel('Original Study Data/NC_vs_CRC_info.xlsx')
pca_eigenval_df = pd.read_excel('Original Study Data/NC_vs_CRC_pca_eigenval.xlsx')
pca_eigenvec_df = pd.read_excel('Original Study Data/NC_vs_CRC_pca_eigenvec.xlsx')
auc_df = pd.read_csv('Original Study Data/NC_vs_CRC.AUC.xls', sep='\t')
filter_anno_df = pd.read_excel('Original Study Data/NC_vs_CRC_filter_anno.xlsx')
hmdb_anno_df = pd.read_excel('Original Study Data/hmdb_anno.xlsx')

In [4]:
# Features matrix
print("=== X_df shape ===")
print(X_df.shape, "\n")

# Labels and metadata
print("=== y_df shape ===")
print(y_df.shape, "\n")

# Pre-filtered important features
print("=== filter_df shape ===")
print(filter_df.shape, "\n")

# Full VIP / p-value / FC table
print("=== info_df shape ===")
print(info_df.shape, "\n")

# PCA eigenvalues
print("=== pca_eigenval_df shape ===")
print(pca_eigenval_df.shape, "\n")

# PCA eigenvectors
print("=== pca_eigenvec_df shape ===")
print(pca_eigenvec_df.shape, "\n")

# Metabolite AUC scores
print("=== auc_df shape ===")
print(auc_df.shape, "\n")

# Annotation for filtered metabolites
print("=== filter_anno_df shape ===")
print(filter_anno_df.shape, "\n")

# Full HMDB annotation
print("=== hmdb_anno_df shape ===")
print(hmdb_anno_df.shape, "\n")


=== X_df shape ===
(927, 96) 

=== y_df shape ===
(70, 4) 

=== filter_df shape ===
(42, 84) 

=== info_df shape ===
(927, 84) 

=== pca_eigenval_df shape ===
(3, 71) 

=== pca_eigenvec_df shape ===
(70, 71) 

=== auc_df shape ===
(42, 4) 

=== filter_anno_df shape ===
(42, 86) 

=== hmdb_anno_df shape ===
(927, 91) 



In [5]:
# Features matrix
print("=== X_df ===")
print(X_df.head(), "\n")

# Labels and metadata
print("=== y_df ===")
print(y_df.head(), "\n")

# Pre-filtered important features
print("=== filter_df ===")
print(filter_df.head(), "\n")

# Full VIP / p-value / FC table
print("=== info_df ===")
print(info_df.head(), "\n")

# PCA eigenvalues
print("=== pca_eigenval_df ===")
print(pca_eigenval_df.head(), "\n")

# PCA eigenvectors
print("=== pca_eigenvec_df ===")
print(pca_eigenvec_df.head(), "\n")

# Metabolite AUC scores
print("=== auc_df ===")
print(auc_df.head(), "\n")

# Annotation for filtered metabolites
print("=== filter_anno_df ===")
print(filter_anno_df.head(), "\n")

# Full HMDB annotation
print("=== hmdb_anno_df ===")
print(hmdb_anno_df.head(), "\n")


=== X_df ===
      Index                                 Compounds            物质  \
0  MADN0001                                   Glycine           甘氨酸   
1  MADN0003                           L-Glutamic Acid         L-谷氨酸   
2  MADN0004                              L-Isoleucine        L-异亮氨酸   
3  MADN0006                              L-Tryptophan         L-色氨酸   
4  MADN0007  3-Hydroxy-3-Methylpentane-1,5-Dioic Acid  3-羟基-3-甲基谷氨酸   

                          Class I    物质一级分类                Class II  物质二级分类  \
0  Amino acid and Its metabolites  氨基酸及其代谢物             Amino acids     氨基酸   
1  Amino acid and Its metabolites  氨基酸及其代谢物             Amino acids     氨基酸   
2  Amino acid and Its metabolites  氨基酸及其代谢物             Amino acids     氨基酸   
3  Amino acid and Its metabolites  氨基酸及其代谢物             Amino acids     氨基酸   
4  Amino acid and Its metabolites  氨基酸及其代谢物  Amino acid derivatives  氨基酸衍生物   

    Q1 (Da) Molecular Weight (Da) Ionization model  ...       mix06  \
0   74.0247   

In [6]:
# Features matrix
print("=== X_df columns ===")
print(X_df.columns.tolist(), "\n")

# Labels and metadata
print("=== y_df columns ===")
print(y_df.columns.tolist(), "\n")

# Pre-filtered important features
print("=== filter_df columns ===")
print(filter_df.columns.tolist(), "\n")

# Full VIP / p-value / FC table
print("=== info_df columns ===")
print(info_df.columns.tolist(), "\n")

# PCA eigenvalues
print("=== pca_eigenval_df columns ===")
print(pca_eigenval_df.columns.tolist(), "\n")

# PCA eigenvectors
print("=== pca_eigenvec_df columns ===")
print(pca_eigenvec_df.columns.tolist(), "\n")

# Metabolite AUC scores
print("=== auc_df columns ===")
print(auc_df.columns.tolist(), "\n")

# Annotation for filtered metabolites
print("=== filter_anno_df columns ===")
print(filter_anno_df.columns.tolist(), "\n")

# Full HMDB annotation
print("=== hmdb_anno_df columns ===")
print(hmdb_anno_df.columns.tolist(), "\n")


=== X_df columns ===
['Index', 'Compounds', '物质', 'Class I', '物质一级分类', 'Class II', '物质二级分类', 'Q1 (Da)', 'Molecular Weight (Da)', 'Ionization model', 'Formula', 'C01', 'D01', 'E01', 'F01', 'G01', 'H01', 'K01', 'L01', 'P01', 'S01', 'T01', 'U01', 'AB01', 'AC01', 'AD01', 'AG01', 'AH01', 'AK01', 'AL01', 'AP01', 'AR01', 'AS01', 'AT01', 'AU01', 'AX01', 'BD01', 'BE01', 'BG01', 'BK01', 'BL01', 'BL01_1', 'BN01', 'BP01', 'BR01', 'BS01', 'BT01', 'C01_1', 'D01_1', 'E01_1', 'F01_1', 'G01_1', 'H01_1', 'K01_1', 'L01_1', 'P01_1', 'S01_1', 'T01_1', 'U01_1', 'AB01_1', 'AC01_1', 'AD01_1', 'AG01_1', 'AH01_1', 'AK01_1', 'AL01_1', 'AP01_1', 'AR01_1', 'AS01_1', 'AT01_1', 'AU01_1', 'AX01_1', 'BD01_1', 'BE01_1', 'BG01_1', 'BK01_1', 'BN01_1', 'BP01_1', 'BR01_1', 'BS01_1', 'BT01_1', 'mix01', 'mix02', 'mix03', 'mix04', 'mix05', 'mix06', 'mix07', 'mix08', 'cpd_ID', 'HMDB', 'Pubchem CID', 'CAS', 'ChEBI', 'Metlin', 'kegg_map'] 

=== y_df columns ===
['组织部位', '处理描述', '样本名称', 'Group'] 

=== filter_df columns ===
['Inde

<div style="
    border-left: 4px solid #2e7d32;
    background: #f7faf7;
    padding: 20px;
    margin: 25px 0;
    border-radius: 8px;
    font-family: Arial, sans-serif;
    line-height: 1.5;
">
  <h3 style="margin: 0 0 12px 0; color: #2e7d32;">2. SMILES Representation and Annotation</h3>
  <ol style="margin: 0; padding: 0 0 0 20px; list-style-type: decimal;">
    <li style="margin-bottom: 12px;">
      <strong>Curate Existing Annotations</strong>
      <ul style="margin: 6px 0 0 20px; padding: 0; list-style-type: disc;">
        <li>Assemble any preexisting identifiers (e.g., HMDB IDs, PubChem CIDs, CAS numbers) into an in‐memory lookup structure.</li>
      </ul>
    </li>
    <li style="margin-bottom: 12px;">
      <strong>Retrieve Canonical SMILES</strong>
      <ul style="margin: 6px 0 0 20px; padding: 0; list-style-type: disc;">
        <li>For each differential metabolite, attempt HMDB‐based SMILES extraction first; if unavailable, fall back to PubChem name lookup, then CAS‐based lookup as needed.</li>
        <li>Introduce a brief pause between external queries to avoid API throttling.</li>
      </ul>
    </li>
    <li style="margin-bottom: 12px;">
      <strong>Supplement Missing SMILES</strong>
      <ul style="margin: 6px 0 0 20px; padding: 0; list-style-type: disc;">
        <li>Where automated lookup fails, incorporate a small, manually curated mapping of compound names to SMILES.</li>
      </ul>
    </li>
    <li>
      <strong>Confirm Coverage</strong>
      <ul style="margin: 6px 0 0 20px; padding: 0; list-style-type: disc;">
        <li>Verify in a summary table that every differential metabolite now has an associated SMILES string (or flag any remaining gaps for manual curation).</li>
      </ul>
    </li>
  </ol>
</div>


In [None]:
# 📌 Key Selection Criteria
#
# Before diving into the data, we want to make sure we only focus on the most impactful metabolites.
# Here’s how we decide which ones really matter:
#
# • VIP score of at least 1.0: 
#   We only keep metabolites that the model ranks as important.
#
# • |Log₂ Fold Change| of at least 1.0: 
#   That means the metabolite has doubled (or halved) in abundance—no small movements here.
#
# • p-value below 0.05: 
#   We want to be confident that these differences aren’t just random noise.
#
# By sticking to these cutoffs (as defined in the paper), we zero in on 24 standout metabolites
# that truly capture the metabolic shifts in CRC.


In [29]:
import pandas as pd

# 1) Read in your two sheets (make sure the index_col is correct)
X_df      = pd.read_excel('ALL_sample_data.xlsx',   index_col='Index')
filter_df = pd.read_excel('NC_vs_CRC_filter.xlsx',  index_col='Index')

# 2) Apply the original paper’s filters
mask24 = (
    (filter_df['VIP']          >= 1.0) &
    (filter_df['Log2FC'].abs() >= 1.0) &
    (filter_df['p_value']       < 0.05)
)
filt24 = filter_df.loc[mask24, ['Compounds','Type']].copy()
print(f"Found {len(filt24)} filtered metabolites (should be 24)")

# 3) Pick the metadata columns *only* from X_df (no overlap with filt24)
meta_cols = [
    'cpd_ID', 'HMDB', 'Pubchem CID', 'CAS', 'ChEBI', 'Metlin', 'kegg_map',
    'Class I', 'Class II', 'Molecular Weight (Da)', 'Formula'
]

# 4) Join those metadata columns onto your 24‐row table
annot24 = filt24.join(
    X_df[meta_cols],
    how='left'
)

# 5) Reorder so Compounds/Type come first
annot24 = annot24.loc[:, ['Compounds','Type'] + meta_cols]

# 6) Reset index to make “Index” a column again
annot24 = annot24.reset_index().rename(columns={'index':'Index'})

# 7) Display
print("\nAnnotations for the 24 differential metabolites:")
annot24


Found 24 filtered metabolites (should be 24)

Annotations for the 24 differential metabolites:


Unnamed: 0,Index,Compounds,Type,cpd_ID,HMDB,Pubchem CID,CAS,ChEBI,Metlin,kegg_map,Class I,Class II,Molecular Weight (Da),Formula
0,MADN0053,D-Erythronolactone,down,-,HMDB0000349,5325915,15667-21-7,87625,5338,--,Carbohydrates and Its metabolites,Carboxylic acids and derivatives,118.027,C4H6O4
1,MADN0166,"1,6-anhydro-β-D-glucose",down,-,HMDB0000640,724705,498-07-7,30997,5613,--,Carbohydrates and Its metabolites,Sugars,162.05283,C6H10O5
2,MADN0220,Deoxyribose 5-phosphate,down,-,HMDB0001031,45934311,-,16132,5956,--,Nucleotide And Its metabolites,Nucleotide And Its metabolites,214.02423,C5H11O7P
3,MADN0329,2-Aminobenzenesulfonic acid,down,C06333,-,6926,88-21-1,-,-,--,Benzene and substituted derivatives,Benzene and substituted derivatives,173.19,C6H7NO3S
4,MADN0333,Quinoline-4-carboxylic acid,down,C06414,-,10243,486-74-8,-,-,--,Heterocyclic compounds,Pteridines and derivatives,173.047678,C10H7NO2
5,MADN0466,cyclo(glu-glu),up,-,-,7408481,16691-00-2,-,-,--,Amino acid and Its metabolites,Small Peptide,258.08573,C10H14N2O6
6,MADN0498,P-sulfanilic acid,down,-,-,8479,121-57-3,-,-,--,Amino acid and Its metabolites,Amino acid derivatives,173.19,C6H7NO3S
7,MADP0119,Methylcysteine,down,-,HMDB0002108,24417,1187-84-4,45658,6490,--,Amino acid and Its metabolites,Amino acids,135.0354,C4H9NO2S
8,MADP0548,Asp-Arg,down,-,-,16122509,-,-,-,--,Amino acid and Its metabolites,Small Peptide,289.13807,C10H19N5O5
9,MEDN1253,LPI(16:2/0:0),down,-,-,-,-,-,-,--,GP,PI,568.26541,C25H45O12P


In [7]:
annot24["Compounds"]


0                             D-Erythronolactone
1                        1,6-anhydro-β-D-glucose
2                        Deoxyribose 5-phosphate
3                    2-Aminobenzenesulfonic acid
4                    Quinoline-4-carboxylic acid
5                                 cyclo(glu-glu)
6                              P-sulfanilic acid
7                                 Methylcysteine
8                                        Asp-Arg
9                                  LPI(16:2/0:0)
10            5'-Deoxy-5'-(Methylthio) Adenosine
11                        Thiamine Monophosphate
12                                    Cytarabine
13                                 LPC(18:3/0:0)
14                                 LPC(16:1/0:0)
15                               Carnitine C7:DC
16                                 17a-Estradiol
17                                 17β-Estradiol
18                                 LPC(13:0/0:0)
19                   Quinoline-2-carboxylic acid
20                  

In [30]:

from pubchempy import get_compounds, PubChemHTTPError
import pandas as pd
import time

# Which column holds your compound names?
compound_col = 'Compound' if 'Compound' in annot24.columns else 'Compounds'

# Prepare output
smiles_list = []
success_count = 0
total = len(annot24)

for i, row in enumerate(annot24.itertuples(), start=1):
    name    = getattr(row, compound_col)
    cid     = str(getattr(row, 'Pubchem CID', '')).strip()
    hmdb    = str(getattr(row, 'HMDB', '')).strip()
    cas     = str(getattr(row, 'CAS', '')).strip()
    smiles  = None

    print(f"[{i}/{total}] 🔄 Looking up SMILES for '{name}'...")

    # 1) Try PubChem by CID
    if cid.isdigit():
        try:
            recs = get_compounds(cid, 'cid')
        except PubChemHTTPError:
            recs = []
        if recs:
            smiles = recs[0].isomeric_smiles
            print(f"   ✅ from CID {cid}")
    
    # 2) Fallback: HMDB ID
    if smiles is None and hmdb.upper().startswith('HMDB'):
        try:
            recs = get_compounds(hmdb, 'name')
        except PubChemHTTPError:
            recs = []
        if recs:
            smiles = recs[0].isomeric_smiles
            print(f"   ✅ from HMDB {hmdb}")

    # 3) Fallback: CAS
    if smiles is None and cas:
        try:
            recs = get_compounds(cas, 'name')
        except PubChemHTTPError:
            recs = []
        if recs:
            smiles = recs[0].isomeric_smiles
            print(f"   ✅ from CAS {cas}")

    # 4) Give up
    if smiles is None:
        print("   ❌ no SMILES found")
    else:
        success_count += 1

    smiles_list.append(smiles)
    time.sleep(0.3)  # polite pause

# Attach back to your DataFrame
annot24['SMILES'] = smiles_list

# Summary
print(f"\n✅ Done. Retrieved SMILES for {success_count} of {total} compounds.\n")
display(annot24[[compound_col, 'Pubchem CID', 'HMDB', 'CAS', 'SMILES']])


[1/24] 🔄 Looking up SMILES for 'D-Erythronolactone'...
   ✅ from CAS 15667-21-7
[2/24] 🔄 Looking up SMILES for '1,6-anhydro-β-D-glucose'...
   ✅ from CAS 498-07-7
[3/24] 🔄 Looking up SMILES for 'Deoxyribose 5-phosphate'...
   ❌ no SMILES found
[4/24] 🔄 Looking up SMILES for '2-Aminobenzenesulfonic acid'...
   ✅ from CAS 88-21-1
[5/24] 🔄 Looking up SMILES for 'Quinoline-4-carboxylic acid'...
   ✅ from CAS 486-74-8
[6/24] 🔄 Looking up SMILES for 'cyclo(glu-glu)'...
   ✅ from CAS 16691-00-2
[7/24] 🔄 Looking up SMILES for 'P-sulfanilic acid'...
   ✅ from CAS 121-57-3
[8/24] 🔄 Looking up SMILES for 'Methylcysteine'...
   ✅ from CAS 1187-84-4
[9/24] 🔄 Looking up SMILES for 'Asp-Arg'...
   ❌ no SMILES found
[10/24] 🔄 Looking up SMILES for 'LPI(16:2/0:0)'...
   ❌ no SMILES found
[11/24] 🔄 Looking up SMILES for '5'-Deoxy-5'-(Methylthio) Adenosine'...
   ✅ from CAS 2457-80-9
[12/24] 🔄 Looking up SMILES for 'Thiamine Monophosphate'...
   ✅ from CAS 273724-21-3
[13/24] 🔄 Looking up SMILES for 'Cyt

Unnamed: 0,Compounds,Pubchem CID,HMDB,CAS,SMILES
0,D-Erythronolactone,5325915,HMDB0000349,15667-21-7,C1[C@H]([C@H](C(=O)O1)O)O
1,"1,6-anhydro-β-D-glucose",724705,HMDB0000640,498-07-7,C1[C@@H]2[C@H]([C@@H]([C@H]([C@H](O1)O2)O)O)O
2,Deoxyribose 5-phosphate,45934311,HMDB0001031,-,
3,2-Aminobenzenesulfonic acid,6926,-,88-21-1,C1=CC=C(C(=C1)N)S(=O)(=O)O
4,Quinoline-4-carboxylic acid,10243,-,486-74-8,C1=CC=C2C(=C1)C(=CC=N2)C(=O)O
5,cyclo(glu-glu),7408481,-,16691-00-2,C(CC(=O)O)[C@H]1C(=O)N[C@H](C(=O)N1)CCC(=O)O
6,P-sulfanilic acid,8479,-,121-57-3,C1=CC(=CC=C1N)S(=O)(=O)O
7,Methylcysteine,24417,HMDB0002108,1187-84-4,CSC[C@@H](C(=O)O)N
8,Asp-Arg,16122509,-,-,
9,LPI(16:2/0:0),-,-,-,


In [18]:
display(annot24)

Unnamed: 0,Index,Compounds,Type,cpd_ID,HMDB,Pubchem CID,CAS,ChEBI,Metlin,kegg_map,Class I,Class II,Molecular Weight (Da),Formula,SMILES,LIPID MAPS ID
0,MADN0053,D-Erythronolactone,down,-,HMDB0000349,,15667-21-7,87625,5338,--,Carbohydrates and Its metabolites,Carboxylic acids and derivatives,118.027,C4H6O4,C1[C@H]([C@H](C(=O)O1)O)O,
1,MADN0166,"1,6-anhydro-β-D-glucose",down,-,HMDB0000640,,498-07-7,30997,5613,--,Carbohydrates and Its metabolites,Sugars,162.05283,C6H10O5,C1[C@@H]2[C@H]([C@@H]([C@H]([C@H](O1)O2)O)O)O,
2,MADN0220,Deoxyribose 5-phosphate,down,-,HMDB0001031,,-,16132,5956,--,Nucleotide And Its metabolites,Nucleotide And Its metabolites,214.02423,C5H11O7P,,
3,MADN0329,2-Aminobenzenesulfonic acid,down,C06333,-,,88-21-1,-,-,--,Benzene and substituted derivatives,Benzene and substituted derivatives,173.19,C6H7NO3S,C1=CC=C(C(=C1)N)S(=O)(=O)O,
4,MADN0333,Quinoline-4-carboxylic acid,down,C06414,-,,486-74-8,-,-,--,Heterocyclic compounds,Pteridines and derivatives,173.047678,C10H7NO2,C1=CC=C2C(=C1)C(=CC=N2)C(=O)O,
5,MADN0466,cyclo(glu-glu),up,-,-,,16691-00-2,-,-,--,Amino acid and Its metabolites,Small Peptide,258.08573,C10H14N2O6,C(CC(=O)O)[C@H]1C(=O)N[C@H](C(=O)N1)CCC(=O)O,
6,MADN0498,P-sulfanilic acid,down,-,-,,121-57-3,-,-,--,Amino acid and Its metabolites,Amino acid derivatives,173.19,C6H7NO3S,C1=CC(=CC=C1N)S(=O)(=O)O,
7,MADP0119,Methylcysteine,down,-,HMDB0002108,,1187-84-4,45658,6490,--,Amino acid and Its metabolites,Amino acids,135.0354,C4H9NO2S,CSC[C@@H](C(=O)O)N,
8,MADP0548,Asp-Arg,down,-,-,,-,-,-,--,Amino acid and Its metabolites,Small Peptide,289.13807,C10H19N5O5,,
9,MEDN1253,LPI(16:2/0:0),down,-,-,,-,-,-,--,GP,PI,568.26541,C25H45O12P,,LMPK03010B0B


In [22]:
print(annot24.columns)


Index(['Index', 'Compounds', 'Type', 'cpd_ID', 'HMDB', 'Pubchem CID', 'CAS',
       'ChEBI', 'Metlink', 'egg_map', 'Class I', 'Class II',
       'Molecular Weight (Da)', 'Formula', 'SMILES', 'LIPID MAPS ID'],
      dtype='object')


In [402]:
# Reasoning for SMILES substitutions (LPC(18:3/0:0), Cyclo(Phe-Glu), LPI(16:2/0:0)):
# 
# The shorthand notations “18:3” and “16:2” can refer to several different fatty acid isomers, which makes accurate
# descriptor calculation tricky. To make sure our computations reflect real biology, we picked specific isomers that
# are common in humans, follow standard stereochemistry, and match what hte original paper provided.
# 
# • LPC(18:3/0:0) was replaced with the α-linolenic acid form (18:3(9Z,12Z,15Z)) attached to sn-1 on an
#   L-glycerol backbone. This choice feels natural since α-linolenic acid is a well-known omega-3 in our diets.
#   SMILES: CCCCC=CCC=CCC=CCCCCCCCC(=O)OC[C@H](COP(=O)([O-])OCC[N+](C)(C)C)O  (PubChem CID: 5288112)
# 
# • Cyclo(Phe-Glu) became Cyclo(L-Phenylalanyl-L-Glutamyl), a classic 2,5-diketopiperazine formed from L-amino acids.
#   It’s a reasonable assumption given the peptide names, and it’s easy to find in PubChem.
#   SMILES: C1=CC=C(C=C1)C[C@H]2C(=O)N[C@@H](C(=O)N2)CCC(=O)O  (PubChem CID: 11355008)
# 
# • LPI(16:2/0:0) was substituted with the palmitolinoleic acid isomer (16:2(9Z,12Z)) on an sn-1 L-glycerol
#   backbone plus 1D-myo-inositol. That way, we capture the biologically relevant 16:2 species in a known lipid
#   context. SMILES: CCCC/C=C\C/C=C\CCCCCCC(=O)OC[C@H](O)COP(=O)([O-])O[C@H]1[C@@H](O)[C@H](O)[C@@H](O)[C@H](O)[C@@H]1O
#   (PubChem CID: 151036169)
# 
# In each case, the chosen isomer reflects what’s actually found in human tissues, follows standard stereochemical
# conventions (L-amino acids, sn-glycerol configuration), and matches database entries so our downstream
# physicochemical calculations are as accurate as possible.


In [80]:
display(df_retrieved_smiles)

Unnamed: 0,Compound Name,PubChem CID Used,Isomeric SMILES,Retrieved IUPAC Name,Status
0,"LPC(18:3(9Z,12Z,15Z)/0:0)",5288112,CCCCC=CCC=CCC=CCCCCCCCC(=O)OC[C@H](COP(=O)([O-...,"Mismatch - PubChem returned: 2,6-diaminoquinaz...",Mismatch: PubChem CID 5288112 returned unexpec...
1,Cyclo(L-Phe-L-Glu),11355008,C1=CC=C(C=C1)C[C@H]2C(=O)N[C@@H](C(=O)N2)CCC(=O)O,Mismatch - PubChem returned: (2R)-2-[[(2S)-6-a...,Mismatch: PubChem CID 11355008 returned unexpe...
2,"LPI(16:2(9Z,12Z)/0:0)",151036169,CCCC/C=C\C/C=C\CCCCCCC(=O)OC[C@H](O)COP(=O)([O...,"Mismatch - PubChem returned: 5-chloro-2-(1,4-d...",Mismatch: PubChem CID 151036169 returned unexp...


In [78]:
import pubchempy as pcp # Still import it in case you want to add lookups later

def get_curated_smiles_data():
    """
    Returns a dictionary of curated compound names and their SMILES strings.
    The SMILES strings are based on previous manual verification from databases.
    """
    curated_compounds = {
        'LPC(18:3(9Z,12Z,15Z)/0:0)': {
            # SMILES for 1-(9Z,12Z,15Z-octadecatrienoyl)-sn-glycero-3-phosphocholine
            # Source: PubChem CID 5288112
            'smiles': r"CCCCC=CCC=CCC=CCCCCCCCC(=O)OC[C@H](COP(=O)([O-])OCC[N+](C)(C)C)O",
            'assumed_cid': "5288112",
            'source_info': "PubChem CID 5288112"
        },
        'Cyclo(L-Phe-L-Glu)': {
            # SMILES for (3S,8S)-3-(2-carboxyethyl)-8-(phenylmethyl)piperazine-2,5-dione
            # Source: PubChem CID 11355008
            'smiles': r"C1=CC=C(C=C1)C[C@H]2C(=O)N[C@@H](C(=O)N2)CCC(=O)O",
            'assumed_cid': "11355008",
            'source_info': "PubChem CID 11355008"
        },
        'LPI(16:2(9Z,12Z)/0:0)': {
            # Assumed SMILES from LIPID MAPS (LMPK03010B0B) for
            # 1-(9Z,12Z-hexadecadienoyl)-sn-glycero-3-phospho-1D-myo-inositol
            'smiles': r"CCCC/C=C\C/C=C\CCCCCCC(=O)OC[C@H](O)COP(=O)([O-])O[C@H]1[C@@H](O)[C@H](O)[C@@H](O)[C@H](O)[C@@H]1O",
            'assumed_lipidmaps_id': "LMPK03010B0B",
            'source_info': "LIPID MAPS (LMPK03010B0B) / Curated"
        }
    }
    return curated_compounds

if __name__ == '__main__':
    smiles_to_use = get_curated_smiles_data()
    print("Using Curated SMILES Strings:")
    for compound_name, data in smiles_to_use.items():
        print(f"Compound: {compound_name}")
        print(f"  SMILES: {data['smiles']}")
        print(f"  Source of SMILES: {data['source_info']}")
        if data.get('assumed_cid'):
            print(f"  Associated PubChem CID (manually verified): {data['assumed_cid']}")
        if data.get('assumed_lipidmaps_id'):
            print(f"  Associated LIPID MAPS ID (manually verified): {data['assumed_lipidmaps_id']}")
        print("-" * 30)

        # Example: If you wanted to try a PubChem lookup by CID for one of them again for testing
        # if compound_name == 'LPC(18:3(9Z,12Z,15Z)/0:0)':
        #     try:
        #         cid_to_test = data['assumed_cid']
        #         compound = pcp.Compound.from_cid(cid_to_test)
        #         print(f"  Test Lookup for CID {cid_to_test}:")
        #         print(f"    PubChem Name: {compound.iupac_name or (compound.synonyms[0] if compound.synonyms else 'N/A')}")
        #         print(f"    PubChem SMILES: {compound.isomeric_smiles}")
        #         if compound.isomeric_smiles == data['smiles']:
        #             print("    Lookup SMILES matches curated SMILES.")
        #         else:
        #             print("    WARNING: Lookup SMILES DOES NOT match curated SMILES.")
        #     except Exception as e:
        #         print(f"  Test Lookup for CID {cid_to_test} failed: {e}")
        #     print("-" * 30)

Using Curated SMILES Strings:
Compound: LPC(18:3(9Z,12Z,15Z)/0:0)
  SMILES: CCCCC=CCC=CCC=CCCCCCCCC(=O)OC[C@H](COP(=O)([O-])OCC[N+](C)(C)C)O
  Source of SMILES: PubChem CID 5288112
  Associated PubChem CID (manually verified): 5288112
------------------------------
Compound: Cyclo(L-Phe-L-Glu)
  SMILES: C1=CC=C(C=C1)C[C@H]2C(=O)N[C@@H](C(=O)N2)CCC(=O)O
  Source of SMILES: PubChem CID 11355008
  Associated PubChem CID (manually verified): 11355008
------------------------------
Compound: LPI(16:2(9Z,12Z)/0:0)
  SMILES: CCCC/C=C\C/C=C\CCCCCCC(=O)OC[C@H](O)COP(=O)([O-])O[C@H]1[C@@H](O)[C@H](O)[C@@H](O)[C@H](O)[C@@H]1O
  Source of SMILES: LIPID MAPS (LMPK03010B0B) / Curated
  Associated LIPID MAPS ID (manually verified): LMPK03010B0B
------------------------------


In [31]:
# Ensure the “LIPID MAPS ID” column exists (or recreate it)
if "LIPID MAPS ID" not in annot24.columns:
    annot24["LIPID MAPS ID"] = ""
else:
    annot24["LIPID MAPS ID"] = ""

# 1. LPI(16:2/0:0)
annot24.loc[
    annot24["Compounds"] == "LPI(16:2/0:0)",
    ["SMILES", "Pubchem CID", "LIPID MAPS ID"]
] = [
    "CCCC/C=C\\C/C=C\\CCCCCCC(=O)OC[C@H](O)COP(=O)([O-])O[C@H]1[C@@H](O)[C@H](O)[C@@H](O)[C@H](O)[C@@H]1O",
    "",                 # leave PubChem CID blank
    "LMPK03010B0B"      # curated LIPID MAPS ID
]

# 2. LPC(18:3/0:0)
annot24.loc[
    annot24["Compounds"] == "LPC(18:3/0:0)",
    ["SMILES", "Pubchem CID", "LIPID MAPS ID"]
] = [
    "CCCCC=CCC=CCC=CCCCCCCCC(=O)OC[C@H](COP(=O)([O-])OCC[N+](C)(C)C)O",
    "5288112",          # curated PubChem CID
    ""                  # no LIPID MAPS ID
]

# 3. Cyclo(Phe-Glu)
annot24.loc[
    annot24["Compounds"] == "Cyclo(Phe-Glu)",
    ["SMILES", "Pubchem CID", "LIPID MAPS ID"]
] = [
    "C1=CC=C(C=C1)C[C@H]2C(=O)N[C@@H](C(=O)N2)CCC(=O)O",
    "11355008",         # curated PubChem CID
    ""                  # no LIPID MAPS ID
]

# Optional: verify updates
print(
    annot24.loc[
        annot24["Compounds"].isin([
            "LPI(16:2/0:0)",
            "LPC(18:3/0:0)",
            "Cyclo(Phe-Glu)"
        ])
    ]
)


       Index       Compounds  Type cpd_ID         HMDB Pubchem CID CAS  ChEBI  \
9   MEDN1253   LPI(16:2/0:0)  down      -            -               -      -   
13  MEDP1345   LPC(18:3/0:0)  down      -  HMDB0010387     5288112   -  88698   
22  MEDP1923  Cyclo(Phe-Glu)    up      -            -    11355008   -      -   

   Metlin kegg_map                         Class I       Class II  \
9       -       --                              GP             PI   
13      -       --                              GP            LPC   
22      -       --  Amino acid and Its metabolites  Small Peptide   

   Molecular Weight (Da)     Formula  \
9              568.26541  C25H45O12P   
13            517.316841  C26H48NO7P   
22            276.112102  C14H16N2O4   

                                               SMILES LIPID MAPS ID  
9   CCCC/C=C\C/C=C\CCCCCCC(=O)OC[C@H](O)COP(=O)([O...  LMPK03010B0B  
13  CCCCC=CCC=CCC=CCCCCCCCC(=O)OC[C@H](COP(=O)([O-...                
22  C1=CC=C(C=C1)C[C@H]2C(=

In [32]:
display(annot24)

Unnamed: 0,Index,Compounds,Type,cpd_ID,HMDB,Pubchem CID,CAS,ChEBI,Metlin,kegg_map,Class I,Class II,Molecular Weight (Da),Formula,SMILES,LIPID MAPS ID
0,MADN0053,D-Erythronolactone,down,-,HMDB0000349,5325915,15667-21-7,87625,5338,--,Carbohydrates and Its metabolites,Carboxylic acids and derivatives,118.027,C4H6O4,C1[C@H]([C@H](C(=O)O1)O)O,
1,MADN0166,"1,6-anhydro-β-D-glucose",down,-,HMDB0000640,724705,498-07-7,30997,5613,--,Carbohydrates and Its metabolites,Sugars,162.05283,C6H10O5,C1[C@@H]2[C@H]([C@@H]([C@H]([C@H](O1)O2)O)O)O,
2,MADN0220,Deoxyribose 5-phosphate,down,-,HMDB0001031,45934311,-,16132,5956,--,Nucleotide And Its metabolites,Nucleotide And Its metabolites,214.02423,C5H11O7P,,
3,MADN0329,2-Aminobenzenesulfonic acid,down,C06333,-,6926,88-21-1,-,-,--,Benzene and substituted derivatives,Benzene and substituted derivatives,173.19,C6H7NO3S,C1=CC=C(C(=C1)N)S(=O)(=O)O,
4,MADN0333,Quinoline-4-carboxylic acid,down,C06414,-,10243,486-74-8,-,-,--,Heterocyclic compounds,Pteridines and derivatives,173.047678,C10H7NO2,C1=CC=C2C(=C1)C(=CC=N2)C(=O)O,
5,MADN0466,cyclo(glu-glu),up,-,-,7408481,16691-00-2,-,-,--,Amino acid and Its metabolites,Small Peptide,258.08573,C10H14N2O6,C(CC(=O)O)[C@H]1C(=O)N[C@H](C(=O)N1)CCC(=O)O,
6,MADN0498,P-sulfanilic acid,down,-,-,8479,121-57-3,-,-,--,Amino acid and Its metabolites,Amino acid derivatives,173.19,C6H7NO3S,C1=CC(=CC=C1N)S(=O)(=O)O,
7,MADP0119,Methylcysteine,down,-,HMDB0002108,24417,1187-84-4,45658,6490,--,Amino acid and Its metabolites,Amino acids,135.0354,C4H9NO2S,CSC[C@@H](C(=O)O)N,
8,MADP0548,Asp-Arg,down,-,-,16122509,-,-,-,--,Amino acid and Its metabolites,Small Peptide,289.13807,C10H19N5O5,,
9,MEDN1253,LPI(16:2/0:0),down,-,-,,-,-,-,--,GP,PI,568.26541,C25H45O12P,CCCC/C=C\C/C=C\CCCCCCC(=O)OC[C@H](O)COP(=O)([O...,LMPK03010B0B


In [33]:
import pandas as pd

# List of the compounds you care about
targets = [
    'Deoxyribose 5-phosphate',
    'Asp-Arg',
    'Carnitine C7:DC',
    'Quinoline-2-carboxylic acid',
    'LPE(17:1/0:0)'
]

# Columns you want to see
cols = ['Compounds', 'cpd_ID', 'HMDB', 'Pubchem CID', 'CAS', 'ChEBI', 'Metlin']

# Subset & display
subset = annot24.loc[annot24['Compounds'].isin(targets), cols].copy()

# Optionally sort by your target list order
subset['__order'] = pd.Categorical(subset['Compounds'], categories=targets, ordered=True)
subset = subset.sort_values('__order').drop(columns='__order')

print("Selected annotations:")
display(subset.reset_index(drop=True))


Selected annotations:


Unnamed: 0,Compounds,cpd_ID,HMDB,Pubchem CID,CAS,ChEBI,Metlin
0,Deoxyribose 5-phosphate,-,HMDB0001031,45934311,-,16132,5956
1,Asp-Arg,-,-,16122509,-,-,-
2,Carnitine C7:DC,-,-,-,-,-,-
3,Quinoline-2-carboxylic acid,-,HMDB0000842,7124,-,18386,5805
4,LPE(17:1/0:0),-,-,42607464,-,-,-


In [37]:
import requests
import xml.etree.ElementTree as ET
from pubchempy import get_compounds
import pandas as pd

# ────────────────────────────────────────────────────────────────────────────────
# 1. Extended dictionary of “Selected annotations,” now including LPC(16:1/0:0).
annotations = {
    "Deoxyribose 5-phosphate": {
        "cpd_ID":      None,
        "HMDB":        "HMDB0001031",
        "Pubchem CID": "45934311",
        "CAS":         None,
        "ChEBI":       "16132",
        "Metlin":      "5956"
    },
    "Asp-Arg": {
        "cpd_ID":      None,
        "HMDB":        None,
        "Pubchem CID": "16122509",
        "CAS":         None,
        "ChEBI":       None,
        "Metlin":      None
    },
    "Carnitine C7:DC": {
        "cpd_ID":      None,
        "HMDB":        None,
        "Pubchem CID": None,
        "CAS":         None,
        "ChEBI":       None,
        "Metlin":      None
    },
    "Quinoline-2-carboxylic acid": {
        "cpd_ID":      None,
        "HMDB":        "HMDB0000842",
        "Pubchem CID": "7124",
        "CAS":         None,
        "ChEBI":       "18386",
        "Metlin":      "5805"
    },
    "LPE(17:1/0:0)": {
        "cpd_ID":      None,
        "HMDB":        None,
        "Pubchem CID": "42607464",
        "CAS":         None,
        "ChEBI":       None,
        "Metlin":      None
    },
    "LPC(16:1/0:0)": {
        "cpd_ID":      None,
        "HMDB":        "HMDB0010383",
        "Pubchem CID": "24779461",
        "CAS":         None,
        "ChEBI":       None,
        "Metlin":      None
    }
}

# ────────────────────────────────────────────────────────────────────────────────
# 2. Convert that dictionary into a pandas DataFrame named `subset`.
#
#    The DataFrame must have these columns: 'Compound', 'HMDB', 'PubCID', 'ChEBI'.
rows = []
for compound_name, info in annotations.items():
    rows.append({
        "Compound": compound_name,
        "HMDB":      info.get("HMDB"),             # e.g. "HMDB0010383" or None
        "PubCID":    info.get("Pubchem CID"),       # e.g. "24779461" or None
        "ChEBI":     info.get("ChEBI")              # e.g. "18386" or None
    })

subset = pd.DataFrame(rows)

print("=== subset DataFrame ===")
print(subset)
print("\n")

# ────────────────────────────────────────────────────────────────────────────────
# 3. XML lookup helper functions (unchanged).
def fetch_hmdb_smiles(hmdb_id):
    url = f"https://www.hmdb.ca/metabolites/{hmdb_id}.xml"
    r = requests.get(url)
    if not r.ok:
        return None
    root = ET.fromstring(r.text)
    node = root.find(".//structure/smiles")
    return node.text.strip() if node is not None else None

def fetch_chebi_smiles(chebi_id):
    url = (
        "https://www.ebi.ac.uk/webservices/chebi/2.0/test/getCompleteEntity"
        f"?chebiId={chebi_id}"
    )
    r = requests.get(url)
    if not r.ok:
        return None
    root = ET.fromstring(r.text)
    node = root.find(".//SMILES")
    return node.text.strip() if node is not None else None

# ────────────────────────────────────────────────────────────────────────────────
# 4. Loop through `subset` and attempt to fetch SMILES from PubChem, then HMDB, then ChEBI.
results = []
for i, row in enumerate(subset.itertuples(index=False), start=1):
    name   = row.Compound
    hmdb   = row.HMDB
    pubcid = row.PubCID
    chebi  = row.ChEBI

    smiles = None
    source = None
    print(f"🔄 [{i}/{len(subset)}] {name}")

    # 4a) Try PubChem by CID
    if pd.notna(pubcid):  # not NaN or None
        try:
            rec = get_compounds(pubcid, 'cid')
            if rec:
                smiles = rec[0].isomeric_smiles
                source = f"PubChem CID {pubcid}"
        except Exception:
            smiles = None

    # 4b) If PubChem failed, try HMDB XML
    if not smiles and pd.notna(hmdb):
        smi = fetch_hmdb_smiles(hmdb)
        if smi:
            smiles = smi
            source = f"HMDB {hmdb}"

    # 4c) If still no SMILES, try ChEBI REST
    if not smiles and pd.notna(chebi):
        smi = fetch_chebi_smiles(chebi)
        if smi:
            smiles = smi
            source = f"ChEBI {chebi}"

    # 4d) Log what we found (or didn’t find)
    print("   →", smiles or "❌ none", f"({source or 'no source'})")
    results.append({
        "Compound": name,
        "SMILES": smiles
    })

# 5. Build a lookup table (DataFrame) of newly‐found SMILES
lookup = pd.DataFrame(results).set_index('Compound')['SMILES']

# ────────────────────────────────────────────────────────────────────────────────
# 6. (Optional) Merge into annot24 if you have it loaded. This will fill any blanks.
try:
    # Example: annot24 = pd.read_csv("annot24.csv")
    annot24['SMILES'] = annot24['SMILES'].combine_first(
        annot24['Compounds'].map(lookup)
    )
    from IPython.display import display
    display(annot24[['Compounds', 'SMILES']])
except NameError:
    print("\nNote: annot24 is not defined here. If you already have annot24, "
          "run the combine_first(...) step above to merge these SMILES.")


=== subset DataFrame ===
                      Compound         HMDB    PubCID  ChEBI
0      Deoxyribose 5-phosphate  HMDB0001031  45934311  16132
1                      Asp-Arg         None  16122509   None
2              Carnitine C7:DC         None      None   None
3  Quinoline-2-carboxylic acid  HMDB0000842      7124  18386
4                LPE(17:1/0:0)         None  42607464   None
5                LPC(16:1/0:0)  HMDB0010383  24779461   None


🔄 [1/6] Deoxyribose 5-phosphate
   → C1[C@@H]([C@H](O[C@H]1O)COP(=O)(O)O)O (PubChem CID 45934311)
🔄 [2/6] Asp-Arg
   → C(C[C@@H](C(=O)O)NC(=O)[C@H](CC(=O)O)N)CN=C(N)N (PubChem CID 16122509)
🔄 [3/6] Carnitine C7:DC
   → ❌ none (no source)
🔄 [4/6] Quinoline-2-carboxylic acid
   → C1=CC=C2C(=C1)C=CC(=N2)C(=O)O (PubChem CID 7124)
🔄 [5/6] LPE(17:1/0:0)
   → CCCCCCC/C=C\CCCCCCCC(=O)OC[C@H](COP(=O)(O)OCCN)O (PubChem CID 42607464)
🔄 [6/6] LPC(16:1/0:0)
   → CCCCCC/C=C\CCCCCCCC(=O)OC[C@H](COP(=O)([O-])OCC[N+](C)(C)C)O (PubChem CID 24779461)


Unnamed: 0,Compounds,SMILES
0,D-Erythronolactone,C1[C@H]([C@H](C(=O)O1)O)O
1,"1,6-anhydro-β-D-glucose",C1[C@@H]2[C@H]([C@@H]([C@H]([C@H](O1)O2)O)O)O
2,Deoxyribose 5-phosphate,C1[C@@H]([C@H](O[C@H]1O)COP(=O)(O)O)O
3,2-Aminobenzenesulfonic acid,C1=CC=C(C(=C1)N)S(=O)(=O)O
4,Quinoline-4-carboxylic acid,C1=CC=C2C(=C1)C(=CC=N2)C(=O)O
5,cyclo(glu-glu),C(CC(=O)O)[C@H]1C(=O)N[C@H](C(=O)N1)CCC(=O)O
6,P-sulfanilic acid,C1=CC(=CC=C1N)S(=O)(=O)O
7,Methylcysteine,CSC[C@@H](C(=O)O)N
8,Asp-Arg,C(C[C@@H](C(=O)O)NC(=O)[C@H](CC(=O)O)N)CN=C(N)N
9,LPI(16:2/0:0),CCCC/C=C\C/C=C\CCCCCCC(=O)OC[C@H](O)COP(=O)([O...


In [38]:
# list the two compounds you want
targets = ['Carnitine C7:DC']

# select just those rows and columns
cols = ['cpd_ID', 'HMDB', 'Pubchem CID', 'CAS', 'ChEBI', 'Metlin']
result = annot24.loc[annot24['Compounds'].isin(targets), ['Compounds'] + cols]

# Optional: reset the index for a clean display
result = result.reset_index(drop=True)

display(result)


Unnamed: 0,Compounds,cpd_ID,HMDB,Pubchem CID,CAS,ChEBI,Metlin
0,Carnitine C7:DC,-,-,-,-,-,-


In [39]:

from pubchempy import get_compounds
import requests
import xml.etree.ElementTree as ET

# 1) PubChem lookup by the “official” name octanoylcarnitine
pubchem_hits = get_compounds("octanoylcarnitine", "name")
if pubchem_hits:
    c = pubchem_hits[0]
    print("✅ Found PubChem CID:", c.cid)
    print("   Some PubChem synonyms:", c.synonyms[:10])
    print("   PubChem SMILES:       ", c.isomeric_smiles)
else:
    print("❌ No PubChem hit for octanoylcarnitine")

# 2) HMDB cross-reference (if you know its HMDB ID, here we try HMDB0007304 as an example)
def fetch_hmdb_smiles(hmdb_id):
    url = f"https://www.hmdb.ca/metabolites/{hmdb_id}.xml"
    r = requests.get(url)
    if not r.ok:
        return None
    root = ET.fromstring(r.text)
    node = root.find(".//structure/smiles")
    return node.text.strip() if node is not None else None

for hmdb_id in ["HMDB0007304", "HMDB0011436"]:  # try a couple candidates
    s = fetch_hmdb_smiles(hmdb_id)
    if s:
        print(f"✅ HMDB {hmdb_id} SMILES:", s)
    else:
        print(f"❌ HMDB {hmdb_id} not found or no SMILES")

# 3) ChEBI cross-reference (CHEBI:73039)
def fetch_chebi_smiles(chebi_id):
    url = ("https://www.ebi.ac.uk/webservices/chebi/2.0/test/getCompleteEntity"
           f"?chebiId={chebi_id}")
    r = requests.get(url)
    if not r.ok:
        return None
    root = ET.fromstring(r.text)
    node = root.find(".//SMILES")
    return node.text.strip() if node is not None else None

chebi_smiles = fetch_chebi_smiles("CHEBI:73039")
print("ChEBI CHEBI:73039 SMILES:", chebi_smiles or "none")


✅ Found PubChem CID: 123701
   Some PubChem synonyms: ['octanoylcarnitine', 'O-octanoylcarnitine', '3671-77-0', 'S1HB7P0O16', 'CHEBI:73039', '3-(octanoyloxy)-4-(trimethylammonio)butanoate', 'UNII-S1HB7P0O16', '3-octanoyloxy-4-(trimethylazaniumyl)butanoate', '1-Propanaminium, 3-carboxy-N,N,N-trimethyl-2-((1-oxooctyl)oxy)-, inner salt', '3-(Octanoyloxy)-4-(trimethylammonio)butanoic acid']
   PubChem SMILES:        CCCCCCCC(=O)OC(CC(=O)[O-])C[N+](C)(C)C
❌ HMDB HMDB0007304 not found or no SMILES
❌ HMDB HMDB0011436 not found or no SMILES
ChEBI CHEBI:73039 SMILES: none


In [40]:
rec = get_compounds('123701', 'cid')[0]
pub_smiles = rec.isomeric_smiles
# inject into annot24
annot24.loc[
    annot24['Compounds']=='Carnitine C7:DC',
    'SMILES'
] = pub_smiles

# verify
display(
    annot24.loc[
        annot24['Compounds']=='Carnitine C7:DC',
        ['Compounds','SMILES']
    ]
)


Unnamed: 0,Compounds,SMILES
15,Carnitine C7:DC,CCCCCCCC(=O)OC(CC(=O)[O-])C[N+](C)(C)C


In [41]:
display(annot24[['Compounds', 'SMILES']])

Unnamed: 0,Compounds,SMILES
0,D-Erythronolactone,C1[C@H]([C@H](C(=O)O1)O)O
1,"1,6-anhydro-β-D-glucose",C1[C@@H]2[C@H]([C@@H]([C@H]([C@H](O1)O2)O)O)O
2,Deoxyribose 5-phosphate,C1[C@@H]([C@H](O[C@H]1O)COP(=O)(O)O)O
3,2-Aminobenzenesulfonic acid,C1=CC=C(C(=C1)N)S(=O)(=O)O
4,Quinoline-4-carboxylic acid,C1=CC=C2C(=C1)C(=CC=N2)C(=O)O
5,cyclo(glu-glu),C(CC(=O)O)[C@H]1C(=O)N[C@H](C(=O)N1)CCC(=O)O
6,P-sulfanilic acid,C1=CC(=CC=C1N)S(=O)(=O)O
7,Methylcysteine,CSC[C@@H](C(=O)O)N
8,Asp-Arg,C(C[C@@H](C(=O)O)NC(=O)[C@H](CC(=O)O)N)CN=C(N)N
9,LPI(16:2/0:0),CCCC/C=C\C/C=C\CCCCCCC(=O)OC[C@H](O)COP(=O)([O...


In [42]:
annot24.to_excel("annot24.xlsx", index=False)

In [43]:
annot24 = pd.read_excel('annot24.xlsx')
display(annot24)

Unnamed: 0,Index,Compounds,Type,cpd_ID,HMDB,Pubchem CID,CAS,ChEBI,Metlin,kegg_map,Class I,Class II,Molecular Weight (Da),Formula,SMILES,LIPID MAPS ID
0,MADN0053,D-Erythronolactone,down,-,HMDB0000349,5325915,15667-21-7,87625,5338,--,Carbohydrates and Its metabolites,Carboxylic acids and derivatives,118.027,C4H6O4,C1[C@H]([C@H](C(=O)O1)O)O,
1,MADN0166,"1,6-anhydro-β-D-glucose",down,-,HMDB0000640,724705,498-07-7,30997,5613,--,Carbohydrates and Its metabolites,Sugars,162.05283,C6H10O5,C1[C@@H]2[C@H]([C@@H]([C@H]([C@H](O1)O2)O)O)O,
2,MADN0220,Deoxyribose 5-phosphate,down,-,HMDB0001031,45934311,-,16132,5956,--,Nucleotide And Its metabolites,Nucleotide And Its metabolites,214.02423,C5H11O7P,C1[C@@H]([C@H](O[C@H]1O)COP(=O)(O)O)O,
3,MADN0329,2-Aminobenzenesulfonic acid,down,C06333,-,6926,88-21-1,-,-,--,Benzene and substituted derivatives,Benzene and substituted derivatives,173.19,C6H7NO3S,C1=CC=C(C(=C1)N)S(=O)(=O)O,
4,MADN0333,Quinoline-4-carboxylic acid,down,C06414,-,10243,486-74-8,-,-,--,Heterocyclic compounds,Pteridines and derivatives,173.047678,C10H7NO2,C1=CC=C2C(=C1)C(=CC=N2)C(=O)O,
5,MADN0466,cyclo(glu-glu),up,-,-,7408481,16691-00-2,-,-,--,Amino acid and Its metabolites,Small Peptide,258.08573,C10H14N2O6,C(CC(=O)O)[C@H]1C(=O)N[C@H](C(=O)N1)CCC(=O)O,
6,MADN0498,P-sulfanilic acid,down,-,-,8479,121-57-3,-,-,--,Amino acid and Its metabolites,Amino acid derivatives,173.19,C6H7NO3S,C1=CC(=CC=C1N)S(=O)(=O)O,
7,MADP0119,Methylcysteine,down,-,HMDB0002108,24417,1187-84-4,45658,6490,--,Amino acid and Its metabolites,Amino acids,135.0354,C4H9NO2S,CSC[C@@H](C(=O)O)N,
8,MADP0548,Asp-Arg,down,-,-,16122509,-,-,-,--,Amino acid and Its metabolites,Small Peptide,289.13807,C10H19N5O5,C(C[C@@H](C(=O)O)NC(=O)[C@H](CC(=O)O)N)CN=C(N)N,
9,MEDN1253,LPI(16:2/0:0),down,-,-,,-,-,-,--,GP,PI,568.26541,C25H45O12P,CCCC/C=C\C/C=C\CCCCCCC(=O)OC[C@H](O)COP(=O)([O...,LMPK03010B0B


In [404]:
# Why SMILES Matter (and why Section 2 is all about them)
#
# In our metabolomics workflow, SMILES strings act like a universal “barcode” for each compound’s structure.
# Without a correct SMILES, we can’t generate accurate 3D conformers, compute physicochemical or quantum
# descriptors, or reliably map metabolites to external databases. In other words, SMILES are the key that
# lets us translate a metabolite’s name into something a computer (and chemistry software) can understand.
# 
# That’s why we devoted an entire section to curating and validating SMILES—getting this step right
# ensures every downstream calculation (from LogP to enzyme inference) rests on a solid foundation.


<div style="
    border-left: 4px solid #2e7d32;
    background: #f7faf7;
    padding: 20px;
    margin: 25px 0;
    border-radius: 8px;
    font-family: Arial, sans-serif;
    line-height: 1.5;
">
  <h3 style="margin: 0 0 12px 0; color: #2e7d32;">3. Physicochemical Property Calculation</h3>
  <ul style="margin: 6px 0 0 20px; padding: 0; list-style-type: disc;">
    <li>Convert each SMILES to an RDKit Mol object.</li>
    <li>Calculate descriptors:
      <ul style="margin: 4px 0 0 20px; padding: 0; list-style-type: circle;">
        <li>XLogP</li>
        <li>FSP3</li>
        <li>Complexity</li>
        <li>HBondDonors</li>
        <li>HBondAcceptors</li>
        <li>TPSA</li>
        <li>RotatableBonds</li>
      </ul>
    </li>
  </ul>
</div>


In [None]:
import time
import pandas as pd
from pubchempy import get_compounds

# Filter out rows where SMILES is missing or empty:
annot24_filtered = annot24.dropna(subset=['SMILES'])
annot24_filtered = annot24_filtered[annot24_filtered['SMILES'].str.strip() != ""]

records = []
failed = []

for idx, row in annot24_filtered.iterrows():
    name = row["Compounds"]
    smiles = row["SMILES"]  # we'll overwrite this if PubChem returns a hit

    xlogp = None
    charge = None
    hbond_donors = None
    hbond_acceptors = None
    tpsa = None
    rot_bonds = None

    try:
        result = get_compounds(smiles, 'smiles')
        if not result:
            raise ValueError("No PubChem hits found")
        compound = result[0]

        # Use PubChem's isomeric_smiles if available; otherwise keep original
        smiles = compound.isomeric_smiles or smiles

        xlogp = compound.xlogp
        charge = compound.charge
        hbond_donors = compound.h_bond_donor_count
        hbond_acceptors = compound.h_bond_acceptor_count
        tpsa = compound.tpsa
        rot_bonds = compound.rotatable_bond_count

        print(f"✅ Retrieved properties for: {name}")
    except Exception as e:
        # On failure, keep original SMILES and leave properties as None
        print(f"⚠️  Failed to retrieve properties for {name}: {e}")
        failed.append(name)

    # Append a single 'SMILES' column containing either retrieved or original string
    records.append({
        "Compound":       name,
        "SMILES":         smiles,
        "XLogP":          xlogp,
        "FormalCharge":   charge,
        "HBondDonors":    hbond_donors,
        "HBondAcceptors": hbond_acceptors,
        "TPSA":           tpsa,
        "RotatableBonds": rot_bonds
    })

    time.sleep(0.4)

# Create DataFrame of results
chem_props_df = pd.DataFrame(records).set_index("Compound")

# Write to CSV
chem_props_df.to_csv("physiochemical_properties.csv", index=True)

print(f"\n✅ Final chemical properties table: {chem_props_df.shape[0]} compounds")
print(chem_props_df)

print("\n📄 Written full table to physiochemical_properties.csv")


In [100]:
import time
import pandas as pd
from pubchempy import get_compounds
from rdkit import Chem
from rdkit.Chem import Descriptors


# Filter out rows where SMILES is missing or empty:
annot24_filtered = annot24.dropna(subset=['SMILES'])
annot24_filtered = annot24_filtered[annot24_filtered['SMILES'].str.strip() != ""]

records = []
failed = []

for idx, row in annot24_filtered.iterrows():
    name = row["Compounds"]
    smiles = row["SMILES"]

    xlogp = None
    fsp3 = None
    complexity = None
    hbond_donors = None
    hbond_acceptors = None
    tpsa = None
    rot_bonds = None

    final_smiles = smiles  # will overwrite with PubChem's if available

    try:
        result = get_compounds(smiles, 'smiles')
        if not result:
            raise ValueError("No PubChem hits found")
        compound = result[0]

        # Use PubChem's isomeric_smiles if available
        final_smiles = compound.isomeric_smiles or smiles

        xlogp = compound.xlogp
        hbond_donors = compound.h_bond_donor_count
        hbond_acceptors = compound.h_bond_acceptor_count
        tpsa = compound.tpsa
        rot_bonds = compound.rotatable_bond_count

        print(f"✅ Retrieved PubChem properties for: {name}")
    except Exception as e:
        # Keep original SMILES if PubChem lookup fails
        final_smiles = smiles
        print(f"⚠️  PubChem lookup failed for {name}: {e}")
        failed.append(name)

    # Compute FSP3 and Complexity locally using RDKit
    mol = Chem.MolFromSmiles(final_smiles)
    if mol:
        try:
            fsp3 = Descriptors.FractionCSP3(mol)
            complexity = Descriptors.BertzCT(mol)
        except Exception as e:
            print(f"⚠️  RDKit descriptor calculation failed for {name}: {e}")
    else:
        print(f"❌ RDKit failed to parse SMILES for {name}: {final_smiles}")

    records.append({
        "Compound":       name,
        "SMILES":         final_smiles,
        "XLogP":          xlogp,
        "FSP3":           fsp3,
        "Complexity":     complexity,
        "HBondDonors":    hbond_donors,
        "HBondAcceptors": hbond_acceptors,
        "TPSA":           tpsa,
        "RotatableBonds": rot_bonds
    })

    time.sleep(0.4)

chem_props_df = pd.DataFrame(records).set_index("Compound")

# Write to CSV
chem_props_df.to_csv("physiochemical_properties.csv", index=True)

print(f"\n✅ Final chemical properties table: {chem_props_df.shape[0]} compounds")
print(chem_props_df)

print("\n📄 Written full table to physiochemical_properties.csv")


✅ Retrieved PubChem properties for: D-Erythronolactone
✅ Retrieved PubChem properties for: 1,6-anhydro-β-D-glucose
✅ Retrieved PubChem properties for: Deoxyribose 5-phosphate
✅ Retrieved PubChem properties for: 2-Aminobenzenesulfonic acid
✅ Retrieved PubChem properties for: Quinoline-4-carboxylic acid
✅ Retrieved PubChem properties for: cyclo(glu-glu)
✅ Retrieved PubChem properties for: P-sulfanilic acid
✅ Retrieved PubChem properties for: Methylcysteine
✅ Retrieved PubChem properties for: Asp-Arg
✅ Retrieved PubChem properties for: LPI(16:2/0:0)
✅ Retrieved PubChem properties for: 5'-Deoxy-5'-(Methylthio) Adenosine
✅ Retrieved PubChem properties for: Thiamine Monophosphate
✅ Retrieved PubChem properties for: Cytarabine
✅ Retrieved PubChem properties for: LPC(18:3/0:0)
✅ Retrieved PubChem properties for: LPC(16:1/0:0)
✅ Retrieved PubChem properties for: Carnitine C7:DC
✅ Retrieved PubChem properties for: 17a-Estradiol
✅ Retrieved PubChem properties for: 17β-Estradiol
✅ Retrieved PubChe

In [101]:
### import pandas as pd
# 1. Read the CSV from the notebook’s working directory
phys_props_df = pd.read_csv("physiochemical_properties.csv", index_col="Compound")

# 2. Peek at the first few rows to confirm it loaded correctly
phys_props_df.head(23)

Unnamed: 0_level_0,SMILES,XLogP,FSP3,Complexity,HBondDonors,HBondAcceptors,TPSA,RotatableBonds
Compound,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
D-Erythronolactone,C1[C@H]([C@H](C(=O)O1)O)O,-1.3,0.75,110.605938,2.0,4.0,66.8,0.0
"1,6-anhydro-β-D-glucose",C1[C@@H]2[C@H]([C@@H]([C@H]([C@H](O1)O2)O)O)O,-2.1,1.0,141.051632,3.0,5.0,79.2,0.0
Deoxyribose 5-phosphate,C1[C@@H]([C@H](O[C@H]1O)COP(=O)(O)O)O,-2.6,1.0,212.928441,4.0,7.0,116.0,3.0
2-Aminobenzenesulfonic acid,C1=CC=C(C(=C1)N)S(=O)(=O)O,0.4,0.0,357.845074,2.0,4.0,88.8,1.0
Quinoline-4-carboxylic acid,C1=CC=C2C(=C1)C(=CC=N2)C(=O)O,0.5,0.0,459.936329,1.0,3.0,50.2,1.0
cyclo(glu-glu),C(CC(=O)O)[C@H]1C(=O)N[C@H](C(=O)N1)CCC(=O)O,-1.5,0.6,344.564194,4.0,6.0,133.0,6.0
P-sulfanilic acid,C1=CC(=CC=C1N)S(=O)(=O)O,-0.6,0.0,340.595074,2.0,4.0,88.8,1.0
Methylcysteine,CSC[C@@H](C(=O)O)N,-2.7,0.75,86.107496,2.0,4.0,88.6,3.0
Asp-Arg,C(C[C@@H](C(=O)O)NC(=O)[C@H](CC(=O)O)N)CN=C(N)N,-5.3,0.6,393.499549,6.0,7.0,194.0,9.0
LPI(16:2/0:0),CCCC/C=C\C/C=C\CCCCCCC(=O)OC[C@H](O)COP(=O)([O...,,0.8,744.625633,,,,


In [76]:
import pubchempy as pcp
import pandas as pd
from rdkit import Chem
from rdkit.Chem import Descriptors, Lipinski, rdMolDescriptors

# Define the compounds with their names, curated SMILES,
# and manually verified PubChem CIDs (for attempted lookup and record-keeping).
compounds_data = {
    'LPC(18:3(9Z,12Z,15Z)/0:0)': {
        'curated_smiles': r"CCCCC=CCC=CCC=CCCCCCCCC(=O)OC[C@H](COP(=O)([O-])OCC[N+](C)(C)C)O",
        'verified_cid': 5288112,
        'expected_name_context': '1-(9Z,12Z,15Z-octadecatrienoyl)-sn-glycero-3-phosphocholine'
    },
    'Cyclo(L-Phe-L-Glu)': {
        'curated_smiles': r"C1=CC=C(C=C1)C[C@H]2C(=O)N[C@@H](C(=O)N2)CCC(=O)O",
        'verified_cid': 11355008,
        'expected_name_context': '(3S,8S)-3-(2-carboxyethyl)-8-(phenylmethyl)piperazine-2,5-dione'
    },
    'LPI(16:2(9Z,12Z)/0:0)': {
        'curated_smiles': r"CCCC/C=C\C/C=C\CCCCCCC(=O)OC[C@H](O)COP(=O)([O-])O[C@H]1[C@@H](O)[C@H](O)[C@@H](O)[C@H](O)[C@@H]1O",
        'verified_cid': 151036169, 
        'expected_name_context': '1-(9Z,12Z-hexadecadienoyl)-sn-glycero-3-phospho-1D-myo-inositol'
    }
}

def get_rdkit_properties(smiles_string):
    """Calculates physicochemical properties using RDKit from a SMILES string."""
    mol = Chem.MolFromSmiles(smiles_string)
    if not mol:
        return {
            'RDKit_MolLogP': None, 'RDKit_FractionCSP3': None, 'RDKit_BertzCT': None,
            'RDKit_HBondDonors': None, 'RDKit_HBondAcceptors': None,
            'RDKit_TPSA': None, 'RDKit_RotatableBonds': None,
            'RDKit_Status': 'Failed to create RDKit molecule'
        }
    
    # Add hydrogens if not explicitly defined, can affect some descriptors
    # mol = Chem.AddHs(mol) # Optional: can sometimes improve accuracy but also change results slightly

    return {
        'RDKit_MolLogP': Descriptors.MolLogP(mol),
        'RDKit_FractionCSP3': Descriptors.FractionCSP3(mol),
        'RDKit_BertzCT': Descriptors.BertzCT(mol), # RDKit's complexity index
        'RDKit_HBondDonors': Lipinski.NumHDonors(mol), # Using Lipinski for HDonors
        'RDKit_HBondAcceptors': Lipinski.NumHAcceptors(mol), # Using Lipinski for HAcceptors
        'RDKit_TPSA': Descriptors.TPSA(mol), # Or rdMolDescriptors.CalcTPSA(mol)
        'RDKit_RotatableBonds': Descriptors.NumRotatableBonds(mol), # Or Lipinski.NumRotatableBonds(mol)
        'RDKit_Status': 'Success'
    }

def get_properties_combined(compound_name, curated_smiles, verified_cid, expected_context):
    """
    Attempts PubChem lookup by CID for record-keeping and verification,
    but calculates physicochemical properties using RDKit from the curated SMILES.
    """
    output_data = {
        'Name': compound_name,
        'Curated SMILES': curated_smiles,
        'PubChem CID Attempted': verified_cid,
        'PubChem Fetched SMILES': None,
        'PubChem Fetched Name': None,
        'PubChem Lookup Status': 'Not Processed',
        # RDKit properties will be merged here
    }

    # Attempt PubChem lookup for record-keeping and to see what it returns
    if verified_cid:
        try:
            print(f"Attempting PubChem lookup for CID: {verified_cid} ({compound_name})")
            c = pcp.Compound.from_cid(verified_cid)
            if c:
                output_data['PubChem Fetched SMILES'] = getattr(c, 'isomeric_smiles', 'N/A')
                fetched_name = getattr(c, 'iupac_name', None)
                if not fetched_name and c.synonyms:
                    fetched_name = c.synonyms[0]
                output_data['PubChem Fetched Name'] = fetched_name if fetched_name else 'N/A'
                
                # Verification
                is_verified_match = False
                if output_data['PubChem Fetched SMILES'] == curated_smiles:
                    is_verified_match = True
                elif fetched_name and isinstance(fetched_name, str) and \
                     (expected_context.lower() in fetched_name.lower() or \
                      compound_name.split('(')[0].lower() in fetched_name.lower()):
                    is_verified_match = True
                elif c.synonyms:
                    for syn in c.synonyms:
                        if expected_context.lower() in syn.lower() or \
                           compound_name.split('(')[0].lower() in syn.lower():
                            is_verified_match = True
                            break
                
                if is_verified_match:
                    output_data['PubChem Lookup Status'] = 'Success (Fetched data matches expected context)'
                else:
                    output_data['PubChem Lookup Status'] = (f"Mismatch Error: CID {verified_cid} returned "
                                                            f"'{output_data['PubChem Fetched Name']}' "
                                                            f"(SMILES: {output_data['PubChem Fetched SMILES']}), "
                                                            f"not matching '{expected_context}'.")
            else:
                output_data['PubChem Lookup Status'] = f"PubChem: Compound not found for CID {verified_cid}"
        except pcp.PubChemHTTPError as e:
            output_data['PubChem Lookup Status'] = f"PubChem API Error for CID {verified_cid}: {e}"
        except Exception as e:
            output_data['PubChem Lookup Status'] = f"General Error during PubChem lookup for CID {verified_cid}: {e}"
    else:
        output_data['PubChem Lookup Status'] = 'No PubChem CID provided for lookup'

    # Calculate properties using RDKit from the curated SMILES
    print(f"Calculating RDKit properties for: {compound_name}")
    rdkit_props = get_rdkit_properties(curated_smiles)
    output_data.update(rdkit_props) # Merge RDKit properties into the main output
        
    return output_data

# --- Main execution ---
all_compound_data_with_props = []
print("Fetching/Calculating physicochemical properties...\n")

for name, data in compounds_data.items():
    props_summary = get_properties_combined(name, data['curated_smiles'], data['verified_cid'], data['expected_name_context'])
    all_compound_data_with_props.append(props_summary)
    print(f"Processed: {name}")
    print(f"  PubChem Lookup Status: {props_summary['PubChem Lookup Status']}")
    print(f"  RDKit Properties Status: {props_summary['RDKit_Status']}")
    if props_summary.get('PubChem Fetched Name') and "Mismatch Error" not in props_summary['PubChem Lookup Status']:
         print(f"  PubChem Fetched Name: {props_summary['PubChem Fetched Name']}")
    print(f"  Curated SMILES: {props_summary['Curated SMILES']}")
    if props_summary.get('PubChem Fetched SMILES') and props_summary['PubChem Fetched SMILES'] != props_summary['Curated SMILES'] and props_summary['PubChem Fetched SMILES'] != 'N/A':
        print(f"  PubChem Fetched SMILES: {props_summary['PubChem Fetched SMILES']}")
    print("-" * 50)

# Display results in a pandas DataFrame
df_final_properties = pd.DataFrame(all_compound_data_with_props)

column_order = [
    'Name', 'Curated SMILES', 'PubChem CID Attempted', 
    'PubChem Fetched SMILES', 'PubChem Fetched Name', 'PubChem Lookup Status',
    'RDKit_MolLogP', 'RDKit_FractionCSP3', 'RDKit_BertzCT', 
    'RDKit_HBondDonors', 'RDKit_HBondAcceptors', 'RDKit_TPSA', 
    'RDKit_RotatableBonds', 'RDKit_Status'
]
# Ensure all columns in column_order exist
existing_columns_final = [col for col in column_order if col in df_final_properties.columns]
df_final_properties = df_final_properties[existing_columns_final]

print("\n--- Physicochemical Properties Summary (RDKit Calculations) ---")
print(df_final_properties)


Fetching/Calculating physicochemical properties...

Attempting PubChem lookup for CID: 5288112 (LPC(18:3(9Z,12Z,15Z)/0:0))
Calculating RDKit properties for: LPC(18:3(9Z,12Z,15Z)/0:0)
Processed: LPC(18:3(9Z,12Z,15Z)/0:0)
  PubChem Lookup Status: Mismatch Error: CID 5288112 returned '2,6-diaminoquinazolin-4-ol' (SMILES: C1=CC2=C(C=C1N)C(=NC(=N2)N)O), not matching '1-(9Z,12Z,15Z-octadecatrienoyl)-sn-glycero-3-phosphocholine'.
  RDKit Properties Status: Success
  Curated SMILES: CCCCC=CCC=CCC=CCCCCCCCC(=O)OC[C@H](COP(=O)([O-])OCC[N+](C)(C)C)O
  PubChem Fetched SMILES: C1=CC2=C(C=C1N)C(=NC(=N2)N)O
--------------------------------------------------
Attempting PubChem lookup for CID: 11355008 (Cyclo(L-Phe-L-Glu))
Calculating RDKit properties for: Cyclo(L-Phe-L-Glu)
Processed: Cyclo(L-Phe-L-Glu)
  PubChem Lookup Status: Mismatch Error: CID 11355008 returned '(2R)-2-[[(2S)-6-amino-1-[[(2S)-1-[[(2S)-1-[[(2S)-1-[[(1S)-1-carboxy-2-phenylethyl]amino]-3-(6-chloro-5-hydroxy-1H-indol-3-yl)-1-oxopropan

In [77]:
display(df_final_properties)

Unnamed: 0,Name,Curated SMILES,PubChem CID Attempted,PubChem Fetched SMILES,PubChem Fetched Name,PubChem Lookup Status,RDKit_MolLogP,RDKit_FractionCSP3,RDKit_BertzCT,RDKit_HBondDonors,RDKit_HBondAcceptors,RDKit_TPSA,RDKit_RotatableBonds,RDKit_Status
0,"LPC(18:3(9Z,12Z,15Z)/0:0)",CCCCC=CCC=CCC=CCCCCCCCC(=O)OC[C@H](COP(=O)([O-...,5288112,C1=CC2=C(C=C1N)C(=NC(=N2)N)O,"2,6-diaminoquinazolin-4-ol","Mismatch Error: CID 5288112 returned '2,6-diam...",5.468,0.75,701.134457,1,7,105.12,24,Success
1,Cyclo(L-Phe-L-Glu),C1=CC=C(C=C1)C[C@H]2C(=O)N[C@@H](C(=O)N2)CCC(=O)O,11355008,CCC(C)[C@H](C(=O)O)NC(=O)N[C@@H](CCCCN)C(=O)N[...,(2R)-2-[[(2S)-6-amino-1-[[(2S)-1-[[(2S)-1-[[(2...,Mismatch Error: CID 11355008 returned '(2R)-2-...,0.0771,0.357143,515.165005,3,3,95.5,5,Success
2,"LPI(16:2(9Z,12Z)/0:0)",CCCC/C=C\C/C=C\CCCCCCC(=O)OC[C@H](O)COP(=O)([O...,151036169,CC1=CC(=CC2=C1NC(=C2)N3CCCNCC3)Cl,"5-chloro-2-(1,4-diazepan-1-yl)-7-methyl-1H-indole",Mismatch Error: CID 151036169 returned '5-chlo...,0.6121,0.8,744.625633,6,12,206.27,19,Success


In [1]:
# The “Mismatch Error” is a built-in safety check that prevents us from using the wrong molecule.
# When we ask PubChem for a CID, we might get back a completely different compound (e.g., PubChem returned 
# “2,6-diaminoquinazolin-4-ol” instead of our LPC structure). Instead of blindly trusting that response, 
# the script compares the returned SMILES/name to what we expected. If they don’t match, it raises an error 
# and falls back on our manually confirmed SMILES (“curated_smiles”). 
#
# In everyday terms: we gave the script a library call number (the CID), but the librarian (PubChem API) 
# handed us the wrong book. The script noticed the title didn’t match and refused to use that data. 
# By catching mismatches early, we ensure all downstream physicochemical calculations are based on the exact 
# molecules we intended to study.


In [102]:
# 1. Your CID map (only include the ones you fetched)
cid_map = {
    'Thiamine Monophosphate': 3382778
}

from pubchempy import get_compounds

# 2. Fetch & update
for name, cid in cid_map.items():
    comp = get_compounds(cid, 'cid')
    if comp:
        phys_props_df.at[name, 'XLogP'] = comp[0].xlogp

# 3. Confirm the updates
print(phys_props_df.loc[list(cid_map.keys()), ['XLogP']])

                        XLogP
Compound                     
Thiamine Monophosphate   -0.2


In [103]:
# 1) First, define the mapping from short → long names:
short_to_long = {
    "LPC(18:3/0:0)":       "LPC(18:3(9Z,12Z,15Z)/0:0)",
    "Cyclo(Phe-Glu)":      "Cyclo(L-Phe-L-Glu)",
    "LPI(16:2/0:0)":        "LPI(16:2(9Z,12Z)/0:0)"
}
# Invert it:
long_to_short = {long: short for short, long in short_to_long.items()}

# 2) Make a copy of df_final_properties and set 'Name' as its index,
#    then rename that index from long → short. Any row whose 'Name' isn't
#    in long_to_short will be dropped by rename.
df_fp_short = df_final_properties.set_index("Name").rename(index=long_to_short)

# Now df_fp_short.index should be exactly ["LPC(18:3/0:0)", "Cyclo(Phe-Glu)", "LPI(16:2/0:0)"]
# (assuming all three existed in the original.)

# 3) The actual RDKit columns present:
print("Available columns in df_fp_short:", df_fp_short.columns.tolist())

# 4) Build a mapping from those RDKit names → phys_props_df column names:
rdkit_to_phys = {
    "RDKit_MolLogP":           "XLogP",
    "RDKit_FractionCSP3":      "FSP3",
    "RDKit_BertzCT":           "Complexity",
    "RDKit_HBondDonors":       "HBondDonors",
    "RDKit_HBondAcceptors":    "HBondAcceptors",
    "RDKit_TPSA":              "TPSA",
    "RDKit_RotatableBonds":    "RotatableBonds"
}

# Identify which of those keys actually exist:
available_columns = [col for col in rdkit_to_phys if col in df_fp_short.columns]
missing = [col for col in rdkit_to_phys if col not in df_fp_short.columns]
print("Columns requested but not found in df_fp_short:", missing)
print("Columns that will be merged:", available_columns)

# 5) Create df_to_merge by selecting only the available RDKit columns, then renaming:
df_to_merge = df_fp_short[available_columns].rename(columns=rdkit_to_phys)

# Now df_to_merge is indexed by the short names, with columns:
# ["XLogP", "FSP3", "Complexity", "HBondDonors", "HBondAcceptors", "TPSA", "RotatableBonds"]

# 6) Ensure phys_props_df has all of those columns (even if currently missing):
for col in df_to_merge.columns:
    if col not in phys_props_df.columns:
        phys_props_df[col] = pd.NA

# 7) Overwrite phys_props_df’s cells with df_to_merge’s values for each index:
for idx in df_to_merge.index:
    for col in df_to_merge.columns:
        phys_props_df.loc[idx, col] = df_to_merge.loc[idx, col]

# 8) Finally, inspect the three compounds to confirm:
print("\n=== Updated phys_props_df for the three compounds ===")
print(phys_props_df.loc[
    ["LPC(18:3/0:0)", "Cyclo(Phe-Glu)", "LPI(16:2/0:0)"],
    ["SMILES", "XLogP", "FSP3", "Complexity", "HBondDonors",
     "HBondAcceptors", "TPSA", "RotatableBonds"]
])


Available columns in df_fp_short: ['Curated SMILES', 'PubChem CID Attempted', 'PubChem Fetched SMILES', 'PubChem Fetched Name', 'PubChem Lookup Status', 'RDKit_MolLogP', 'RDKit_FractionCSP3', 'RDKit_BertzCT', 'RDKit_HBondDonors', 'RDKit_HBondAcceptors', 'RDKit_TPSA', 'RDKit_RotatableBonds', 'RDKit_Status']
Columns requested but not found in df_fp_short: []
Columns that will be merged: ['RDKit_MolLogP', 'RDKit_FractionCSP3', 'RDKit_BertzCT', 'RDKit_HBondDonors', 'RDKit_HBondAcceptors', 'RDKit_TPSA', 'RDKit_RotatableBonds']

=== Updated phys_props_df for the three compounds ===
                                                           SMILES   XLogP  \
Compound                                                                    
LPC(18:3/0:0)   CCCCC=CCC=CCC=CCCCCCCCC(=O)OC[C@H](COP(=O)([O-...  5.4680   
Cyclo(Phe-Glu)  C1=CC=C(C=C1)C[C@H]2C(=O)N[C@@H](C(=O)N2)CCC(=O)O  0.0771   
LPI(16:2/0:0)   CCCC/C=C\C/C=C\CCCCCCC(=O)OC[C@H](O)COP(=O)([O...  0.6121   

                    FSP3  Compl

In [105]:
phys_props_df.head(27)

Unnamed: 0_level_0,SMILES,XLogP,FSP3,Complexity,HBondDonors,HBondAcceptors,TPSA,RotatableBonds
Compound,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
D-Erythronolactone,C1[C@H]([C@H](C(=O)O1)O)O,-1.3,0.75,110.605938,2.0,4.0,66.8,0.0
"1,6-anhydro-β-D-glucose",C1[C@@H]2[C@H]([C@@H]([C@H]([C@H](O1)O2)O)O)O,-2.1,1.0,141.051632,3.0,5.0,79.2,0.0
Deoxyribose 5-phosphate,C1[C@@H]([C@H](O[C@H]1O)COP(=O)(O)O)O,-2.6,1.0,212.928441,4.0,7.0,116.0,3.0
2-Aminobenzenesulfonic acid,C1=CC=C(C(=C1)N)S(=O)(=O)O,0.4,0.0,357.845074,2.0,4.0,88.8,1.0
Quinoline-4-carboxylic acid,C1=CC=C2C(=C1)C(=CC=N2)C(=O)O,0.5,0.0,459.936329,1.0,3.0,50.2,1.0
cyclo(glu-glu),C(CC(=O)O)[C@H]1C(=O)N[C@H](C(=O)N1)CCC(=O)O,-1.5,0.6,344.564194,4.0,6.0,133.0,6.0
P-sulfanilic acid,C1=CC(=CC=C1N)S(=O)(=O)O,-0.6,0.0,340.595074,2.0,4.0,88.8,1.0
Methylcysteine,CSC[C@@H](C(=O)O)N,-2.7,0.75,86.107496,2.0,4.0,88.6,3.0
Asp-Arg,C(C[C@@H](C(=O)O)NC(=O)[C@H](CC(=O)O)N)CN=C(N)N,-5.3,0.6,393.499549,6.0,7.0,194.0,9.0
LPI(16:2/0:0),CCCC/C=C\C/C=C\CCCCCCC(=O)OC[C@H](O)COP(=O)([O...,0.6121,0.8,744.625633,6.0,12.0,206.27,19.0


In [91]:
phys_props_df.to_csv('physiochemical_properties.csv', index=True)


In [106]:
# Assume you already have three DataFrames in memory:
# 1) filter_df with columns:
#      ['Compounds', 'Class I', 'Class II', 'VIP', 'p_value',
#       'Fold_Change', 'Log2FC', 'Type', …]
#
# 2) X_df with columns:
#      ['Compounds', 'Molecular Weight (Da)']
#
# 3) phys_props_df with index or column ‘Compound’ matching the same names in filter_df['Compounds'] and X_df['Compounds'].
#
# We want to bring the columns from filter_df and X_df into phys_props_df, matching on the compound name.

# Step 1: If phys_props_df uses “Compound” as its index, leave it. If “Compound” is just a column, set it as the index:
if 'Compound' in phys_props_df.columns:
    phys_props_df = phys_props_df.set_index('Compound')

# Step 2: Prepare filter_df by setting its index to ‘Compounds’
filter_indexed = filter_df.set_index('Compounds')

# Step 3: Prepare X_df by setting its index to ‘Compounds’
X_indexed = X_df.set_index('Compounds')

# Step 4: Select only the columns we want from filter_df and X_df, and join them into phys_props_df
# From filter_df we need these columns: ['Class I', 'Class II', 'VIP', 'p_value', 'Fold_Change', 'Log2FC', 'Type']
phys_props_df = phys_props_df.join(
    filter_indexed[['Class I', 'Class II', 'VIP', 'p_value', 'Fold_Change', 'Log2FC', 'Type']],
    how='left'
)

# From X_df we need ['Molecular Weight (Da)']
phys_props_df = phys_props_df.join(
    X_indexed[['Molecular Weight (Da)']],
    how='left'
)

# Step 5: If you prefer to have “Compound” back as a column rather than the index, do:
phys_props_df = phys_props_df.reset_index().rename(columns={'index': 'Compound'})

# Now phys_props_df contains all its original columns plus:
#   'Class I', 'Class II', 'VIP', 'p_value', 'Fold_Change', 'Log2FC', 'Type', 'Molecular Weight (Da)'
#
# You can inspect it:
print(phys_props_df.head())


                      Compound                                         SMILES  \
0           D-Erythronolactone                      C1[C@H]([C@H](C(=O)O1)O)O   
1      1,6-anhydro-β-D-glucose  C1[C@@H]2[C@H]([C@@H]([C@H]([C@H](O1)O2)O)O)O   
2      Deoxyribose 5-phosphate          C1[C@@H]([C@H](O[C@H]1O)COP(=O)(O)O)O   
3  2-Aminobenzenesulfonic acid                     C1=CC=C(C(=C1)N)S(=O)(=O)O   
4  Quinoline-4-carboxylic acid                  C1=CC=C2C(=C1)C(=CC=N2)C(=O)O   

   XLogP  FSP3  Complexity  HBondDonors  HBondAcceptors   TPSA  \
0   -1.3  0.75  110.605938          2.0             4.0   66.8   
1   -2.1  1.00  141.051632          3.0             5.0   79.2   
2   -2.6  1.00  212.928441          4.0             7.0  116.0   
3    0.4  0.00  357.845074          2.0             4.0   88.8   
4    0.5  0.00  459.936329          1.0             3.0   50.2   

   RotatableBonds                              Class I  \
0             0.0    Carbohydrates and Its metabolites   


In [107]:
phys_props_df.head(27)

Unnamed: 0,Compound,SMILES,XLogP,FSP3,Complexity,HBondDonors,HBondAcceptors,TPSA,RotatableBonds,Class I,Class II,VIP,p_value,Fold_Change,Log2FC,Type,Molecular Weight (Da)
0,D-Erythronolactone,C1[C@H]([C@H](C(=O)O1)O)O,-1.3,0.75,110.605938,2.0,4.0,66.8,0.0,Carbohydrates and Its metabolites,Carboxylic acids and derivatives,3.20052,0.000679,0.276351,-1.855428,down,118.027
1,"1,6-anhydro-β-D-glucose",C1[C@@H]2[C@H]([C@@H]([C@H]([C@H](O1)O2)O)O)O,-2.1,1.0,141.051632,3.0,5.0,79.2,0.0,Carbohydrates and Its metabolites,Sugars,1.832149,0.005326,0.388313,-1.364709,down,162.05283
2,Deoxyribose 5-phosphate,C1[C@@H]([C@H](O[C@H]1O)COP(=O)(O)O)O,-2.6,1.0,212.928441,4.0,7.0,116.0,3.0,Nucleotide And Its metabolites,Nucleotide And Its metabolites,1.448724,0.019611,0.314954,-1.666788,down,214.02423
3,2-Aminobenzenesulfonic acid,C1=CC=C(C(=C1)N)S(=O)(=O)O,0.4,0.0,357.845074,2.0,4.0,88.8,1.0,Benzene and substituted derivatives,Benzene and substituted derivatives,1.988007,9.6e-05,0.481644,-1.05396,down,173.19
4,Quinoline-4-carboxylic acid,C1=CC=C2C(=C1)C(=CC=N2)C(=O)O,0.5,0.0,459.936329,1.0,3.0,50.2,1.0,Heterocyclic compounds,Pteridines and derivatives,1.988007,9.6e-05,0.481644,-1.05396,down,173.047678
5,cyclo(glu-glu),C(CC(=O)O)[C@H]1C(=O)N[C@H](C(=O)N1)CCC(=O)O,-1.5,0.6,344.564194,4.0,6.0,133.0,6.0,Amino acid and Its metabolites,Small Peptide,1.137473,0.041106,2.235048,1.160306,up,258.08573
6,P-sulfanilic acid,C1=CC(=CC=C1N)S(=O)(=O)O,-0.6,0.0,340.595074,2.0,4.0,88.8,1.0,Amino acid and Its metabolites,Amino acid derivatives,1.988007,9.6e-05,0.481644,-1.05396,down,173.19
7,Methylcysteine,CSC[C@@H](C(=O)O)N,-2.7,0.75,86.107496,2.0,4.0,88.6,3.0,Amino acid and Its metabolites,Amino acids,1.603143,0.021896,0.438727,-1.188603,down,135.0354
8,Asp-Arg,C(C[C@@H](C(=O)O)NC(=O)[C@H](CC(=O)O)N)CN=C(N)N,-5.3,0.6,393.499549,6.0,7.0,194.0,9.0,Amino acid and Its metabolites,Small Peptide,1.30826,0.012759,0.47731,-1.067001,down,289.13807
9,LPI(16:2/0:0),CCCC/C=C\C/C=C\CCCCCCC(=O)OC[C@H](O)COP(=O)([O...,0.6121,0.8,744.625633,6.0,12.0,206.27,19.0,GP,PI,2.600966,0.005109,0.466713,-1.099393,down,568.26541


In [108]:
phys_props_df.to_csv('physiochemical_properties.csv', index=True)


In [407]:
# We convert each SMILES string into an RDKit Mol object so we can compute chemical properties programmatically.
# These properties — called molecular descriptors — give us quick, intuitive insights into how each compound behaves.
#
# Here's what each descriptor tells us:
#
# • XLogP            → Measures how hydrophobic (fat-loving) the molecule is. Important for predicting solubility and membrane permeability.
# • Fsp³             → Fraction of sp³-hybridized carbons. High values suggest a 3D, drug-like shape; low values mean the molecule is flat and aromatic.
# • Complexity       → An estimate of structural intricacy. More rings, branches, and stereocenters = higher complexity.
# • H-bond Donors    → Number of groups like –OH or –NH that can donate hydrogen bonds. Influences water solubility and protein interactions.
# • H-bond Acceptors → Number of atoms (like O or N) that can accept hydrogen bonds. Also key for binding and solubility.
# • TPSA             → Topological Polar Surface Area — total surface area available for hydrogen bonding. Strongly linked to oral absorption and permeability.
# • Rotatable Bonds  → Counts how many single bonds can rotate freely. More rotatable bonds = more flexible molecules, which can affect bioavailability.
#
# Calculating these descriptors gives us a fast, interpretable profile of each metabolite’s chemical behavior —
# before diving into more complex cheminformatics or biological analysis.


<div style="
    border-left: 4px solid #2e7d32;
    background: #eec7b7;
    padding: 20px;
    margin: 25px 0;
    border-radius: 8px;
    font-family: Arial, sans-serif;
    line-height: 1.5;
">
  <h3 style="margin: 0 0 12px 0; color: #d5744b;">4. Quantum‐Mechanical Property Calculation (xTB)</h3>
  <ol style="margin: 0; padding: 0 0 0 20px; list-style-type: decimal;">
    <li style="margin-bottom: 12px;">
      <strong>Convert SMILES to 3D Coordinates</strong>
      <ul style="margin: 6px 0 0 20px; padding: 0; list-style-type: disc;">
        <li>Create a minimal‐energy 3D conformer via RDKit embedding and force‐field optimization for each SMILES.</li>
      </ul>
    </li>
    <li style="margin-bottom: 12px;">
      <strong>Batch xTB Execution</strong>
      <ul style="margin: 6px 0 0 20px; padding: 0; list-style-type: disc;">
        <li>Run xTB geometry optimizations and single‐point calculations for every conformer, specifying the appropriate net charge.</li>
        <li>Capture output in memory (standard out or log files) for parsing.</li>
      </ul>
    </li>
    <li style="margin-bottom: 12px;">
      <strong>Parse xTB Outputs</strong>
      <ul style="margin: 6px 0 0 20px; padding: 0; list-style-type: disc;">
        <li>Extract HOMO and LUMO energies, HOMO–LUMO gap, and dipole moment using regular expressions or simple text parsing.</li>
      </ul>
    </li>
    <li>
      <strong>Assemble Quantum Properties</strong>
      <ul style="margin: 6px 0 0 20px; padding: 0; list-style-type: disc;">
        <li>Combine these quantum properties into a DataFrame keyed by compound name.</li>
      </ul>
    </li>
    <li style="margin-top: 12px;">
      <strong>Quality Control</strong>
      <ul style="margin: 6px 0 0 20px; padding: 0; list-style-type: disc;">
        <li>Report any molecules for which xTB failed or produced missing values.</li>
        <li>Confirm that the final count matches the expected number of valid SMILES.</li>
      </ul>
    </li>
  </ol>
</div>


In [109]:
phys_props_df = pd.read_csv("physiochemical_properties.csv")
display(phys_props_df)

Unnamed: 0.1,Unnamed: 0,Compound,SMILES,XLogP,FSP3,Complexity,HBondDonors,HBondAcceptors,TPSA,RotatableBonds,Class I,Class II,VIP,p_value,Fold_Change,Log2FC,Type,Molecular Weight (Da)
0,0,D-Erythronolactone,C1[C@H]([C@H](C(=O)O1)O)O,-1.3,0.75,110.605938,2.0,4.0,66.8,0.0,Carbohydrates and Its metabolites,Carboxylic acids and derivatives,3.20052,0.000679,0.276351,-1.855428,down,118.027
1,1,"1,6-anhydro-β-D-glucose",C1[C@@H]2[C@H]([C@@H]([C@H]([C@H](O1)O2)O)O)O,-2.1,1.0,141.051632,3.0,5.0,79.2,0.0,Carbohydrates and Its metabolites,Sugars,1.832149,0.005326,0.388313,-1.364709,down,162.05283
2,2,Deoxyribose 5-phosphate,C1[C@@H]([C@H](O[C@H]1O)COP(=O)(O)O)O,-2.6,1.0,212.928441,4.0,7.0,116.0,3.0,Nucleotide And Its metabolites,Nucleotide And Its metabolites,1.448724,0.019611,0.314954,-1.666788,down,214.02423
3,3,2-Aminobenzenesulfonic acid,C1=CC=C(C(=C1)N)S(=O)(=O)O,0.4,0.0,357.845074,2.0,4.0,88.8,1.0,Benzene and substituted derivatives,Benzene and substituted derivatives,1.988007,9.6e-05,0.481644,-1.05396,down,173.19
4,4,Quinoline-4-carboxylic acid,C1=CC=C2C(=C1)C(=CC=N2)C(=O)O,0.5,0.0,459.936329,1.0,3.0,50.2,1.0,Heterocyclic compounds,Pteridines and derivatives,1.988007,9.6e-05,0.481644,-1.05396,down,173.047678
5,5,cyclo(glu-glu),C(CC(=O)O)[C@H]1C(=O)N[C@H](C(=O)N1)CCC(=O)O,-1.5,0.6,344.564194,4.0,6.0,133.0,6.0,Amino acid and Its metabolites,Small Peptide,1.137473,0.041106,2.235048,1.160306,up,258.08573
6,6,P-sulfanilic acid,C1=CC(=CC=C1N)S(=O)(=O)O,-0.6,0.0,340.595074,2.0,4.0,88.8,1.0,Amino acid and Its metabolites,Amino acid derivatives,1.988007,9.6e-05,0.481644,-1.05396,down,173.19
7,7,Methylcysteine,CSC[C@@H](C(=O)O)N,-2.7,0.75,86.107496,2.0,4.0,88.6,3.0,Amino acid and Its metabolites,Amino acids,1.603143,0.021896,0.438727,-1.188603,down,135.0354
8,8,Asp-Arg,C(C[C@@H](C(=O)O)NC(=O)[C@H](CC(=O)O)N)CN=C(N)N,-5.3,0.6,393.499549,6.0,7.0,194.0,9.0,Amino acid and Its metabolites,Small Peptide,1.30826,0.012759,0.47731,-1.067001,down,289.13807
9,9,LPI(16:2/0:0),CCCC/C=C\C/C=C\CCCCCCC(=O)OC[C@H](O)COP(=O)([O...,0.6121,0.8,744.625633,6.0,12.0,206.27,19.0,GP,PI,2.600966,0.005109,0.466713,-1.099393,down,568.26541


In [119]:
#!/usr/bin/env python3
import os
import re
import subprocess
import pandas as pd
from rdkit import Chem
from rdkit.Chem import AllChem

def smiles_to_xyz(smiles: str, name: str) -> tuple[str | None, int]:
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        print(f"❌ RDKit failed to parse SMILES: {smiles}")
        return None, 0 # Return None for content, 0 for charge

    # Separate fragments (like counterions)
    fragments = Chem.GetMolFrags(mol, asMols=True)

    if not fragments:
        print(f"❌ RDKit found no fragments for SMILES: {smiles}")
        return None, 0

    # Assume the largest fragment (by atom count) is the one we want for geometry
    # This is a common convention but might need adjustment for specific edge cases.
    target_mol = max(fragments, key=lambda m: m.GetNumAtoms())

    # Calculate charge *of the target fragment*
    charge = AllChem.GetFormalCharge(target_mol)

    target_mol = Chem.AddHs(target_mol)
    valid_symbols = set(Chem.GetPeriodicTable().GetElementSymbol(i) for i in range(1, 119))
    for atom in target_mol.GetAtoms():
        symbol = atom.GetSymbol()
        if symbol not in valid_symbols:
            print(f"❌ Invalid atom symbol '{symbol}' found in molecule derived from SMILES: {smiles}")
            return None, charge # Return charge even on error
    try:
        AllChem.EmbedMolecule(target_mol, AllChem.ETKDGv2())
        AllChem.UFFOptimizeMolecule(target_mol)
    except Exception as e:
        print(f"⚠️ RDKit conformer generation or optimization failed for {name}: {e}")

    if target_mol.GetNumConformers() == 0: # Check the molecule we actually embedded
        print(f"❌ RDKit failed to generate conformer for {name}")
        return None, charge

    conf = target_mol.GetConformer()
    atoms = target_mol.GetAtoms()
    lines = [str(len(atoms)), name]
    for atom in atoms:
        pos = conf.GetAtomPosition(atom.GetIdx())
        lines.append(f"{atom.GetSymbol():<2} {pos.x:.6f} {pos.y:.6f} {pos.z:.6f}")
    return "\n".join(lines), charge

def parse_properties(fn: str) -> dict:
    props = {}
    # Look for "TOTAL ENERGY" followed by the value in Eh (Hartree)
    pat_energy = re.compile(r'\s+TOTAL\s+ENERGY\s+([-\d\.]+)\s+Eh', re.IGNORECASE)
    # Try to capture HOMO, LUMO, and optionally the Gap from the summary line
    pat_homo_lumo_gap = re.compile(r'\s+\(HOMO\)\s+([-\d\.]+)\s+eV\s+\(LUMO\)\s+([-\d\.]+)\s+eV(?:\s+\(GAP\)\s+([-\d\.]+)\s+eV)?', re.IGNORECASE)
    # Note: xTB often prints HOMO/LUMO/Gap on the same line in eV

    found_props = set()
    expected_props = {'TotalEnergy', 'HOMO', 'LUMO', 'Gap'}

    try:
        with open(fn, 'r', errors='ignore', encoding='utf-8') as f:
            lines = f.readlines()
        for line in lines:
            if 'TotalEnergy' not in found_props:
                m = pat_energy.search(line)
                if m:
                    props['TotalEnergy'] = float(m.group(1))
                    found_props.add('TotalEnergy')

            # Try to find HOMO, LUMO, and Gap on the same line
            if not {'HOMO', 'LUMO', 'Gap'}.issubset(found_props):
                m = pat_homo_lumo_gap.search(line)
                if m:
                    props['HOMO'] = float(m.group(1)) # Value in eV
                    props['LUMO'] = float(m.group(2)) # Value in eV
                    props['Gap'] = float(m.group(3))  # Value in eV
                    found_props.add('HOMO')
                    found_props.add('LUMO')
                    # Group 3 captures the Gap, might be None if not found on the same line
                    if m.group(3):
                        props['Gap'] = float(m.group(3))  # Value in eV
                        found_props.add('Gap')

        # Fallback search if properties are still missing
        if not expected_props.issubset(found_props):
            with open(fn, 'r', errors='ignore', encoding='utf-8') as f:
                content = f.read() # Read the file content here
                if 'HOMO' not in found_props:
                    m = re.search(r'HOMO.*?([-+]?[0-9]*\.?[0-9]+)', content, re.IGNORECASE)
                    if m:
                        props['HOMO'] = float(m.group(1))
                        found_props.add('HOMO')
                if 'LUMO' not in found_props:
                    m = re.search(r'LUMO.*?([-+]?[0-9]*\.?[0-9]+)', content, re.IGNORECASE)
                    if m:
                        props['LUMO'] = float(m.group(1))
                        found_props.add('LUMO')
                if 'Gap' not in found_props:
                    m = re.search(r'gap.*?([-+]?[0-9]*\.?[0-9]+)', content, re.IGNORECASE)
                    if m:
                        props['Gap'] = float(m.group(1))
                        found_props.add('Gap')

    except FileNotFoundError:
        print(f"❌ Output file not found: {fn}")
    except Exception as e:
        print(f"❌ Error parsing properties from {fn}: {e}")

    missing = expected_props - found_props
    if missing:
        if os.path.exists(fn):
            print(f"⚠️ Missing properties in {fn}: {', '.join(missing)}")

    return props

def parse_dipole(fn: str) -> float | None:
    # Look for the total dipole moment value, often formatted like:
    # molecular dipole:
    #    full:        X.XXX       Y.YYY       Z.ZZZ       T.TTT  (Debye)
    # or in a summary:
    # DIPOLE MOMENT            4.359 D
    # or: total dipole moment:          4.129 D

    # Ordered list of regex patterns to try. More specific/reliable ones first.
    # Each dictionary contains the regex string and the capturing group index for the value.
    patterns_config = [
        # P1: "full: ... TotalValue (Debye)" (e.g., from GFN2-xTB verbose output)
        # Example: "   full:        X.XXX       Y.YYY       Z.ZZZ       T.TTT  (Debye)"
        {"regex": r"^\s*full\s*:\s*.*\s+([-+]?[\d\.]+)\s*(?:\(Debye\)|D|Debye)?\s*$", "value_group": 1},

        # P2: "-> total dipole moment: VALUE D" or "molecular dipole moment: VALUE D"
        {"regex": r"^\s*(?:->\s*)?(?:total|molecular)\s+dipole\s+moment\s*:\s*([-+]?[\d\.]+)\s*(?:D|Debye)?\s*$", "value_group": 1},

        # P3: "DIPOLE MOMENT VALUE D" (e.g., GFN1-xTB summary)
        {"regex": r"^\s*DIPOLE MOMENT\s+([-+]?[\d\.]+)\s*(?:D|Debye)?\s*$", "value_group": 1},

        # P4: Lines like "  |d| / Debye :  VALUE" or "  total / Debye : VALUE" or "  Norm: VALUE Debye"
        # Example: Dipole          :    0.000    0.000    0.000    Norm:    0.000 Debye
        {"regex": r"^\s*(?:\|d\||total|norm)\s*(?:/\s*Debye)?\s*:\s*([-+]?[\d\.]+)\s*(?:D|Debye)?\s*$", "value_group": 1},
        
        # P5: For lines like "Dipole moment / Debye      VALUE" or "Dipole moment (Debye) VALUE"
        {"regex": r"^\s*Dipole\s+moment\s*(?:(?:/\s*Debye)|(?:\(Debye\)))?\s*([-+]?[\d\.]+)\s*(?:D|Debye)?\s*$", "value_group": 1},

        # P6: A line with 3 vector components and a total, often follows "molecular dipole:" header
        # Example: " -2.03310    -0.00000     0.00000     2.03310 "
        # This pattern is less anchored to keywords on the same line, so it's later in priority.
        # It assumes the 4th number is the total dipole.
        {"regex": r"^\s*([-+]?[\d\.]+)\s+([-+]?[\d\.]+)\s+([-+]?[\d\.]+)\s+([-+]?[\d\.]+)\s*(?:D|Debye|\(Debye\))?\s*$", "value_group": 4},
    ]

    compiled_patterns = []
    for p_conf in patterns_config:
        compiled_patterns.append({
            "pattern_re": re.compile(p_conf["regex"], re.MULTILINE | re.IGNORECASE),
            "value_group": p_conf["value_group"]
        })

    try:
        with open(fn, 'r', errors='ignore', encoding='utf-8') as f:
            content = f.read()

        for pat_config in compiled_patterns:
            pat_re = pat_config["pattern_re"]
            value_group_idx = pat_config["value_group"]
            
            last_match_val = None
            # Iterate through all matches for the current pattern in the content
            for match in pat_re.finditer(content):
                # Ensure the specified capturing group exists and has a value
                if len(match.groups()) >= value_group_idx and match.group(value_group_idx) is not None:
                    try:
                        current_val = float(match.group(value_group_idx))
                        last_match_val = current_val # Keep the last valid float found by this pattern
                    except ValueError:
                        pass # Ignore if conversion fails, though regex should ensure it's a number
            
            if last_match_val is not None: # If this pattern yielded any valid float
                return last_match_val # Return the last one found by this pattern

    except FileNotFoundError:
        pass
    except Exception as e:
        print(f"❌ Error parsing dipole from {fn}: {e}")
    return None

def run_xtb_job(xyz_fn: str, compound_name_for_special_parsing: str, charge: int = 0) -> dict:
    props = {}
    out_fn = 'xtb.out'

    for level in ['2', '1', '0']:
        if os.path.exists(out_fn):
            os.remove(out_fn)

        cmd = [
            'xtb', xyz_fn,
            '--opt', '--property', '--verbose', '--dipole',
            '--gfn', level, '--charge', str(charge), '--uhf', '0' # Use the passed charge
        ]
        print(f"▶ {' '.join(cmd)}")

        try:
            result = subprocess.run(cmd, capture_output=True, text=True, check=False, encoding='utf-8')
            with open(out_fn, 'w', encoding='utf-8') as f:
                f.write(result.stdout)
        except FileNotFoundError:
            print(f"❌ Error: 'xtb' command not found. Is xTB installed and in your PATH?")
            return {}
        except Exception as e:
            print(f"❌ Error running xTB command: {e}")
            return {}

        props = parse_properties(out_fn)
        dipole = parse_dipole(out_fn)
        if dipole is not None:
            props['DipoleMoment'] = dipole

        # Special targeted parsing for Carnitine C7:DC if dipole is still missing
        if 'DipoleMoment' not in props and compound_name_for_special_parsing == "Carnitine C7:DC":
            print(f"    ℹ️ Attempting special dipole parsing for Carnitine C7:DC...")
            try:
                with open(out_fn, 'r', errors='ignore', encoding='utf-8') as f_special:
                    content_special = f_special.read()
                # A very specific pattern, possibly looking for a line unique to Carnitine C7:DC's output
                # Example: looking for "molecular dipole moment" then the last number
                pat_carnitine_dipole = re.compile(r'molecular\s+dipole\s+moment.*\s+([-+]?[\d\.]+)\s*(?:D|Debye)?\s*$', re.MULTILINE | re.IGNORECASE)
                match_special = pat_carnitine_dipole.search(content_special)
                if match_special and match_special.group(1) is not None:
                    props['DipoleMoment'] = float(match_special.group(1))
                    print(f"      ✅ Special dipole parsing for Carnitine C7:DC successful: {props['DipoleMoment']}")
            except Exception as e_special:
                print(f"      ⚠️ Special dipole parsing for Carnitine C7:DC failed: {e_special}")

        if 'TotalEnergy' in props:
            props['GFN'] = level
            if 'DipoleMoment' not in props:
                print(f"⚠️ Could not parse dipole moment from {out_fn}")
            print(f"✅ Success with GFN{level}")
            break
        else:
            print(f"    no total energy found in output from GFN{level}, trying next")
    else:
        print(f"❌ xTB failed to produce properties for {xyz_fn} with GFN2, GFN1, and GFN0.")
        return {}

    return props

def main(input_data_df: pd.DataFrame) -> pd.DataFrame:
    """
    Runs xTB calculations for compounds specified in the input DataFrame.

    Args:
        input_data_df: A pandas DataFrame with 'Compound' and 'SMILES' columns.

    Returns:
        A pandas DataFrame containing the xTB calculation results.
    """
    if not isinstance(input_data_df, pd.DataFrame) or input_data_df.empty:
        print("INFO: Input DataFrame is empty or not a DataFrame. No calculations will be performed.")
        return pd.DataFrame()

    workdir = 'xtb_test'
    os.makedirs(workdir, exist_ok=True)
    original_dir = os.getcwd()
    os.chdir(workdir)

    results = []
    for index, row in input_data_df.iterrows():
        name = str(row["Compound"])
        smiles = str(row["SMILES"])
        print(f"\n--- Processing {name} (from DataFrame index {index}) ---")
        safe_name = re.sub(r'[^a-zA-Z0-9_-]', '_', name)
        xyz_file = safe_name + '.xyz'
        xyz_content, charge = smiles_to_xyz(smiles, name) # Get charge here
        if xyz_content is None:
            print(f"❌ Skipping {name} due to invalid SMILES or generated structure.")
            results.append({'Compound': name, 'Error': 'Invalid SMILES or structure generation failed', 'Charge': charge})
            if os.path.exists(xyz_file):
                os.remove(xyz_file)
            continue

        try:
            with open(xyz_file, 'w', encoding='utf-8') as f:
                f.write(xyz_content)
            print(f"✍️ Created {xyz_file}")
        except IOError as e:
            print(f"❌ Error writing {xyz_file}: {e}")
            results.append({'Compound': name, 'Error': f'File writing failed: {e}', 'Charge': charge})
            if os.path.exists(xyz_file):
                os.remove(xyz_file)
            continue

        print(f"⚛ Running xTB on {name} ({xyz_file})")
        # Pass the determined charge to the xTB job
        res = run_xtb_job(xyz_file, compound_name_for_special_parsing=name, charge=charge)
        res['Compound'] = name
        results.append(res)

        if os.path.exists(xyz_file):
            os.remove(xyz_file)

    os.chdir(original_dir)
    if results:
        df = pd.DataFrame(results).set_index('Compound')
        cols = ['TotalEnergy', 'HOMO', 'LUMO', 'Gap', 'DipoleMoment', 'GFN', 'Error']
        df = df.reindex(columns=cols)
        print("\n✅ xTB properties summary:")
        return df
    else:
        # This message will be printed by the caller if needed
        return pd.DataFrame()

if __name__ == '__main__':
    input_df_to_process = None
    # Assume 'phys_props_df' is loaded in the global scope, similar to 'annot24' in other scripts.
    if 'phys_props_df' in globals() and isinstance(phys_props_df, pd.DataFrame) and not phys_props_df.empty:
        required_cols = ['Compound', 'SMILES']
        missing_cols = [col for col in required_cols if col not in phys_props_df.columns]
        if not missing_cols:
            input_df_to_process = phys_props_df.copy() # Use a copy
            print(f"INFO: Using 'phys_props_df' DataFrame with {len(input_df_to_process)} entries for quantum calculations.")
        else:
            print(f"ERROR: 'phys_props_df' is missing required columns: {', '.join(missing_cols)}")
            input_df_to_process = pd.DataFrame() # Create an empty DataFrame to prevent errors
    else:
        print("ERROR: 'phys_props_df' DataFrame not found in the global scope, is not a DataFrame, or is empty.")
        print("Please ensure 'phys_props_df' is loaded with 'Compound' and 'SMILES' columns before running this script.")
        input_df_to_process = pd.DataFrame() # Create an empty DataFrame

    if not input_df_to_process.empty:
        results_summary_df = main(input_df_to_process)
        if results_summary_df is not None and not results_summary_df.empty:
            # The summary is already printed within main if successful,
            # but we can print the DataFrame string here as well.
            print(results_summary_df.to_string())
        else:
            print("\n⚠️ No results were generated or the main function returned an empty DataFrame.")
    else:
        print("\n❌ No data to process from 'phys_props_df'. Exiting quantum calculations.")

INFO: Using 'phys_props_df' DataFrame with 24 entries for quantum calculations.

--- Processing D-Erythronolactone (from DataFrame index 0) ---
✍️ Created D-Erythronolactone.xyz
⚛ Running xTB on D-Erythronolactone (D-Erythronolactone.xyz)
▶ xtb D-Erythronolactone.xyz --opt --property --verbose --dipole --gfn 2 --charge 0 --uhf 0
✅ Success with GFN2

--- Processing 1,6-anhydro-β-D-glucose (from DataFrame index 1) ---
✍️ Created 1_6-anhydro-_-D-glucose.xyz
⚛ Running xTB on 1,6-anhydro-β-D-glucose (1_6-anhydro-_-D-glucose.xyz)
▶ xtb 1_6-anhydro-_-D-glucose.xyz --opt --property --verbose --dipole --gfn 2 --charge 0 --uhf 0
✅ Success with GFN2

--- Processing Deoxyribose 5-phosphate (from DataFrame index 2) ---
✍️ Created Deoxyribose_5-phosphate.xyz
⚛ Running xTB on Deoxyribose 5-phosphate (Deoxyribose_5-phosphate.xyz)
▶ xtb Deoxyribose_5-phosphate.xyz --opt --property --verbose --dipole --gfn 2 --charge 0 --uhf 0
✅ Success with GFN2

--- Processing 2-Aminobenzenesulfonic acid (from DataFra

<div style="
    border-left: 4px solid #2e7d32;
    background: #f7faf7;
    padding: 20px;
    margin: 25px 0;
    border-radius: 8px;
    font-family: Arial, sans-serif;
    line-height: 1.5;
">
  <h3 style="margin: 0 0 12px 0; color: #2e7d32;">Quantum-Mechanical Property Calculation Pipeline</h3>
  <p style="margin: 0 0 16px 0;">
    This script is a full end-to-end quantum chemistry pipeline written in Python. It takes molecules represented as SMILES strings (like <code>"C=O"</code> for formaldehyde), turns them into 3D structures, runs semi-empirical quantum chemistry calculations using the xTB package, and extracts important molecular properties like total electronic energy, HOMO/LUMO energies, band gap, and dipole moment. Below, each part is explained intuitively yet in full technical detail, highlighting the underlying chemistry and how these calculations work.
  </p>
  
  <!-- Part 1 -->
  <h4 style="margin: 0 0 8px 0; color: #2e7d32;">🌱 Part 1: <em>smiles_to_xyz</em> — Turning 1D into 3D</h4>
  <p style="margin: 0 0 8px 0;">
    <strong>What it does:</strong><br>
    Takes a SMILES string (a text-encoded 2D molecule) and builds a 3D model, then outputs an <code>.xyz</code> file, which is the format quantum chemistry programs like xTB expect.
  </p>
  <h5 style="margin: 8px 0 4px 0; color: #2e7d32;">🧠 Underlying Chemistry</h5>
  <p style="margin: 0 0 8px 0;">
    <strong>What is a SMILES string?</strong><br>
    It’s like a chemical sentence: for example, <code>"C(=O)O"</code> means carbon double-bonded to oxygen and single-bonded to another oxygen. But that representation is flat—like a recipe that lists ingredients without telling you how to arrange them in 3D space.
  </p>
  <h5 style="margin: 8px 0 4px 0; color: #2e7d32;">🔬 RDKit Chemistry Steps</h5>
  <ul style="margin: 0 0 12px 20px; padding: 0; list-style-type: disc;">
    <li><code>Chem.MolFromSmiles(smiles)</code><br>
      Parses the SMILES into a molecular graph (atoms and bonds) but still no 3D geometry.</li>
    <li><code>Chem.GetMolFrags</code><br>
      Separates any fragments (e.g., salts or counterions) into distinct pieces.</li>
    <li><code>max(fragments, key=GetNumAtoms)</code><br>
      Picks the largest fragment—usually the part we care about.</li>
    <li><code>AddHs()</code><br>
      Adds explicit hydrogen atoms, since quantum chemistry needs every atom in the model.</li>
    <li><code>AllChem.EmbedMolecule(...)</code><br>
      “Puffs” the molecule into 3D space based on known bond lengths and angles using distance geometry.</li>
    <li><code>AllChem.UFFOptimizeMolecule()</code><br>
      Applies a force field (like springs and repulsions) to relax the geometry into a realistic conformation.</li>
  </ul>
  <p style="margin: 0 0 16px 0;">
    <strong>Output:</strong> An <code>.xyz</code> file listing each atom and its (x, y, z) coordinates. This 3D structure is what xTB will use for quantum calculations.
  </p>
  
  <!-- Part 2 -->
  <h4 style="margin: 0 0 8px 0; color: #2e7d32;">⚛️ Part 2: <em>run_xtb_job</em> — Running the Quantum Chemistry</h4>
  <p style="margin: 0 0 8px 0;">
    <strong>What it does:</strong><br>
    Takes the <code>.xyz</code> file and runs xTB (extended Tight-Binding), a semi-empirical quantum chemistry tool developed by Grimme’s group.
  </p>
  <h5 style="margin: 8px 0 4px 0; color: #2e7d32;">🧠 What xTB Does Under the Hood</h5>
  <p style="margin: 0 0 8px 0;">
    xTB uses an approximation to the Schrödinger equation, making it fast yet chemically meaningful. It calculates:
  </p>
  <ul style="margin: 0 0 12px 20px; padding: 0; list-style-type: disc;">
    <li><strong>Total electronic energy</strong> (in Hartree): the sum of kinetic and potential energy of all electrons and nuclei, a measure of molecular stability.</li>
    <li><strong>HOMO and LUMO energies</strong> (in eV): highest occupied and lowest unoccupied molecular orbital energies, which hint at how easily the molecule can donate or accept electrons.</li>
    <li><strong>Dipole moment</strong> (in Debye): indicates how unevenly the electron cloud is distributed, relating to polarity and solubility.</li>
  </ul>
  <h5 style="margin: 8px 0 4px 0; color: #2e7d32;">⚙️ xTB Input and Control</h5>
  <p style="margin: 0 0 16px 0;">
    xTB needs:
  </p>
  <ul style="margin: 0 0 12px 20px; padding: 0; list-style-type: disc;">
    <li>An <code>.xyz</code> structure file.</li>
    <li>The net charge of the molecule.</li>
    <li>Options like <code>--opt</code> (optimize geometry), <code>--property</code> (compute properties), and <code>--gfn 2</code> (use the GFN2-xTB model).</li>
  </ul>
  <p style="margin: 0 0 16px 0;">
    The script builds a command such as:
    <pre style="background: #eef5ee; padding: 8px; border-radius: 4px; overflow-x: auto;">
xtb molecule.xyz --opt --property --gfn 2 --charge 0
    </pre>
    It then runs this via <code>subprocess.run()</code> and captures the output text for parsing.
  </p>
  
  <!-- Part 3 -->
  <h4 style="margin: 0 0 8px 0; color: #2e7d32;">🛠️ Part 3: <em>parse_properties</em> &amp; <em>parse_dipole</em></h4>
  <p style="margin: 0 0 8px 0;">
    Once xTB finishes, it spits out text logs. We need to pull out the numerical values from those logs.
  </p>
  <h5 style="margin: 8px 0 4px 0; color: #2e7d32;">🔍 Parsing Details</h5>
  <ul style="margin: 0 0 12px 20px; padding: 0; list-style-type: disc;">
    <li><strong>Total Energy:</strong> Extracted with a regex pattern matching “TOTAL ENERGY = …” from xTB output.</li>
    <li><strong>HOMO/LUMO:</strong> Found by searching for lines like “HOMO = … eV” and “LUMO = … eV.”</li>
    <li><strong>Gap:</strong> Computed as the numeric difference between the parsed HOMO and LUMO values.</li>
    <li><strong>Dipole Moment:</strong> Parsed via multiple patterns (because xTB formats can vary). We look for “Dipole (Debye) …” or other similar lines.</li>
  </ul>
  <p style="margin: 0 0 16px 0;">
    The dipole parser tries several patterns so that if one format changes between xTB versions, we still catch the value correctly.
  </p>
  
  <!-- Summary Table -->
  <h4 style="margin: 0 0 8px 0; color: #2e7d32;">🧪 Chemistry Summary of Outputs</h4>
  <table style="width: 100%; border-collapse: collapse; margin-bottom: 0;">
    <thead>
      <tr style="background: #e0f2e0;">
        <th style="border: 1px solid #c7e0c7; padding: 8px; text-align: left;">Property</th>
        <th style="border: 1px solid #c7e0c7; padding: 8px; text-align: left;">Unit</th>
        <th style="border: 1px solid #c7e0c7; padding: 8px; text-align: left;">Chemistry Intuition</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td style="border: 1px solid #c7e0c7; padding: 8px;">TotalElectronicEnergy</td>
        <td style="border: 1px solid #c7e0c7; padding: 8px;">Hartree</td>
        <td style="border: 1px solid #c7e0c7; padding: 8px;">How stable is the molecule? Lower = more stable (electronic + nuclear contributions).</td>
      </tr>
      <tr style="background: #f7f7f7;">
        <td style="border: 1px solid #c7e0c7; padding: 8px;">HOMO</td>
        <td style="border: 1px solid #c7e0c7; padding: 8px;">eV</td>
        <td style="border: 1px solid #c7e0c7; padding: 8px;">How easily can the molecule donate electrons? Higher HOMO = better donor.</td>
      </tr>
      <tr>
        <td style="border: 1px solid #c7e0c7; padding: 8px;">LUMO</td>
        <td style="border: 1px solid #c7e0c7; padding: 8px;">eV</td>
        <td style="border: 1px solid #c7e0c7; padding: 8px;">How easily can the molecule accept electrons? Lower LUMO = better acceptor.</td>
      </tr>
      <tr style="background: #f7f7f7;">
        <td style="border: 1px solid #c7e0c7; padding: 8px;">BandGap (HOMO–LUMO)</td>
        <td style="border: 1px solid #c7e0c7; padding: 8px;">eV</td>
        <td style="border: 1px solid #c7e0c7; padding: 8px;">Like a band gap in solids: smaller gap = more reactive / conducting.</td>
      </tr>
      <tr>
        <td style="border: 1px solid #c7e0c7; padding: 8px;">DipoleMoment</td>
        <td style="border: 1px solid #c7e0c7; padding: 8px;">Debye</td>
        <td style="border: 1px solid #c7e0c7; padding: 8px;">How polar is the molecule? More polar = greater charge separation and likely higher solubility in water.</td>
      </tr>
    </tbody>
  </table>
</div>


In [120]:
display(results_summary_df)

Unnamed: 0_level_0,TotalEnergy,HOMO,LUMO,Gap,DipoleMoment,GFN,Error
Compound,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
D-Erythronolactone,-27.863108,4.01588,4.01588,0.147581,4.347,2,
"1,6-anhydro-β-D-glucose",-38.265851,10.163291,10.163291,0.373494,-6.845,2,
Deoxyribose 5-phosphate,-47.412065,6.484645,6.484645,0.238306,-2.751,2,
2-Aminobenzenesulfonic acid,-34.751298,3.30511,3.30511,0.12146,-7.745,2,
Quinoline-4-carboxylic acid,-36.013343,1.74371,1.74371,0.06408,0.227,2,
cyclo(glu-glu),-58.8091,3.919771,3.919771,0.144049,-5.833,2,
P-sulfanilic acid,-34.744605,3.62175,3.62175,0.133097,-0.91,2,
Methylcysteine,-27.418815,2.695181,2.695181,0.099046,-1.311,2,
Asp-Arg,51.177561,1.897123,1.897123,0.069718,-9.2964,2,
LPI(16:2/0:0),-127.573976,0.989721,0.989721,0.036372,74.54,2,


In [123]:
results_summary_df.to_excel("xtb_properties.xlsx", index=True)

In [124]:
xtb_df = pd.read_excel('xtb_properties.xlsx')
print(xtb_df.shape)
xtb_df.head(24)

(24, 8)


Unnamed: 0,Compound,TotalEnergy,HOMO,LUMO,Gap,DipoleMoment,GFN,Error
0,D-Erythronolactone,-27.863108,4.01588,4.01588,0.147581,4.347,2,
1,"1,6-anhydro-β-D-glucose",-38.265851,10.163291,10.163291,0.373494,-6.845,2,
2,Deoxyribose 5-phosphate,-47.412065,6.484645,6.484645,0.238306,-2.751,2,
3,2-Aminobenzenesulfonic acid,-34.751298,3.30511,3.30511,0.12146,-7.745,2,
4,Quinoline-4-carboxylic acid,-36.013343,1.74371,1.74371,0.06408,0.227,2,
5,cyclo(glu-glu),-58.8091,3.919771,3.919771,0.144049,-5.833,2,
6,P-sulfanilic acid,-34.744605,3.62175,3.62175,0.133097,-0.91,2,
7,Methylcysteine,-27.418815,2.695181,2.695181,0.099046,-1.311,2,
8,Asp-Arg,51.177561,1.897123,1.897123,0.069718,-9.2964,2,
9,LPI(16:2/0:0),-127.573976,0.989721,0.989721,0.036372,74.54,2,


<div style="
    border-left: 4px solid #2e7d32;
    background: #f7faf7;
    padding: 20px;
    margin: 25px 0;
    border-radius: 8px;
    font-family: Arial, sans-serif;
    line-height: 1.5;
">
  <div>
    <strong>5. Metabolomic Property Integration and KEGG Pathway Enrichment</strong>
    <h5 style="margin: 12px 0 6px 0; color: #2e7d32;">5.1 Map Metabolites to KEGG IDs for Enzyme, Pathway &amp; Reaction Retrieval</h5>
    <ul style="margin: 6px 0 0 20px; padding: 0; list-style-type: disc;">
      <li>Lookup each metabolite’s SMILES in KEGG (or an in-memory reference) to retrieve the KEGG Compound ID.</li>
      <li>Using that KEGG ID, fetch linked EC numbers, pathway identifiers/names, and reaction details.</li>
    </ul>
    <h5 style="margin: 12px 0 6px 0; color: #2e7d32;">5.2 Handle Metabolites without KEGG Information</h5>
    <ul style="margin: 6px 0 0 20px; padding: 0; list-style-type: disc;">
      <li>For any metabolite lacking a KEGG Compound ID, attempt SMARTS-based inference of EC numbers and associated pathways.</li>
    </ul>
  </div>
</div>


In [126]:
annot24 = pd.read_excel('annot24.xlsx')
display(annot24)

Unnamed: 0,Index,Compounds,Type,cpd_ID,HMDB,Pubchem CID,CAS,ChEBI,Metlin,kegg_map,Class I,Class II,Molecular Weight (Da),Formula,SMILES,LIPID MAPS ID
0,MADN0053,D-Erythronolactone,down,-,HMDB0000349,5325915,15667-21-7,87625,5338,--,Carbohydrates and Its metabolites,Carboxylic acids and derivatives,118.027,C4H6O4,C1[C@H]([C@H](C(=O)O1)O)O,
1,MADN0166,"1,6-anhydro-β-D-glucose",down,-,HMDB0000640,724705,498-07-7,30997,5613,--,Carbohydrates and Its metabolites,Sugars,162.05283,C6H10O5,C1[C@@H]2[C@H]([C@@H]([C@H]([C@H](O1)O2)O)O)O,
2,MADN0220,Deoxyribose 5-phosphate,down,-,HMDB0001031,45934311,-,16132,5956,--,Nucleotide And Its metabolites,Nucleotide And Its metabolites,214.02423,C5H11O7P,C1[C@@H]([C@H](O[C@H]1O)COP(=O)(O)O)O,
3,MADN0329,2-Aminobenzenesulfonic acid,down,C06333,-,6926,88-21-1,-,-,--,Benzene and substituted derivatives,Benzene and substituted derivatives,173.19,C6H7NO3S,C1=CC=C(C(=C1)N)S(=O)(=O)O,
4,MADN0333,Quinoline-4-carboxylic acid,down,C06414,-,10243,486-74-8,-,-,--,Heterocyclic compounds,Pteridines and derivatives,173.047678,C10H7NO2,C1=CC=C2C(=C1)C(=CC=N2)C(=O)O,
5,MADN0466,cyclo(glu-glu),up,-,-,7408481,16691-00-2,-,-,--,Amino acid and Its metabolites,Small Peptide,258.08573,C10H14N2O6,C(CC(=O)O)[C@H]1C(=O)N[C@H](C(=O)N1)CCC(=O)O,
6,MADN0498,P-sulfanilic acid,down,-,-,8479,121-57-3,-,-,--,Amino acid and Its metabolites,Amino acid derivatives,173.19,C6H7NO3S,C1=CC(=CC=C1N)S(=O)(=O)O,
7,MADP0119,Methylcysteine,down,-,HMDB0002108,24417,1187-84-4,45658,6490,--,Amino acid and Its metabolites,Amino acids,135.0354,C4H9NO2S,CSC[C@@H](C(=O)O)N,
8,MADP0548,Asp-Arg,down,-,-,16122509,-,-,-,--,Amino acid and Its metabolites,Small Peptide,289.13807,C10H19N5O5,C(C[C@@H](C(=O)O)NC(=O)[C@H](CC(=O)O)N)CN=C(N)N,
9,MEDN1253,LPI(16:2/0:0),down,-,-,,-,-,-,--,GP,PI,568.26541,C25H45O12P,CCCC/C=C\C/C=C\CCCCCCC(=O)OC[C@H](O)COP(=O)([O...,LMPK03010B0B


#### 5.1. KEGG Compound ID Retrieval and Annotation
This subsection details the process of finding KEGG Compound IDs using HMDB, PubChem, and CAS numbers, and then fetching associated KEGG pathways, reactions, and enzymes.

In [127]:
import requests
import xml.etree.ElementTree as ET
import json
import re
import time
import urllib.parse # Needed for URL encoding names
import pandas as pd # Added for DataFrame

# --- Constants for API URLs ---
HMDB_URL = 'https://hmdb.ca/metabolites/{}.xml'
KEGG_REST_URL = 'http://rest.kegg.jp'
PUBCHEM_PUG_VIEW_URL = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound'
TARGET_ORGANISM_CODE = "hsa" # For human-specific pathways


# --- Helper for safe requests ---
def safe_request(url, params=None, headers=None, method='get', data_type='text'):
    """Performs a request and handles common errors."""
    try:
        if method.lower() == 'get':
            req_headers = {'User-Agent': 'Python-MetaboliteAnnotator/0.9 (DataFrameLoop_PUGView_HSA)'} 
            if headers:
                req_headers.update(headers)
            resp = requests.get(url, params=params, headers=req_headers, timeout=45)
        else:
            # This function currently only supports GET requests.
            print(f"Unsupported method: {method}")
            return None, f"Unsupported method: {method}"

        resp.raise_for_status() # Raises an HTTPError if the HTTP request returned an unsuccessful status code

        if data_type == 'json':
            return resp.json(), None
        elif data_type == 'xml':
            if not resp.text or resp.text.isspace(): # Check for empty XML response
                return None, "Empty XML response"
            return ET.fromstring(resp.text), None
        else: # Default to text
            return resp.text, None

    except requests.exceptions.Timeout as e:
        print(f"Request timed out for {url}: {e}")
        return None, f"Timeout: {e}"
    except requests.exceptions.HTTPError as e:
        if e.response.status_code == 404:
            pass # Silently handle 404, function will return None
        elif e.response.status_code == 429: # HTTP 429 Too Many Requests
            print(f"HTTP 429 Too Many Requests for {url}. Consider increasing api_delay.")
        else:
            print(f"HTTP error for {url}: {e} (Status code: {e.response.status_code})")
        return None, f"HTTP Error: {e.response.status_code}" 
    except requests.exceptions.RequestException as e:
        print(f"Request failed for {url}: {e}")
        return None, f"Request Failed: {e}"
    except ET.ParseError as e: # For XML parsing errors
        response_text_snippet = "N/A"
        if 'resp' in locals() and hasattr(resp, 'text'): response_text_snippet = resp.text[:500]
        print(f"Failed to parse XML from {url}: {e}\nResponse text snippet: {response_text_snippet}...")
        return None, f"XML Parse Error: {e}"
    except json.JSONDecodeError as e: # For JSON parsing errors
        response_text_snippet = "N/A"
        if 'resp' in locals() and hasattr(resp, 'text'): response_text_snippet = resp.text[:500]
        print(f"Failed to parse JSON from {url}: {e}\nResponse text snippet: {response_text_snippet}...")
        return None, f"JSON Parse Error: {e}"
    except Exception as e: # Catch any other unexpected errors
        print(f"An unexpected error occurred for {url}: {e}")
        return None, f"Unexpected Error: {e}"

# --- Functions for fetching and parsing data ---

def get_hmdb_crossrefs(hmdb_id):
    if not hmdb_id or not isinstance(hmdb_id, str) or hmdb_id.isspace():
        return {}, "Invalid HMDB ID"
    url = HMDB_URL.format(urllib.parse.quote(hmdb_id))
    root, error = safe_request(url, data_type='xml')
    if error: 
        return {}, error 
    if root is None:
        return {}, "Failed to parse HMDB XML (root is None or XML empty)"

    xrefs = {}
    tag_map = {
        'kegg_id': 'KEGG Compound', 'chebi_id': 'ChEBI', 'drugbank_id': 'DrugBank',
        'pubchem_compound_id': 'PubChem CID', 'chemspider_id': 'ChemSpider',
        'name': 'Name', 'metlin_id': 'METLIN', 'biocyc_id': 'BioCyc',
    }
    for tag_name, db_name in tag_map.items():
        element = root.find(f"./{tag_name}")
        if element is not None and element.text and element.text.strip():
            xrefs.setdefault(db_name, []).append(element.text.strip())

    if 'KEGG Compound' in xrefs:
        processed_kegg_ids = []
        for kid_text in xrefs['KEGG Compound']:
            kid = kid_text.strip()
            if re.match(r'^C\d{5}$', kid): 
                processed_kegg_ids.append(kid)
            elif re.match(r'^\d{5}$', kid): 
                processed_kegg_ids.append(f"C{kid}")
        if processed_kegg_ids:
            xrefs['KEGG Compound'] = list(set(processed_kegg_ids))
        else:
            del xrefs['KEGG Compound'] 
    return xrefs, None


def get_kegg_id_from_pubchem_pug_rest_xref(pubchem_cid_str):
    if not pubchem_cid_str or not isinstance(pubchem_cid_str, str) or not pubchem_cid_str.isdigit():
        return None, "Invalid PubChem CID format for PUG REST XRef"

    source_name_encoded = urllib.parse.quote("KEGG Compound", safe='')
    url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{pubchem_cid_str}/xrefs/{source_name_encoded}/JSON"
    
    data, error = safe_request(url, data_type='json')

    if error:
        if "404" in str(error): 
            print(f"      PUG REST XRef: 'KEGG Compound' source not found for CID {pubchem_cid_str}, trying 'KEGG' source.")
            source_name_encoded_alt = urllib.parse.quote("KEGG", safe='')
            url_alt = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{pubchem_cid_str}/xrefs/{source_name_encoded_alt}/JSON"
            data, error = safe_request(url_alt, data_type='json')
            if error:
                return None, f"PUG REST XRef error (after trying 'KEGG' source): {error}"
        else:
            return None, f"PUG REST XRef error: {error}"

    if not data:
        return None, "No data from PubChem PUG REST XRef"

    try:
        if (data and "InformationList" in data and "Information" in data["InformationList"] and
                data["InformationList"]["Information"]):
            for info_item in data["InformationList"]["Information"]:
                kegg_id_val = info_item.get('KEGG ID') or (info_item.get("XRef")[0] if isinstance(info_item.get("XRef"), list) and info_item.get("XRef") else None)
                if isinstance(kegg_id_val, str) and re.match(r'^C\d{5}$', kegg_id_val.strip()):
                    return kegg_id_val.strip(), None
        return None, "KEGG ID not found in PUG REST XRef response structure"
    except Exception as e:
        return None, f"PUG REST XRef Parse Error: {e}"

def get_kegg_ids_from_pubchem_pug_view(pubchem_cid):
    if not pubchem_cid or not isinstance(pubchem_cid, str) or not pubchem_cid.isdigit():
        return [], "Invalid PubChem CID format"
    url = f"{PUBCHEM_PUG_VIEW_URL}/{urllib.parse.quote(pubchem_cid)}/JSON/"
    data, error = safe_request(url, data_type='json')

    if error:
        if "404" in str(error):
            return [], f"CID {pubchem_cid} not found in PubChem (404)"
        return [], str(error) 
    if not data: 
        return [], "No data received from PubChem PUG View"

    kegg_ids_found = []
    try:
        if "Record" in data and "Section" in data["Record"]:
            for section in data["Record"]["Section"]:
                if section.get("TOCHeading") == "Names and Identifiers":
                    for subsection in section.get("Section", []):
                        if subsection.get("TOCHeading") == "Other Identifiers":
                            info_items = subsection.get("Information", [])
                            for item_group in info_items:
                                identifier_list = []
                                if "Value" in item_group and "StringWithMarkup" in item_group["Value"] and isinstance(item_group["Value"]["StringWithMarkup"], list):
                                    for swm_item in item_group["Value"]["StringWithMarkup"]:
                                        identifier_list.append(swm_item) 
                                elif "Name" in item_group and "StringValue" in item_group: 
                                    identifier_list.append(item_group)
                                else: 
                                    identifier_list.append(item_group)

                                for identifier_info in identifier_list:
                                    if not isinstance(identifier_info, dict): continue
                                    source_name = identifier_info.get("SourceName")
                                    string_value = identifier_info.get("String") or identifier_info.get("StringValue")
                                    url_val = identifier_info.get("URL")
                                    kegg_id_to_add = None

                                    if source_name == "KEGG" and string_value:
                                        potential_id = string_value.strip()
                                        if re.match(r'^C\d{5}$', potential_id): kegg_id_to_add = potential_id
                                        elif re.match(r'^\d{5}$', potential_id): kegg_id_to_add = f"C{potential_id}"
                                    
                                    if not kegg_id_to_add and string_value: 
                                        potential_id = string_value.strip()
                                        if re.match(r'^C\d{5}$', potential_id): kegg_id_to_add = potential_id
                                        elif re.match(r'^\d{5}$', potential_id): kegg_id_to_add = f"C{potential_id}"

                                    if not kegg_id_to_add and url_val and "genome.jp" in url_val and "cpd:" in url_val:
                                        match = re.search(r'cpd:(C\d{5})', url_val)
                                        if match: kegg_id_to_add = match.group(1)
                                    
                                    if kegg_id_to_add and kegg_id_to_add not in kegg_ids_found:
                                        kegg_ids_found.append(kegg_id_to_add)
            if kegg_ids_found:
                return list(set(kegg_ids_found)), None
            else:
                return [], None 
        else:
            return [], "Unexpected JSON structure from PUG View (No Record/Section)"
    except Exception as e:
        return [], f"PUG View Parse Error: {e}"


def fetch_links(endpoint, cpd_id):
    """
    Fetches linked KEGG entries (pathways, reactions, enzymes) for a given KEGG compound ID.
    For pathways, it converts 'map' IDs to 'hsa' (human) IDs.
    """
    if not cpd_id or not isinstance(cpd_id, str) or not re.match(r'^C\d{5}$', cpd_id):
        return [] # Return empty list for invalid compound ID
    
    url = f'{KEGG_REST_URL}/link/{endpoint}/{urllib.parse.quote(cpd_id)}'
    text_data, error = safe_request(url, data_type='text')

    if error:
        # Only print error if it's not a 404, as 404 often means no links exist.
        if not (isinstance(error, str) and "404" in str(error)):
              print(f"        KEGG link error for {endpoint}/{cpd_id}: {error}")
        return []
    
    out = []
    if text_data:
        for line in text_data.strip().splitlines():
            if '\t' not in line: continue # Ensure the line has a tab separator
            _, target_raw = line.split('\t',1) # Split only on the first tab
            target_raw = target_raw.strip() # Clean whitespace

            prefix_map = {'reaction': 'rn:', 'enzyme': 'ec:'} # Pathway handled separately

            if endpoint == 'pathway':
                if target_raw.startswith("path:"):
                    path_id_full = target_raw.split(':',1)[1]
                    if path_id_full.startswith("map"):
                        # Convert mapXXXXX to hsaXXXXX
                        out.append(TARGET_ORGANISM_CODE + path_id_full[3:])
                    elif path_id_full.startswith(TARGET_ORGANISM_CODE):
                        # Already a human pathway ID
                        out.append(path_id_full)
                    # else: # Optionally handle/log other pathway types like koXXXXX if needed
                    # print(f"        Skipping non-map/non-{TARGET_ORGANISM_CODE} pathway: {path_id_full}")
            elif endpoint in prefix_map and target_raw.startswith(prefix_map[endpoint]):
                # For reactions and enzymes, extract the ID after the prefix
                out.append(target_raw.split(':',1)[1])
            elif endpoint == 'enzyme' and re.match(r'^\d+\.\d+\.\d+\.((n)?\d+|(-|([a-zA-Z]\d*)))$', target_raw):
                # Handle EC numbers that might not have the 'ec:' prefix in some KEGG responses
                out.append(target_raw)
                
    return list(set(out)) # Return unique IDs


def get_kegg_id_from_cas_kegg_api(cas_number):
    if not cas_number or not isinstance(cas_number, str) or cas_number.isspace():
        return None, "Invalid CAS number"
    url = f"{KEGG_REST_URL}/find/compound/{urllib.parse.quote(cas_number)}"
    text_data, error = safe_request(url, data_type='text')

    if error:
        if not (isinstance(error, str) and "404" in str(error)):
              print(f"        KEGG find (CAS {cas_number}) error: {error}")
        return None, error 
    
    kegg_ids_found = []
    if text_data:
        lines = text_data.strip().splitlines()
        for line in lines:
            parts = line.split('\t')
            if len(parts) > 0 and parts[0].strip().startswith('cpd:'):
                kegg_id_potential = parts[0].strip().split(':',1)[1]
                if re.match(r'^C\d{5}$', kegg_id_potential):
                    kegg_ids_found.append(kegg_id_potential)
    if kegg_ids_found:
        unique_kegg_ids = list(dict.fromkeys(kegg_ids_found)) 
        return unique_kegg_ids[0], None # Return the first one found
    else:
        return None, None # No KEGG ID found, but no error (e.g., empty response)

# --- Main processing logic ---
def process_metabolites(df_annot):
    results_data = []
    api_delay = 0.2 # Slightly reduced delay, adjust if rate limited

    for i, (index_label, row) in enumerate(df_annot.iterrows()):
        print(f"\n🔄 Processing iteration {i} (DataFrame Index: {index_label}, Compound: {row.get('Compounds', 'N/A')}, Original Index: {row.get('Index', 'N/A')}):")
        hmdb_id_input = str(row.get('HMDB')).strip() if pd.notna(row.get('HMDB')) else None
        pubchem_cid_input_raw = str(row.get('Pubchem CID')).strip() if pd.notna(row.get('Pubchem CID')) else None
        cas_number_input = str(row.get('CAS')).strip() if pd.notna(row.get('CAS')) else None

        hmdb_id = None if hmdb_id_input in ['None', 'nan', ''] else hmdb_id_input
        pubchem_cid_raw = None if pubchem_cid_input_raw in ['None', 'nan', ''] else pubchem_cid_input_raw
        cas_number = None if cas_number_input in ['None', 'nan', ''] else cas_number_input

        pubchem_cid_cleaned = None
        if pubchem_cid_raw:
            if '.' in pubchem_cid_raw:
                try: pubchem_cid_cleaned = str(int(float(pubchem_cid_raw)))
                except ValueError:
                    if pubchem_cid_raw.isdigit(): pubchem_cid_cleaned = pubchem_cid_raw
                    else: print(f"      Warning: PubChem CID '{pubchem_cid_raw}' has non-digits after decimal point processing attempt.")
            elif pubchem_cid_raw.isdigit():
                pubchem_cid_cleaned = pubchem_cid_raw
            else:
                print(f"      Warning: PubChem CID '{pubchem_cid_raw}' is not purely digits and not a simple float.")
        
        print(f"    Inputs: HMDB='{hmdb_id}', PubChem CID (cleaned)='{pubchem_cid_cleaned}', CAS='{cas_number}'")
        kegg_compound_id, source_of_kegg_id, intermediate_id_used = None, None, None

        if hmdb_id:
            print(f"    ➡️ Attempting HMDB ID: {hmdb_id}")
            xrefs_hmdb, error_hmdb = get_hmdb_crossrefs(hmdb_id)
            if not error_hmdb and 'KEGG Compound' in xrefs_hmdb and xrefs_hmdb['KEGG Compound']:
                kegg_compound_id = xrefs_hmdb['KEGG Compound'][0]
                source_of_kegg_id, intermediate_id_used = "HMDB", hmdb_id
                print(f"      ✅ SUCCESS: Found KEGG ID '{kegg_compound_id}' from HMDB ID '{hmdb_id}'.")
            elif error_hmdb:
                print(f"      ❌ HMDB Error for {hmdb_id}: {error_hmdb}")
            else: 
                print(f"      ℹ️ HMDB: No KEGG Compound ID found for {hmdb_id} (or no error reported).")
            time.sleep(api_delay)

        if not kegg_compound_id and pubchem_cid_cleaned:
            print(f"    ➡️ Attempting PubChem CID (via PUG REST XRef then PUG View): {pubchem_cid_cleaned}")
            kegg_id_from_rest, error_pc_rest = get_kegg_id_from_pubchem_pug_rest_xref(pubchem_cid_cleaned)
            if kegg_id_from_rest:
                kegg_compound_id = kegg_id_from_rest
                source_of_kegg_id, intermediate_id_used = "PubChem CID (PUG REST XRef)", pubchem_cid_cleaned
                print(f"      ✅ SUCCESS (PUG REST): Found KEGG ID '{kegg_compound_id}' from PubChem CID '{pubchem_cid_cleaned}'.")
            else:
                print(f"      ℹ️ PUG REST XRef for CID {pubchem_cid_cleaned} failed or no KEGG ID found. Error: {error_pc_rest}. Falling back to PUG View.")
                kegg_ids_from_view, error_pc_view = get_kegg_ids_from_pubchem_pug_view(pubchem_cid_cleaned)
                if kegg_ids_from_view: 
                    kegg_compound_id = kegg_ids_from_view[0] 
                    source_of_kegg_id, intermediate_id_used = "PubChem CID (PUG View)", pubchem_cid_cleaned
                    print(f"      ✅ SUCCESS (PUG View): Found KEGG ID '{kegg_compound_id}' from PubChem CID '{pubchem_cid_cleaned}'. Found: {kegg_ids_from_view}")
                elif error_pc_view: print(f"      ❌ PubChem PUG View Error for CID {pubchem_cid_cleaned}: {error_pc_view}")
                else: print(f"      ℹ️ PubChem PUG View: No KEGG Compound ID found for CID {pubchem_cid_cleaned} (or no error reported).")
            time.sleep(api_delay)

        if not kegg_compound_id and cas_number:
            print(f"    ➡️ Attempting CAS Number (via KEGG find): {cas_number}")
            kegg_id_from_cas, error_cas_kegg = get_kegg_id_from_cas_kegg_api(cas_number)
            if kegg_id_from_cas: 
                kegg_compound_id = kegg_id_from_cas
                source_of_kegg_id, intermediate_id_used = "CAS (KEGG find)", cas_number
                print(f"      ✅ SUCCESS: Found KEGG ID '{kegg_compound_id}' from CAS '{cas_number}' via KEGG API.")
            elif error_cas_kegg: 
                print(f"      ❌ KEGG Find (CAS {cas_number}) Error: {error_cas_kegg}")
            else: 
                print(f"      ℹ️ KEGG Find (CAS {cas_number}): No KEGG Compound ID found (or no error reported).")
            time.sleep(api_delay)
        
        current_result = {
            'input_original_index': row.get('Index', 'N/A'),
            'input_compound_name': row.get('Compounds', 'N/A'),
            'input_hmdb': hmdb_id_input,
            'input_pubchem_cid': pubchem_cid_input_raw,
            'input_cas': cas_number_input,
            'derived_kegg_compound_id': None,
            'kegg_id_source_identifier': None,
            'kegg_id_source_type': None,
            'kegg_reactions': [],
            'kegg_enzymes': [],
            'kegg_pathways': [] # This will now store hsaXXXXX pathways
        }

        if kegg_compound_id:
            print(f"    ℹ️ Fetching KEGG details for Compound ID: {kegg_compound_id} (derived from {source_of_kegg_id}: {intermediate_id_used})")
            current_result.update({
                'derived_kegg_compound_id': kegg_compound_id,
                'kegg_id_source_identifier': intermediate_id_used,
                'kegg_id_source_type': source_of_kegg_id
            })
            # fetch_links for 'pathway' will now return 'hsaXXXXX' IDs
            current_result['kegg_pathways'] = fetch_links('pathway', kegg_compound_id); time.sleep(api_delay)
            current_result['kegg_reactions'] = fetch_links('reaction', kegg_compound_id); time.sleep(api_delay)
            current_result['kegg_enzymes'] = fetch_links('enzyme', kegg_compound_id); 
            print(f"      Kegg Pathways (hsa): {current_result['kegg_pathways'] if current_result['kegg_pathways'] else 'None'}")
            print(f"      Kegg Reactions: {current_result['kegg_reactions'] if current_result['kegg_reactions'] else 'None'}")
            print(f"      Kegg Enzymes: {current_result['kegg_enzymes'] if current_result['kegg_enzymes'] else 'None'}")
        else:
            print(f"    ℹ️ Could not determine KEGG Compound ID for this row. Skipping KEGG detail retrieval.")
        
        results_data.append(current_result)
        if i < len(df_annot) - 1 : 
              print("-" * 70) # Separator line
    return pd.DataFrame(results_data)

if __name__ == '__main__':
    # This script assumes 'annot24' is a pandas DataFrame already loaded 
    # in your notebook's global scope.
    # It will use that 'annot24' DataFrame directly.

    annot24_to_process = None
    if 'annot24' in globals() and isinstance(annot24, pd.DataFrame) and not annot24.empty:
        annot24_to_process = annot24.copy() 
        print(f"INFO: Using existing 'annot24' DataFrame with {len(annot24_to_process)} entries.")
        required_cols = ['Compounds', 'HMDB', 'Pubchem CID', 'CAS', 'Index'] 
        for col in required_cols:
            if col not in annot24_to_process.columns:
                print(f"INFO: Column '{col}' not found in 'annot24'. Adding it with None values.")
                annot24_to_process[col] = None
                if col == 'Compounds': 
                    annot24_to_process[col] = [f"Unnamed Compound {i+1}" for i in range(len(annot24_to_process))]
                if col == 'Index': 
                    annot24_to_process[col] = [f"InputIDX{i+1}" for i in range(len(annot24_to_process))]
    else:
        print("ERROR: 'annot24' DataFrame not found in the global scope or is empty.")
        print("Please ensure 'annot24' is loaded as a pandas DataFrame before running this script.")
        annot24_to_process = pd.DataFrame() 

    if annot24_to_process.empty:
        print("\n❌ No data to process. Exiting.")
    else:
        print("\n🚀 Starting metabolite annotation processing...")
        results_df = process_metabolites(annot24_to_process) 
        
        print("\n\n--- ✅ Processing Complete ---")
        print("📊 Final Results DataFrame:")
        pd.set_option('display.max_columns', None)
        pd.set_option('display.max_colwidth', None) 
        pd.set_option('display.width', 1000) 
        print(results_df)

        if not results_df.empty:
            results_df['KEGG_Hits_Count'] = results_df.apply(
                lambda row: sum([1 for col in ['kegg_pathways', 'kegg_reactions', 'kegg_enzymes'] if row[col]]), axis=1
            )

            print("\n\n--- 📊 Summary of KEGG Annotation Hits ---")
            
            hits_summary_dict = {
                '3/3 hits': [], '2/3 hits': [], '1/3 hits': [], '0/3 hits': []
            }

            if 'KEGG_Hits_Count' in results_df.columns and 'input_compound_name' in results_df.columns:
                grouped_hits = results_df.groupby('KEGG_Hits_Count')['input_compound_name'].apply(list)

                for count, compounds_list in grouped_hits.items():
                    if count == 3: hits_summary_dict['3/3 hits'].extend(compounds_list)
                    elif count == 2: hits_summary_dict['2/3 hits'].extend(compounds_list)
                    elif count == 1: hits_summary_dict['1/3 hits'].extend(compounds_list)
                    elif count == 0: hits_summary_dict['0/3 hits'].extend(compounds_list)
            else:
                print("      ⚠️ Could not generate hits summary due to missing columns in results_df.")

            for hit_category, compounds in hits_summary_dict.items():
                print(f"\n{hit_category}:")
                if compounds:
                    for compound_name in sorted(list(set(compounds))): 
                        print(f"  - {compound_name}")
                else:
                    print("  (None)")
        else:
            print("\n--- No results to summarize for KEGG Annotation Hits ---")

INFO: Using existing 'annot24' DataFrame with 24 entries.

🚀 Starting metabolite annotation processing...

🔄 Processing iteration 0 (DataFrame Index: 0, Compound: D-Erythronolactone, Original Index: MADN0053):
    Inputs: HMDB='HMDB0000349', PubChem CID (cleaned)='5325915', CAS='15667-21-7'
    ➡️ Attempting HMDB ID: HMDB0000349
      ℹ️ HMDB: No KEGG Compound ID found for HMDB0000349 (or no error reported).
    ➡️ Attempting PubChem CID (via PUG REST XRef then PUG View): 5325915
HTTP error for https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/5325915/xrefs/KEGG%20Compound/JSON: 400 Client Error: PUGREST.BadRequest for url: https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/5325915/xrefs/KEGG%20Compound/JSON (Status code: 400)
      ℹ️ PUG REST XRef for CID 5325915 failed or no KEGG ID found. Error: PUG REST XRef error: HTTP Error: 400. Falling back to PUG View.
      ℹ️ PubChem PUG View: No KEGG Compound ID found for CID 5325915 (or no error reported).
    ➡️ Attempting CAS N

<div style="
    background-color: #fff;
    padding: 20px;
    border-radius: 8px;
    box-shadow: 0 0 10px rgba(0,0,0,0.1);
    margin: 20px 0;
">
  <h2 style="
      color: #2c3e50;
      border-bottom: 2px solid #3498db;
      padding-bottom: 5px;
      margin-top: 0;
  ">
    🐍 Metabolite Annotation &amp; KEGG Enrichment Script
  </h2>
  <p style="margin-bottom: 15px; color: #333; line-height: 1.6;">
    This Python script automates the process of fetching and integrating biological data for a list of metabolites. 
    It primarily aims to find <strong>KEGG Compound IDs</strong> and then uses these IDs to retrieve associated 
    <strong>pathways, reactions, and enzymes</strong>. It queries several public databases including HMDB, PubChem, and KEGG.
  </p>

  <div style="
      background-color: #e8f4fd;
      border-left: 4px solid #3498db;
      padding: 10px;
      margin-bottom: 15px;
  ">
    <p style="margin: 0; color: #333; line-height: 1.6;">
      <strong>Input:</strong> The script expects a pandas DataFrame (named <code>annot24</code>) containing metabolite information, 
      ideally with columns like <code>Compounds</code> (name), <code>HMDB</code> (ID), <code>Pubchem CID</code>, and <code>CAS</code> (number).
    </p>
  </div>

  <h3 style="
      color: #3498db;
      margin-top: 20px;
      margin-bottom: 10px;
  ">
    🔑 Key Functionalities:
  </h3>
  <ul style="list-style-type: disc; margin-left: 20px; color: #333; line-height: 1.6;">
    <li><strong>Robust API Requests:</strong>
      <ul style="list-style-type: circle; margin-left: 20px; color: #333;">
        <li>A <code>safe_request</code> function handles API calls to HMDB, KEGG, and PubChem.</li>
        <li>Includes error handling for timeouts, HTTP errors (404, 429), and XML/JSON parsing issues.</li>
        <li>Sets a custom <code>User-Agent</code> for each request.</li>
      </ul>
    </li>
    <li><strong>KEGG ID Retrieval Strategy:</strong> The script uses a hierarchical approach:
      <ol style="margin-left: 20px; color: #333; line-height: 1.6;">
        <li><strong>HMDB First:</strong> If an HMDB ID is provided, it fetches HMDB XML (<code>https://hmdb.ca/metabolites/{hmdb_id}.xml</code>) to find cross-references, including KEGG IDs. Formats KEGG IDs to <code>CXXXXX</code>.</li>
        <li><strong>PubChem Fallback:</strong> If no KEGG ID from HMDB and a PubChem CID is available:
          <ul style="list-style-type: circle; margin-left: 20px; color: #333;">
            <li>Tries PubChem PUG REST XRef service with source “KEGG Compound” or “KEGG.”</li>
            <li>If that fails, uses PubChem PUG View API to parse the “Names and Identifiers” section for KEGG IDs.</li>
            <li>Cleans PubChem CIDs (e.g., converts floats like <code>1234.0</code> to string “1234”).</li>
          </ul>
        </li>
        <li><strong>CAS Number Fallback:</strong> If still no KEGG ID and a CAS number is provided, uses KEGG REST API: 
          <code>http://rest.kegg.jp/find/compound/{cas_number}</code>.
        </li>
      </ol>
    </li>
    <li><strong>KEGG Data Fetching:</strong>
      <ul style="list-style-type: circle; margin-left: 20px; color: #333;">
        <li>Once a KEGG Compound ID (<code>CXXXXX</code>) is obtained, <code>fetch_links</code> retrieves linked entries:
          <ul style="list-style-type: square; margin-left: 20px; color: #333;">
            <li>Pathways (<code>/link/pathway/{cpd_id}</code>): Converts “map” IDs to human‐specific “hsa” IDs (e.g., <code>map00010</code> → <code>hsa00010</code>).</li>
            <li>Reactions (<code>/link/reaction/{cpd_id}</code>).</li>
            <li>Enzymes (<code>/link/enzyme/{cpd_id}</code>).</li>
          </ul>
        </li>
        <li>Handles cases where no links are found (404 responses from KEGG).</li>
      </ul>
    </li>
    <li><strong>Main Processing Loop:</strong>
      <ul style="list-style-type: circle; margin-left: 20px; color: #333;">
        <li><code>process_metabolites</code> iterates through each row of <code>annot24</code>.</li>
        <li>Applies the KEGG ID retrieval strategy with an <code>api_delay</code> between calls to prevent rate‐limit issues.</li>
        <li>Collects input details, derived KEGG ID &amp; source, plus linked pathways, reactions, and enzymes.</li>
      </ul>
    </li>
    <li><strong>Output &amp; Summary:</strong>
      <ul style="list-style-type: circle; margin-left: 20px; color: #333;">
        <li>Compiles results into a new pandas DataFrame.</li>
        <li>Logs progress details during processing.</li>
        <li>Displays the final DataFrame and a summary of how many compounds received each type of KEGG link (pathway, reaction, enzyme).</li>
      </ul>
    </li>
    <li><strong>Execution Guard:</strong> Wrapped in <code>if __name__ == '__main__':</code> so it can be run directly or imported. Verifies that <code>annot24</code> exists and is valid.</li>
  </ul>

  <h3 style="
      color: #3498db;
      margin-top: 20px;
      margin-bottom: 10px;
  ">
    ⚙️ Core Functions:
  </h3>
  <ul style="list-style-type: disc; margin-left: 20px; color: #333; line-height: 1.6;">
    <li><code>safe_request(url, ...)</code>: Performs HTTP GET with robust error handling.</li>
    <li><code>get_hmdb_crossrefs(hmdb_id)</code>: Parses HMDB XML to extract KEGG cross‐references.</li>
    <li><code>get_kegg_id_from_pubchem_pug_rest_xref(pubchem_cid_str)</code>: Uses PubChem PUG REST to find KEGG IDs.</li>
    <li><code>get_kegg_ids_from_pubchem_pug_view(pubchem_cid)</code>: Uses PubChem PUG View API to parse KEGG IDs.</li>
    <li><code>fetch_links(endpoint, cpd_id)</code>: Retrieves KEGG pathways, reactions, or enzymes for a given compound.</li>
    <li><code>get_kegg_id_from_cas_kegg_api(cas_number)</code>: Finds KEGG Compound ID via KEGG API by CAS.</li>
    <li><code>process_metabolites(df_annot)</code>: Coordinates the entire annotation workflow for <code>annot24</code>.</li>
  </ul>

  <div style="
      background-color: #e6f7e9;
      border-left: 4px solid #2ecc71;
      padding: 10px;
      margin-top: 20px;
  ">
    <p style="margin: 0; color: #333; line-height: 1.6;">
      <strong>Outcome:</strong> The primary output is a pandas DataFrame containing the original metabolite identifiers 
      alongside any successfully retrieved KEGG Compound ID, and its associated human pathways, reactions, and enzymes. 
      This enriched dataset is valuable for downstream biological interpretation and pathway analysis.
    </p>
  </div>
</div>


In [128]:
display(results_df)

Unnamed: 0,input_original_index,input_compound_name,input_hmdb,input_pubchem_cid,input_cas,derived_kegg_compound_id,kegg_id_source_identifier,kegg_id_source_type,kegg_reactions,kegg_enzymes,kegg_pathways,KEGG_Hits_Count
0,MADN0053,D-Erythronolactone,HMDB0000349,5325915,15667-21-7,,,,[],[],[],0
1,MADN0166,"1,6-anhydro-β-D-glucose",HMDB0000640,724705,498-07-7,,,,[],[],[],0
2,MADN0220,Deoxyribose 5-phosphate,HMDB0001031,45934311,-,C00673,HMDB0001031,HMDB,"[R02750, R02749, R01066]","[4.1.2.4, 2.7.1.229, 2.7.1.15, 5.4.2.7]","[hsa00030, hsa01100]",3
3,MADN0329,2-Aminobenzenesulfonic acid,-,6926,88-21-1,,,,[],[],[],0
4,MADN0333,Quinoline-4-carboxylic acid,-,10243,486-74-8,,,,[],[],[],0
5,MADN0466,cyclo(glu-glu),-,7408481,16691-00-2,,,,[],[],[],0
6,MADN0498,P-sulfanilic acid,-,8479,121-57-3,,,,[],[],[],0
7,MADP0119,Methylcysteine,HMDB0002108,24417,1187-84-4,,,,[],[],[],0
8,MADP0548,Asp-Arg,-,16122509,-,,,,[],[],[],0
9,MEDN1253,LPI(16:2/0:0),-,,-,,,,[],[],[],0


In [132]:
import pandas as pd

# (Assume annot24 and ReacEnzyPath are already loaded as pandas DataFrames.
#  annot24 has columns ['Compounds', 'SMILES', …]
#  ReacEnzyPath has a column 'input_compound_name' whose values match annot24['Compounds'])

# Option A: Use a simple map() based on setting annot24’s index to 'Compounds'
smiles_map = annot24.set_index('Compounds')['SMILES']

# Create a new column in ReacEnzyPath by mapping input_compound_name → SMILES
ReacEnzyPath['SMILES'] = ReacEnzyPath['input_compound_name'].map(smiles_map)

# Now ReacEnzyPath contains a 'SMILES' column with the correct strings for each metabolite
print(ReacEnzyPath.head())

# Option B (equivalent): merge the two DataFrames on the matching column names
# merged = ReacEnzyPath.merge(
#     annot24[['Compounds', 'SMILES']],
#     left_on='input_compound_name',
#     right_on='Compounds',
#     how='left'
# )
# merged = merged.drop(columns=['Compounds'])  # drop the extra column if you like
# ReacEnzyPath = merged


  input_original_index          input_compound_name   input_hmdb input_pubchem_cid   input_cas derived_kegg_compound_id kegg_id_source_identifier kegg_id_source_type                  kegg_reactions                                     kegg_enzymes             kegg_pathways  KEGG_Hits_Count                                         SMILES
0             MADN0053           D-Erythronolactone  HMDB0000349           5325915  15667-21-7                      NaN                       NaN                 NaN                              []                                               []                        []                0                      C1[C@H]([C@H](C(=O)O1)O)O
1             MADN0166      1,6-anhydro-β-D-glucose  HMDB0000640            724705    498-07-7                      NaN                       NaN                 NaN                              []                                               []                        []                0  C1[C@@H]2[C@H]([C@@H]([C@H]([C@H](O

In [133]:
display(ReacEnzyPath)

Unnamed: 0,input_original_index,input_compound_name,input_hmdb,input_pubchem_cid,input_cas,derived_kegg_compound_id,kegg_id_source_identifier,kegg_id_source_type,kegg_reactions,kegg_enzymes,kegg_pathways,KEGG_Hits_Count,SMILES
0,MADN0053,D-Erythronolactone,HMDB0000349,5325915,15667-21-7,,,,[],[],[],0,C1[C@H]([C@H](C(=O)O1)O)O
1,MADN0166,"1,6-anhydro-β-D-glucose",HMDB0000640,724705,498-07-7,,,,[],[],[],0,C1[C@@H]2[C@H]([C@@H]([C@H]([C@H](O1)O2)O)O)O
2,MADN0220,Deoxyribose 5-phosphate,HMDB0001031,45934311,-,C00673,HMDB0001031,HMDB,"['R02750', 'R02749', 'R01066']","['4.1.2.4', '2.7.1.229', '2.7.1.15', '5.4.2.7']","['hsa00030', 'hsa01100']",3,C1[C@@H]([C@H](O[C@H]1O)COP(=O)(O)O)O
3,MADN0329,2-Aminobenzenesulfonic acid,-,6926,88-21-1,,,,[],[],[],0,C1=CC=C(C(=C1)N)S(=O)(=O)O
4,MADN0333,Quinoline-4-carboxylic acid,-,10243,486-74-8,,,,[],[],[],0,C1=CC=C2C(=C1)C(=CC=N2)C(=O)O
5,MADN0466,cyclo(glu-glu),-,7408481,16691-00-2,,,,[],[],[],0,C(CC(=O)O)[C@H]1C(=O)N[C@H](C(=O)N1)CCC(=O)O
6,MADN0498,P-sulfanilic acid,-,8479,121-57-3,,,,[],[],[],0,C1=CC(=CC=C1N)S(=O)(=O)O
7,MADP0119,Methylcysteine,HMDB0002108,24417,1187-84-4,,,,[],[],[],0,CSC[C@@H](C(=O)O)N
8,MADP0548,Asp-Arg,-,16122509,-,,,,[],[],[],0,C(C[C@@H](C(=O)O)NC(=O)[C@H](CC(=O)O)N)CN=C(N)N
9,MEDN1253,LPI(16:2/0:0),-,,-,,,,[],[],[],0,CCCC/C=C\C/C=C\CCCCCCC(=O)OC[C@H](O)COP(=O)([O-])O[C@H]1[C@@H](O)[C@H](O)[C@@H](O)[C@H](O)[C@@H]1O


In [136]:
ReacEnzyPath.to_csv("ReacEnzyPath.csv", index=False)

In [3]:
## This following code block grabs KEGG’s entire pathway list and filters for anything mentioning “cancer.”
## 1) Fetch all pathways from KEGG.
## 2) For each line, split into the map ID (e.g., “map05210”) and its description (e.g., “Colorectal cancer”).
## 3) Keep only those where the description contains “cancer” (case-insensitive).
## 4) Print how many cancer-related pathways were found and list their IDs with descriptions.

## Example output (17 maps):
##   map05200 → Pathways in cancer
##   map05210 → Colorectal cancer
##   map05224 → Breast cancer
##   … and so on.


In [36]:
import requests

# 1) Download the full pathway list
url = "http://rest.kegg.jp/list/pathway"
resp = requests.get(url, timeout=10)
resp.raise_for_status()

# 2) Parse & filter for “cancer”
cancer_maps = []
for line in resp.text.strip().splitlines():
    if not line.strip():
        continue
    # split only on the first tab
    parts = line.split('\t', 1)
    if len(parts) != 2:
        continue
    code, description = parts
    # code might be "path:map05210" or just "map05210"; take the last part
    map_id = code.split(':')[-1]
    if "cancer" in description.lower():
        cancer_maps.append((map_id, description))

# 3) Report
print(f"Found {len(cancer_maps)} cancer-related maps:\n")
for mid, desc in cancer_maps:
    print(f"  {mid:<8} → {desc}")


Found 17 cancer-related maps:

  map05200 → Pathways in cancer
  map05202 → Transcriptional misregulation in cancer
  map05206 → MicroRNAs in cancer
  map05205 → Proteoglycans in cancer
  map05230 → Central carbon metabolism in cancer
  map05231 → Choline metabolism in cancer
  map05235 → PD-L1 expression and PD-1 checkpoint pathway in cancer
  map05210 → Colorectal cancer
  map05212 → Pancreatic cancer
  map05226 → Gastric cancer
  map05216 → Thyroid cancer
  map05219 → Bladder cancer
  map05215 → Prostate cancer
  map05213 → Endometrial cancer
  map05224 → Breast cancer
  map05222 → Small cell lung cancer
  map05223 → Non-small cell lung cancer


In [2]:
import os
import requests

# --------------------------------------------------------------------------
# CONFIGURATION:
# The list below has been populated with the filenames from your image.
# --------------------------------------------------------------------------
PATHWAY_IDS = [
    "hsa05231",
    "hsa05230",
    "hsa04923",
    "hsa04978",
    "hsa00010",
    "hsa05200",
    "hsa05210",
]

# Set the name of the folder where the files will be saved.
OUTPUT_DIR = "kegg_kgml_files"
# --------------------------------------------------------------------------


def download_kegg_kgml_files(pathway_list, output_directory):
    """
    Downloads raw KGML files for a given list of KEGG pathway IDs.

    Args:
        pathway_list (list): A list of strings, where each string is a
                             KEGG pathway ID (e.g., "hsa01100").
        output_directory (str): The name of the directory to save files to.
    """
    print(f"Starting download process...")
    print(f"Files will be saved in the '{output_directory}' folder.")

    # Create the output directory if it doesn't already exist
    os.makedirs(output_directory, exist_ok=True)

    # Loop through each pathway ID provided in the list
    for pid in pathway_list:
        file_path = os.path.join(output_directory, f"{pid}.xml")

        # Check if the file has already been downloaded
        if os.path.exists(file_path):
            print(f"- Skipping {pid}: Already exists.")
            continue

        # Construct the specific URL for the KEGG API
        url = f"http://rest.kegg.jp/get/{pid}/kgml"
        print(f"- Requesting {pid} from {url}...")

        try:
            # Make the HTTP request to the KEGG API
            response = requests.get(url)
            
            # Raise an error if the request was unsuccessful (e.g., 404 Not Found)
            response.raise_for_status()

            # Write the raw text content to the specified file
            with open(file_path, 'w', encoding='utf-8') as f:
                f.write(response.text)
            
            print(f"  -> Successfully saved {file_path}")

        except requests.exceptions.RequestException as e:
            # Handle potential network errors or bad responses
            print(f"  -> ERROR: Failed to download {pid}. Reason: {e}")

    print("\nDownload process finished.")


# --- Main execution block ---
if __name__ == "__main__":
    download_kegg_kgml_files(PATHWAY_IDS, OUTPUT_DIR)

Starting download process...
Files will be saved in the 'kegg_kgml_files' folder.
- Requesting hsa05231 from http://rest.kegg.jp/get/hsa05231/kgml...
  -> Successfully saved kegg_kgml_files\hsa05231.xml
- Requesting hsa05230 from http://rest.kegg.jp/get/hsa05230/kgml...
  -> Successfully saved kegg_kgml_files\hsa05230.xml
- Requesting hsa04923 from http://rest.kegg.jp/get/hsa04923/kgml...
  -> Successfully saved kegg_kgml_files\hsa04923.xml
- Requesting hsa04978 from http://rest.kegg.jp/get/hsa04978/kgml...
  -> Successfully saved kegg_kgml_files\hsa04978.xml
- Requesting hsa00010 from http://rest.kegg.jp/get/hsa00010/kgml...
  -> Successfully saved kegg_kgml_files\hsa00010.xml
- Requesting hsa05200 from http://rest.kegg.jp/get/hsa05200/kgml...
  -> Successfully saved kegg_kgml_files\hsa05200.xml
- Requesting hsa05210 from http://rest.kegg.jp/get/hsa05210/kgml...
  -> Successfully saved kegg_kgml_files\hsa05210.xml

Download process finished.


In [22]:
import os
import requests
from tqdm import tqdm
import time

# List of the 28 unique pathway IDs from your script's output
pids_to_download = [
    'hsa00040', 'hsa00053', 'hsa00140', 'hsa00220', 'hsa00280', 'hsa00330',
    'hsa00340', 'hsa00350', 'hsa00380', 'hsa00410', 'hsa00450', 'hsa00480',
    'hsa00590', 'hsa00750', 'hsa00760', 'hsa00790', 'hsa00830', 'hsa00860',
    'hsa00960', 'hsa00980', 'hsa00982', 'hsa00983', 'hsa01100', 'hsa01110',
    'hsa01120', 'hsa01210', 'hsa01230', 'hsa01240'
]

# The directory where your files are located
output_dir = r"kegg_kgml_files"

# Ensure the directory exists
os.makedirs(output_dir, exist_ok=True)

print(f"Starting download of {len(pids_to_download)} KGML files to '{output_dir}'...")

for pid in tqdm(pids_to_download, desc="Downloading KGML"):
    # Check if the file already exists to avoid re-downloading
    filepath = os.path.join(output_dir, f"{pid}.xml")
    if os.path.exists(filepath):
        # print(f"Skipping {pid}, file already exists.")
        continue

    # The URL to get the KGML file from the KEGG API
    url = f"http://rest.kegg.jp/get/{pid}/kgml"

    try:
        # Be nice to the KEGG server by waiting a bit between requests
        time.sleep(0.2)
        response = requests.get(url, timeout=30)
        response.raise_for_status() # Raise an exception for bad status codes (like 404)

        # KEGG sometimes returns an empty response instead of an error for invalid IDs
        if response.text.strip() == "":
            print(f"Warning: No KGML data returned from API for {pid}. Skipping.")
            continue

        # Save the content to the .xml file
        with open(filepath, 'w', encoding='utf-8') as f:
            f.write(response.text)

    except requests.exceptions.HTTPError as e:
        # The same IDs that failed to provide a name might also fail here
        print(f"Warning: Failed to download {pid}. Status code: {e.response.status_code}. Skipping.")
    except Exception as e:
        print(f"An error occurred while downloading {pid}: {e}")

print("Download complete.")
print(f"Your directory '{output_dir}' should now contain the new files.")

Starting download of 28 KGML files to 'C:\Users\vince\Coursework\Masters Project\kegg_kgml_files'...


Downloading KGML:  68%|██████▊   | 19/28 [00:59<00:25,  2.84s/it]



Downloading KGML:  86%|████████▌ | 24/28 [01:31<00:25,  6.28s/it]



Downloading KGML:  89%|████████▉ | 25/28 [01:33<00:15,  5.18s/it]



Downloading KGML: 100%|██████████| 28/28 [01:44<00:00,  3.73s/it]

Download complete.
Your directory 'C:\Users\vince\Coursework\Masters Project\kegg_kgml_files' should now contain the new files.





### 5.2 SMARTS-Guided EC and Pathway Inference from CRC-Enriched Motifs

This section implements a motif-level strategy using CRC-relevant metabolites to infer enzyme classes, KEGG reactions, and pathways by mining substructures and matching them via SMARTS patterns.

In [122]:
import os
import requests
import json
from lxml import etree
from collections import defaultdict

def get_master_reaction_ec_map():
    """
    Downloads the master KEGG file that links all Reaction IDs to EC numbers.
    This is the definitive source of truth.
    """
    print("--- Stage 1: Downloading KEGG's master Reaction-to-EC map ---")
    reaction_to_ec = defaultdict(set)
    url = "http://rest.kegg.jp/link/reaction/ec"
    
    try:
        response = requests.get(url)
        response.raise_for_status()
        
        # The file is a simple two-column (ec -> reaction) text file
        for line in response.text.strip().split('\n'):
            # Line format: "ec:1.1.1.1\trn:R00001"
            parts = line.strip().split('\t')
            if len(parts) == 2:
                ec_id = parts[0].replace('ec:', '')
                reaction_id = parts[1].replace('rn:', '')
                reaction_to_ec[reaction_id].add(ec_id)
        
        print(f"✅ Successfully built map for {len(reaction_to_ec)} reactions with known enzymes.")
        return reaction_to_ec
        
    except requests.exceptions.RequestException as e:
        print(f"FATAL: Could not download the master map from KEGG. Error: {e}")
        return None

def parse_local_files(kegg_dir):
    """
    Stage 2: Parses local KGML files to map Compound IDs to Reaction IDs.
    """
    print("\n--- Stage 2: Parsing local KGML files to map compounds to reactions ---")
    compound_to_reactions = defaultdict(set)
    for filename in sorted(os.listdir(kegg_dir)):
        if not filename.endswith(".xml"): continue
        filepath = os.path.join(kegg_dir, filename)
        try:
            tree = etree.parse(filepath)
            reactions = tree.xpath('//reaction')
            for reaction in reactions:
                reaction_ids = {rn.strip() for rn in reaction.get('name', '').replace('rn:', '').split()}
                compounds = reaction.xpath('.//substrate | .//product')
                for cpd_element in compounds:
                    compound_ids = {cpd.strip() for cpd in cpd_element.get('name', '').replace('cpd:', '').split()}
                    for cpd_id in compound_ids:
                        if cpd_id and '...' not in cpd_id:
                            compound_to_reactions[cpd_id].update(reaction_ids)
        except Exception as e:
            print(f"Warning: Could not process {filename}. Reason: {e}")
            
    print(f"✅ Found {len(compound_to_reactions)} compounds in your local pathway files.")
    return compound_to_reactions

# --- Main Execution ---

# Stage 1: Get the ground truth map from the KEGG server
reaction_to_ec_map = get_master_reaction_ec_map()

if reaction_to_ec_map:
    # Stage 2: Parse local files
    kegg_dir = r"kegg_kgml_files"
    compound_to_reactions_map = parse_local_files(kegg_dir)

    # Stage 3: Combine the maps
    print("\n--- Stage 3: Combining results ---")
    kegg_cpid_to_ec_final = defaultdict(set)
    for cpd_id, reaction_ids in compound_to_reactions_map.items():
        for rn_id in reaction_ids:
            if rn_id in reaction_to_ec_map:
                kegg_cpid_to_ec_final[cpd_id].update(reaction_to_ec_map[rn_id])

    # Final conversion and save
    kegg_cpid_to_ec_final = {k: sorted(list(v)) for k, v in kegg_cpid_to_ec_final.items()}
    output_path = 'kegg_compound_to_ec_map.json'
    with open(output_path, 'w') as f:
        json.dump(kegg_cpid_to_ec_final, f, indent=4)

    print(f"\n--- FINAL RESULTS ---")
    print(f"✅ Done. Found enzyme associations for {len(kegg_cpid_to_ec_final)} unique compounds.")
    print(f"Results saved to '{output_path}'.")

    print("\n--- Sanity Check ---")
    print(f"🔍 Arginine (C00062): {kegg_cpid_to_ec_final.get('C00062', 'Not found')}")
    print(f"🔍 Pyruvate (C00022): {kegg_cpid_to_ec_final.get('C00022', 'Not found')}")
    print(f"🔍 ATP (C00002): {kegg_cpid_to_ec_final.get('C00002', 'Not found')}")

--- Stage 1: Downloading KEGG's master Reaction-to-EC map ---
✅ Successfully built map for 9613 reactions with known enzymes.

--- Stage 2: Parsing local KGML files to map compounds to reactions ---
✅ Found 1516 compounds in your local pathway files.

--- Stage 3: Combining results ---

--- FINAL RESULTS ---
✅ Done. Found enzyme associations for 1442 unique compounds.
Results saved to 'kegg_compound_to_ec_map.json'.

--- Sanity Check ---
🔍 Arginine (C00062): ['1.14.13.39', '1.14.14.47', '2.1.4.1', '3.5.3.1', '3.5.3.6', '4.1.1.19', '4.3.2.1']
🔍 Pyruvate (C00022): ['1.1.1.27', '1.1.1.28', '1.1.1.38', '1.1.1.39', '1.1.1.40', '1.1.1.436', '1.1.2.3', '1.1.2.4', '1.1.2.5', '1.1.5.12', '1.1.99.40', '1.1.99.6', '1.1.99.7', '1.2.1.104', '1.2.1.51', '1.2.4.1', '1.2.7.1', '1.2.7.11', '1.4.1.1', '2.2.1.6', '2.3.1.54', '2.6.1.12', '2.6.1.2', '2.6.1.44', '2.7.1.40', '2.7.9.1', '2.7.9.2', '2.8.1.2', '3.7.1.20', '3.7.1.5', '4.1.1.1', '4.1.1.112', '4.1.3.16', '4.3.1.17', '4.3.1.19', '4.4.1.1', '4.4.1.1

In [None]:
# 🧬 What the above script does:
#
# This script maps KEGG compound IDs (like "C00022" for pyruvate) to the EC numbers of enzymes 
# that catalyze reactions involving them. It does this in **three stages**:
#
# ─────────────────────────────────────────────────────────────────────────────
# Stage 1️⃣: Fetch the global KEGG reaction-to-enzyme (EC number) map
#   - Downloads a flat file from KEGG's REST API.
#   - Each line maps a reaction (e.g., "R00001") to an EC number (e.g., "1.1.1.1").
#   - Builds a dictionary: reaction ID → set of EC numbers.
#
# Stage 2️⃣: Parse your local KEGG pathway XML (KGML) files
#   - For each file, reads all reaction entries.
#   - For each reaction, finds substrates and products (the compounds involved).
#   - Builds a dictionary: compound ID → set of reaction IDs it appears in.
#
# Stage 3️⃣: Join the data to map compounds to enzymes
#   - For each compound, get its reactions (from local files).
#   - For each reaction, get the EC numbers (from KEGG's master list).
#   - Result: compound ID → list of enzyme EC numbers.
#
# The final mapping is saved as a JSON file (`kegg_compound_to_ec_map.json`) and printed for sanity checks
# on well-known metabolites like Arginine, Pyruvate, and ATP.


In [None]:
from lxml import etree
import pandas as pd

# Define the path to your large HMDB XML file
xml_path = r"path\hmdb_metabolites.xml"
ns = {'hmdb': 'http://www.hmdb.ca'}
crc_keywords = {"colorectal cancer", "colon cancer", "rectal cancer", "bowel cancer"}

# This process can take a while, as the file is very large
print(f"🔄 Parsing {xml_path} to find CRC metabolites WITH their KEGG IDs...")
print("This may take several minutes...")

context = etree.iterparse(xml_path, events=("end",), tag="{http://www.hmdb.ca}metabolite")

records = []
count = 0

for event, metabolite in context:
    hmdb_id = metabolite.findtext('hmdb:accession', namespaces=ns)
    smiles = metabolite.findtext('hmdb:smiles', namespaces=ns)
    kegg_id = metabolite.findtext('hmdb:kegg_id', namespaces=ns)

    # We only care about metabolites that have both SMILES and a KEGG ID
    if not (hmdb_id and smiles and kegg_id):
        metabolite.clear()
        continue

    # Look for disease associations
    is_crc = False
    for disease in metabolite.findall('.//hmdb:disease', namespaces=ns):
        name = disease.findtext('hmdb:name', namespaces=ns)
        if name and any(kw in name.lower() for kw in crc_keywords):
            is_crc = True
            break
    
    if is_crc:
        records.append({
            "id": hmdb_id, 
            "smiles": smiles,
            "kegg_id": kegg_id
        })

    # Memory cleanup
    metabolite.clear()
    while metabolite.getprevious() is not None:
        del metabolite.getparent()[0]

    count += 1
    if count % 25000 == 0:
        print(f"  ...parsed {count} metabolites.")

# Final export
crc_df_with_kegg = pd.DataFrame(records).drop_duplicates()
output_csv = 'crc_metabolites_with_kegg_id.csv'
crc_df_with_kegg.to_csv(output_csv, index=False)

print(f"\n🎉 Done. Found {len(crc_df_with_kegg)} CRC-relevant metabolites.")
print(f"Results saved to '{output_csv}'.")

In [2]:
import pandas as pd
import json
from collections import defaultdict

# --- Load the input files we've already created ---

# 1. Your list of CRC-relevant metabolites with their KEGG IDs and SMILES
try:
    crc_df = pd.read_csv('crc_metabolites_with_kegg_id.csv')
    print("✅ Successfully loaded 'crc_metabolites_with_kegg_id.csv'.")
except FileNotFoundError:
    print("❌ FATAL: 'crc_metabolites_with_kegg_id.csv' not found. This should not happen. Please re-run the previous script.")
    exit()

# 2. The master map of compounds to the enzymes that act on them
try:
    with open('kegg_compound_to_ec_map.json', 'r') as f:
        cpd_to_ec_map = json.load(f)
    print("✅ Successfully loaded 'kegg_compound_to_ec_map.json'.")
except FileNotFoundError:
    print("❌ FATAL: 'kegg_compound_to_ec_map.json' not found. Please re-run the script that generates this file.")
    exit()

# --- Main Logic: Group metabolites by EC number ---

# This will be our final dictionary: { "ec_number": [ {metabolite_info}, ... ] }
ec_to_metabolites = defaultdict(list)

print("\n🔄 Grouping your CRC metabolites by the enzymes that act on them...")

for _, row in crc_df.iterrows():
    kegg_id = row.get('kegg_id')
    hmdb_id = row.get('id')
    smiles = row.get('smiles')
    
    if not kegg_id or not pd.notna(kegg_id):
        continue
        
    # Find the EC numbers associated with this metabolite's KEGG ID
    if kegg_id in cpd_to_ec_map:
        ec_numbers = cpd_to_ec_map[kegg_id]
        for ec in ec_numbers:
            metabolite_info = {
                "hmdb_id": hmdb_id,
                "kegg_id": kegg_id,
                "smiles": smiles
            }
            # Add this metabolite's info to the list for this EC number
            ec_to_metabolites[ec].append(metabolite_info)

# --- Save the results to a new JSON file for analysis ---
output_path = 'ec_to_crc_metabolites.json'
with open(output_path, 'w') as f:
    # Sort the dictionary by EC number for consistent ordering
    sorted_ec_to_metabolites = dict(sorted(ec_to_metabolites.items()))
    json.dump(sorted_ec_to_metabolites, f, indent=4)

print(f"\n🎉 Success! Analysis complete.")
print(f"A total of {len(ec_to_metabolites)} enzyme groups were created.")
print(f"Results are saved in '{output_path}'. This file is your guide for the next step.")

✅ Successfully loaded 'crc_metabolites_with_kegg_id.csv'.
✅ Successfully loaded 'kegg_compound_to_ec_map.json'.

🔄 Grouping your CRC metabolites by the enzymes that act on them...

🎉 Success! Analysis complete.
A total of 571 enzyme groups were created.
Results are saved in 'ec_to_crc_metabolites.json'. This file is your guide for the next step.


<h1>🧬 CRC Metabolite-Enzyme Mapping Workflow</h1>

<h2>🎯 Goal</h2>
<p>
Link CRC-associated metabolites to the enzymes (EC numbers) that act on them by combining:
<ul>
    <li>Disease-annotated metabolite data from <strong>HMDB</strong></li>
    <li>Reaction-to-enzyme relationships from <strong>KEGG</strong></li>
</ul>
</p>

---

<h2>📦 Part 1: Extract CRC Metabolites</h2>
<p>
Parse the large <code>hmdb_metabolites.xml</code> file to extract only those metabolites that:
<ul>
    <li>Have a valid <strong>SMILES</strong> structure</li>
    <li>Include a <strong>KEGG compound ID</strong></li>
    <li>Mention a <strong>colorectal cancer-related disease</strong></li>
</ul>
</p>

<p>
These are saved to a filtered file: <code>crc_metabolites_with_kegg_id.csv</code>
</p>

---

<h2>🔗 Part 2: Map to Enzymes via KEGG</h2>
<p>
Using the precomputed <code>kegg_compound_to_ec_map.json</code>, for each CRC metabolite:
<ul>
    <li>Look up its <strong>KEGG ID</strong></li>
    <li>Get the <strong>EC numbers</strong> of enzymes acting on it</li>
    <li>Group the data: <code>EC → list of CRC-relevant metabolites</code></li>
</ul>
</p>

<p>
Output saved to: <code>ec_to_crc_metabolites.json</code>
</p>

---

<h2>📊 Output Format Example</h2>

```json
{
  "1.1.1.1": [
    {
      "hmdb_id": "HMDB00001",
      "kegg_id": "C00022",
      "smiles": "CC(=O)O"
    }
  ]
}


In [3]:
import json
from rdkit import Chem
from rdkit.Chem import rdFMCS
import pandas as pd

# Load the dictionary we just created
try:
    with open('ec_to_crc_metabolites.json', 'r') as f:
        ec_to_metabolites = json.load(f)
except FileNotFoundError:
    print("❌ FATAL: 'ec_to_crc_metabolites.json' not found. Please run the previous script first.")
    exit()

# --- Step 1: Find a promising group to analyze ---
# We'll look for an enzyme group with 3-5 different molecules, which is ideal for MCS.
target_ec = None
promising_groups = []

for ec, metabolites in ec_to_metabolites.items():
    # Get unique SMILES strings to avoid duplicates
    unique_smiles = {met['smiles'] for met in metabolites}
    if 3 <= len(unique_smiles) <= 5:
        promising_groups.append(ec)

if promising_groups:
    target_ec = promising_groups[0] # Just pick the first promising group
    print(f"✅ Found a promising group to analyze under EC Number: {target_ec}")
else:
    print("⚠️ No ideal groups of 3-5 found. Picking the first group with > 1 molecule for demonstration.")
    for ec, metabolites in ec_to_metabolites.items():
        if len({met['smiles'] for met in metabolites}) > 1:
            target_ec = ec
            break

if not target_ec:
    print("❌ Could not find any suitable groups to analyze programmatically.")
    exit()

# --- Step 2: Prepare the molecules for MCS ---
smiles_list = list({met['smiles'] for met in ec_to_metabolites[target_ec]})
mols = [Chem.MolFromSmiles(s) for s in smiles_list]
# Filter out any invalid SMILES
mols = [m for m in mols if m is not None]

print("\n--- Molecules in this group ---")
display_df = pd.DataFrame([
    {"KEGG_ID": met['kegg_id'], "SMILES": met['smiles']}
    for met in ec_to_metabolites[target_ec]
]).drop_duplicates()
print(display_df)

# --- Step 3: Run the MCS Algorithm ---
# We add a timeout in case the calculation is too complex
print("\n🔄 Running Maximum Common Substructure (MCS) algorithm...")
mcs_result = rdFMCS.FindMCS(mols, timeout=30)

print("\n--- MCS Results ---")
if mcs_result.numAtoms > 0:
    mcs_smarts = mcs_result.smartsString
    print(f"✅ MCS Found ({mcs_result.numAtoms} atoms, {mcs_result.numBonds} bonds):")
    print(f"   -> Raw SMARTS Pattern: {mcs_smarts}")
    
    # --- Step 4: Human Interpretation ---
    print("\n--- Analysis of the Result ---")
    print("This raw SMARTS pattern is the largest chemical graph present in all molecules.")
    print("Your task now is to decide if this is a useful 'rule'.")
    print("  - Is it too specific? (e.g., includes parts of the carbon skeleton)")
    print("  - Is it too generic? (e.g., just a single atom)")
    print("Often, this raw pattern needs to be simplified by a human to create a more general and useful motif for your dictionary.")
    
else:
    print("❌ No significant common substructure was found for this group.")

✅ Found a promising group to analyze under EC Number: 1.1.1.14

--- Molecules in this group ---
  KEGG_ID                                 SMILES
0  C00794  OC[C@H](O)[C@@H](O)[C@H](O)[C@H](O)CO
1  C00310         OC[C@@]1(O)OC[C@@H](O)[C@@H]1O
2  C00379          OC[C@H](O)[C@@H](O)[C@H](O)CO

🔄 Running Maximum Common Substructure (MCS) algorithm...

--- MCS Results ---
✅ MCS Found (10 atoms, 9 bonds):
   -> Raw SMARTS Pattern: [#8]-[#6]-[#6](-[#8])-[#6](-[#8])-[#6](-[#8])-[#6]-[#8]

--- Analysis of the Result ---
This raw SMARTS pattern is the largest chemical graph present in all molecules.
Your task now is to decide if this is a useful 'rule'.
  - Is it too specific? (e.g., includes parts of the carbon skeleton)
  - Is it too generic? (e.g., just a single atom)
Often, this raw pattern needs to be simplified by a human to create a more general and useful motif for your dictionary.


<h2>🧪 Maximum Common Substructure (MCS) Analysis</h2>

<h3>🎯 Objective</h3>
<p>
Identify the largest shared substructure across CRC-relevant metabolites that are substrates or products of the same enzyme.
This helps uncover conserved chemical motifs that enzymes in a pathway tend to recognize or act on.
</p>

---

<h3>🔍 What This Script Does</h3>
<ol>
    <li><strong>Loads</strong> the enzyme → metabolites mapping from <code>ec_to_crc_metabolites.json</code></li>
    <li><strong>Selects</strong> an enzyme group (<code>EC number</code>) with 3–5 distinct molecules — ideal for MCS comparison</li>
    <li><strong>Prepares</strong> the molecules by:
        <ul>
            <li>Extracting unique <code>SMILES</code> strings</li>
            <li>Converting to RDKit molecular graphs</li>
        </ul>
    </li>
    <li><strong>Runs</strong> RDKit’s <code>FindMCS</code> algorithm to find the largest common chemical pattern (subgraph)</li>
    <li><strong>Outputs</strong> the result as a SMARTS pattern and explains what to consider when interpreting it</li>
</ol>

---

<h3>🧠 Example Interpretation</h3>
<p>
If the output SMARTS is:
<pre><code>[#8]-[#6]-[#6](-[#8])-[#6](-[#8])-[#6](-[#8])-[#6]-[#8]</code></pre>
then this is likely a conserved sugar-like backbone with hydroxyl groups:
<ul>
    <li><code>[#6]</code> = carbon atoms</li>
    <li><code>[#8]</code> = oxygen atoms</li>
    <li>Chains like <code>COH-CHOH-CHOH...</code> suggest polyols (e.g., sugars or sugar alcohols)</li>
</ul>
</p>

---

<h3>🧬 Why This Matters</h3>
<ul>
    <li>Highlights structural features common to enzyme substrates → useful for QSAR, screening, or motif learning</li>
    <li>Lets you define enzyme-specific chemical "signatures"</li>
    <li>Can be used to group or filter unknown metabolites based on structural similarity</li>
</ul>

---

<h3>🚦 Next Step: Use SMARTS for CRC-Relevant Enzyme Prediction</h3>
<p>
Now that we’ve extracted meaningful SMARTS motifs, the next step is to apply them to all CRC-relevant metabolites and 
<strong>predict potential enzyme and pathway associations</strong> using substructure matching.
This turns motifs into functional classifiers that can label unannotated molecules based on chemical similarity.
</p>

<ul>
    <li>Use these patterns in Phase 2 for systematic motif scanning</li>
    <li>Evaluate motif coverage and specificity across the CRC metabolome</li>
    <li>Optionally, refine motifs or apply SMARTS simplification based on overfitting or redundancy</li>
</ul>


In [4]:
import json
import time
import requests
from rdkit import Chem
from rdkit.Chem import rdFMCS
from collections import defaultdict
from tqdm import tqdm

# --- Helper Function to get Pathway info for an EC number ---
# We use a cache to avoid asking the API for the same EC number multiple times
ec_pathway_cache = {}

def get_pathways_for_ec(ec_number):
    """Queries the KEGG API to find pathways associated with an EC number."""
    if ec_number in ec_pathway_cache:
        return ec_pathway_cache[ec_number]

    url = f"http://rest.kegg.jp/get/ec:{ec_number}"
    try:
        response = requests.get(url)
        response.raise_for_status()
        pathways = []
        for line in response.text.strip().split('\n'):
            if line.startswith("PATHWAY"):
                # Pathway line format: "PATHWAY     map00010  Glycolysis / Gluconeogenesis"
                pathway_id = line.split()[1].replace("map", "hsa")
                pathways.append(pathway_id)
        
        ec_pathway_cache[ec_number] = pathways
        time.sleep(0.2) # Be polite to the server
        return pathways
    except requests.exceptions.RequestException:
        return [] # Return empty list on error

# --- Main Script ---
print("--- Starting Fully Programmatic Rule Generation ---")

# 1. Load the grouped metabolites file
try:
    with open('ec_to_crc_metabolites.json', 'r') as f:
        ec_to_metabolites = json.load(f)
    print(f"✅ Loaded {len(ec_to_metabolites)} enzyme groups for analysis.")
except FileNotFoundError:
    print("❌ FATAL: 'ec_to_crc_metabolites.json' not found. Please run the grouping script first.")
    exit()

# This will be our final, programmatically generated dictionary
final_motifs = {}

# 2. Iterate through all enzyme groups
for ec, metabolites in tqdm(ec_to_metabolites.items(), desc="Analyzing Enzyme Groups"):
    
    # --- Filter for promising groups ---
    unique_smiles = list({met['smiles'] for met in metabolites})
    if not (2 <= len(unique_smiles) <= 10):
        continue # Skip groups that are too small or too large

    # --- Prepare molecules and run MCS ---
    mols = [Chem.MolFromSmiles(s) for s in unique_smiles]
    mols = [m for m in mols if m is not None]

    if len(mols) < 2:
        continue

    # Find the Maximum Common Substructure
    mcs_result = rdFMCS.FindMCS(mols, timeout=10)

    # --- Filter for meaningful results and build the rule ---
    # We'll require the common substructure to have at least 3 atoms
    if mcs_result.numAtoms > 2:
        
        # Get the associated pathways
        pathways = get_pathways_for_ec(ec)
        
        if pathways: # Only create a rule if we have pathway information
            motif_name = f"ec_{ec}_motif"
            raw_smarts = mcs_result.smartsString
            
            final_motifs[motif_name] = {
                'smarts': raw_smarts,
                'ec': [ec],
                'pathways': pathways
            }

print(f"\n✅ Successfully generated {len(final_motifs)} automated rules.")

# 3. Save the dictionary to a Python file
output_file = 'smart_motifs_programmatic.py'
with open(output_file, 'w') as f:
    f.write("# This file was generated programmatically.\n")
    f.write("# It contains raw SMARTS patterns derived from Maximum Common Substructure analysis.\n\n")
    f.write("smart_motifs = ")
    # Use json.dumps for pretty printing the dictionary
    f.write(json.dumps(final_motifs, indent=4))

print(f"🎉 Success! Your new rulebook is saved as '{output_file}'.")

--- Starting Fully Programmatic Rule Generation ---
✅ Loaded 571 enzyme groups for analysis.


Analyzing Enzyme Groups: 100%|██████████| 571/571 [13:42<00:00,  1.44s/it]


✅ Successfully generated 245 automated rules.
🎉 Success! Your new rulebook is saved as 'smart_motifs_programmatic.py'.





<h2>🧠 Phase 1: Programmatic Rule Generation from CRC Metabolites</h2>

<h3>🎯 Objective</h3>
<p>
Automatically generate SMARTS-based chemical motifs for enzyme groups that act on colorectal cancer (CRC)-associated metabolites,
by combining <strong>maximum common substructure (MCS)</strong> analysis with <strong>KEGG pathway context</strong>.
</p>

---

<h3>🔬 What This Script Does</h3>

<ol>
    <li><strong>Loads</strong> the <code>ec_to_crc_metabolites.json</code> file containing grouped metabolites by enzyme (EC number).</li>
    <li><strong>Iterates</strong> over each enzyme group and filters for groups that have between 2 and 10 structurally distinct molecules.</li>
    <li><strong>Runs</strong> RDKit's <code>FindMCS</code> to extract a shared chemical substructure (SMARTS pattern) from each group.</li>
    <li><strong>Looks up</strong> KEGG pathways associated with each EC number using the KEGG REST API.</li>
    <li><strong>Stores</strong> only motifs that:
        <ul>
            <li>Are at least 3 atoms large (i.e., non-trivial)</li>
            <li>Have pathway information available from KEGG</li>
        </ul>
    </li>
    <li><strong>Outputs</strong> a dictionary of motifs to a file called <code>smart_motifs_programmatic.py</code></li>
</ol>

---

<h3>📦 Output Structure</h3>
<p>The saved file contains a Python dictionary called <code>smart_motifs</code> with entries like:</p>

```python
smart_motifs = {
    "ec_1.1.1.14_motif": {
        "smarts": "[#8]-[#6]-[#6](-[#8])-[#6](-[#8])-[#6](-[#8])-[#6]-[#8]",
        "ec": ["1.1.1.14"],
        "pathways": ["hsa00010", "hsa00500"]
    },
    ...
}


In [5]:
import pandas as pd
from rdkit import Chem
# *** NOTE: We are importing from your new, programmatically-generated file ***
from smart_motifs_programmatic import smart_motifs

print("--- Phase 2: Applying Your Automated Rules to Make Predictions ---")

# --- Step 1: Load your data and rules ---
try:
    # This is the list of 460 metabolites we need to analyze
    crc_df = pd.read_csv('crc_metabolites_with_kegg_id.csv')
    print(f"✅ Successfully loaded {len(crc_df)} CRC metabolites.")
except FileNotFoundError:
    print("❌ FATAL: 'crc_metabolites_with_kegg_id.csv' not found. Please re-run the script to generate it.")
    exit()

# This assumes your smart_motifs_programmatic.py file is in the same directory
print(f"✅ Successfully loaded {len(smart_motifs)} automated rules.")


# --- Step 2: Define the prediction function ---
def predict_from_smiles(smiles_string):
    """
    Takes a SMILES string and checks it against all rules in the smart_motifs dictionary.
    Returns a list of all hits.
    """
    hits = []
    mol = Chem.MolFromSmiles(smiles_string)
    if mol is None:
        return hits # Return empty list if SMILES is invalid

    for motif_name, rule_data in smart_motifs.items():
        # Ensure the SMARTS pattern is valid before using it
        pattern = Chem.MolFromSmarts(rule_data['smarts'])
        if pattern is None:
            continue
            
        if mol.HasSubstructMatch(pattern):
            # We have a hit! Record the information.
            hits.append({
                'matched_motif': motif_name,
                'predicted_ec': rule_data['ec'],
                'predicted_pathways': rule_data['pathways']
            })
    return hits

# --- Step 3: Apply the function to all 460 metabolites ---
print("\n🔄 Scanning all metabolites against your new rulebook...")

results = []
for _, row in crc_df.iterrows():
    smiles = row['smiles']
    hmdb_id = row['id']
    
    # Get all the predicted hits for the current molecule
    predictions = predict_from_smiles(smiles)
    
    # Format the results for our final DataFrame
    for pred in predictions:
        results.append({
            'HMDB_ID': hmdb_id,
            'SMILES': smiles,
            'Matched_Motif': pred['matched_motif'],
            'Predicted_EC': ", ".join(pred['predicted_ec']), # Join lists for cleaner output
            'Predicted_Pathways': ", ".join(pred['predicted_pathways'])
        })

# Create the final DataFrame of all predictions
predictions_df = pd.DataFrame(results)

print("\n🎉 Prediction Complete! 🎉")
print(f"Found a total of {len(predictions_df)} potential enzyme/pathway associations.")
print("The results are now available in the 'predictions_df' DataFrame.")


# Display the top 20 predictions
print("\n--- Top 20 Predictions ---")
print(predictions_df.head(20).to_string())

# The 'predictions_df' variable now holds your results for any further analysis.

--- Phase 2: Applying Your Automated Rules to Make Predictions ---
✅ Successfully loaded 460 CRC metabolites.
✅ Successfully loaded 245 automated rules.

🔄 Scanning all metabolites against your new rulebook...

🎉 Prediction Complete! 🎉
Found a total of 17096 potential enzyme/pathway associations.
The results are now available in the 'predictions_df' DataFrame.

--- Top 20 Predictions ---
        HMDB_ID            SMILES       Matched_Motif Predicted_EC Predicted_Pathways
0   HMDB0000002             NCCCN  ec_1.14.14.1_motif    1.14.14.1            ec00071
1   HMDB0000002             NCCCN   ec_1.4.3.21_motif     1.4.3.21            ec00260
2   HMDB0000002             NCCCN   ec_1.4.3.22_motif     1.4.3.22            ec00330
3   HMDB0000002             NCCCN   ec_2.3.1.65_motif     2.3.1.65            ec00120
4   HMDB0000002             NCCCN   ec_4.1.1.15_motif     4.1.1.15            ec00250
5   HMDB0000008  CC[C@H](O)C(O)=O   ec_1.1.1.27_motif     1.1.1.27            ec00010
6   HMD

<h2>🚀 Phase 2: Applying SMARTS-Based Rules to CRC Metabolites</h2>

<h3>🎯 Objective</h3>
<p>
Use the chemical motifs (SMARTS patterns) previously derived from enzyme groups to <strong>predict which enzymes and pathways</strong> 
each CRC-associated metabolite may participate in, based on substructure matching.
</p>

---

<h3>🧬 What This Script Does</h3>

<ol>
  <li><strong>Loads</strong> two key files:
    <ul>
      <li><code>crc_metabolites_with_kegg_id.csv</code> — 460 CRC-relevant metabolites with SMILES</li>
      <li><code>smart_motifs_programmatic.py</code> — 245 enzyme/pathway-specific SMARTS rules</li>
    </ul>
  </li>
  
  <li><strong>Defines</strong> a function <code>predict_from_smiles</code> that:
    <ul>
      <li>Converts a SMILES string to an RDKit molecule</li>
      <li>Checks for substructure matches against all motifs</li>
      <li>Returns a list of EC/pathway predictions if matches are found</li>
    </ul>
  </li>
  
  <li><strong>Applies</strong> the rule set to each CRC metabolite and aggregates the matches</li>
</ol>

---

<h3>📊 Result Summary</h3>
<ul>
  <li>Total CRC metabolites processed: <strong>460</strong></li>
  <li>SMARTS rules applied: <strong>245</strong></li>
  <li>Substructure matches found: <strong>17,096</strong> associations</li>
</ul>

---

<h3>🧠 Example Prediction</h3>

Each row in the resulting DataFrame <code>predictions_df</code> represents a predicted match between:
- A CRC metabolite (via <code>HMDB ID</code> and <code>SMILES</code>)
- A known enzyme motif (e.g., <code>ec_1.1.1.27_motif</code>)
- The corresponding <strong>EC number</strong> and associated <strong>KEGG pathway</strong>

```text
| HMDB_ID      | SMILES             | Matched_Motif       | Predicted_EC | Predicted_Pathways |
|--------------|--------------------|----------------------|--------------|---------------------|
| HMDB0000008  | CC[C@H](O)C(O)=O   | ec_1.1.1.27_motif    | 1.1.1.27     | ec00010             |


In [138]:
ReacEnzyPath.head()

Unnamed: 0,input_original_index,input_compound_name,input_hmdb,input_pubchem_cid,input_cas,derived_kegg_compound_id,kegg_id_source_identifier,kegg_id_source_type,kegg_reactions,kegg_enzymes,kegg_pathways,KEGG_Hits_Count,SMILES
0,MADN0053,D-Erythronolactone,HMDB0000349,5325915,15667-21-7,,,,[],[],[],0,C1[C@H]([C@H](C(=O)O1)O)O
1,MADN0166,"1,6-anhydro-β-D-glucose",HMDB0000640,724705,498-07-7,,,,[],[],[],0,C1[C@@H]2[C@H]([C@@H]([C@H]([C@H](O1)O2)O)O)O
2,MADN0220,Deoxyribose 5-phosphate,HMDB0001031,45934311,-,C00673,HMDB0001031,HMDB,"['R02750', 'R02749', 'R01066']","['4.1.2.4', '2.7.1.229', '2.7.1.15', '5.4.2.7']","['hsa00030', 'hsa01100']",3,C1[C@@H]([C@H](O[C@H]1O)COP(=O)(O)O)O
3,MADN0329,2-Aminobenzenesulfonic acid,-,6926,88-21-1,,,,[],[],[],0,C1=CC=C(C(=C1)N)S(=O)(=O)O
4,MADN0333,Quinoline-4-carboxylic acid,-,10243,486-74-8,,,,[],[],[],0,C1=CC=C2C(=C1)C(=CC=N2)C(=O)O


In [139]:
import pandas as pd

# Assuming you already have ReacEnzyPath loaded as a DataFrame and it contains at least:
#   - 'input_compound_name'   (the compound name)
#   - 'SMILES'                (the corresponding SMILES string)
#   - 'kegg_reactions', 'kegg_enzymes', 'kegg_pathways'  (each stored as a string, e.g. "[]" when empty)

# 1) Create a boolean mask for rows where all three KEGG columns equal the string "[]"
mask_empty_kegg = (
    (ReacEnzyPath['kegg_reactions'] == '[]') &
    (ReacEnzyPath['kegg_enzymes'] == '[]') &
    (ReacEnzyPath['kegg_pathways'] == '[]')
)

# 2) Filter ReacEnzyPath to only those rows
empty_kegg_df = ReacEnzyPath.loc[mask_empty_kegg, ['input_compound_name', 'SMILES']].copy()

# 3) Optionally rename columns for clarity
empty_kegg_df.columns = ['Compound', 'SMILES']

# Now empty_kegg_df contains only the compounds (and their SMILES) for which
# all three KEGG fields were "[]".

print(f"Found {len(empty_kegg_df)} compounds with no KEGG hits:")
print(empty_kegg_df)


Found 15 compounds with no KEGG hits:
                                      Compound                                                                                              SMILES
0                           D-Erythronolactone                                                                           C1[C@H]([C@H](C(=O)O1)O)O
1                      1,6-anhydro-β-D-glucose                                                       C1[C@@H]2[C@H]([C@@H]([C@H]([C@H](O1)O2)O)O)O
3                  2-Aminobenzenesulfonic acid                                                                          C1=CC=C(C(=C1)N)S(=O)(=O)O
4                  Quinoline-4-carboxylic acid                                                                       C1=CC=C2C(=C1)C(=CC=N2)C(=O)O
5                               cyclo(glu-glu)                                                        C(CC(=O)O)[C@H]1C(=O)N[C@H](C(=O)N1)CCC(=O)O
6                            P-sulfanilic acid                                  

In [142]:
empty_df = empty_kegg_df.rename(columns={'Compound': 'input_compound_name'})

In [141]:
# Block 1: Define the Prediction Function

import pandas as pd
from rdkit import Chem
from collections import defaultdict

def run_smarts_predictions(target_df):
    """
    Takes an input DataFrame and runs SMARTS-based predictions on it.

    Args:
        target_df (pd.DataFrame): The input DataFrame. Must contain 'SMILES'
                                  and 'input_compound_name' columns.

    Returns:
        pd.DataFrame: A new DataFrame containing the predictions.
    """
    # --- Step 1: Load the rules ---
    try:
        from smart_motifs_programmatic import smart_motifs
        print(f"✅ Successfully loaded {len(smart_motifs)} automated rules.")
    except ImportError:
        print("❌ FATAL: Could not find 'smart_motifs_programmatic.py'. Please ensure it's in the same directory.")
        return pd.DataFrame() # Return empty DataFrame on error

    print(f"✅ Received {len(target_df)} compounds to analyze.")
    
    # --- Step 2: Prediction Logic ---
    results = []
    print("\n🔄 Scanning compounds against the rulebook...")
    
    for _, row in target_df.iterrows():
        smiles = row['SMILES']
        compound_name = row['input_compound_name']
        mol = Chem.MolFromSmiles(smiles)

        if mol is None:
            # Handle invalid SMILES strings
            results.append({
                'Compound': compound_name, 'SMILES': smiles, 'Matched_Motif': 'Invalid SMILES',
                'Predicted_EC': '', 'Predicted_Pathways': ''
            })
            continue

        hits = []
        for motif_name, rule_data in smart_motifs.items():
            pattern = Chem.MolFromSmarts(rule_data['smarts'])
            if pattern and mol.HasSubstructMatch(pattern):
                hits.append({
                    'matched_motif': motif_name,
                    'predicted_ec': ", ".join(rule_data['ec']),
                    'predicted_pathways': ", ".join(rule_data['pathways'])
                })
        
        if not hits:
            # Handle cases with no hits
            results.append({
                'Compound': compound_name, 'SMILES': smiles, 'Matched_Motif': 'No Hit',
                'Predicted_EC': '', 'Predicted_Pathways': ''
            })
        else:
            for pred in hits:
                results.append({
                    'Compound': compound_name, 'SMILES': smiles,
                    'Matched_Motif': pred['matched_motif'],
                    'Predicted_EC': pred['predicted_ec'],
                    'Predicted_Pathways': pred['predicted_pathways']
                })
    
    # --- Step 3: Create and return the final DataFrame ---
    predictions_df = pd.DataFrame(results)
    print("\n🎉 Prediction Complete! 🎉")
    return predictions_df

In [143]:
# Block 2: Call the function with your DataFrame

# We are assuming your 'empty_df' DataFrame already exists in your environment.
# Now, we call the function we just defined and pass your DataFrame to it.

predictions_for_empty_df = run_smarts_predictions(empty_df)


# --- Display the full results ---
print("\n--- Prediction Results for your DataFrame ---")
print(predictions_for_empty_df.to_string())

✅ Successfully loaded 245 automated rules.
✅ Received 15 compounds to analyze.

🔄 Scanning compounds against the rulebook...

🎉 Prediction Complete! 🎉

--- Prediction Results for your DataFrame ---
                                       Compound                                                                                              SMILES        Matched_Motif Predicted_EC Predicted_Pathways
0                            D-Erythronolactone                                                                           C1[C@H]([C@H](C(=O)O1)O)O    ec_1.1.1.26_motif     1.1.1.26            ec00630
1                            D-Erythronolactone                                                                           C1[C@H]([C@H](C(=O)O1)O)O    ec_1.1.1.27_motif     1.1.1.27            ec00010
2                            D-Erythronolactone                                                                           C1[C@H]([C@H](C(=O)O1)O)O    ec_1.1.1.29_motif     1.1.1.29            ec0026

In [144]:
predictions_for_empty_df.shape

(743, 5)

In [145]:
import pandas as pd
from rdkit import Chem

# We assume your 'predictions_for_empty_df' DataFrame already exists
# and your 'smart_motifs_programmatic.py' file is in the same directory.
try:
    from smart_motifs_programmatic import smart_motifs
except ImportError:
    print("❌ FATAL: Could not find 'smart_motifs_programmatic.py'.")
    exit()

# 1. Define the scoring function
def calculate_plausibility(motif_name):
    """
    Calculates a complexity score for a motif based on its atoms and bonds.
    """
    if motif_name not in smart_motifs:
        return 0
    
    smarts_string = smart_motifs[motif_name]['smarts']
    mol_from_smarts = Chem.MolFromSmarts(smarts_string)
    
    if mol_from_smarts is None:
        return 0
        
    # Score = number of atoms + number of bonds
    score = mol_from_smarts.GetNumAtoms() + mol_from_smarts.GetNumBonds()
    return score

# 2. Apply the scoring function to your DataFrame
print("🔄 Calculating plausibility scores for all 261 predictions...")
predictions_for_empty_df['Plausibility_Score'] = predictions_for_empty_df['Matched_Motif'].apply(calculate_plausibility)
print("✅ Scoring complete.")

# 3. Rank the predictions and show the best ones
print("\n--- Top 3 Most Plausible Predictions Per Compound ---")

# Group by compound, sort by score, and take the top 3 from each group
top_predictions = (
    predictions_for_empty_df
    .sort_values(by='Plausibility_Score', ascending=False)
    .groupby('Compound')
    .head(3)
    .reset_index(drop=True)
)

# Display the final ranked list
print(top_predictions.to_string())

🔄 Calculating plausibility scores for all 261 predictions...
✅ Scoring complete.

--- Top 3 Most Plausible Predictions Per Compound ---
                                      Compound                                                                                              SMILES        Matched_Motif Predicted_EC Predicted_Pathways  Plausibility_Score
0                                   Cytarabine                                                 C1=CN(C(=O)N=C1N)[C@H]2[C@H]([C@@H]([C@H](O2)CO)O)O    ec_3.1.3.91_motif     3.1.3.91            ec00240                  35
1                                   Cytarabine                                                 C1=CN(C(=O)N=C1N)[C@H]2[C@H]([C@@H]([C@H](O2)CO)O)O   ec_2.7.1.213_motif    2.7.1.213            ec00240                  33
2                                   Cytarabine                                                 C1=CN(C(=O)N=C1N)[C@H]2[C@H]([C@@H]([C@H](O2)CO)O)O    ec_2.7.1.74_motif     2.7.1.74            ec00240     

In [5]:
# This above code assigns a “plausibility” score to each SMARTS-based prediction, allowing us to identify
# which predicted motifs are most chemically precise. The steps are as follows:
#
# 1) Import necessary libraries:
#    - pandas for DataFrame operations
#    - RDKit’s Chem module for parsing and analyzing SMARTS patterns
#    - smart_motifs from a local file, which provides each motif’s SMARTS, associated EC numbers, and pathways
#
# 2) Define calculate_plausibility(motif_name):
#    - Retrieve the SMARTS string corresponding to motif_name from the smart_motifs dictionary
#    - Convert that SMARTS string into an RDKit Mol object; if the pattern is invalid or missing, return 0
#    - Compute a score as the sum of the number of atoms and bonds in the motif’s substructure
#    - Return this integer score, which reflects how structurally detailed the motif is
#
# 3) Compute scores for all predictions:
#    - Add a new column, “Plausibility_Score,” to the existing predictions_for_empty_df DataFrame
#    - For each row, use calculate_plausibility to evaluate the matched motif’s complexity
#
# 4) Identify top predictions per compound:
#    - Sort the DataFrame in descending order by Plausibility_Score
#    - Group entries by “Compound” and select the top three rows for each group
#    - Print this subset so that, for each compound, the three highest-scoring SMARTS matches are displayed
#
# In summary, by quantifying each matched motif’s complexity, we can rank the predictions and highlight
# the most chemically plausible enzyme and pathway inferences for each compound.


In [146]:
import pandas as pd
import requests
from collections import defaultdict
from tqdm import tqdm # Import the tqdm library

def enrich_and_collapse_predictions(predictions_df):
    """
    Takes a DataFrame of predictions, enriches it with KEGG Reaction IDs,
    and collapses it to one unique row per compound. Now includes a progress bar.
    """
    
    # --- Helper sub-function to get the master EC-to-Reaction map ---
    def get_ec_to_reaction_map():
        print("--- Step 1: Downloading KEGG's master EC-to-Reaction map ---")
        ec_to_reaction = defaultdict(set)
        url = "http://rest.kegg.jp/link/reaction/ec"
        try:
            response = requests.get(url)
            response.raise_for_status()
            for line in response.text.strip().split('\n'):
                parts = line.strip().split('\t')
                if len(parts) == 2:
                    ec_id = parts[0].replace('ec:', '')
                    reaction_id = parts[1].replace('rn:', '')
                    ec_to_reaction[ec_id].add(reaction_id)
            print(f"✅ Successfully built map for {len(ec_to_reaction)} enzymes.")
            return ec_to_reaction
        except requests.exceptions.RequestException as e:
            print(f"FATAL: Could not download the master map from KEGG. Error: {e}")
            return None

    # --- Main function logic starts here ---
    
    # 1. Get the enrichment data from KEGG
    ec_reaction_map = get_ec_to_reaction_map()
    if not ec_reaction_map:
        return None 

    print("\n--- Step 2: Enriching predictions with Reaction IDs and formatting pathways ---")
    enriched_data = []
    
    # *** ADDED PROGRESS BAR aroud the loop ***
    for _, row in tqdm(predictions_df.iterrows(), total=len(predictions_df), desc="Enriching Predictions"):
        ec_number = row['Predicted_EC']
        reaction_ids = ec_reaction_map.get(ec_number, set())
        
        enriched_data.append({
            'Compound': row['Compound'],
            'SMILES': row['SMILES'],
            'Predicted_EC': ec_number,
            'Predicted_Pathway': row.get('Predicted_Pathways', '').replace('ec', 'hsa'),
            'KEGG_Reaction_IDs': ", ".join(sorted(list(reaction_ids)))
        })

    enriched_df = pd.DataFrame(enriched_data)

    # 3. Group by Compound and SMILES, then aggregate the results
    print("\n--- Step 3: Collapsing data into one row per compound ---")
    aggregation_rules = {
        'Predicted_EC': lambda x: sorted(list(x.unique())),
        'Predicted_Pathway': lambda x: sorted(list(x.unique())),
        'KEGG_Reaction_IDs': lambda x: sorted(list(set(r_id for s in x if s for r_id in s.split(', '))))
    }
    final_collapsed_df = (
        enriched_df
        .groupby(['Compound', 'SMILES'])
        .agg(aggregation_rules)
        .reset_index()
    )
    
    print("\n🎉 Process Complete! 🎉")
    return final_collapsed_df

In [147]:
# Assuming your 'top_predictions' DataFrame already exists...
final_summary_df = enrich_and_collapse_predictions(top_predictions)

# Display the final, collapsed results
if final_summary_df is not None:
    print("\n--- Final Collapsed Results ---")
    print(final_summary_df.to_string())

--- Step 1: Downloading KEGG's master EC-to-Reaction map ---
✅ Successfully built map for 5992 enzymes.

--- Step 2: Enriching predictions with Reaction IDs and formatting pathways ---


Enriching Predictions: 100%|██████████| 41/41 [00:00<00:00, 20491.71it/s]


--- Step 3: Collapsing data into one row per compound ---

🎉 Process Complete! 🎉

--- Final Collapsed Results ---
                                      Compound                                                                                              SMILES                     Predicted_EC               Predicted_Pathway                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 KEGG_Reaction_IDs
0                      1,6-anhydro-β-D-glucose                                                       C1[C@@H]2[C@H]([C@@H]([C@H]([C@H](O1)O2)O)




In [7]:
# The above script takes a table of SMARTS-based predictions (one row per matched motif) and produces
# a cleaned-up summary with one row per compound. In plain terms, it does the following:
#
# 1) Downloads KEGG’s master map of EC numbers to reaction IDs:
#    - We call KEGG’s “link/reaction/ec” endpoint and build a dictionary: EC_number → set of reaction IDs.
#
# 2) Walks through each prediction and “enriches” it by looking up which reactions that EC number participates in:
#    - For each row (compound, SMILES, predicted EC, predicted pathways), we replace the EC with its linked
#      KEGG reaction IDs (comma-separated). We also adjust pathway IDs (e.g., “ec00010” → “hsa00010”) to match KEGG style.
#
# 3) Collapses everything so we end up with one row per (Compound, SMILES):
#    - Combine all EC numbers, pathways, and reaction IDs seen for that compound into sorted, unique lists.
#    - The result is a DataFrame where each compound appears only once, and each column holds all
#      the ECs, pathways, and reactions that were predicted.
#
# The final DataFrame is saved to “kegg_compound_to_ec_map.json” (via JSON dump) and printed so
# you can see, for example, which enzymes (EC numbers) and reactions are associated with Arginine, Pyruvate, ATP, etc.


In [148]:
display(final_summary_df)

Unnamed: 0,Compound,SMILES,Predicted_EC,Predicted_Pathway,KEGG_Reaction_IDs
0,"1,6-anhydro-β-D-glucose",C1[C@@H]2[C@H]([C@@H]([C@H]([C@H](O1)O2)O)O)O,"[1.1.1.307, 1.1.1.9, 1.1.3.41]",[hsa00040],"[R01431, R01896, R07152, R09477, R11618, R11620]"
1,2-Aminobenzenesulfonic acid,C1=CC=C(C(=C1)N)S(=O)(=O)O,[1.14.14.1],[hsa00071],"[R01842, R02354, R02355, R02356, R02503, R03088, R03089, R03090, R03408, R03629, R03697, R04121, R04122, R05259, R07000, R07001, R07021, R07022, R07042, R07043, R07044, R07045, R07046, R07048, R07050, R07051, R07052, R07054, R07055, R07056, R07079, R07080, R07081, R07085, R07087, R07098, R07099, R07939, R07943, R07945, R08265, R08267, R08270, R08286, R08287, R08293, R08294, R08312, R08343, R08344, R08345, R08390, R08391, R08392, R09404, R09405, R09406, R09407, R09408, R09416, R09418, R09421, R09423, R09424, R09425, R09442]"
2,Asp-Arg,C(C[C@@H](C(=O)O)NC(=O)[C@H](CC(=O)O)N)CN=C(N)N,"[2.1.3.3, 3.5.1.38, 3.5.3.7]","[hsa00220, hsa00330]","[R00256, R00485, R01398, R01579, R01990, R06134]"
3,Carnitine C7:DC,CCCCCCCC(=O)OC(CC(=O)[O-])C[N+](C)(C)C,"[2.7.1.32, 3.1.2.21, 3.5.1.12]","[hsa00061, hsa00552, hsa00780]","[R01021, R01077, R04014, R08157, R08158, R12329]"
4,Cyclo(Phe-Glu),C1=CC=C(C=C1)C[C@H]2C(=O)N[C@@H](C(=O)N2)CCC(=O)O,"[1.4.1.13, 1.4.1.14, 4.1.1.28]","[hsa00250, hsa00350]","[R00093, R00114, R00243, R00248, R00256, R00685, R00699, R00736, R02080, R02701, R04909]"
5,Cytarabine,C1=CN(C(=O)N=C1N)[C@H]2[C@H]([C@@H]([C@H](O2)CO)O)O,"[2.7.1.213, 2.7.1.74, 3.1.3.91]",[hsa00240],"[R00511, R00513, R00964, R01666, R02321, R10546]"
6,D-Erythronolactone,C1[C@H]([C@H](C(=O)O1)O)O,"[1.1.1.29, 1.1.1.30, 1.4.1.9]","[hsa00260, hsa00280, hsa00650]","[R00145, R00146, R00717, R01088, R01361, R01388, R01434, R02196]"
7,LPC(13:0/0:0),CCCCCCCCCCCCC(=O)OC[C@H](COP(=O)([O-])OCC[N+](C)(C)C)O,"[3.1.2.14, 3.1.2.21, 3.5.1.12]","[hsa00061, hsa00780]","[R01077, R01706, R02814, R04014, R08157, R08158, R08159, R08162, R08163, R12329]"
8,LPE(17:1/0:0),CCCCCCC/C=C\CCCCCCCC(=O)OC[C@H](COP(=O)(O)OCCN)O,"[2.7.1.30, 3.1.2.21, 3.5.1.12]","[hsa00061, hsa00561, hsa00780]","[R00847, R01077, R04014, R08157, R08158, R12329]"
9,LPI(16:2/0:0),CCCC/C=C\C/C=C\CCCCCCC(=O)OC[C@H](O)COP(=O)([O-])O[C@H]1[C@@H](O)[C@H](O)[C@@H](O)[C@H](O)[C@@H]1O,"[1.13.11.33, 3.1.1.4, 3.1.3.25]","[hsa00521, hsa00564, hsa00590]","[R01185, R01186, R01187, R01313, R01315, R01317, R01593, R02053, R03107, R03626, R07064, R07343, R07379, R07387, R07859]"


In [149]:
import pandas as pd
import numpy as np

# We assume 'final_summary_df' and 'ReacEnzyPath' are already loaded.

# --- Step 1: Prepare the new data for the update ---
# Select the source columns and set the compound name as the index
update_data = final_summary_df[['Compound', 'KEGG_Reaction_IDs', 'Predicted_EC', 'Predicted_Pathway']].set_index('Compound')

# Rename the columns to exactly match the destination columns in ReacEnzyPath
update_data.columns = ['kegg_reactions', 'kegg_enzymes', 'kegg_pathways']


# --- Step 2: Prepare the target DataFrame and perform the update ---
# Set the index of your main DataFrame to allow for a name-based update
ReacEnzyPath.set_index('input_compound_name', inplace=True)

# The .update() method modifies the DataFrame in place.
# It finds rows in ReacEnzyPath with the same index as update_data
# and fills in the values from update_data into the specified columns.
ReacEnzyPath.update(update_data)

# --- Step 3: Reset the index to return the DataFrame to its original structure ---
ReacEnzyPath.reset_index(inplace=True)


# --- Display the updated results for verification ---
print("✅ 'ReacEnzyPath' has been updated.")
print("\n--- Verifying the updated rows: ---")

# Get the list of names from your summary to filter and view the updated rows
updated_names = final_summary_df['Compound'].tolist()
print(ReacEnzyPath[ReacEnzyPath['input_compound_name'].isin(updated_names)].to_string())

✅ 'ReacEnzyPath' has been updated.

--- Verifying the updated rows: ---
                           input_compound_name input_original_index   input_hmdb input_pubchem_cid   input_cas derived_kegg_compound_id kegg_id_source_identifier kegg_id_source_type                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    kegg_reactions                     kegg_enzymes                   kegg_pathways  KEGG_Hits_Count                                                                                              SMILES
0                           D-Eryt

In [150]:
ReacEnzyPath

Unnamed: 0,input_compound_name,input_original_index,input_hmdb,input_pubchem_cid,input_cas,derived_kegg_compound_id,kegg_id_source_identifier,kegg_id_source_type,kegg_reactions,kegg_enzymes,kegg_pathways,KEGG_Hits_Count,SMILES
0,D-Erythronolactone,MADN0053,HMDB0000349,5325915,15667-21-7,,,,"[R00145, R00146, R00717, R01088, R01361, R01388, R01434, R02196]","[1.1.1.29, 1.1.1.30, 1.4.1.9]","[hsa00260, hsa00280, hsa00650]",0,C1[C@H]([C@H](C(=O)O1)O)O
1,"1,6-anhydro-β-D-glucose",MADN0166,HMDB0000640,724705,498-07-7,,,,"[R01431, R01896, R07152, R09477, R11618, R11620]","[1.1.1.307, 1.1.1.9, 1.1.3.41]",[hsa00040],0,C1[C@@H]2[C@H]([C@@H]([C@H]([C@H](O1)O2)O)O)O
2,Deoxyribose 5-phosphate,MADN0220,HMDB0001031,45934311,-,C00673,HMDB0001031,HMDB,"['R02750', 'R02749', 'R01066']","['4.1.2.4', '2.7.1.229', '2.7.1.15', '5.4.2.7']","['hsa00030', 'hsa01100']",3,C1[C@@H]([C@H](O[C@H]1O)COP(=O)(O)O)O
3,2-Aminobenzenesulfonic acid,MADN0329,-,6926,88-21-1,,,,"[R01842, R02354, R02355, R02356, R02503, R03088, R03089, R03090, R03408, R03629, R03697, R04121, R04122, R05259, R07000, R07001, R07021, R07022, R07042, R07043, R07044, R07045, R07046, R07048, R07050, R07051, R07052, R07054, R07055, R07056, R07079, R07080, R07081, R07085, R07087, R07098, R07099, R07939, R07943, R07945, R08265, R08267, R08270, R08286, R08287, R08293, R08294, R08312, R08343, R08344, R08345, R08390, R08391, R08392, R09404, R09405, R09406, R09407, R09408, R09416, R09418, R09421, R09423, R09424, R09425, R09442]",[1.14.14.1],[hsa00071],0,C1=CC=C(C(=C1)N)S(=O)(=O)O
4,Quinoline-4-carboxylic acid,MADN0333,-,10243,486-74-8,,,,"[R00366, R00372, R00635, R01340, R01709, R02125, R02457, R02657, R02894, R02923, R03871, R04085, R04221, R04904, R05861, R06124, R07400, R08349, R08384, R08408]","[1.2.3.1, 1.4.3.3, 2.6.1.4]","[hsa00260, hsa00280]",0,C1=CC=C2C(=C1)C(=CC=N2)C(=O)O
5,cyclo(glu-glu),MADN0466,-,7408481,16691-00-2,,,,"[R00093, R00114, R00243, R00248, R00256, R01990]","[1.4.1.13, 1.4.1.14, 3.5.3.7]","[hsa00250, hsa00330]",0,C(CC(=O)O)[C@H]1C(=O)N[C@H](C(=O)N1)CCC(=O)O
6,P-sulfanilic acid,MADN0498,-,8479,121-57-3,,,,"[R01842, R02354, R02355, R02356, R02503, R03088, R03089, R03090, R03408, R03629, R03697, R04121, R04122, R05259, R07000, R07001, R07021, R07022, R07042, R07043, R07044, R07045, R07046, R07048, R07050, R07051, R07052, R07054, R07055, R07056, R07079, R07080, R07081, R07085, R07087, R07098, R07099, R07939, R07943, R07945, R08265, R08267, R08270, R08286, R08287, R08293, R08294, R08312, R08343, R08344, R08345, R08390, R08391, R08392, R09404, R09405, R09406, R09407, R09408, R09416, R09418, R09421, R09423, R09424, R09425, R09442]",[1.14.14.1],[hsa00071],0,C1=CC(=CC=C1N)S(=O)(=O)O
7,Methylcysteine,MADP0119,HMDB0002108,24417,1187-84-4,,,,"[R00891, R00894, R01289, R01290, R02743, R03749, R04942, R10993]","[4.2.1.22, 4.3.2.9, 6.3.2.2]","[hsa00260, hsa00270, hsa00480]",0,CSC[C@@H](C(=O)O)N
8,Asp-Arg,MADP0548,-,16122509,-,,,,"[R00256, R00485, R01398, R01579, R01990, R06134]","[2.1.3.3, 3.5.1.38, 3.5.3.7]","[hsa00220, hsa00330]",0,C(C[C@@H](C(=O)O)NC(=O)[C@H](CC(=O)O)N)CN=C(N)N
9,LPI(16:2/0:0),MEDN1253,-,,-,,,,"[R01185, R01186, R01187, R01313, R01315, R01317, R01593, R02053, R03107, R03626, R07064, R07343, R07379, R07387, R07859]","[1.13.11.33, 3.1.1.4, 3.1.3.25]","[hsa00521, hsa00564, hsa00590]",0,CCCC/C=C\C/C=C\CCCCCCC(=O)OC[C@H](O)COP(=O)([O-])O[C@H]1[C@@H](O)[C@H](O)[C@@H](O)[C@H](O)[C@@H]1O


In [151]:
ReacEnzyPath.to_csv('ReacEnzyPath.csv', index=False)

In [152]:
import pandas as pd
ReacEnzyPath = pd.read_csv('ReacEnzyPath.csv')
ReacEnzyPath.head(24)


Unnamed: 0,input_compound_name,input_original_index,input_hmdb,input_pubchem_cid,input_cas,derived_kegg_compound_id,kegg_id_source_identifier,kegg_id_source_type,kegg_reactions,kegg_enzymes,kegg_pathways,KEGG_Hits_Count,SMILES
0,D-Erythronolactone,MADN0053,HMDB0000349,5325915,15667-21-7,,,,"['R00145', 'R00146', 'R00717', 'R01088', 'R01361', 'R01388', 'R01434', 'R02196']","['1.1.1.29', '1.1.1.30', '1.4.1.9']","['hsa00260', 'hsa00280', 'hsa00650']",0,C1[C@H]([C@H](C(=O)O1)O)O
1,"1,6-anhydro-β-D-glucose",MADN0166,HMDB0000640,724705,498-07-7,,,,"['R01431', 'R01896', 'R07152', 'R09477', 'R11618', 'R11620']","['1.1.1.307', '1.1.1.9', '1.1.3.41']",['hsa00040'],0,C1[C@@H]2[C@H]([C@@H]([C@H]([C@H](O1)O2)O)O)O
2,Deoxyribose 5-phosphate,MADN0220,HMDB0001031,45934311,-,C00673,HMDB0001031,HMDB,"['R02750', 'R02749', 'R01066']","['4.1.2.4', '2.7.1.229', '2.7.1.15', '5.4.2.7']","['hsa00030', 'hsa01100']",3,C1[C@@H]([C@H](O[C@H]1O)COP(=O)(O)O)O
3,2-Aminobenzenesulfonic acid,MADN0329,-,6926,88-21-1,,,,"['R01842', 'R02354', 'R02355', 'R02356', 'R02503', 'R03088', 'R03089', 'R03090', 'R03408', 'R03629', 'R03697', 'R04121', 'R04122', 'R05259', 'R07000', 'R07001', 'R07021', 'R07022', 'R07042', 'R07043', 'R07044', 'R07045', 'R07046', 'R07048', 'R07050', 'R07051', 'R07052', 'R07054', 'R07055', 'R07056', 'R07079', 'R07080', 'R07081', 'R07085', 'R07087', 'R07098', 'R07099', 'R07939', 'R07943', 'R07945', 'R08265', 'R08267', 'R08270', 'R08286', 'R08287', 'R08293', 'R08294', 'R08312', 'R08343', 'R08344', 'R08345', 'R08390', 'R08391', 'R08392', 'R09404', 'R09405', 'R09406', 'R09407', 'R09408', 'R09416', 'R09418', 'R09421', 'R09423', 'R09424', 'R09425', 'R09442']",['1.14.14.1'],['hsa00071'],0,C1=CC=C(C(=C1)N)S(=O)(=O)O
4,Quinoline-4-carboxylic acid,MADN0333,-,10243,486-74-8,,,,"['R00366', 'R00372', 'R00635', 'R01340', 'R01709', 'R02125', 'R02457', 'R02657', 'R02894', 'R02923', 'R03871', 'R04085', 'R04221', 'R04904', 'R05861', 'R06124', 'R07400', 'R08349', 'R08384', 'R08408']","['1.2.3.1', '1.4.3.3', '2.6.1.4']","['hsa00260', 'hsa00280']",0,C1=CC=C2C(=C1)C(=CC=N2)C(=O)O
5,cyclo(glu-glu),MADN0466,-,7408481,16691-00-2,,,,"['R00093', 'R00114', 'R00243', 'R00248', 'R00256', 'R01990']","['1.4.1.13', '1.4.1.14', '3.5.3.7']","['hsa00250', 'hsa00330']",0,C(CC(=O)O)[C@H]1C(=O)N[C@H](C(=O)N1)CCC(=O)O
6,P-sulfanilic acid,MADN0498,-,8479,121-57-3,,,,"['R01842', 'R02354', 'R02355', 'R02356', 'R02503', 'R03088', 'R03089', 'R03090', 'R03408', 'R03629', 'R03697', 'R04121', 'R04122', 'R05259', 'R07000', 'R07001', 'R07021', 'R07022', 'R07042', 'R07043', 'R07044', 'R07045', 'R07046', 'R07048', 'R07050', 'R07051', 'R07052', 'R07054', 'R07055', 'R07056', 'R07079', 'R07080', 'R07081', 'R07085', 'R07087', 'R07098', 'R07099', 'R07939', 'R07943', 'R07945', 'R08265', 'R08267', 'R08270', 'R08286', 'R08287', 'R08293', 'R08294', 'R08312', 'R08343', 'R08344', 'R08345', 'R08390', 'R08391', 'R08392', 'R09404', 'R09405', 'R09406', 'R09407', 'R09408', 'R09416', 'R09418', 'R09421', 'R09423', 'R09424', 'R09425', 'R09442']",['1.14.14.1'],['hsa00071'],0,C1=CC(=CC=C1N)S(=O)(=O)O
7,Methylcysteine,MADP0119,HMDB0002108,24417,1187-84-4,,,,"['R00891', 'R00894', 'R01289', 'R01290', 'R02743', 'R03749', 'R04942', 'R10993']","['4.2.1.22', '4.3.2.9', '6.3.2.2']","['hsa00260', 'hsa00270', 'hsa00480']",0,CSC[C@@H](C(=O)O)N
8,Asp-Arg,MADP0548,-,16122509,-,,,,"['R00256', 'R00485', 'R01398', 'R01579', 'R01990', 'R06134']","['2.1.3.3', '3.5.1.38', '3.5.3.7']","['hsa00220', 'hsa00330']",0,C(C[C@@H](C(=O)O)NC(=O)[C@H](CC(=O)O)N)CN=C(N)N
9,LPI(16:2/0:0),MEDN1253,-,,-,,,,"['R01185', 'R01186', 'R01187', 'R01313', 'R01315', 'R01317', 'R01593', 'R02053', 'R03107', 'R03626', 'R07064', 'R07343', 'R07379', 'R07387', 'R07859']","['1.13.11.33', '3.1.1.4', '3.1.3.25']","['hsa00521', 'hsa00564', 'hsa00590']",0,CCCC/C=C\C/C=C\CCCCCCC(=O)OC[C@H](O)COP(=O)([O-])O[C@H]1[C@@H](O)[C@H](O)[C@@H](O)[C@H](O)[C@@H]1O


In [9]:
# In this following section, we perform three different inference strategies to predict KEGG pathways
# for the compound “Quinoline-2-carboxylic acid” (SMILES: C1=CC=C2C(=C1)C=CC(=N2)C(=O)O) and enzyme EC number 1.3.99.18:
#
# 1) infer_pathways_from_structure:
#    • Load our SMARTS-based motif rules (smart_motifs_programmatic.py).
#    • Convert the input SMILES into an RDKit Mol object.
#    • For each motif, check if the molecule’s substructure matches the SMARTS pattern.
#    • Collect any associated KEGG pathway IDs (replace “ec” prefixes with “hsa”).
#
# 2) infer_pathways_from_enzyme_family:
#    • Take the first three levels of the EC number (“1.3.99”) to define its enzyme sub-subclass.
#    • Query KEGG’s REST API (link/pathway/ec) for all pathways linked to enzymes in that subclass.
#
# 3) infer_pathways_from_similar_compounds:
#    • Perform a KEGG similarity search (simcomp) at 85% similarity for the input SMILES.
#    • If similar compounds are found, request their pathway mappings (link/pathway/compound).
#    • Collect any unique KEGG pathway IDs from those compounds.
#
# We then print an “INFERENCE SUMMARY” showing:
#    – The compound name and enzyme of interest.
#    – Lists of pathways predicted by each strategy.
#    – A consensus list of pathways appearing in more than one strategy (if any).
#
# Finally, we update an existing DataFrame ReacEnzyPath by:
#    • Locating the row where 'input_compound_name' == "Quinoline-2-carboxylic acid".
#    • Writing our inferred pathway list (e.g., ['hsa00830']) into its 'kegg_pathways' column.
#    • Printing the entire DataFrame to verify that only the target row has been modified.


In [153]:
import pandas as pd
import requests
from rdkit import Chem
from collections import defaultdict

# --- INPUTS for the compound being investigated ---
COMPOUND_SMILES = "C1=CC=C2C(=C1)C=CC(=N2)C(=O)O"
EC_NUMBER = "1.3.99.18"
# ---------------------------------------------------

def infer_pathways_from_structure(smiles_string):
    """Uses our programmatically generated SMARTS rules to predict pathways."""
    print("--- Strategy 1: Inferring from Compound Structure ---")
    try:
        from smart_motifs_programmatic import smart_motifs
    except ImportError:
        print("❌ Could not find 'smart_motifs_programmatic.py'. Aborting this strategy.")
        return []
    mol = Chem.MolFromSmiles(smiles_string)
    if mol is None: return []
    predicted_pathways = set()
    for motif_name, rule_data in smart_motifs.items():
        pattern = Chem.MolFromSmarts(rule_data['smarts'])
        if pattern and mol.HasSubstructMatch(pattern):
            for path in rule_data.get('pathways', []):
                predicted_pathways.add(path.replace('ec', 'hsa'))
    return sorted(list(predicted_pathways))

def infer_pathways_from_enzyme_family(ec_number):
    """Finds pathways associated with other enzymes in the same sub-subclass."""
    print("\n--- Strategy 2: Inferring from Enzyme Family ---")
    if not ec_number or len(ec_number.split('.')) != 4: return []
    ec_class = ".".join(ec_number.split('.')[:3])
    print(f"Searching for pathways linked to other enzymes in class {ec_class}...")
    found_pathways = set()
    url = f"http://rest.kegg.jp/link/pathway/ec"
    try:
        response = requests.get(url)
        response.raise_for_status()
        for line in response.text.strip().split('\n'):
            parts = line.strip().split('\t')
            if len(parts) == 2 and parts[0].startswith(f"ec:{ec_class}"):
                pathway_id = parts[1].replace('path:', '').replace('map', 'hsa')
                found_pathways.add(pathway_id)
        return sorted(list(found_pathways))
    except requests.RequestException as e:
        print(f"❌ API call failed: {e}")
        return []

def infer_pathways_from_similar_compounds(smiles_string):
    """Finds pathways of compounds structurally similar to the input SMILES."""
    print("\n--- Strategy 3: Inferring from Similar Compounds ---")
    similar_compounds = []
    url_sim = f"http://rest.kegg.jp/find/compound/{smiles_string}/simcomp/0.85" # Using 85% similarity
    try:
        response_sim = requests.get(url_sim)
        response_sim.raise_for_status()
        for line in response_sim.text.strip().split('\n'):
            parts = line.strip().split('\t')
            if len(parts) > 0:
                similar_compounds.append(parts[0])
    except requests.RequestException as e:
        print(f"❌ Similarity search failed: {e}")
        return []
    if not similar_compounds: return []
    print(f"Found {len(similar_compounds)} similar compounds. Checking their pathways...")
    found_pathways = set()
    url_path = f"http://rest.kegg.jp/link/pathway/{'+'.join(similar_compounds)}"
    try:
        response_path = requests.get(url_path)
        response_path.raise_for_status()
        for line in response_path.text.strip().split('\n'):
            parts = line.strip().split('\t')
            if len(parts) == 2:
                pathway_id = parts[1].replace('path:', '').replace('map', 'hsa')
                found_pathways.add(pathway_id)
        return sorted(list(found_pathways))
    except requests.RequestException as e:
        print(f"❌ Pathway lookup for similar compounds failed: {e}")
        return []

# --- Run All Three Inference Strategies ---
struct_preds = infer_pathways_from_structure(COMPOUND_SMILES)
enzyme_preds = infer_pathways_from_enzyme_family(EC_NUMBER)
sim_preds = infer_pathways_from_similar_compounds(COMPOUND_SMILES)

# --- Synthesize the Final Results ---
print("\n=============================================")
print("          INFERENCE SUMMARY")
print("=============================================")
print(f"Compound: Quinoline-2-carboxylic acid")
print(f"Enzyme: {EC_NUMBER} | Reaction: R03687")
print("---------------------------------------------")
print(f"Pathways inferred from Structure: {struct_preds}")
print(f"Pathways inferred from Enzyme Family: {enzyme_preds}")
print(f"Pathways inferred from Similar Compounds: {sim_preds}")
print("---------------------------------------------")

all_predictions = struct_preds + enzyme_preds + sim_preds
consensus = sorted([p for p in set(all_predictions) if all_predictions.count(p) > 1])
if consensus:
    print(f"🔥 Consensus Prediction (appears in multiple results): {consensus}")
else:
    print("ℹ️ No consensus found across different methods. Review individual results.")

--- Strategy 1: Inferring from Compound Structure ---

--- Strategy 2: Inferring from Enzyme Family ---
Searching for pathways linked to other enzymes in class 1.3.99...

--- Strategy 3: Inferring from Similar Compounds ---
❌ Similarity search failed: 400 Client Error: Bad Request for url: https://rest.kegg.jp/find/compound/C1=CC=C2C(=C1)C=CC(=N2)C(=O)O/simcomp/0.85

          INFERENCE SUMMARY
Compound: Quinoline-2-carboxylic acid
Enzyme: 1.3.99.18 | Reaction: R03687
---------------------------------------------
Pathways inferred from Structure: ['hsa00010', 'hsa00020', 'hsa00071', 'hsa00220', 'hsa00250', 'hsa00260', 'hsa00350', 'hsa00620', 'hsa00720', 'hsa00730']
Pathways inferred from Enzyme Family: ['ec00140', 'ec00340', 'ec00362', 'ec00365', 'ec00830', 'ec00903', 'ec00906', 'ec00920', 'ec00984', 'ec01100', 'ec01110', 'ec01120', 'hsa00140', 'hsa00340', 'hsa00362', 'hsa00365', 'hsa00830', 'hsa00903', 'hsa00906', 'hsa00920', 'hsa00984', 'hsa01100', 'hsa01110', 'hsa01120']
Pathways in

In [154]:
# --- Step 2: Define the target compound and the new pathway information ---
target_compound_name = 'Quinoline-2-carboxylic acid'
inferred_pathway = ['hsa00830'] # The pathway we inferred

# --- Step 3: Locate the row and update the 'kegg_pathways' column ---
# Create a boolean condition to find the correct row(s)
condition = ReacEnzyPath['input_compound_name'] == target_compound_name

# Update the 'kegg_pathways' column for the matched row(s)
# We'll store it as a string representation of a list to be consistent with '[]'
ReacEnzyPath.loc[condition, 'kegg_pathways'] = str(inferred_pathway)

print("\n--- DataFrame AFTER update ---")
print(ReacEnzyPath)

# Verify the specific change
print("\n--- Updated row for Quinoline-2-carboxylic acid ---")
print(ReacEnzyPath[ReacEnzyPath['input_compound_name'] == target_compound_name].to_string())


--- DataFrame AFTER update ---
                           input_compound_name input_original_index   input_hmdb input_pubchem_cid    input_cas derived_kegg_compound_id kegg_id_source_identifier kegg_id_source_type                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        kegg_reactions  \
0                           D-Erythronolactone             MADN0053  HMDB0000349           5325915   15667-21-7         

In [156]:
ReacEnzyPath.to_csv('ReacEnzyPath.csv', index=False)

<div style="
    border-left: 6px solid #FF9800;
    background-color: #FFF3E0;
    padding: 16px 20px;
    margin: 24px 0;
    border-radius: 8px;
    box-shadow: 0 2px 4px rgba(0,0,0,0.05);
">
  <h3 style="margin-top: 0; margin-bottom: 8px; color: #FF9800;">🔬 Pathway &amp; Enzyme Reaction Validations</h3>
  <p style="margin: 0; color: #333;">
    This section checks that all metabolites are correctly mapped to their KEGG/Reactome pathways and validates enzyme-catalyzed reaction assignments against curated databases.
  </p>
</div>


<h2>🧪 Phase 3: CRC Pathway Relevance Validation</h2>

<h3>🎯 Objective</h3>
<p>
Validate whether enzyme/reaction/pathway annotations for CRC-associated metabolites connect to a curated list of 
<code>gold_standard_crc_pathways</code> known to be biologically relevant in colorectal cancer (CRC).
</p>

---

<h3>🔬 What This Script Does</h3>

<ol>
  <li><strong>Defines</strong> a gold standard list of CRC-relevant KEGG pathway IDs, including:
    <ul>
      <li><code>hsa05210</code>: Colorectal cancer</li>
      <li>Core energy metabolism: glycolysis, TCA cycle, pentose phosphate</li>
      <li>Nucleotide, amino acid, lipid, and xenobiotic metabolism</li>
    </ul>
  </li>

  <li><strong>Loads</strong> a DataFrame <code>ReacEnzyPath</code> that contains, per compound:
    <ul>
      <li><code>kegg_pathways</code></li>
      <li><code>kegg_enzymes</code> (EC numbers)</li>
      <li><code>kegg_reactions</code> (reaction IDs)</li>
    </ul>
  </li>

  <li><strong>Validates</strong> each compound through three routes:
    <ul>
      <li>💠 <strong>Direct Pathway Match</strong>: Are any listed pathways in the gold standard?</li>
      <li>🧬 <strong>Enzyme-Pathway Match</strong>: Do the EC numbers map to gold standard pathways via KEGG?</li>
      <li>⚙️ <strong>Reaction-Pathway Match</strong>: Do the reaction IDs map to gold standard pathways?</li>
    </ul>
  </li>

  <li><strong>Annotates</strong> the validated results in a new DataFrame: <code>ReacEnzyPath_validated</code></li>

  <li><strong>Flags</strong> all compounds that are:
    <ul>
      <li>✅ Validated by at least one route</li>
      <li>❌ Not validated by any route (fully unlinked to CRC gold standard)</li>
    </ul>
  </li>
</ol>

---

<h3>📊 Output Overview</h3>

<p>The final DataFrame includes:</p>

<ul>
  <li><code>is_crc_pathway_relevant</code>: based on direct pathway listing</li>
  <li><code>is_enzyme_crc_relevant</code>: based on EC-to-pathway mapping</li>
  <li><code>is_reaction_crc_relevant</code>: based on Reaction-to-pathway mapping</li>
  <li><code>crc_pathway_hits</code>, <code>enzyme_crc_pathway_links</code>, <code>reaction_crc_pathway_links</code> for detailed tracking</li>
</ul>

---

<h3>🔎 Why This Matters</h3>

<ul>
  <li>Helps prioritize high-confidence metabolites for CRC pathway modeling</li>
  <li>Enables filtering of noise or off-pathway annotations</li>
  <li>Supports downstream enrichment analysis or network refinement</li>
</ul>


In [31]:
ReacEnzyPath.to_csv('ReacEnzyPath.csv', index=False)

In [157]:
gold_standard_crc_pathways = [
    # --- KEGG's Direct CRC Pathway ---
    'hsa05210',  # Colorectal cancer

    # --- Core Energy & Carbon Metabolism (often dysregulated - Warburg effect, etc.) ---
    'hsa00010',  # Glycolysis / Gluconeogenesis
    'hsa00020',  # Citrate cycle (TCA cycle)
    'hsa00030',  # Pentose phosphate pathway (provides NADPH and nucleotide precursors)
    'hsa00620',  # Pyruvate metabolism

    # --- Nucleotide Metabolism (for DNA/RNA synthesis in proliferating cells) ---
    'hsa00230',  # Purine metabolism
    'hsa00240',  # Pyrimidine metabolism

    # --- Amino Acid Metabolism (cancer cells often "addicted" to certain amino acids) ---
    'hsa00250',  # Alanine, aspartate and glutamate metabolism
    'hsa00260',  # Glycine, serine and threonine metabolism (key for one-carbon metabolism)
    'hsa00220',  # Arginine biosynthesis
    'hsa00330',  # Arginine and proline metabolism
    'hsa00280',  # Valine, leucine and isoleucine degradation (BCAAs)
    'hsa00380',  # Tryptophan metabolism (can lead to kynurenine, impacting immunity)
    'hsa00480',  # Glutathione metabolism (redox balance, drug resistance)

    # --- Lipid Metabolism (for new membranes, signaling lipids) ---
    'hsa00061',  # Fatty acid biosynthesis
    'hsa00062',  # Fatty acid elongation
    'hsa00071',  # Fatty acid degradation (beta-oxidation)
    'hsa00100',  # Steroid biosynthesis (relevant if hormonal aspects are considered)

    # --- Metabolism of Cofactors and Vitamins (can be linked to enzyme function) ---
    'hsa00790',  # Folate biosynthesis (links to one-carbon metabolism and nucleotide synthesis)
    'hsa00830',  # Retinol metabolism (Vitamin A, differentiation, relevant to our previous inference)

    # --- Xenobiotic/Drug Metabolism (relevant for therapy and carcinogen processing) ---
    'hsa00980',  # Metabolism of xenobiotics by cytochrome P450
    'hsa00982',  # Drug metabolism - cytochrome P450

    # --- Broader Metabolic Overview Maps (can catch things missed by specifics) ---
    # 'hsa01100',  # Metabolic pathways (use with caution, very broad)
    # 'hsa01200',  # Carbon metabolism
    # 'hsa01212',  # Fatty acid metabolism (overview)
    # 'hsa01230',  # Biosynthesis of amino acids
    # 'hsa01232',  # Nucleotide metabolism (overview in some KEGG versions)
]
# Remove duplicates just in case
gold_standard_crc_pathways = sorted(list(set(gold_standard_crc_pathways)))

print(f"✅ Defined 'gold_standard_crc_pathways' list with {len(gold_standard_crc_pathways)} entries.")

✅ Defined 'gold_standard_crc_pathways' list with 22 entries.


In [158]:
# PREREQUISITES:
# 1. A pandas DataFrame named 'ReacEnzyPath' must already be loaded in your environment.
#    It should contain columns like 'input_compound_name', 'kegg_reactions', 
#    'kegg_enzymes', and 'kegg_pathways' (where the KEGG columns store string 
#    representations of lists, e.g., "['R00001', 'R00002']").
# 2. A Python list named 'gold_standard_crc_pathways' must already be defined,
#    containing your reference list of KEGG pathway IDs (e.g., ['hsa05210', 'hsa00010', ...]).

import pandas as pd
import ast # For safely evaluating string representations of lists
import requests
from collections import defaultdict

# --- Safety Check for Prerequisites (Optional but Recommended) ---
if 'ReacEnzyPath' not in locals() or not isinstance(ReacEnzyPath, pd.DataFrame):
    print("❌ ERROR: DataFrame 'ReacEnzyPath' is not defined or is not a pandas DataFrame.")
    print("Please ensure 'ReacEnzyPath' is loaded correctly before running this script.")
    # Consider exiting or raising an error if you want to stop execution:
    # exit() 
elif 'gold_standard_crc_pathways' not in locals() or not isinstance(gold_standard_crc_pathways, list):
    print("❌ ERROR: List 'gold_standard_crc_pathways' is not defined or is not a list.")
    print("Please ensure 'gold_standard_crc_pathways' is defined correctly before running this script.")
    # exit()
else:
    print("✅ Prerequisites 'ReacEnzyPath' and 'gold_standard_crc_pathways' found. Starting validation...")

    # --- Create a new DataFrame for validation results ---
    # This ensures your original ReacEnzyPath is not modified.
    ReacEnzyPath_validated = ReacEnzyPath.copy()
    print("✅ Created 'ReacEnzyPath_validated' DataFrame to store validation results.")

    # --- Main Validation Logic (operates on ReacEnzyPath_validated) ---

    def safe_literal_eval(val):
        """Safely evaluate a string representation of a list, or return empty list."""
        try:
            if not isinstance(val, str):
                if isinstance(val, list): 
                    return val
                return [] 
            return ast.literal_eval(val)
        except (ValueError, SyntaxError):
            return []

    print("\n--- Validating Pathways against Gold Standard ---")
    ReacEnzyPath_validated['parsed_pathways'] = ReacEnzyPath_validated['kegg_pathways'].apply(safe_literal_eval)
    ReacEnzyPath_validated['crc_pathway_hits'] = ReacEnzyPath_validated['parsed_pathways'].apply(
        lambda pathways: [p for p in pathways if p in gold_standard_crc_pathways]
    )
    ReacEnzyPath_validated['is_crc_pathway_relevant'] = ReacEnzyPath_validated['crc_pathway_hits'].apply(lambda x: len(x) > 0)

    print("\n--- Downloading EC-to-Pathway map from KEGG (for enzyme/reaction validation) ---")
    ec_to_pathway_map = defaultdict(set)
    url_ec_path = "http://rest.kegg.jp/link/pathway/ec" 
    try:
        response = requests.get(url_ec_path)
        response.raise_for_status()
        for line in response.text.strip().split('\n'):
            parts = line.strip().split('\t')
            if len(parts) == 2:
                ec_id = parts[0].replace('ec:', '')
                pathway_id = parts[1].replace('path:', '').replace('map', 'hsa')
                ec_to_pathway_map[ec_id].add(pathway_id)
        print(f"✅ Successfully built EC-to-Pathway map for {len(ec_to_pathway_map)} enzymes.")
    except requests.exceptions.RequestException as e:
        print(f"Warning: Could not download EC-to-Pathway map from KEGG. Error: {e}")

    print("\n--- Validating Enzymes against Gold Standard Pathways ---")
    ReacEnzyPath_validated['parsed_enzymes'] = ReacEnzyPath_validated['kegg_enzymes'].apply(safe_literal_eval)
    def check_enzyme_relevance(ec_list):
        relevant_pathways_for_enzymes = set()
        for ec in ec_list:
            pathways_for_this_ec = ec_to_pathway_map.get(ec, set())
            for p_ec in pathways_for_this_ec:
                if p_ec in gold_standard_crc_pathways:
                    relevant_pathways_for_enzymes.add(p_ec)
        return sorted(list(relevant_pathways_for_enzymes))

    ReacEnzyPath_validated['enzyme_crc_pathway_links'] = ReacEnzyPath_validated['parsed_enzymes'].apply(check_enzyme_relevance)
    ReacEnzyPath_validated['is_enzyme_crc_relevant'] = ReacEnzyPath_validated['enzyme_crc_pathway_links'].apply(lambda x: len(x) > 0)

    print("\n--- Downloading Reaction-to-Pathway map from KEGG ---")
    reaction_to_pathway_map = defaultdict(set)
    url_rn_path = "http://rest.kegg.jp/link/pathway/reaction"
    try:
        response = requests.get(url_rn_path)
        response.raise_for_status()
        for line in response.text.strip().split('\n'):
            parts = line.strip().split('\t')
            if len(parts) == 2:
                reaction_id = parts[0].replace('rn:', '')
                pathway_id = parts[1].replace('path:', '').replace('map', 'hsa')
                reaction_to_pathway_map[reaction_id].add(pathway_id)
        print(f"✅ Successfully built Reaction-to-Pathway map for {len(reaction_to_pathway_map)} reactions.")
    except requests.exceptions.RequestException as e:
        print(f"Warning: Could not download Reaction-to-Pathway map from KEGG. Error: {e}")

    print("\n--- Validating Reactions against Gold Standard Pathways ---")
    ReacEnzyPath_validated['parsed_reactions'] = ReacEnzyPath_validated['kegg_reactions'].apply(safe_literal_eval)
    def check_reaction_relevance(reaction_list):
        relevant_pathways_for_reactions = set()
        for rn in reaction_list:
            pathways_for_this_reaction = reaction_to_pathway_map.get(rn, set())
            for p_rn in pathways_for_this_reaction:
                if p_rn in gold_standard_crc_pathways:
                    relevant_pathways_for_reactions.add(p_rn)
        return sorted(list(relevant_pathways_for_reactions))

    ReacEnzyPath_validated['reaction_crc_pathway_links'] = ReacEnzyPath_validated['parsed_reactions'].apply(check_reaction_relevance)
    ReacEnzyPath_validated['is_reaction_crc_relevant'] = ReacEnzyPath_validated['reaction_crc_pathway_links'].apply(lambda x: len(x) > 0)

    # --- Display Main Results from the new DataFrame ---
    print("\n\n--- 'ReacEnzyPath_validated' DataFrame with CRC Relevance Flags (Main Display) ---")
    display_cols = [
        'input_compound_name', 'kegg_pathways', 'is_crc_pathway_relevant', 'crc_pathway_hits',
        'kegg_enzymes', 'is_enzyme_crc_relevant', 'enzyme_crc_pathway_links',
        'kegg_reactions', 'is_reaction_crc_relevant', 'reaction_crc_pathway_links'
    ]
    existing_display_cols = [col for col in display_cols if col in ReacEnzyPath_validated.columns]
    if existing_display_cols:
        print(ReacEnzyPath_validated[existing_display_cols].to_string())
    else:
        print("Could not display main results, some expected columns might be missing from ReacEnzyPath_validated if API calls failed.")

    # --- Integrated Filtering for Unvalidated Entries (operates on ReacEnzyPath_validated) ---
    print("\n\n--- Explicitly Listing Unvalidated Entries from 'ReacEnzyPath_validated' ---")

    filter_check_cols = ['is_crc_pathway_relevant', 'is_enzyme_crc_relevant', 'is_reaction_crc_relevant']
    if not all(col in ReacEnzyPath_validated.columns for col in filter_check_cols):
        print("\n❌ ERROR: Your 'ReacEnzyPath_validated' DataFrame is missing some of the boolean 'is_..._relevant' columns.")
        print("Cannot proceed with filtering for unvalidated entries.")
    else:
        # 1. Compounds whose DIRECTLY listed pathways are NOT in the gold standard
        unvalidated_by_direct_pathway = ReacEnzyPath_validated[ReacEnzyPath_validated['is_crc_pathway_relevant'] == False]
        print(f"\nFound {len(unvalidated_by_direct_pathway)} compounds whose directly listed pathways are NOT in the gold standard:")
        if not unvalidated_by_direct_pathway.empty:
            print(unvalidated_by_direct_pathway[['input_compound_name', 'kegg_pathways', 'crc_pathway_hits']].to_string())
        else:
            print("All compounds have at least one directly listed pathway in the gold standard (or this check wasn't applicable).")

        # 2. Compounds whose ENZYMES do NOT link to any gold standard pathways
        unvalidated_by_enzyme_pathways = ReacEnzyPath_validated[ReacEnzyPath_validated['is_enzyme_crc_relevant'] == False]
        print(f"\nFound {len(unvalidated_by_enzyme_pathways)} compounds whose enzymes do NOT link to gold standard pathways:")
        if not unvalidated_by_enzyme_pathways.empty:
            print(unvalidated_by_enzyme_pathways[['input_compound_name', 'kegg_enzymes', 'enzyme_crc_pathway_links']].to_string())
        else:
            print("All compounds have enzymes linked to gold standard pathways (or this check wasn't applicable).")

        # 3. Compounds whose REACTIONS do NOT link to any gold standard pathways
        unvalidated_by_reaction_pathways = ReacEnzyPath_validated[ReacEnzyPath_validated['is_reaction_crc_relevant'] == False]
        print(f"\nFound {len(unvalidated_by_reaction_pathways)} compounds whose reactions do NOT link to gold standard pathways:")
        if not unvalidated_by_reaction_pathways.empty:
            print(unvalidated_by_reaction_pathways[['input_compound_name', 'kegg_reactions', 'reaction_crc_pathway_links']].to_string())
        else:
            print("All compounds have reactions linked to gold standard pathways (or this check wasn't applicable).")

        # 4. Compounds that are "fully unvalidated" by any of these pathway-based checks
        fully_unvalidated_by_pathways = ReacEnzyPath_validated[
            (ReacEnzyPath_validated['is_crc_pathway_relevant'] == False) &
            (ReacEnzyPath_validated['is_enzyme_crc_relevant'] == False) &
            (ReacEnzyPath_validated['is_reaction_crc_relevant'] == False)
        ]
        print(f"\nFound {len(fully_unvalidated_by_pathways)} compounds with NO links to gold standard pathways via direct, enzyme, or reaction routes:")
        if not fully_unvalidated_by_pathways.empty:
            print(fully_unvalidated_by_pathways[['input_compound_name', 'kegg_pathways', 'kegg_enzymes', 'kegg_reactions']].to_string())
        else:
            print("All compounds have at least one pathway link (direct, enzyme, or reaction) to the gold standard (or this check wasn't applicable).")

    print("\n--- Original 'ReacEnzyPath' DataFrame remains unchanged. ---")
    # You can verify this by printing ReacEnzyPath here if you wish, e.g.:
    # print(ReacEnzyPath.head().to_string())

✅ Prerequisites 'ReacEnzyPath' and 'gold_standard_crc_pathways' found. Starting validation...
✅ Created 'ReacEnzyPath_validated' DataFrame to store validation results.

--- Validating Pathways against Gold Standard ---

--- Downloading EC-to-Pathway map from KEGG (for enzyme/reaction validation) ---
✅ Successfully built EC-to-Pathway map for 3901 enzymes.

--- Validating Enzymes against Gold Standard Pathways ---

--- Downloading Reaction-to-Pathway map from KEGG ---
✅ Successfully built Reaction-to-Pathway map for 8086 reactions.

--- Validating Reactions against Gold Standard Pathways ---


--- 'ReacEnzyPath_validated' DataFrame with CRC Relevance Flags (Main Display) ---
                           input_compound_name                                                                                                                                                                                     kegg_pathways  is_crc_pathway_relevant      crc_pathway_hits                              

In [159]:
display(ReacEnzyPath_validated)

Unnamed: 0,input_compound_name,input_original_index,input_hmdb,input_pubchem_cid,input_cas,derived_kegg_compound_id,kegg_id_source_identifier,kegg_id_source_type,kegg_reactions,kegg_enzymes,kegg_pathways,KEGG_Hits_Count,SMILES,parsed_pathways,crc_pathway_hits,is_crc_pathway_relevant,parsed_enzymes,enzyme_crc_pathway_links,is_enzyme_crc_relevant,parsed_reactions,reaction_crc_pathway_links,is_reaction_crc_relevant
0,D-Erythronolactone,MADN0053,HMDB0000349,5325915,15667-21-7,,,,"['R00145', 'R00146', 'R00717', 'R01088', 'R01361', 'R01388', 'R01434', 'R02196']","['1.1.1.29', '1.1.1.30', '1.4.1.9']","['hsa00260', 'hsa00280', 'hsa00650']",0,C1[C@H]([C@H](C(=O)O1)O)O,"[hsa00260, hsa00280, hsa00650]","[hsa00260, hsa00280]",True,"[1.1.1.29, 1.1.1.30, 1.4.1.9]","[hsa00260, hsa00280]",True,"[R00145, R00146, R00717, R01088, R01361, R01388, R01434, R02196]","[hsa00260, hsa00280]",True
1,"1,6-anhydro-β-D-glucose",MADN0166,HMDB0000640,724705,498-07-7,,,,"['R01431', 'R01896', 'R07152', 'R09477', 'R11618', 'R11620']","['1.1.1.307', '1.1.1.9', '1.1.3.41']",['hsa00040'],0,C1[C@@H]2[C@H]([C@@H]([C@H]([C@H](O1)O2)O)O)O,[hsa00040],[],False,"[1.1.1.307, 1.1.1.9, 1.1.3.41]",[],False,"[R01431, R01896, R07152, R09477, R11618, R11620]",[],False
2,Deoxyribose 5-phosphate,MADN0220,HMDB0001031,45934311,-,C00673,HMDB0001031,HMDB,"['R02750', 'R02749', 'R01066']","['4.1.2.4', '2.7.1.229', '2.7.1.15', '5.4.2.7']","['hsa00030', 'hsa01100']",3,C1[C@@H]([C@H](O[C@H]1O)COP(=O)(O)O)O,"[hsa00030, hsa01100]",[hsa00030],True,"[4.1.2.4, 2.7.1.229, 2.7.1.15, 5.4.2.7]","[hsa00030, hsa00230]",True,"[R02750, R02749, R01066]",[hsa00030],True
3,2-Aminobenzenesulfonic acid,MADN0329,-,6926,88-21-1,,,,"['R01842', 'R02354', 'R02355', 'R02356', 'R02503', 'R03088', 'R03089', 'R03090', 'R03408', 'R03629', 'R03697', 'R04121', 'R04122', 'R05259', 'R07000', 'R07001', 'R07021', 'R07022', 'R07042', 'R07043', 'R07044', 'R07045', 'R07046', 'R07048', 'R07050', 'R07051', 'R07052', 'R07054', 'R07055', 'R07056', 'R07079', 'R07080', 'R07081', 'R07085', 'R07087', 'R07098', 'R07099', 'R07939', 'R07943', 'R07945', 'R08265', 'R08267', 'R08270', 'R08286', 'R08287', 'R08293', 'R08294', 'R08312', 'R08343', 'R08344', 'R08345', 'R08390', 'R08391', 'R08392', 'R09404', 'R09405', 'R09406', 'R09407', 'R09408', 'R09416', 'R09418', 'R09421', 'R09423', 'R09424', 'R09425', 'R09442']",['1.14.14.1'],['hsa00071'],0,C1=CC=C(C(=C1)N)S(=O)(=O)O,[hsa00071],[hsa00071],True,[1.14.14.1],"[hsa00071, hsa00380, hsa00830, hsa00980, hsa00982]",True,"[R01842, R02354, R02355, R02356, R02503, R03088, R03089, R03090, R03408, R03629, R03697, R04121, R04122, R05259, R07000, R07001, R07021, R07022, R07042, R07043, R07044, R07045, R07046, R07048, R07050, R07051, R07052, R07054, R07055, R07056, R07079, R07080, R07081, R07085, R07087, R07098, R07099, R07939, R07943, R07945, R08265, R08267, R08270, R08286, R08287, R08293, R08294, R08312, R08343, R08344, R08345, R08390, R08391, R08392, R09404, R09405, R09406, R09407, R09408, R09416, R09418, R09421, R09423, R09424, R09425, R09442]","[hsa00071, hsa00380, hsa00830, hsa00980, hsa00982]",True
4,Quinoline-4-carboxylic acid,MADN0333,-,10243,486-74-8,,,,"['R00366', 'R00372', 'R00635', 'R01340', 'R01709', 'R02125', 'R02457', 'R02657', 'R02894', 'R02923', 'R03871', 'R04085', 'R04221', 'R04904', 'R05861', 'R06124', 'R07400', 'R08349', 'R08384', 'R08408']","['1.2.3.1', '1.4.3.3', '2.6.1.4']","['hsa00260', 'hsa00280']",0,C1=CC=C2C(=C1)C(=CC=N2)C(=O)O,"[hsa00260, hsa00280]","[hsa00260, hsa00280]",True,"[1.2.3.1, 1.4.3.3, 2.6.1.4]","[hsa00260, hsa00280, hsa00330, hsa00380, hsa00830, hsa00982]",True,"[R00366, R00372, R00635, R01340, R01709, R02125, R02457, R02657, R02894, R02923, R03871, R04085, R04221, R04904, R05861, R06124, R07400, R08349, R08384, R08408]","[hsa00260, hsa00280, hsa00330, hsa00380, hsa00830, hsa00982]",True
5,cyclo(glu-glu),MADN0466,-,7408481,16691-00-2,,,,"['R00093', 'R00114', 'R00243', 'R00248', 'R00256', 'R01990']","['1.4.1.13', '1.4.1.14', '3.5.3.7']","['hsa00250', 'hsa00330']",0,C(CC(=O)O)[C@H]1C(=O)N[C@H](C(=O)N1)CCC(=O)O,"[hsa00250, hsa00330]","[hsa00250, hsa00330]",True,"[1.4.1.13, 1.4.1.14, 3.5.3.7]","[hsa00250, hsa00330]",True,"[R00093, R00114, R00243, R00248, R00256, R01990]","[hsa00220, hsa00250, hsa00330]",True
6,P-sulfanilic acid,MADN0498,-,8479,121-57-3,,,,"['R01842', 'R02354', 'R02355', 'R02356', 'R02503', 'R03088', 'R03089', 'R03090', 'R03408', 'R03629', 'R03697', 'R04121', 'R04122', 'R05259', 'R07000', 'R07001', 'R07021', 'R07022', 'R07042', 'R07043', 'R07044', 'R07045', 'R07046', 'R07048', 'R07050', 'R07051', 'R07052', 'R07054', 'R07055', 'R07056', 'R07079', 'R07080', 'R07081', 'R07085', 'R07087', 'R07098', 'R07099', 'R07939', 'R07943', 'R07945', 'R08265', 'R08267', 'R08270', 'R08286', 'R08287', 'R08293', 'R08294', 'R08312', 'R08343', 'R08344', 'R08345', 'R08390', 'R08391', 'R08392', 'R09404', 'R09405', 'R09406', 'R09407', 'R09408', 'R09416', 'R09418', 'R09421', 'R09423', 'R09424', 'R09425', 'R09442']",['1.14.14.1'],['hsa00071'],0,C1=CC(=CC=C1N)S(=O)(=O)O,[hsa00071],[hsa00071],True,[1.14.14.1],"[hsa00071, hsa00380, hsa00830, hsa00980, hsa00982]",True,"[R01842, R02354, R02355, R02356, R02503, R03088, R03089, R03090, R03408, R03629, R03697, R04121, R04122, R05259, R07000, R07001, R07021, R07022, R07042, R07043, R07044, R07045, R07046, R07048, R07050, R07051, R07052, R07054, R07055, R07056, R07079, R07080, R07081, R07085, R07087, R07098, R07099, R07939, R07943, R07945, R08265, R08267, R08270, R08286, R08287, R08293, R08294, R08312, R08343, R08344, R08345, R08390, R08391, R08392, R09404, R09405, R09406, R09407, R09408, R09416, R09418, R09421, R09423, R09424, R09425, R09442]","[hsa00071, hsa00380, hsa00830, hsa00980, hsa00982]",True
7,Methylcysteine,MADP0119,HMDB0002108,24417,1187-84-4,,,,"['R00891', 'R00894', 'R01289', 'R01290', 'R02743', 'R03749', 'R04942', 'R10993']","['4.2.1.22', '4.3.2.9', '6.3.2.2']","['hsa00260', 'hsa00270', 'hsa00480']",0,CSC[C@@H](C(=O)O)N,"[hsa00260, hsa00270, hsa00480]","[hsa00260, hsa00480]",True,"[4.2.1.22, 4.3.2.9, 6.3.2.2]","[hsa00260, hsa00480]",True,"[R00891, R00894, R01289, R01290, R02743, R03749, R04942, R10993]","[hsa00260, hsa00480]",True
8,Asp-Arg,MADP0548,-,16122509,-,,,,"['R00256', 'R00485', 'R01398', 'R01579', 'R01990', 'R06134']","['2.1.3.3', '3.5.1.38', '3.5.3.7']","['hsa00220', 'hsa00330']",0,C(C[C@@H](C(=O)O)NC(=O)[C@H](CC(=O)O)N)CN=C(N)N,"[hsa00220, hsa00330]","[hsa00220, hsa00330]",True,"[2.1.3.3, 3.5.1.38, 3.5.3.7]","[hsa00220, hsa00250, hsa00330]",True,"[R00256, R00485, R01398, R01579, R01990, R06134]","[hsa00220, hsa00250, hsa00330]",True
9,LPI(16:2/0:0),MEDN1253,-,,-,,,,"['R01185', 'R01186', 'R01187', 'R01313', 'R01315', 'R01317', 'R01593', 'R02053', 'R03107', 'R03626', 'R07064', 'R07343', 'R07379', 'R07387', 'R07859']","['1.13.11.33', '3.1.1.4', '3.1.3.25']","['hsa00521', 'hsa00564', 'hsa00590']",0,CCCC/C=C\C/C=C\CCCCCCC(=O)OC[C@H](O)COP(=O)([O-])O[C@H]1[C@@H](O)[C@H](O)[C@@H](O)[C@H](O)[C@@H]1O,"[hsa00521, hsa00564, hsa00590]",[],False,"[1.13.11.33, 3.1.1.4, 3.1.3.25]",[],False,"[R01185, R01186, R01187, R01313, R01315, R01317, R01593, R02053, R03107, R03626, R07064, R07343, R07379, R07387, R07859]",[],False


In [160]:
import pandas as pd

# This script assumes 'ReacEnzyPath_validated' is your fully processed DataFrame
# from the PREVIOUS script, and it already contains columns like:
# 'is_crc_pathway_relevant', 'is_enzyme_crc_relevant', 'is_reaction_crc_relevant'

# --- Safety Check for Prerequisites ---
if 'ReacEnzyPath_validated' not in locals() or not isinstance(ReacEnzyPath_validated, pd.DataFrame):
    print("❌ ERROR: DataFrame 'ReacEnzyPath_validated' is not defined or is not a pandas DataFrame.")
    print("Please ensure it's loaded correctly (output of the previous script) before running this snippet.")
else:
    print("✅ 'ReacEnzyPath_validated' DataFrame found. Proceeding with categorization...")

    validation_columns = ['is_crc_pathway_relevant', 'is_enzyme_crc_relevant', 'is_reaction_crc_relevant']
    
    # Ensure these columns exist before trying to sum them
    missing_cols = [col for col in validation_columns if col not in ReacEnzyPath_validated.columns]
    if missing_cols:
        print(f"❌ ERROR: The following required validation columns are missing: {missing_cols}")
        print("Please ensure the previous validation script ran successfully.")
    else:
        # --- Step 1: Calculate the number of TRUE validations for each compound ---
        # Boolean columns are summed as True=1, False=0
        ReacEnzyPath_validated['num_crc_validations'] = ReacEnzyPath_validated[validation_columns].sum(axis=1)
        print("✅ Added 'num_crc_validations' column to the DataFrame.")

        # --- Step 2: Filter and display for each category ---
        # Define which columns to show in the summary for brevity
        display_cols_summary = ['input_compound_name'] + validation_columns + ['num_crc_validations']
        # If you want to see the actual pathway/enzyme/reaction hits that led to True, add:
        # 'crc_pathway_hits', 'enzyme_crc_pathway_links', 'reaction_crc_pathway_links'
        # For example:
        # display_cols_detailed = ['input_compound_name'] + validation_columns + \
        #                         ['crc_pathway_hits', 'enzyme_crc_pathway_links', 'reaction_crc_pathway_links', 'num_crc_validations']


        print("\n\n--- Compounds with 3/3 Validations (Pathway, Enzyme, AND Reaction linked to CRC Gold Standard) ---")
        df_3_of_3 = ReacEnzyPath_validated[ReacEnzyPath_validated['num_crc_validations'] == 3]
        print(f"Found {len(df_3_of_3)} compounds.")
        if not df_3_of_3.empty:
            print(df_3_of_3[display_cols_summary].to_string())

        print("\n\n--- Compounds with Exactly 2/3 Validations ---")
        df_2_of_3 = ReacEnzyPath_validated[ReacEnzyPath_validated['num_crc_validations'] == 2]
        print(f"Found {len(df_2_of_3)} compounds.")
        if not df_2_of_3.empty:
            print(df_2_of_3[display_cols_summary].to_string())

        print("\n\n--- Compounds with Exactly 1/3 Validations ---")
        df_1_of_3 = ReacEnzyPath_validated[ReacEnzyPath_validated['num_crc_validations'] == 1]
        print(f"Found {len(df_1_of_3)} compounds.")
        if not df_1_of_3.empty:
            print(df_1_of_3[display_cols_summary].to_string())

        print("\n\n--- Compounds with 0/3 Validations (No direct pathway, enzyme, or reaction links to Gold Standard) ---")
        df_0_of_3 = ReacEnzyPath_validated[ReacEnzyPath_validated['num_crc_validations'] == 0]
        print(f"Found {len(df_0_of_3)} compounds.")
        if not df_0_of_3.empty:
            print(df_0_of_3[display_cols_summary].to_string())
        
        # You can drop the temporary sum column if you want, or keep it for further sorting
        # ReacEnzyPath_validated.drop(columns=['num_crc_validations'], inplace=True)

✅ 'ReacEnzyPath_validated' DataFrame found. Proceeding with categorization...
✅ Added 'num_crc_validations' column to the DataFrame.


--- Compounds with 3/3 Validations (Pathway, Enzyme, AND Reaction linked to CRC Gold Standard) ---
Found 14 compounds.
                 input_compound_name  is_crc_pathway_relevant  is_enzyme_crc_relevant  is_reaction_crc_relevant  num_crc_validations
0                 D-Erythronolactone                     True                    True                      True                    3
2            Deoxyribose 5-phosphate                     True                    True                      True                    3
3        2-Aminobenzenesulfonic acid                     True                    True                      True                    3
4        Quinoline-4-carboxylic acid                     True                    True                      True                    3
5                     cyclo(glu-glu)                     True                    

<h2>🧪 Phase 3 Validation Results: CRC-Relevant Metabolite Categorization</h2>

<h3>✅ Summary of Results</h3>
<p>
After validating 24 CRC-associated metabolites against a curated list of <code>gold_standard_crc_pathways</code>, 
compounds were categorized based on the number of supporting links (out of 3 possible routes):
</p>

<ul>
  <li><strong>Pathway Match</strong>: The compound is directly annotated to a CRC-relevant KEGG pathway.</li>
  <li><strong>Enzyme Match</strong>: Its catalyzing enzyme(s) map to CRC pathways.</li>
  <li><strong>Reaction Match</strong>: The biochemical reactions it participates in link to CRC pathways.</li>
</ul>

<table>
  <thead>
    <tr>
      <th>Validation Tier</th>
      <th>Compound Count</th>
      <th>Interpretation</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>✅ 3/3 Validations</strong></td>
      <td>14 compounds</td>
      <td>High confidence — supported by pathway, enzyme, and reaction annotations.</td>
    </tr>
    <tr>
      <td><strong>⚠️ 2/3 Validations</strong></td>
      <td>1 compound</td>
      <td>Likely biologically relevant, but missing direct KEGG pathway annotation.</td>
    </tr>
    <tr>
      <td><strong>❓ 1/3 Validations</strong></td>
      <td>3 compounds</td>
      <td>Weakly supported — potential annotation gaps or indirect roles in CRC.</td>
    </tr>
    <tr>
      <td><strong>❌ 0/3 Validations</strong></td>
      <td>6 compounds</td>
      <td>Unvalidated — either peripheral to core metabolism or affected by data/model limitations.</td>
    </tr>
  </tbody>
</table>

---

<h3>⚠️ Methodological Shortcomings</h3>
<p>
The inability to validate 10 out of 24 metabolites through all three routes likely reflects a combination of the following limitations:
</p>

<ul>
  <li><strong>KEGG incompleteness:</strong> KEGG pathways, especially for lipids and secondary metabolites, may be outdated or sparse.</li>
  <li><strong>Non-metabolic CRC roles:</strong> Some compounds (e.g., hormones, signaling lipids) affect cancer via gene regulation, not core metabolism.</li>
  <li><strong>Missing enzyme annotations:</strong> Enzymes for certain metabolites may be experimentally known but unlisted in KEGG (e.g., orphan reactions).</li>
  <li><strong>Ambiguity in metabolite identity:</strong> Some entries (like isomers or modifications) may not match canonical KEGG IDs precisely.</li>
</ul>

---

<h3>🔗 Importance of Layering Multi-Omics Data</h3>
<p>
While metabolite-centric validation provides strong biochemical context, it does not capture the full picture of CRC relevance. 
Layering this with <strong>genomic and transcriptomic data</strong> significantly increases biological resolution:
</p>

<ul>
  <li>🧬 <strong>Genomics:</strong> Identifies somatic mutations in genes encoding enzymes that act on these metabolites (e.g., <code>IDH1</code>, <code>TP53</code>).</li>
  <li>📈 <strong>Transcriptomics:</strong> Reveals up- or down-regulation of enzyme-coding genes in CRC vs. healthy tissue, confirming activity changes.</li>
  <li>🔍 <strong>Contextualization:</strong> Helps distinguish metabolites that are CRC drivers vs. downstream byproducts or noise.</li>
</ul>

<p>
By combining pathway validation with gene-level expression and mutation data, we can construct <strong>multi-layered CRC-specific metabolic networks</strong> 
that are both mechanistically grounded and data-driven. This is essential for prioritizing biomarkers, drug targets, and metabolic interventions.
</p>


<div style="
    border-left: 4px solid #2e7d32;
    background: #EEE3B7;
    padding: 20px;
    margin: 25px 0;
    border-radius: 8px;
    font-family: Arial, sans-serif;
    line-height: 1.5;
">
  <strong style="color: #2e7d32; font-size: 18px;">6. Genomic &amp; Transcriptomic Layering</strong>
  
  <h5 style="margin: 16px 0 8px 0; color: #2e7d32;">6.1 Prior-Based Gene Annotation</h5>
  <ul style="margin: 6px 0 0 20px; padding: 0; list-style-type: disc;">
    <li><strong>Gene → KEGG Pathway Mapping:</strong> Map each gene (via its EC number) to one or more KEGG metabolic pathways, placing gene expression within metabolic contexts.</li>
  </ul>

In [8]:
import pandas as pd
ReacEnzyPath = pd.read_csv('ReacEnzyPath.csv')
ReacEnzyPath.head(24)


Unnamed: 0,input_compound_name,input_original_index,input_hmdb,input_pubchem_cid,input_cas,derived_kegg_compound_id,kegg_id_source_identifier,kegg_id_source_type,kegg_reactions,kegg_enzymes,kegg_pathways,KEGG_Hits_Count,SMILES
0,D-Erythronolactone,MADN0053,HMDB0000349,5325915,15667-21-7,,,,"['R00145', 'R00146', 'R00717', 'R01088', 'R013...","['1.1.1.29', '1.1.1.30', '1.4.1.9']","['hsa00260', 'hsa00280', 'hsa00650']",0,C1[C@H]([C@H](C(=O)O1)O)O
1,"1,6-anhydro-β-D-glucose",MADN0166,HMDB0000640,724705,498-07-7,,,,"['R01431', 'R01896', 'R07152', 'R09477', 'R116...","['1.1.1.307', '1.1.1.9', '1.1.3.41']",['hsa00040'],0,C1[C@@H]2[C@H]([C@@H]([C@H]([C@H](O1)O2)O)O)O
2,Deoxyribose 5-phosphate,MADN0220,HMDB0001031,45934311,-,C00673,HMDB0001031,HMDB,"['R02750', 'R02749', 'R01066']","['4.1.2.4', '2.7.1.229', '2.7.1.15', '5.4.2.7']","['hsa00030', 'hsa01100']",3,C1[C@@H]([C@H](O[C@H]1O)COP(=O)(O)O)O
3,2-Aminobenzenesulfonic acid,MADN0329,-,6926,88-21-1,,,,"['R01842', 'R02354', 'R02355', 'R02356', 'R025...",['1.14.14.1'],['hsa00071'],0,C1=CC=C(C(=C1)N)S(=O)(=O)O
4,Quinoline-4-carboxylic acid,MADN0333,-,10243,486-74-8,,,,"['R00366', 'R00372', 'R00635', 'R01340', 'R017...","['1.2.3.1', '1.4.3.3', '2.6.1.4']","['hsa00260', 'hsa00280']",0,C1=CC=C2C(=C1)C(=CC=N2)C(=O)O
5,cyclo(glu-glu),MADN0466,-,7408481,16691-00-2,,,,"['R00093', 'R00114', 'R00243', 'R00248', 'R002...","['1.4.1.13', '1.4.1.14', '3.5.3.7']","['hsa00250', 'hsa00330']",0,C(CC(=O)O)[C@H]1C(=O)N[C@H](C(=O)N1)CCC(=O)O
6,P-sulfanilic acid,MADN0498,-,8479,121-57-3,,,,"['R01842', 'R02354', 'R02355', 'R02356', 'R025...",['1.14.14.1'],['hsa00071'],0,C1=CC(=CC=C1N)S(=O)(=O)O
7,Methylcysteine,MADP0119,HMDB0002108,24417,1187-84-4,,,,"['R00891', 'R00894', 'R01289', 'R01290', 'R027...","['4.2.1.22', '4.3.2.9', '6.3.2.2']","['hsa00260', 'hsa00270', 'hsa00480']",0,CSC[C@@H](C(=O)O)N
8,Asp-Arg,MADP0548,-,16122509,-,,,,"['R00256', 'R00485', 'R01398', 'R01579', 'R019...","['2.1.3.3', '3.5.1.38', '3.5.3.7']","['hsa00220', 'hsa00330']",0,C(C[C@@H](C(=O)O)NC(=O)[C@H](CC(=O)O)N)CN=C(N)N
9,LPI(16:2/0:0),MEDN1253,-,,-,,,,"['R01185', 'R01186', 'R01187', 'R01313', 'R013...","['1.13.11.33', '3.1.1.4', '3.1.3.25']","['hsa00521', 'hsa00564', 'hsa00590']",0,CCCC/C=C\C/C=C\CCCCCCC(=O)OC[C@H](O)COP(=O)([O...


<h2>🧬 Mapping Enzymes to Genes via KEGG Orthology</h2>

<h3>🎯 Objective</h3>
<p>
To establish gene-level associations for each enzyme that acts on CRC-relevant metabolites by querying KEGG Orthology (KO) data.
</p>

---

<h3>🔄 What This Script Does</h3>
<ol>
    <li><strong>Loads</strong> the <code>ReacEnzyPath.csv</code> dataset that includes enzyme (EC) and reaction annotations.</li>
    <li><strong>Filters</strong> entries to those that include both enzyme and reaction information.</li>
    <li><strong>Uses KEGG’s REST API</strong> (via <code>bioservices</code>) to:
        <ul>
            <li>Fetch orthology mappings (<code>KO</code> IDs) for each enzyme.</li>
            <li>Parse the associated <code>GENES</code> section to extract gene IDs across species.</li>
        </ul>
    </li>
    <li><strong>Reshapes</strong> the dataset so each enzyme entry maps to a list of gene identifiers.</li>
    <li><strong>Saves</strong> the output as <code>final_df_with_genes.csv</code>, linking enzymes to genes.</li>
</ol>

---

<h3>📁 Output Snapshot</h3>
<p>The resulting <code>final_df</code> links each KEGG enzyme (EC number) to:</p>
<ul>
    <li>The reactions it catalyzes</li>
    <li>A curated list of genes associated with that enzyme across organisms</li>
</ul>

<pre><code>
kegg_enzyme     | kegg_reactions         | gene_list
--------------- | ---------------------- | --------------------------
'1.1.1.14'      | ['R01070']             | ['hsa:847', 'mmu:12345', ...]
'2.7.1.1'       | ['R00100', 'R00500']   | ['hsa:111', 'eco:b0012', ...]
</code></pre>


In [2]:
import pandas as pd
from bioservices import KEGG # Make sure KEGG is imported
import re # Make sure re is imported
import ast # For ast.literal_eval

# Initialize KEGG service (needs to be done once per session)
print("Initializing KEGG service...")
k = KEGG()
print("✅ KEGG service initialized.")

# --- Load your initial data ---
# Make sure 'ReacEnzyPath.csv' is uploaded to your Colab environment
try:
    print("\nLoading 'ReacEnzyPath.csv'...")
    ReacEnzyPath = pd.read_csv('ReacEnzyPath.csv')
    print("✅ ReacEnzyPath.csv loaded successfully.")
except FileNotFoundError:
    print("❌ ERROR: 'ReacEnzyPath.csv' not found. Please upload it to your Colab environment.")
    exit() # Stop if the base file isn't there

# --- Create 'filtered_df' ---
print("\nCreating 'filtered_df'...")
filtered_df = ReacEnzyPath[
    (ReacEnzyPath['kegg_enzymes'] != '[]') &
    (ReacEnzyPath['kegg_reactions'] != '[]')
].copy()

def extract_first_id(kegg_list_str): # Not strictly needed for final_df but was part of original filtered_df creation
    if not isinstance(kegg_list_str, str): return None
    ids = kegg_list_str.strip('[]').split(',')
    if ids:
        first_id = ids[0].strip().strip("'").strip('"')
        if first_id: return first_id
    return None
filtered_df['first_kegg_enzyme'] = filtered_df['kegg_enzymes'].apply(extract_first_id)
filtered_df['first_kegg_reaction'] = filtered_df['kegg_reactions'].apply(extract_first_id)
filtered_df.dropna(subset=['first_kegg_enzyme', 'first_kegg_reaction'], inplace=True) # Also not strictly needed if not using these columns later
print("✅ 'filtered_df' created.")

# --- Define get_genes_manual_parse function ---
print("\nDefining 'get_genes_manual_parse' function...")
def get_genes_manual_parse(enzyme_id, reaction_id): # reaction_id is a placeholder
    enzyme_query = f"ec:{enzyme_id}"
    ko_id_for_error_msg = "N/A" # Initialize for error message

    # --- ADDED DEBUG PRINT ---
    # print(f"DEBUG: Processing EC: {enzyme_id}") # You can uncomment this for verbose debugging
    try:
        # --- ADDED DEBUG PRINT ---
        # print(f"  DEBUG: Fetching and parsing enzyme entry: {enzyme_query}")
        parsed_enzyme = k.parse(k.get(enzyme_query)) # This line might trigger bioservices warnings

        if 'ORTHOLOGY' not in parsed_enzyme or not parsed_enzyme['ORTHOLOGY']:
            # --- ADDED DEBUG PRINT ---
            # print(f"  DEBUG: No ORTHOLOGY found or ORTHOLOGY is empty for EC: {enzyme_id}")
            return []
        
        ko_id = list(parsed_enzyme['ORTHOLOGY'].keys())[0]
        ko_id_for_error_msg = ko_id # Store for potential error message
        
        # --- ADDED DEBUG PRINT ---
        # print(f"  DEBUG: Fetching KO entry for: ko:{ko_id} (derived from EC: {enzyme_id})")
        ko_entry_text = k.get(f"ko:{ko_id}") # This line might also trigger bioservices "Not Found"

        if not ko_entry_text: 
            # --- ADDED DEBUG PRINT ---
            # print(f"  DEBUG: No entry text found for ko:{ko_id}")
            return []
            
        all_genes = []
        in_genes_section = False
        for line in ko_entry_text.strip().split('\n'):
            if not line.startswith(' ') and in_genes_section: break
            if line.startswith('GENES'):
                in_genes_section = True
                line = line[5:].strip()
            if in_genes_section and ':' in line:
                parts = line.split(':', 1)
                org_code = parts[0].strip().lower()
                gene_details = parts[1].strip()
                potential_genes = gene_details.split(' ')
                for item in potential_genes:
                    if not item: continue
                    gene_id_match = re.match(r"([a-zA-Z0-9_.-]+)(\(|$)", item) 
                    if gene_id_match:
                        gene_id = gene_id_match.group(1)
                        all_genes.append(f"{org_code}:{gene_id}")
        
        # if not all_genes and in_genes_section:
            # --- ADDED DEBUG PRINT ---
            # print(f"  DEBUG: GENES section was found for ko:{ko_id}, but no genes were extracted.")
        # elif not all_genes and not in_genes_section and ko_entry_text:
            # --- ADDED DEBUG PRINT ---
            # print(f"  DEBUG: No GENES section found or no genes extracted for ko:{ko_id}. Entry text snippet:\n'''\n{ko_entry_text[:300]}...\n'''")

        return list(set(all_genes))
    except Exception as e:
        # Enhanced error print now includes the KO ID if it was determined
        # print(f"  ERROR_CAUGHT in get_genes_manual_parse for EC {enzyme_id} (KO: {ko_id_for_error_msg}): {type(e).__name__} - {e}")
        return []
print("✅ 'get_genes_manual_parse' function defined.")

# --- Reshape 'filtered_df' to create 'final_df' ---
print("\nReshaping DataFrame to create 'final_df'...")
df_to_reshape = filtered_df.copy()

print("Parsing list-like columns into actual lists...")
for col in ['kegg_enzymes', 'kegg_reactions']:
    df_to_reshape[col] = df_to_reshape[col].apply(ast.literal_eval)

print("Reshaping DataFrame to have one enzyme per row...")
df_exploded = df_to_reshape.explode('kegg_enzymes')

print("Fetching specific gene list for each enzyme (this may take a while)...")
# Using tqdm for progress bar if you have many rows
from tqdm.auto import tqdm
tqdm.pandas(desc="Fetching genes")
df_exploded['gene_list'] = df_exploded['kegg_enzymes'].progress_apply(
    lambda enzyme_id: get_genes_manual_parse(enzyme_id, reaction_id=None)
)

final_df = df_exploded[['kegg_enzymes', 'kegg_reactions', 'gene_list']].copy()
final_df.rename(columns={'kegg_enzymes': 'kegg_enzyme'}, inplace=True)
final_df = final_df[final_df['gene_list'].apply(lambda x: len(x) > 0)] # Remove rows with empty gene lists

print("✅ 'final_df' created successfully.")
print("\n--- Head of final_df ---")
# Assuming you are in an environment where display works (like Jupyter/Colab)
# If not, use print(final_df.head().to_string())
try:
    display(final_df.head())
except NameError:
    print(final_df.head().to_string())

# --- Save final_df to a CSV file ---
output_csv_path = 'final_df_with_genes.csv'
final_df.to_csv(output_csv_path, index=False)
print(f"\n✅ 'final_df' successfully saved to '{output_csv_path}'")

Initializing KEGG service...
✅ KEGG service initialized.

Loading 'ReacEnzyPath.csv'...
✅ ReacEnzyPath.csv loaded successfully.

Creating 'filtered_df'...
✅ 'filtered_df' created.

Defining 'get_genes_manual_parse' function...
✅ 'get_genes_manual_parse' function defined.

Reshaping DataFrame to create 'final_df'...
Parsing list-like columns into actual lists...
Reshaping DataFrame to have one enzyme per row...
Fetching specific gene list for each enzyme (this may take a while)...


Fetching genes:   0%|          | 0/102 [00:00<?, ?it/s]

✅ 'final_df' created successfully.

--- Head of final_df ---


Unnamed: 0,kegg_enzyme,kegg_reactions,gene_list
0,1.1.1.29,"[R00145, R00146, R00717, R01088, R01361, R0138...","[prhz:CRX69_22360, ccag:SR908_00095, pae:PA462..."
0,1.1.1.30,"[R00145, R00146, R00717, R01088, R01361, R0138...","[sufl:FIL70_21435, aus:IPK37_09295, ofa:NF556_..."
0,1.4.1.9,"[R00145, R00146, R00717, R01088, R01361, R0138...","[csee:C10C_0892, alic:GI364_13075, rpod:E0E05_..."
1,1.1.1.307,"[R01431, R01896, R07152, R09477, R11618, R11620]","[pte:PTT_08729, ttt:THITE_2121064, mbe:MBM_097..."
1,1.1.1.9,"[R01431, R01896, R07152, R09477, R11618, R11620]","[psx:DR96_2471, abe:ARB_04704, psta:BGK56_1982..."



✅ 'final_df' successfully saved to 'final_df_with_genes.csv'


<h2>🧬 Human Gene Symbol Extraction from Enzyme–Gene Associations</h2>

<h3>🎯 Objective</h3>
<p>
To isolate human gene identifiers from the KEGG-derived enzyme–gene mappings and convert them into standard gene symbols.
</p>

---

<h3>🔍 What This Script Does</h3>

<ol>
    <li><strong>Loads</strong> the <code>final_df_with_genes.csv</code> which links KEGG enzymes to gene lists.</li>
    <li><strong>Filters</strong> those gene lists to retain only human genes (i.e., entries starting with <code>hsa:</code>).</li>
    <li><strong>Queries KEGG</strong> to convert each human KEGG gene ID to a standard HGNC gene symbol (e.g., <code>hsa:5290 → G6PD</code>).</li>
    <li><strong>Stores</strong> the resulting gene symbol list alongside each enzyme for downstream use.</li>
    <li><strong>Exports</strong> the results into <code>analysis_ready_df.csv</code>.</li>
</ol>

---

<h3>📦 Output Structure</h3>

<p>The final output is a filtered DataFrame with the following structure:</p>

<pre><code>
kegg_enzyme    | kegg_reactions         | gene_symbols
-------------- | ---------------------- | -----------------------------
1.1.1.14       | ['R01070']             | ['ADH1A', 'ADH1B', 'ADH1C']
2.7.1.1        | ['R00100', 'R00500']   | ['HK1', 'HK2']
</code></pre>

<p>Only rows with non-empty gene symbol lists are retained.</p>


In [1]:
from bioservices import KEGG
import pandas as pd
import ast # For ast.literal_eval

# --- Step 0: Load your final_df and convert 'gene_list' ---
print("Loading 'final_df_with_genes.csv'...")
try:
    final_df = pd.read_csv('final_df_with_genes.csv')
    print("✅ 'final_df_with_genes.csv' loaded successfully.")
    
    # *** THIS IS THE CRUCIAL FIX ***
    # Convert the 'gene_list' column from string representation to actual lists
    # Also, handle potential NaN values if any rows had no genes initially
    print("Converting 'gene_list' column to actual lists...")
    final_df['gene_list'] = final_df['gene_list'].apply(
        lambda x: ast.literal_eval(x) if isinstance(x, str) else ([] if pd.isna(x) else x)
    )
    print("✅ 'gene_list' column converted.")
    
except FileNotFoundError:
    print("❌ ERROR: 'final_df_with_genes.csv' not found. Please ensure it exists.")
    exit()
except Exception as e:
    print(f"❌ ERROR during loading or ast.literal_eval: {e}")
    exit()

# Initialize KEGG service if needed
print("\nInitializing KEGG service...")
k = KEGG()
print("✅ KEGG service initialized.")

# --- Step 1: Function to isolate human ('hsa') genes ---
def filter_human_genes(gene_list_val):
    """Takes a list of KEGG gene IDs and returns only the human ones."""
    # Ensure gene_list_val is actually a list
    if not isinstance(gene_list_val, list):
        return []
    human_genes = [gene for gene in gene_list_val if isinstance(gene, str) and gene.startswith('hsa:')]
    return human_genes

# --- Step 2: Function to convert human KEGG IDs to official Gene Symbols ---
gene_id_to_symbol_cache = {}
def convert_kegg_ids_to_symbols(kegg_ids):
    """
    Takes a list of human KEGG IDs (e.g., ['hsa:5199']) and converts them
    to gene symbols (e.g., ['PEPD']), using a cache to improve speed.
    """
    symbols = []
    if not isinstance(kegg_ids, list): # Ensure input is a list
        return symbols

    for kegg_id in kegg_ids:
        if not isinstance(kegg_id, str): # Ensure item in list is a string
            continue
        if kegg_id in gene_id_to_symbol_cache:
            symbols.append(gene_id_to_symbol_cache[kegg_id])
            continue
        try:
            gene_info = k.get(kegg_id)
            if not gene_info: # Handle empty response from k.get()
                # print(f"DEBUG: No gene_info returned for {kegg_id}")
                continue
            parsed_gene = k.parse(gene_info)
            if 'SYMBOL' in parsed_gene:
                symbol = parsed_gene['SYMBOL'].split(',')[0].strip()
                if symbol:
                    symbols.append(symbol)
                    gene_id_to_symbol_cache[kegg_id] = symbol
            # else:
                # print(f"DEBUG: No 'SYMBOL' field in parsed_gene for {kegg_id}")
        except Exception as e:
            # print(f"DEBUG: Error converting {kegg_id} to symbol: {e}")
            continue
    return symbols

# --- Step 3: Apply these functions to your DataFrame ---
print("\nCleaning data for expression analysis...")

print("1. Filtering for human genes...")
final_df['human_gene_ids'] = final_df['gene_list'].apply(filter_human_genes)

print("2. Converting human gene IDs to symbols (this may take a moment)...")
# Using tqdm for progress bar if you have many rows
from tqdm.auto import tqdm
tqdm.pandas(desc="Converting to Symbols")
final_df['gene_symbols'] = final_df['human_gene_ids'].progress_apply(convert_kegg_ids_to_symbols)

# --- Display the final, analysis-ready DataFrame ---
# We create a new DataFrame, keeping only the useful columns and non-empty lists.
analysis_ready_df = final_df[final_df['gene_symbols'].apply(lambda x: len(x) > 0)].copy()

print("\n--- Analysis-Ready DataFrame ---")
# Assuming you are in an environment where display works (like Jupyter/Colab)
# If not, use print(analysis_ready_df[['kegg_enzyme', 'kegg_reactions', 'gene_symbols']].to_string())
try:
    display(analysis_ready_df[['kegg_enzyme', 'kegg_reactions', 'gene_symbols']])
except NameError:
    print(analysis_ready_df[['kegg_enzyme', 'kegg_reactions', 'gene_symbols']].to_string())

if not analysis_ready_df.empty:
    output_csv_path_analysis_ready = 'analysis_ready_df.csv'
    analysis_ready_df.to_csv(output_csv_path_analysis_ready, index=False)
    print(f"\n✅ 'analysis_ready_df' successfully saved to '{output_csv_path_analysis_ready}'")
else:
    print("\n⚠️ 'analysis_ready_df' is empty. No CSV file saved.")

Loading 'final_df_with_genes.csv'...
✅ 'final_df_with_genes.csv' loaded successfully.
Converting 'gene_list' column to actual lists...
✅ 'gene_list' column converted.

Initializing KEGG service...
✅ KEGG service initialized.

Cleaning data for expression analysis...
1. Filtering for human genes...
2. Converting human gene IDs to symbols (this may take a moment)...


Converting to Symbols:   0%|          | 0/88 [00:00<?, ?it/s]


--- Analysis-Ready DataFrame ---


Unnamed: 0,kegg_enzyme,kegg_reactions,gene_symbols
1,1.1.1.30,"['R00145', 'R00146', 'R00717', 'R01088', 'R013...",[BDH1]
6,4.1.2.4,"['R02750', 'R02749', 'R01066']",[DERA]
7,2.7.1.15,"['R02750', 'R02749', 'R01066']",[RBKS]
10,1.2.3.1,"['R00366', 'R00372', 'R00635', 'R01340', 'R017...",[AOX1]
11,1.4.3.3,"['R00366', 'R00372', 'R00635', 'R01340', 'R017...",[DAO]
17,4.2.1.22,"['R00891', 'R00894', 'R01289', 'R01290', 'R027...","[CBS, CBS]"
18,4.3.2.9,"['R00891', 'R00894', 'R01289', 'R01290', 'R027...",[GGCT]
20,2.1.3.3,"['R00256', 'R00485', 'R01398', 'R01579', 'R019...",[OTC]
23,1.13.11.33,"['R01185', 'R01186', 'R01187', 'R01313', 'R013...",[ALOX15]
24,3.1.1.4,"['R01185', 'R01186', 'R01187', 'R01313', 'R013...","[PLA2G3, PLA2G2E, PLA2G2F, PLA2G5, PLA2G12B, P..."



✅ 'analysis_ready_df' successfully saved to 'analysis_ready_df.csv'


In [168]:
display(analysis_ready_df)

Unnamed: 0,kegg_enzyme,kegg_reactions,gene_list,human_gene_ids,gene_symbols
1,1.1.1.30,"['R00145', 'R00146', 'R00717', 'R01088', 'R01361', 'R01388', 'R01434', 'R02196']","[snz:DC008_03210, babc:DO78_1636, vde:111254904, dtl:H8F01_10525, tup:102502144, sdub:R1T39_03790, spf:SpyM50444, rpb:RPB_1408, rfv:RFYW14_02492, mpuf:101670579, sala:ESZ53_00620, canu:128165098, egz:104127126, anu:117717787, rfr:Rfer_3285, boj:CBF45_05335, sphs:ETR14_06525, sauo:BV401_07595, mamm:ABNF92_01235, cvn:111100326, rre:MCC_00235, bdf:WI26_10680, vde:111243348, cpra:CPter91_3560, mrm:A7982_06988, hda:BB347_02440, amye:MJQ72_44515, baml:BAM5036_1757, psnc:C0058_19215, dsu:Dsui_2312, hdi:HDIA_4200, xom:XOO2083, psf:PSE_0086, ppun:PP4_24260, xct:J151_02204, pmaj:107208857, pdk:PADK2_15635, parh:I5S86_15085, pcri:104029833, nfe:HUT17_00330, pbi:103054298, mju:123872216, hew:HBDW_10940, kpp:A79E_2194, bel:BE61_89060, pht:BLM14_18365, mdo:100032119, xva:C7V42_11020, sbj:CF168_03465, paca:ID47_02340, tala:104362023, ppg:PputGB1_2793, bxh:BAXH7_01614, acur:JZ785_17385, bof:FQV39_25535, cglo:123273635, phas:123827211, sdeg:GOM96_07345, xgl:120789671, mecl:CLZ_23735, blup:AB3L03_13590, spca:GL174_07630, fmu:J7337_010154, sko:100369271, bpsa:BBU_3616, rpi:Rpic_1981, pdec:H1Q58_04430, tbr:Tb10.389.1850, caq:IM40_07690, bgp:BGL_2c00220, scam:104153220, jov:P8627_12790, xla:100127326, clec:106667552, bthe:BTN_4441, sdur:M4V62_24435, vem:105559323, phey:O4N75_07690, rcv:PFY06_10415, pln:Plano_0698, afil:140158910, acyn:ORV05_24505, msut:LC048_03300, hail:ASB7_03780, bant:A16_42500, spa:M6_Spy1393, hcra:139468715, nwh:119426707, maea:128202956, udv:129226384, snh:120038350, gmu:124857863, sasa:106588253, hqn:M0220_16170, rhib:I8E17_19855, llp:GH975_02270, ahad:FRZ06_07540, ftc:DA46_1491, pbl:PAAG_04142, otc:121341177, ...]",[hsa:622],[BDH1]
6,4.1.2.4,"['R02750', 'R02749', 'R01066']","[kmi:VW41_22690, aste:118514563, mbk:K60_005050, seo:STM14_5487, eav:EH197_09590, prj:NCTC6933_00687, pdes:FE840_005695, ccoe:CETAM_01785, pra:PALO_00765, eei:NX720_09840, ppru:FDP22_20045, otu:111424818, rle:RL2744, rsi:Runsl_0373, lah:LA20533_05135, mlut:JET14_03415, lmr:LMR479A_2105, adc:FOC79_00550, nbr:O3I_039020, spac:B1H29_13900, ecoj:P423_24885, smon:AWR27_23450, btrh:F543_4180, ccae:111928508, bves:QO058_16695, sapi:SAPIS_v1c05450, sik:K710_0989, sdd:D9753_13300, vvl:VV93_v1c24000, nann:O0S08_39625, mcar:MCCPILRI181_00884, selo:AXE86_06275, mde:101894936, ncb:C0V82_15415, rro:104666401, sbn:Sbal195_3366, oki:109882147, mcep:124999789, acir:OG979_01200, con:TQ29_14550, pcx:LPB68_15445, mrm:A7982_04317, gej:A0V43_10385, arm:ART_4057, scl:sce5008, yfr:AW19_1747, mgz:GCW_03310, tpal:117643508, stii:DRB96_39430, agk:NYR60_04165, apr:Apre_1538, tsu:Tresu_1875, mci:Mesci_0700, pfq:QQ39_16870, pals:PAF20_06770, fok:KK2020170_10200, bwe:BcerKBAB4_1752, ptd:PTET_a0237, pkp:SK3146_01526, psyb:KD050_02310, aur:HMPREF9243_0259, cpsl:KBP54_08700, psew:JHW44_18160, laci:CW733_06980, fpe:Ferpe_0063, dot:EZY12_13725, pseg:D3H65_18210, bmet:BMMGA3_12000, ssun:H9Q77_07505, sbu:SpiBuddy_2882, sagm:BSA_20580, yet:CH48_170, ssu:SSU05_1085, omm:135536387, mcit:NCTC10181_00336, ock:EXM22_06170, mfu:LILAB_36070, mhas:MHAS_04846, mtd:UDA_0478, plul:FOB45_05630, cet:B8281_16170, tph:TPChic_0264, bma:BMAA0110, cgt:cgR_0457, kpq:KPR0928_13165, snt:SPT_1360, scap:AYP1020_1327, gln:F1C58_05850, spei:EHW89_04855, cter:A606_10215, erm:EYR00_03975, bose:RMR04_26395, pcub:JR316_0007723, sacg:FDZ84_07805, tmol:138131687, saui:AZ30_11280, trc:DYE49_03150, eol:Emtol_3254, btf:YBT020_09855, apa:APP7_1072, ...]",[hsa:51071],[DERA]
7,2.7.1.15,"['R02750', 'R02749', 'R01066']","[sapl:U0355_01780, pmq:PM3016_2063, pvd:CFBP1590__2212, gbc:GbCGDNIH3_1810, edj:ECDH1ME8569_3640, fga:104079114, rhi:NGR_c00540, mcoc:116080927, abao:BEQ56_11105, orp:MOP44_00870, tco:Theco_3928, sfh:SFHH103_05765, elim:B2M23_12685, pcib:F9282_10935, eun:UMNK88_4564, spoc:NCTC10925_00519, aho:Ahos_2238, btv:BTHA_3642, fmi:F0R74_08400, lsq:119600727, scap:AYP1020_1689, cgb:cg2554, pvo:PVOR_23464, erh:ERH_1481, sacd:HS1genome_1373, esn:127007485, cbp:EB354_02475, cam:101510967, cpoi:OE229_04470, ppsc:EHS13_10180, paeu:BN889_02112, lna:RIN67_07825, hdi:HDIA_1783, burm:GJG85_30965, ecg:E2348C_0066, bli:BL02439, day:FV141_10315, aou:ACTOB_007691, spad:DVK44_01585, pcs:N7525_008093, lpz:Lp16_1768, pgh:FH974_17305, palo:E6C60_0586, abao:BEQ56_00880, atl:Athai_17130, cfu:CFU_4098, talb:FTW19_12800, llut:K1X41_15060, pfq:QQ39_16110, babo:DK55_3114, plon:Pla110_26750, bhk:B4U37_20020, emt:CPZ25_017840, acij:JS278_00018, tle:Tlet_1905, baa:BAA13334_II00221, lne:FZC33_27210, marr:BKP56_00170, ptl:AOT13_01815, ecd:ECDH10B_3940, smeo:124389642, kos:KORDIASMS9_01856, kno:V6K52_15640, ers:K210_05865, eto:RIN69_22125, rmk:RAN89_11750, pacn:TIA1EST1_05985, vir:X953_18360, kut:JJ691_78020, coq:D9V35_00510, sgp:SpiGrapes_1098, pben:V6W80_11925, strr:EKD16_02435, cbol:CGC65_02725, sauo:BV401_17455, gly:K3N28_09940, pcg:AXG94_05815, tln:139821747, rlt:Rleg2_0053, ggn:109304351, lkn:FGL74_07990, csu:CSUB_C0883, dpte:113788994, ypd:YPD4_3099, supe:P0H77_22950, acac:EYQ97_13190, ajc:117111686, yre:HEC60_18050, bpus:UP12_16890, bie:RFF05_02595, amz:B737_4140, senb:BN855_38850, rao:DSD31_12470, fgi:OP10G_4377, rat:M949_1669, kok:KONIH1_14745, ppog:QPK24_14700, seq:SZO_15150, ecul:PVA46_06505, sud:ST398NM01_0283, ...]",[hsa:64080],[RBKS]
10,1.2.3.1,"['R00366', 'R00372', 'R00635', 'R01340', 'R01709', 'R02125', 'R02457', 'R02657', 'R02894', 'R02923', 'R03871', 'R04085', 'R04221', 'R04904', 'R05861', 'R06124', 'R07400', 'R08349', 'R08384', 'R08408']","[elk:111142784, ccri:104168783, arow:112969295, oor:101285123, mthb:126932306, iti:101965594, cbai:105082310, ggn:109301889, pprm:120486809, clud:112656803, pcao:104053673, mthb:126932129, lroh:127153727, mcf:102134825, asn:102383614, pviv:125172692, vko:123024423, cgib:127942713, nvs:122903471, pfor:103149483, pmei:106906675, mmur:105856981, muo:115474518, asn:102377274, mrv:120375097, bbuf:121007452, dtc:141054219, mcc:701024, acs:100565481, hsa:316, anu:117705354, vpc:102539743, tgu:100217503, csum:138075916, rfq:117026362, mcoc:116088456, csum:138076312, etl:114055687, pdab:135726365, lav:100662726, mmu:213043, ccar:109090992, pcoc:116233971, amj:102558282, cimi:108315708, aml:100474367, xco:114140277, tge:112636356, tben:117499028, gste:104251026, hmh:116810476, afb:129111600, hcq:109520227, pmaj:107207425, ptg:102949513, ocu:100008601, ray:107503175, omc:131529646, eny:140364181, oro:101369776, cge:100760212, scam:104150187, vlg:121480828, mcoc:116088453, sfum:130021027, vvp:112922804, ajm:119065518, gas:123242136, csab:103217611, apla:101795737, pvp:105298171, cvf:104291038, hcg:128339747, myum:139014904, pcoc:116233962, apri:131200566, vlg:121480502, trn:134304379, mgp:100549268, cel:CELE_B0222.9, maua:101823275, cdk:105099168, nsu:110591947, cmac:104483254, mamb:125254213, oke:118363697, mcal:110291452, hle:104829954, pov:109639051, anan:135273577, xtr:100379890, afor:103903466, hgl:101710340, rro:104670351, afz:127557273, shab:115607863, gcl:127018248, maua:101823809, prob:127226326, npr:108799582, ...]",[hsa:316],[AOX1]
11,1.4.3.3,"['R00366', 'R00372', 'R00635', 'R01340', 'R01709', 'R02125', 'R02457', 'R02657', 'R02894', 'R02923', 'R03871', 'R04085', 'R04221', 'R04904', 'R05861', 'R06124', 'R07400', 'R08349', 'R08384', 'R08408']","[achc:115346439, caua:113048893, mtm:MYCTH_60632, mlan:JY572_08485, pja:122249508, loc:102688406, ccrn:123292112, sanl:KZO11_32750, plop:125349017, aalb:115268049, ppyr:116172507, mia:OCU_26140, msim:MSIM_50640, nfi:NFIA_075980, muo:115479103, grl:LPB144_05280, dpx:DAPPUDRAFT_213708, der:6542841, hhar:128146308, pcan:112573874, cqu:CpipJ_CPIJ007273, tben:117496805, zvi:118095871, pfy:PFICI_08522, ngi:103746689, etl:114062674, fox:FOXG_10896, sla:SERLADRAFT_458189, spil:134089314, scae:IHE65_05825, arow:112963342, aswu:HUW51_09195, shs:STEHIDRAFT_127642, mbrn:26247253, asp:AOR13_1084, dpa:109541460, seng:OJ254_27515, nca:Noca_1780, tru:101077517, shaw:CEB94_35480, bcom:BAUCODRAFT_302218, pif:PITG_08125, aste:118517591, pecq:AD017_16400, amd:AMED_2680, ecra:117944970, sanh:107696048, ang:An06g02340, pat:Patl_3809, mmi:MMAR_2477, ccad:122422138, nfi:NFIA_073030, alz:AV940_11820, pou:POX_d06014, bod:106622550, oha:104331919, kdj:28968585, dros:Drose_07775, pvt:110083399, cdet:87948429, tml:GSTUM_00002749001, dros:Drose_00720, ksh:87958508, cot:CORT_0E06250, caur:CJI96_0004484, lav:100658336, shr:100929799, pbg:122470339, acoe:138119425, mly:CJ228_001315, ker:91099515, epa:110233699, pfp:PFL1_02372, glz:GLAREA_00037, som:SOMG_02970, cal:CAALFM_C500450CA, sauh:SU9_029305, schf:IPT68_31595, bfre:138627274, bfu:BCIN_12g04240, nsu:110592535, pdic:114510022, ctp:CTRG_04578, ker:91098832, dmat:Dmats_05555, clum:117736257, ker:91100917, tasa:A1Q1_03993, bbel:109466174, nrn:SHK17_06825, zro:ZYRO0B01892g, bom:102288350, dvc:Dvina_08015, mte:CCDC5079_1761, tsd:MTP03_39660, mul:MUL_1659, stsi:A4E84_34650, agen:126041210, xma:102218661, mcra:ID554_09910, ...]",[hsa:1610],[DAO]
17,4.2.1.22,"['R00891', 'R00894', 'R01289', 'R01290', 'R02743', 'R03749', 'R04942', 'R10993']","[fsf:AABD74_12985, faz:M0M57_02450, bbre:B12L_0456, kfa:Q73A0000_03375, marm:YQ22_17170, mlx:118026969, arto:PJ267_15910, pago:IRJ34_15185, pseh:XF36_20675, cfa:611071, plal:FXN65_01435, ccau:EG346_13215, uli:ETAA1_47060, pecq:AD017_24075, kno:V6K52_15800, pfuc:122517607, gsh:117359439, ppel:H6H00_04060, vcd:124538385, sspn:LXH13_24300, ccic:134058111, lfx:LU699_11890, ccay:125643160, fcu:NOX80_00250, fcg:LNP23_09810, obb:114877262, sinn:ABB07_22495, opi:101516453, hfl:PUV54_10000, nfa:NFA_48640, aoq:129238563, pqu:IG609_018465, cmaq:H0S70_06125, ecli:ECNIH5_09110, abre:pbN1_27460, ldc:111506556, pbro:HOP40_03570, shun:DWB77_04595, kdi:Krodi_2944, xcb:XC_3636, edl:AAZ33_09855, dye:EO087_08555, mcx:BN42_20925, ifu:128601987, abe:ARB_07681, rhi:NGR_a00730, rfs:C1I64_11905, gms:SOIL9_23980, spin:KV203_14085, spun:BFF78_24940, nob:CW736_10230, fbm:MQE35_16675, mgi:Mflv_2043, ade:Adeh_3300, sscv:125987927, fbi:L0669_13240, foh:FORMA_16570, btax:128045376, pbf:CFX0092_A0187, scb:SCAB_5301, rhd:R2APBS1_0565, umr:103680763, sma:SAVERM_3510, lmb:C9I47_0687, cpu:CPFRC_03115, cop:CP31_03470, lpal:LDL79_03265, tatv:25775430, bbis:104984942, cly:Celly_2204, ppae:LDL65_26450, nmf:NMS_2520, spir:CWM47_15965, sasa:106562750, mlw:MJO58_21560, clud:112651104, pfll:139116215, agg:C1N71_03000, wch:wcw_1144, aten:116305218, miz:BAB75_06520, mle:ML2396, pmoa:120511828, ned:HUN01_13785, salp:111959713, fcb:LOS89_06985, lse:F1C12_21180, zvi:118085215, cgar:128506706, smau:118311295, pct:PC1_3903, mabl:MMASJCM_1221, cira:LFM56_13715, prho:PZB74_21770, nyu:D7D52_02275, tpre:106652234, ako:N9A08_12700, plai:106958598, kbu:Q4V64_20580, smac:SMDB11_0903, ...]","[hsa:102724560, hsa:875]","[CBS, CBS]"
18,4.3.2.9,"['R00891', 'R00894', 'R01289', 'R01290', 'R02743', 'R03749', 'R04942', 'R10993']","[hcra:139453113, nwh:119425359, ccae:111923889, aang:118212318, dro:112306962, amex:103033278, ptg:102972301, cbr:CBG_11175, lalb:132524661, mna:107544417, dpol:127845240, stru:115173880, emc:129337838, phas:123811697, obb:114876242, svg:106854490, atra:106140073, pxu:106124195, dpl:KGM_215638, pmei:106912476, tng:GSTEN00021982G001, pxu:106124283, ptru:123498548, sanh:107673341, lsax:138972038, nvg:124309443, cfr:102512131, shab:115613484, pgeo:117455238, mmur:105855110, zca:113911457, prap:111003918, abru:129963705, ray:107499554, aqu:100640069, amer:121598538, pret:103472371, csyr:103270036, nvi:100122156, omy:118938772, stru:115177047, ccrc:123694109, haw:110377784, scac:106094106, hze:124634747, one:115128635, slal:111664184, ppap:129799979, vps:122628419, hvi:124355419, achc:115339007, oga:100956580, pmur:107294022, gmw:113518297, gmw:113518182, tge:112621323, nasi:112406298, amj:102567732, nai:NECAME_13395, arot:135259530, plet:104616011, hhal:112211193, cvs:136990277, otu:111425266, epz:103552750, gmh:115545208, omy:110511884, gsh:117353761, asua:134225391, rmp:119181366, apln:108734187, ocla:139380163, ipu:100528263, bmor:101743698, aju:106968672, maea:128209168, xco:114152740, eja:138211856, ofu:114351739, cang:105520308, pov:109631991, breg:104631158, anu:117720627, hhip:117777335, dsr:110178952, asn:102385482, rmp:119181365, hrf:124147279, manu:129448901, plep:121965390, ipu:108256920, pame:138699454, eju:114221157, ola:101162615, bacu:102997196, eai:106831060, lroh:127181797, npt:124219955, vvp:112935405, lgf:123606649, ...]",[hsa:79017],[GGCT]
20,2.1.3.3,"['R00256', 'R00485', 'R01398', 'R01579', 'R01990', 'R06134']","[led:BBK82_24610, pgis:I6I06_15380, pdf:CD630DERM_20300, beba:BWI17_08480, ncx:Nocox_24360, marc:AR505_1401, hix:NTHI723_01089, cma:Cmaq_0092, ele:Elen_1948, eclc:ECR091_02725, fpb:NLJ00_11370, kpb:FH42_19695, ssch:LH95_00415, fpla:A4U99_11020, oca:OCAR_4655, sax:USA300HOU_0067, casp:NQ535_02670, mpak:MIU77_08820, axy:AXYL_01973, tml:GSTUM_00003888001, nli:G3M70_07840, llx:NCDO2118_2143, vcv:GJV44_00392, hbc:AEM38_02735, prq:CYG50_18400, ppai:E1956_06255, aes:C2U30_12980, huw:FPZ11_00845, casp:NQ535_20085, hara:AArcS_3066, egd:GS424_015355, mesg:MLAUSG7_1366, clc:Calla_0960, cid:P73_3734, mhd:Marky_1757, agar:OAG1_36780, dch:SY84_04190, pba:PSEBR_a4400, ceec:P3F56_01740, bvit:JIP62_11355, set:SEN4216, slia:HA039_07325, ecq:ECED1_5107, bthy:AQ980_09205, amav:GCM10025877_30580, prop:QQ658_14495, sdiz:WCY31_03070, alr:DS731_18720, hya:HY04AAS1_1002, mtuu:HKBT2_1753, cld:CLSPO_c26650, mmb:Mmol_1369, bku:QIA22_04820, mok:Metok_1035, mcra:ID554_01870, emv:HQR01_06430, cpoo:109305927, nih:NitYY0810_C0444, serh:GF111_10540, abau:IX87_21695, bsan:CHH28_18580, brhi:104489781, rpus:CFBP5875_00430, aang:118214522, fwa:DCMF_08690, synp:Syn7502_01014, clu:CLUG_05305, tas:TASI_0924, sulr:B649_10235, dct:110095612, pfk:PFAS1_05570, sen:SACE_4292, cib:HF677_020615, ssif:AL483_11265, camh:LCW13_12755, fal:FRAAL5204, sins:PW252_03760, ntm:BTDUT50_04915, pgf:J0G10_24875, edu:LIU_09830, pmam:KSS90_20150, mou:OU421_05640, tci:A7K98_01400, ssed:H9L14_10645, paen:P40081_16685, actn:L083_5870, pbz:GN234_24975, asu:Asuc_1509, sclo:SCLO_1016020, cko:CKO_03551, svi:Svir_25590, nad:NCTC11293_03064, staa:LDH80_10285, dden:KI615_15415, kvr:CIB50_0001545, mfa:Mfla_1709, etb:N7L95_07230, fpr:FP2_23850, nlk:NIES25_37540, pdec:H1Q58_00300, ...]",[hsa:5009],[OTC]
23,1.13.11.33,"['R01185', 'R01186', 'R01187', 'R01313', 'R01315', 'R01317', 'R01593', 'R02053', 'R03107', 'R03626', 'R07064', 'R07343', 'R07379', 'R07387', 'R07859']","[mlf:102427209, oor:101281563, sbq:101031385, hai:109394595, cimi:108306442, ocu:100009114, mnp:132004425, lww:102750374, cdk:105106001, lav:104845446, etf:101654335, ssyn:129470266, cfr:116669221, ccad:122446330, pkl:118726749, myum:138994085, tup:102478167, csum:138084752, opi:101518498, mbez:129542456, pcad:102986457, tvp:118856898, pkl:118726751, chx:102179132, cang:105523680, ppam:129072206, oas:101116587, dsp:122100448, pcoq:105822846, ppyg:129018111, umr:103660453, puc:125926176, mlf:106694362, tge:112609437, anan:105714417, biu:109574003, nle:100598410, mree:136148954, lve:103073785, myum:138993692, phas:123809317, mcc:709212, pleu:114693249, pdic:114515339, tge:112610174, vvp:112924802, ecb:100072900, sara:101547759, bom:102285661, ptr:748294, ssyn:134731306, ncar:124980592, tup:102488255, ovr:110150179, myum:138994127, mdt:132218115, pkl:118726750, oga:100953458, mna:107528023, hsa:246, mcal:110305643, anu:117710957, aml:100479002, eai:106844159, mlf:102421821, panu:101014249, cang:105523522, rno:81639, cbai:105066404, bbub:102397311, etf:101649462, myum:138994149, mlx:117999977, mmur:105879207, mlf:102427827, mdo:100015976, gas:123247523, dsp:122108619, tod:119232067, psiu:116745707, efus:103299846, dord:105983343, pcw:110194258, dnm:101445760, mfot:126514012, iti:101973674, bta:282139, mun:110542750, lalb:132511482, npo:129497273, uar:123792173, npo:129497891, uah:113271093, ssc:396971, llv:125087857, ngi:103752179, morg:121456877, cfa:489458, efus:129147367, mpah:110332116, ...]",[hsa:246],[ALOX15]
24,3.1.1.4,"['R01185', 'R01186', 'R01187', 'R01313', 'R01315', 'R01317', 'R01593', 'R02053', 'R03107', 'R03626', 'R07064', 'R07343', 'R07379', 'R07387', 'R07859']","[plop:125354236, pxb:103959387, jre:109007453, pyu:121022670, cmao:118810793, pmec:138938729, cbri:134830991, agb:108910181, ccin:107263376, cbai:105070899, mcal:110293160, lsq:119614332, bfo:118428251, scan:103827085, cboy:140661598, char:105899085, dsv:119466612, cvs:136980294, ssoe:131446065, rfq:117036414, iel:124155131, pmax:117345037, hrt:120759654, ctul:119785360, cgib:127952397, asyl:127696622, otc:121339293, praf:128401962, bfre:138624670, sanh:107674667, oep:134015405, lak:106164109, lbd:127289117, ptep:107453256, dsp:122099052, mlx:118021504, stru:115204156, ajj:139975955, bman:114250385, api:100569970, acos:131264619, pcoo:112848836, umr:103666808, zvi:118092865, cdk:105088846, hsa:64600, gfs:119633680, cabi:116824403, dcc:119845018, tmol:138130351, cpla:122562836, ajm:119045395, csab:103239247, ovi:T265_05468, dne:112984133, ray:107521019, csec:111863777, bom:102276506, ipu:108263807, llv:125082321, tst:117888347, bgar:122931693, sluc:116048457, osn:115232337, ipu:108278205, alim:106529961, sjo:128369248, eai:106845081, mpah:110323483, bbub:123332249, cimi:108287739, gat:120821017, nasi:112402738, csum:138076385, opi:101523511, bgt:106072631, sbq:101034469, sasa:106593742, mrj:136846752, dse:6605838, dmo:Dmoj_GI14370, mbez:129561895, pbrv:138153968, lsr:110472283, cjo:107323345, bpec:110165394, ags:114121615, hvi:124358555, ptr:470183, pdam:113677597, ajc:117120535, pov:109638354, ppot:106107748, hcg:128327182, schu:122866099, ssyn:129472254, lco:104939254, vem:105560078, acat:141111737, anu:117709810, ...]","[hsa:64600, hsa:5320, hsa:26279, hsa:391013, hsa:8399, hsa:5322, hsa:30814, hsa:50487, hsa:84647, hsa:5319, hsa:81579]","[PLA2G2F, PLA2G2A, PLA2G2D, PLA2G2C, PLA2G10, PLA2G5, PLA2G2E, PLA2G3, PLA2G12B, PLA2G1B, PLA2G12A]"


<h2>🧬 Linking Enzymes to Human Pathways via KEGG</h2>

<h3>🎯 Objective</h3>
<p>
To annotate each CRC-relevant enzyme (EC number) with its associated human pathways using KEGG's REST API. 
This enables downstream integration with pathway-level omics analyses (e.g., enrichment, dysregulation).
</p>

---

<h3>🔍 What This Script Does</h3>
<ol>
    <li><strong>Takes</strong> each <code>EC number</code> from the <code>analysis_ready_df</code> dataset.</li>
    <li><strong>Queries</strong> the KEGG API to find linked pathways via <code>ec → pathway</code> mapping.</li>
    <li><strong>Filters</strong> and converts each pathway ID to <code>hsaXXXXX</code> (human-specific format).</li>
    <li><strong>Adds</strong> a new column <code>kegg_pathways</code> to the DataFrame for each enzyme.</li>
</ol>

---

<h3>📦 Output Structure</h3>

<p>The updated <code>analysis_ready_df</code> will now include:</p>

<pre><code>
kegg_enzyme | kegg_reactions       | gene_symbols         | kegg_pathways
------------|----------------------|----------------------|------------------------
1.1.1.1     | ['R00268']           | ['ADH1A', 'ADH1B']   | ['hsa00010', 'hsa00040']
2.7.1.1     | ['R00100', 'R00500'] | ['HK1', 'HK2']       | ['hsa00010']
</code></pre>

<p>This mapping makes it possible to cross-reference predicted enzyme activity with known cancer or metabolic pathways.</p>


In [2]:
import pandas as pd
import requests
import time
import re # For potentially more complex parsing if needed, though not strictly used in this version

# Assume 'analysis_ready_df' is your existing DataFrame with the 'kegg_enzyme' column.

TARGET_ORGANISM_CODE = "hsa"
KEGG_API_DELAY = 0.1 # Seconds to wait between API calls

def get_hsa_pathways_for_ec(ec_number_str):
    """
    Fetches KEGG pathways for a single EC number and converts them to HSA format.
    Returns a sorted list of unique hsaXXXXX pathway IDs.
    """
    if not isinstance(ec_number_str, str) or not ec_number_str:
        return []

    processed_pathways = set()
    time.sleep(KEGG_API_DELAY) # Be polite to the KEGG API

    try:
        # Query KEGG to link EC number to pathways
        url = f"http://rest.kegg.jp/link/pathway/ec:{ec_number_str}"
        response = requests.get(url, timeout=10)
        response.raise_for_status() # Raise an exception for HTTP errors

        if response.text:
            for line in response.text.strip().splitlines():
                parts = line.split("\t")
                # Expected format: "ec:1.1.1.1\tpath:map00010"
                if len(parts) == 2 and parts[1].startswith("path:"):
                    pid_full = parts[1].split(":", 1)[1] # e.g., "map00010" or "hsa00010"

                    if pid_full.startswith("map"):
                        # Convert mapXXXXX to hsaXXXXX
                        processed_pathways.add(TARGET_ORGANISM_CODE + pid_full[3:])
                    elif pid_full.startswith(TARGET_ORGANISM_CODE):
                        # Already a target organism pathway ID
                        processed_pathways.add(pid_full)
                    # Other pathway types (ko, other organisms, ec-pathways) are ignored
    except requests.exceptions.RequestException as e:
        print(f"    Warning: Could not fetch pathways for EC {ec_number_str}: {e}")
    
    return sorted(list(processed_pathways))

# --- Apply the function to your DataFrame ---
print(f"\nFetching KEGG pathways for each enzyme in 'analysis_ready_df' (targeting {TARGET_ORGANISM_CODE})...")

# Ensure 'analysis_ready_df' and 'kegg_enzyme' column exist
if 'analysis_ready_df' in locals() or 'analysis_ready_df' in globals():
    if 'kegg_enzyme' in analysis_ready_df.columns:
        from tqdm.auto import tqdm
        tqdm.pandas(desc="Fetching Pathways") # Initialize tqdm for pandas
        
        # Create the new 'kegg_pathways' column
        analysis_ready_df['kegg_pathways'] = analysis_ready_df['kegg_enzyme'].progress_apply(get_hsa_pathways_for_ec)
        
        print("\n✅ 'kegg_pathways' column added/updated in analysis_ready_df.")
        print("\n--- analysis_ready_df with kegg_pathways (head) ---")
        
        # Display relevant columns
        display_cols = ['kegg_enzyme', 'kegg_reactions', 'gene_symbols', 'kegg_pathways']
        existing_display_cols = [col for col in display_cols if col in analysis_ready_df.columns]

        with pd.option_context('display.max_colwidth', None, 
                               'display.max_rows', 10, # Show more rows if needed 
                               'display.max_columns', None, 
                               'display.width', 1000):
            print(analysis_ready_df[existing_display_cols].head().to_string())
            if len(analysis_ready_df) > 5:
                print("...")
                print(analysis_ready_df[existing_display_cols].tail().to_string())


    else:
        print("ERROR: 'kegg_enzyme' column not found in analysis_ready_df.")
else:
    print("ERROR: 'analysis_ready_df' DataFrame not found. Please ensure it is loaded.")


Fetching KEGG pathways for each enzyme in 'analysis_ready_df' (targeting hsa)...


Fetching Pathways:   0%|          | 0/43 [00:00<?, ?it/s]


✅ 'kegg_pathways' column added/updated in analysis_ready_df.

--- analysis_ready_df with kegg_pathways (head) ---
   kegg_enzyme                                                                                                                                                                                            kegg_reactions gene_symbols                                                                               kegg_pathways
1     1.1.1.30                                                                                                                          ['R00145', 'R00146', 'R00717', 'R01088', 'R01361', 'R01388', 'R01434', 'R02196']       [BDH1]                                                                        [hsa00650, hsa01100]
6      4.1.2.4                                                                                                                                                                            ['R02750', 'R02749', 'R01066']       [DERA]            

In [170]:
display(analysis_ready_df)

Unnamed: 0,kegg_enzyme,kegg_reactions,gene_list,human_gene_ids,gene_symbols,kegg_pathways
1,1.1.1.30,"['R00145', 'R00146', 'R00717', 'R01088', 'R01361', 'R01388', 'R01434', 'R02196']","[snz:DC008_03210, babc:DO78_1636, vde:111254904, dtl:H8F01_10525, tup:102502144, sdub:R1T39_03790, spf:SpyM50444, rpb:RPB_1408, rfv:RFYW14_02492, mpuf:101670579, sala:ESZ53_00620, canu:128165098, egz:104127126, anu:117717787, rfr:Rfer_3285, boj:CBF45_05335, sphs:ETR14_06525, sauo:BV401_07595, mamm:ABNF92_01235, cvn:111100326, rre:MCC_00235, bdf:WI26_10680, vde:111243348, cpra:CPter91_3560, mrm:A7982_06988, hda:BB347_02440, amye:MJQ72_44515, baml:BAM5036_1757, psnc:C0058_19215, dsu:Dsui_2312, hdi:HDIA_4200, xom:XOO2083, psf:PSE_0086, ppun:PP4_24260, xct:J151_02204, pmaj:107208857, pdk:PADK2_15635, parh:I5S86_15085, pcri:104029833, nfe:HUT17_00330, pbi:103054298, mju:123872216, hew:HBDW_10940, kpp:A79E_2194, bel:BE61_89060, pht:BLM14_18365, mdo:100032119, xva:C7V42_11020, sbj:CF168_03465, paca:ID47_02340, tala:104362023, ppg:PputGB1_2793, bxh:BAXH7_01614, acur:JZ785_17385, bof:FQV39_25535, cglo:123273635, phas:123827211, sdeg:GOM96_07345, xgl:120789671, mecl:CLZ_23735, blup:AB3L03_13590, spca:GL174_07630, fmu:J7337_010154, sko:100369271, bpsa:BBU_3616, rpi:Rpic_1981, pdec:H1Q58_04430, tbr:Tb10.389.1850, caq:IM40_07690, bgp:BGL_2c00220, scam:104153220, jov:P8627_12790, xla:100127326, clec:106667552, bthe:BTN_4441, sdur:M4V62_24435, vem:105559323, phey:O4N75_07690, rcv:PFY06_10415, pln:Plano_0698, afil:140158910, acyn:ORV05_24505, msut:LC048_03300, hail:ASB7_03780, bant:A16_42500, spa:M6_Spy1393, hcra:139468715, nwh:119426707, maea:128202956, udv:129226384, snh:120038350, gmu:124857863, sasa:106588253, hqn:M0220_16170, rhib:I8E17_19855, llp:GH975_02270, ahad:FRZ06_07540, ftc:DA46_1491, pbl:PAAG_04142, otc:121341177, ...]",[hsa:622],[BDH1],"[hsa00650, hsa01100]"
6,4.1.2.4,"['R02750', 'R02749', 'R01066']","[kmi:VW41_22690, aste:118514563, mbk:K60_005050, seo:STM14_5487, eav:EH197_09590, prj:NCTC6933_00687, pdes:FE840_005695, ccoe:CETAM_01785, pra:PALO_00765, eei:NX720_09840, ppru:FDP22_20045, otu:111424818, rle:RL2744, rsi:Runsl_0373, lah:LA20533_05135, mlut:JET14_03415, lmr:LMR479A_2105, adc:FOC79_00550, nbr:O3I_039020, spac:B1H29_13900, ecoj:P423_24885, smon:AWR27_23450, btrh:F543_4180, ccae:111928508, bves:QO058_16695, sapi:SAPIS_v1c05450, sik:K710_0989, sdd:D9753_13300, vvl:VV93_v1c24000, nann:O0S08_39625, mcar:MCCPILRI181_00884, selo:AXE86_06275, mde:101894936, ncb:C0V82_15415, rro:104666401, sbn:Sbal195_3366, oki:109882147, mcep:124999789, acir:OG979_01200, con:TQ29_14550, pcx:LPB68_15445, mrm:A7982_04317, gej:A0V43_10385, arm:ART_4057, scl:sce5008, yfr:AW19_1747, mgz:GCW_03310, tpal:117643508, stii:DRB96_39430, agk:NYR60_04165, apr:Apre_1538, tsu:Tresu_1875, mci:Mesci_0700, pfq:QQ39_16870, pals:PAF20_06770, fok:KK2020170_10200, bwe:BcerKBAB4_1752, ptd:PTET_a0237, pkp:SK3146_01526, psyb:KD050_02310, aur:HMPREF9243_0259, cpsl:KBP54_08700, psew:JHW44_18160, laci:CW733_06980, fpe:Ferpe_0063, dot:EZY12_13725, pseg:D3H65_18210, bmet:BMMGA3_12000, ssun:H9Q77_07505, sbu:SpiBuddy_2882, sagm:BSA_20580, yet:CH48_170, ssu:SSU05_1085, omm:135536387, mcit:NCTC10181_00336, ock:EXM22_06170, mfu:LILAB_36070, mhas:MHAS_04846, mtd:UDA_0478, plul:FOB45_05630, cet:B8281_16170, tph:TPChic_0264, bma:BMAA0110, cgt:cgR_0457, kpq:KPR0928_13165, snt:SPT_1360, scap:AYP1020_1327, gln:F1C58_05850, spei:EHW89_04855, cter:A606_10215, erm:EYR00_03975, bose:RMR04_26395, pcub:JR316_0007723, sacg:FDZ84_07805, tmol:138131687, saui:AZ30_11280, trc:DYE49_03150, eol:Emtol_3254, btf:YBT020_09855, apa:APP7_1072, ...]",[hsa:51071],[DERA],"[hsa00030, hsa01100]"
7,2.7.1.15,"['R02750', 'R02749', 'R01066']","[sapl:U0355_01780, pmq:PM3016_2063, pvd:CFBP1590__2212, gbc:GbCGDNIH3_1810, edj:ECDH1ME8569_3640, fga:104079114, rhi:NGR_c00540, mcoc:116080927, abao:BEQ56_11105, orp:MOP44_00870, tco:Theco_3928, sfh:SFHH103_05765, elim:B2M23_12685, pcib:F9282_10935, eun:UMNK88_4564, spoc:NCTC10925_00519, aho:Ahos_2238, btv:BTHA_3642, fmi:F0R74_08400, lsq:119600727, scap:AYP1020_1689, cgb:cg2554, pvo:PVOR_23464, erh:ERH_1481, sacd:HS1genome_1373, esn:127007485, cbp:EB354_02475, cam:101510967, cpoi:OE229_04470, ppsc:EHS13_10180, paeu:BN889_02112, lna:RIN67_07825, hdi:HDIA_1783, burm:GJG85_30965, ecg:E2348C_0066, bli:BL02439, day:FV141_10315, aou:ACTOB_007691, spad:DVK44_01585, pcs:N7525_008093, lpz:Lp16_1768, pgh:FH974_17305, palo:E6C60_0586, abao:BEQ56_00880, atl:Athai_17130, cfu:CFU_4098, talb:FTW19_12800, llut:K1X41_15060, pfq:QQ39_16110, babo:DK55_3114, plon:Pla110_26750, bhk:B4U37_20020, emt:CPZ25_017840, acij:JS278_00018, tle:Tlet_1905, baa:BAA13334_II00221, lne:FZC33_27210, marr:BKP56_00170, ptl:AOT13_01815, ecd:ECDH10B_3940, smeo:124389642, kos:KORDIASMS9_01856, kno:V6K52_15640, ers:K210_05865, eto:RIN69_22125, rmk:RAN89_11750, pacn:TIA1EST1_05985, vir:X953_18360, kut:JJ691_78020, coq:D9V35_00510, sgp:SpiGrapes_1098, pben:V6W80_11925, strr:EKD16_02435, cbol:CGC65_02725, sauo:BV401_17455, gly:K3N28_09940, pcg:AXG94_05815, tln:139821747, rlt:Rleg2_0053, ggn:109304351, lkn:FGL74_07990, csu:CSUB_C0883, dpte:113788994, ypd:YPD4_3099, supe:P0H77_22950, acac:EYQ97_13190, ajc:117111686, yre:HEC60_18050, bpus:UP12_16890, bie:RFF05_02595, amz:B737_4140, senb:BN855_38850, rao:DSD31_12470, fgi:OP10G_4377, rat:M949_1669, kok:KONIH1_14745, ppog:QPK24_14700, seq:SZO_15150, ecul:PVA46_06505, sud:ST398NM01_0283, ...]",[hsa:64080],[RBKS],"[hsa00030, hsa01100]"
10,1.2.3.1,"['R00366', 'R00372', 'R00635', 'R01340', 'R01709', 'R02125', 'R02457', 'R02657', 'R02894', 'R02923', 'R03871', 'R04085', 'R04221', 'R04904', 'R05861', 'R06124', 'R07400', 'R08349', 'R08384', 'R08408']","[elk:111142784, ccri:104168783, arow:112969295, oor:101285123, mthb:126932306, iti:101965594, cbai:105082310, ggn:109301889, pprm:120486809, clud:112656803, pcao:104053673, mthb:126932129, lroh:127153727, mcf:102134825, asn:102383614, pviv:125172692, vko:123024423, cgib:127942713, nvs:122903471, pfor:103149483, pmei:106906675, mmur:105856981, muo:115474518, asn:102377274, mrv:120375097, bbuf:121007452, dtc:141054219, mcc:701024, acs:100565481, hsa:316, anu:117705354, vpc:102539743, tgu:100217503, csum:138075916, rfq:117026362, mcoc:116088456, csum:138076312, etl:114055687, pdab:135726365, lav:100662726, mmu:213043, ccar:109090992, pcoc:116233971, amj:102558282, cimi:108315708, aml:100474367, xco:114140277, tge:112636356, tben:117499028, gste:104251026, hmh:116810476, afb:129111600, hcq:109520227, pmaj:107207425, ptg:102949513, ocu:100008601, ray:107503175, omc:131529646, eny:140364181, oro:101369776, cge:100760212, scam:104150187, vlg:121480828, mcoc:116088453, sfum:130021027, vvp:112922804, ajm:119065518, gas:123242136, csab:103217611, apla:101795737, pvp:105298171, cvf:104291038, hcg:128339747, myum:139014904, pcoc:116233962, apri:131200566, vlg:121480502, trn:134304379, mgp:100549268, cel:CELE_B0222.9, maua:101823275, cdk:105099168, nsu:110591947, cmac:104483254, mamb:125254213, oke:118363697, mcal:110291452, hle:104829954, pov:109639051, anan:135273577, xtr:100379890, afor:103903466, hgl:101710340, rro:104670351, afz:127557273, shab:115607863, gcl:127018248, maua:101823809, prob:127226326, npr:108799582, ...]",[hsa:316],[AOX1],"[hsa00280, hsa00350, hsa00380, hsa00750, hsa00760, hsa00830, hsa00982, hsa01100, hsa01120]"
11,1.4.3.3,"['R00366', 'R00372', 'R00635', 'R01340', 'R01709', 'R02125', 'R02457', 'R02657', 'R02894', 'R02923', 'R03871', 'R04085', 'R04221', 'R04904', 'R05861', 'R06124', 'R07400', 'R08349', 'R08384', 'R08408']","[achc:115346439, caua:113048893, mtm:MYCTH_60632, mlan:JY572_08485, pja:122249508, loc:102688406, ccrn:123292112, sanl:KZO11_32750, plop:125349017, aalb:115268049, ppyr:116172507, mia:OCU_26140, msim:MSIM_50640, nfi:NFIA_075980, muo:115479103, grl:LPB144_05280, dpx:DAPPUDRAFT_213708, der:6542841, hhar:128146308, pcan:112573874, cqu:CpipJ_CPIJ007273, tben:117496805, zvi:118095871, pfy:PFICI_08522, ngi:103746689, etl:114062674, fox:FOXG_10896, sla:SERLADRAFT_458189, spil:134089314, scae:IHE65_05825, arow:112963342, aswu:HUW51_09195, shs:STEHIDRAFT_127642, mbrn:26247253, asp:AOR13_1084, dpa:109541460, seng:OJ254_27515, nca:Noca_1780, tru:101077517, shaw:CEB94_35480, bcom:BAUCODRAFT_302218, pif:PITG_08125, aste:118517591, pecq:AD017_16400, amd:AMED_2680, ecra:117944970, sanh:107696048, ang:An06g02340, pat:Patl_3809, mmi:MMAR_2477, ccad:122422138, nfi:NFIA_073030, alz:AV940_11820, pou:POX_d06014, bod:106622550, oha:104331919, kdj:28968585, dros:Drose_07775, pvt:110083399, cdet:87948429, tml:GSTUM_00002749001, dros:Drose_00720, ksh:87958508, cot:CORT_0E06250, caur:CJI96_0004484, lav:100658336, shr:100929799, pbg:122470339, acoe:138119425, mly:CJ228_001315, ker:91099515, epa:110233699, pfp:PFL1_02372, glz:GLAREA_00037, som:SOMG_02970, cal:CAALFM_C500450CA, sauh:SU9_029305, schf:IPT68_31595, bfre:138627274, bfu:BCIN_12g04240, nsu:110592535, pdic:114510022, ctp:CTRG_04578, ker:91098832, dmat:Dmats_05555, clum:117736257, ker:91100917, tasa:A1Q1_03993, bbel:109466174, nrn:SHK17_06825, zro:ZYRO0B01892g, bom:102288350, dvc:Dvina_08015, mte:CCDC5079_1761, tsd:MTP03_39660, mul:MUL_1659, stsi:A4E84_34650, agen:126041210, xma:102218661, mcra:ID554_09910, ...]",[hsa:1610],[DAO],"[hsa00260, hsa00311, hsa00330, hsa00470, hsa01100, hsa01110]"
17,4.2.1.22,"['R00891', 'R00894', 'R01289', 'R01290', 'R02743', 'R03749', 'R04942', 'R10993']","[fsf:AABD74_12985, faz:M0M57_02450, bbre:B12L_0456, kfa:Q73A0000_03375, marm:YQ22_17170, mlx:118026969, arto:PJ267_15910, pago:IRJ34_15185, pseh:XF36_20675, cfa:611071, plal:FXN65_01435, ccau:EG346_13215, uli:ETAA1_47060, pecq:AD017_24075, kno:V6K52_15800, pfuc:122517607, gsh:117359439, ppel:H6H00_04060, vcd:124538385, sspn:LXH13_24300, ccic:134058111, lfx:LU699_11890, ccay:125643160, fcu:NOX80_00250, fcg:LNP23_09810, obb:114877262, sinn:ABB07_22495, opi:101516453, hfl:PUV54_10000, nfa:NFA_48640, aoq:129238563, pqu:IG609_018465, cmaq:H0S70_06125, ecli:ECNIH5_09110, abre:pbN1_27460, ldc:111506556, pbro:HOP40_03570, shun:DWB77_04595, kdi:Krodi_2944, xcb:XC_3636, edl:AAZ33_09855, dye:EO087_08555, mcx:BN42_20925, ifu:128601987, abe:ARB_07681, rhi:NGR_a00730, rfs:C1I64_11905, gms:SOIL9_23980, spin:KV203_14085, spun:BFF78_24940, nob:CW736_10230, fbm:MQE35_16675, mgi:Mflv_2043, ade:Adeh_3300, sscv:125987927, fbi:L0669_13240, foh:FORMA_16570, btax:128045376, pbf:CFX0092_A0187, scb:SCAB_5301, rhd:R2APBS1_0565, umr:103680763, sma:SAVERM_3510, lmb:C9I47_0687, cpu:CPFRC_03115, cop:CP31_03470, lpal:LDL79_03265, tatv:25775430, bbis:104984942, cly:Celly_2204, ppae:LDL65_26450, nmf:NMS_2520, spir:CWM47_15965, sasa:106562750, mlw:MJO58_21560, clud:112651104, pfll:139116215, agg:C1N71_03000, wch:wcw_1144, aten:116305218, miz:BAB75_06520, mle:ML2396, pmoa:120511828, ned:HUN01_13785, salp:111959713, fcb:LOS89_06985, lse:F1C12_21180, zvi:118085215, cgar:128506706, smau:118311295, pct:PC1_3903, mabl:MMASJCM_1221, cira:LFM56_13715, prho:PZB74_21770, nyu:D7D52_02275, tpre:106652234, ako:N9A08_12700, plai:106958598, kbu:Q4V64_20580, smac:SMDB11_0903, ...]","[hsa:102724560, hsa:875]","[CBS, CBS]","[hsa00260, hsa00270, hsa00670, hsa01100, hsa01110]"
18,4.3.2.9,"['R00891', 'R00894', 'R01289', 'R01290', 'R02743', 'R03749', 'R04942', 'R10993']","[hcra:139453113, nwh:119425359, ccae:111923889, aang:118212318, dro:112306962, amex:103033278, ptg:102972301, cbr:CBG_11175, lalb:132524661, mna:107544417, dpol:127845240, stru:115173880, emc:129337838, phas:123811697, obb:114876242, svg:106854490, atra:106140073, pxu:106124195, dpl:KGM_215638, pmei:106912476, tng:GSTEN00021982G001, pxu:106124283, ptru:123498548, sanh:107673341, lsax:138972038, nvg:124309443, cfr:102512131, shab:115613484, pgeo:117455238, mmur:105855110, zca:113911457, prap:111003918, abru:129963705, ray:107499554, aqu:100640069, amer:121598538, pret:103472371, csyr:103270036, nvi:100122156, omy:118938772, stru:115177047, ccrc:123694109, haw:110377784, scac:106094106, hze:124634747, one:115128635, slal:111664184, ppap:129799979, vps:122628419, hvi:124355419, achc:115339007, oga:100956580, pmur:107294022, gmw:113518297, gmw:113518182, tge:112621323, nasi:112406298, amj:102567732, nai:NECAME_13395, arot:135259530, plet:104616011, hhal:112211193, cvs:136990277, otu:111425266, epz:103552750, gmh:115545208, omy:110511884, gsh:117353761, asua:134225391, rmp:119181366, apln:108734187, ocla:139380163, ipu:100528263, bmor:101743698, aju:106968672, maea:128209168, xco:114152740, eja:138211856, ofu:114351739, cang:105520308, pov:109631991, breg:104631158, anu:117720627, hhip:117777335, dsr:110178952, asn:102385482, rmp:119181365, hrf:124147279, manu:129448901, plep:121965390, ipu:108256920, pame:138699454, eju:114221157, ola:101162615, bacu:102997196, eai:106831060, lroh:127181797, npt:124219955, vvp:112935405, lgf:123606649, ...]",[hsa:79017],[GGCT],"[hsa00480, hsa01100]"
20,2.1.3.3,"['R00256', 'R00485', 'R01398', 'R01579', 'R01990', 'R06134']","[led:BBK82_24610, pgis:I6I06_15380, pdf:CD630DERM_20300, beba:BWI17_08480, ncx:Nocox_24360, marc:AR505_1401, hix:NTHI723_01089, cma:Cmaq_0092, ele:Elen_1948, eclc:ECR091_02725, fpb:NLJ00_11370, kpb:FH42_19695, ssch:LH95_00415, fpla:A4U99_11020, oca:OCAR_4655, sax:USA300HOU_0067, casp:NQ535_02670, mpak:MIU77_08820, axy:AXYL_01973, tml:GSTUM_00003888001, nli:G3M70_07840, llx:NCDO2118_2143, vcv:GJV44_00392, hbc:AEM38_02735, prq:CYG50_18400, ppai:E1956_06255, aes:C2U30_12980, huw:FPZ11_00845, casp:NQ535_20085, hara:AArcS_3066, egd:GS424_015355, mesg:MLAUSG7_1366, clc:Calla_0960, cid:P73_3734, mhd:Marky_1757, agar:OAG1_36780, dch:SY84_04190, pba:PSEBR_a4400, ceec:P3F56_01740, bvit:JIP62_11355, set:SEN4216, slia:HA039_07325, ecq:ECED1_5107, bthy:AQ980_09205, amav:GCM10025877_30580, prop:QQ658_14495, sdiz:WCY31_03070, alr:DS731_18720, hya:HY04AAS1_1002, mtuu:HKBT2_1753, cld:CLSPO_c26650, mmb:Mmol_1369, bku:QIA22_04820, mok:Metok_1035, mcra:ID554_01870, emv:HQR01_06430, cpoo:109305927, nih:NitYY0810_C0444, serh:GF111_10540, abau:IX87_21695, bsan:CHH28_18580, brhi:104489781, rpus:CFBP5875_00430, aang:118214522, fwa:DCMF_08690, synp:Syn7502_01014, clu:CLUG_05305, tas:TASI_0924, sulr:B649_10235, dct:110095612, pfk:PFAS1_05570, sen:SACE_4292, cib:HF677_020615, ssif:AL483_11265, camh:LCW13_12755, fal:FRAAL5204, sins:PW252_03760, ntm:BTDUT50_04915, pgf:J0G10_24875, edu:LIU_09830, pmam:KSS90_20150, mou:OU421_05640, tci:A7K98_01400, ssed:H9L14_10645, paen:P40081_16685, actn:L083_5870, pbz:GN234_24975, asu:Asuc_1509, sclo:SCLO_1016020, cko:CKO_03551, svi:Svir_25590, nad:NCTC11293_03064, staa:LDH80_10285, dden:KI615_15415, kvr:CIB50_0001545, mfa:Mfla_1709, etb:N7L95_07230, fpr:FP2_23850, nlk:NIES25_37540, pdec:H1Q58_00300, ...]",[hsa:5009],[OTC],"[hsa00220, hsa01100, hsa01110]"
23,1.13.11.33,"['R01185', 'R01186', 'R01187', 'R01313', 'R01315', 'R01317', 'R01593', 'R02053', 'R03107', 'R03626', 'R07064', 'R07343', 'R07379', 'R07387', 'R07859']","[mlf:102427209, oor:101281563, sbq:101031385, hai:109394595, cimi:108306442, ocu:100009114, mnp:132004425, lww:102750374, cdk:105106001, lav:104845446, etf:101654335, ssyn:129470266, cfr:116669221, ccad:122446330, pkl:118726749, myum:138994085, tup:102478167, csum:138084752, opi:101518498, mbez:129542456, pcad:102986457, tvp:118856898, pkl:118726751, chx:102179132, cang:105523680, ppam:129072206, oas:101116587, dsp:122100448, pcoq:105822846, ppyg:129018111, umr:103660453, puc:125926176, mlf:106694362, tge:112609437, anan:105714417, biu:109574003, nle:100598410, mree:136148954, lve:103073785, myum:138993692, phas:123809317, mcc:709212, pleu:114693249, pdic:114515339, tge:112610174, vvp:112924802, ecb:100072900, sara:101547759, bom:102285661, ptr:748294, ssyn:134731306, ncar:124980592, tup:102488255, ovr:110150179, myum:138994127, mdt:132218115, pkl:118726750, oga:100953458, mna:107528023, hsa:246, mcal:110305643, anu:117710957, aml:100479002, eai:106844159, mlf:102421821, panu:101014249, cang:105523522, rno:81639, cbai:105066404, bbub:102397311, etf:101649462, myum:138994149, mlx:117999977, mmur:105879207, mlf:102427827, mdo:100015976, gas:123247523, dsp:122108619, tod:119232067, psiu:116745707, efus:103299846, dord:105983343, pcw:110194258, dnm:101445760, mfot:126514012, iti:101973674, bta:282139, mun:110542750, lalb:132511482, npo:129497273, uar:123792173, npo:129497891, uah:113271093, ssc:396971, llv:125087857, ngi:103752179, morg:121456877, cfa:489458, efus:129147367, mpah:110332116, ...]",[hsa:246],[ALOX15],"[hsa00590, hsa00591, hsa01100]"
24,3.1.1.4,"['R01185', 'R01186', 'R01187', 'R01313', 'R01315', 'R01317', 'R01593', 'R02053', 'R03107', 'R03626', 'R07064', 'R07343', 'R07379', 'R07387', 'R07859']","[plop:125354236, pxb:103959387, jre:109007453, pyu:121022670, cmao:118810793, pmec:138938729, cbri:134830991, agb:108910181, ccin:107263376, cbai:105070899, mcal:110293160, lsq:119614332, bfo:118428251, scan:103827085, cboy:140661598, char:105899085, dsv:119466612, cvs:136980294, ssoe:131446065, rfq:117036414, iel:124155131, pmax:117345037, hrt:120759654, ctul:119785360, cgib:127952397, asyl:127696622, otc:121339293, praf:128401962, bfre:138624670, sanh:107674667, oep:134015405, lak:106164109, lbd:127289117, ptep:107453256, dsp:122099052, mlx:118021504, stru:115204156, ajj:139975955, bman:114250385, api:100569970, acos:131264619, pcoo:112848836, umr:103666808, zvi:118092865, cdk:105088846, hsa:64600, gfs:119633680, cabi:116824403, dcc:119845018, tmol:138130351, cpla:122562836, ajm:119045395, csab:103239247, ovi:T265_05468, dne:112984133, ray:107521019, csec:111863777, bom:102276506, ipu:108263807, llv:125082321, tst:117888347, bgar:122931693, sluc:116048457, osn:115232337, ipu:108278205, alim:106529961, sjo:128369248, eai:106845081, mpah:110323483, bbub:123332249, cimi:108287739, gat:120821017, nasi:112402738, csum:138076385, opi:101523511, bgt:106072631, sbq:101034469, sasa:106593742, mrj:136846752, dse:6605838, dmo:Dmoj_GI14370, mbez:129561895, pbrv:138153968, lsr:110472283, cjo:107323345, bpec:110165394, ags:114121615, hvi:124358555, ptr:470183, pdam:113677597, ajc:117120535, pov:109638354, ppot:106107748, hcg:128327182, schu:122866099, ssyn:129472254, lco:104939254, vem:105560078, acat:141111737, anu:117709810, ...]","[hsa:64600, hsa:5320, hsa:26279, hsa:391013, hsa:8399, hsa:5322, hsa:30814, hsa:50487, hsa:84647, hsa:5319, hsa:81579]","[PLA2G2F, PLA2G2A, PLA2G2D, PLA2G2C, PLA2G10, PLA2G5, PLA2G2E, PLA2G3, PLA2G12B, PLA2G1B, PLA2G12A]","[hsa00564, hsa00565, hsa00590, hsa00591, hsa00592, hsa01100, hsa01110]"


In [3]:
import ast

# 1) Parse the 'kegg_pathways' column from strings to lists
analysis_ready_df['kegg_pathways'] = (
    analysis_ready_df['kegg_pathways']
    .apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)
)

# 2) Identify rows where kegg_pathways is an empty list
mask_empty = analysis_ready_df['kegg_pathways'].apply(
    lambda x: isinstance(x, list) and len(x) == 0
)

# 3) Print out the corresponding kegg_enzymes for those empty-pathway rows
dropped_enzymes = analysis_ready_df.loc[mask_empty, 'kegg_enzyme'].tolist()
print(f"Dropping {len(dropped_enzymes)} rows with no pathways. Corresponding kegg_enzymes:")
for enz in dropped_enzymes:
    print(" -", enz)

# 4) Drop those rows from the DataFrame
analysis_ready_df = analysis_ready_df.loc[~mask_empty].reset_index(drop=True)

# 5) (Optional) report how many remain
print(f"{len(dropped_enzymes)} rows removed; {len(analysis_ready_df)} rows remain.")


Dropping 3 rows with no pathways. Corresponding kegg_enzymes:
 - 2.5.1.108
 - 2.5.1.114
 - 2.5.1.25
3 rows removed; 40 rows remain.


In [5]:
analysis_ready_df.to_csv("analysis_ready_df.csv", index=False)

In [6]:
import pandas as pd
analysis_ready_df = pd.read_csv('analysis_ready_df.csv')
analysis_ready_df.head(24)

Unnamed: 0,kegg_enzyme,kegg_reactions,gene_list,human_gene_ids,gene_symbols,kegg_pathways
0,1.1.1.30,"['R00145', 'R00146', 'R00717', 'R01088', 'R013...","['sufl:FIL70_21435', 'aus:IPK37_09295', 'ofa:N...",['hsa:622'],['BDH1'],"['hsa00650', 'hsa01100']"
1,4.1.2.4,"['R02750', 'R02749', 'R01066']","['svb:NCTC12167_00820', 'sfp:QUY26_14155', 'ah...",['hsa:51071'],['DERA'],"['hsa00030', 'hsa01100']"
2,2.7.1.15,"['R02750', 'R02749', 'R01066']","['shf:CEQ32_12405', 'baml:BAM5036_3234', 'rsu:...",['hsa:64080'],['RBKS'],"['hsa00030', 'hsa01100']"
3,1.2.3.1,"['R00366', 'R00372', 'R00635', 'R01340', 'R017...","['plop:125349936', 'puc:125911786', 'bbis:1050...",['hsa:316'],['AOX1'],"['hsa00280', 'hsa00350', 'hsa00380', 'hsa00750..."
4,1.4.3.3,"['R00366', 'R00372', 'R00635', 'R01340', 'R017...","['sbro:GQF42_37510', 'amyd:K1T34_31600', 'mamb...",['hsa:1610'],['DAO'],"['hsa00260', 'hsa00311', 'hsa00330', 'hsa00470..."
5,4.2.1.22,"['R00891', 'R00894', 'R01289', 'R01290', 'R027...","['aue:C5O00_01525', 'nsd:BST91_05810', 'mgen:1...","['hsa:102724560', 'hsa:875']","['CBS', 'CBS']","['hsa00260', 'hsa00270', 'hsa00670', 'hsa01100..."
6,4.3.2.9,"['R00891', 'R00894', 'R01289', 'R01290', 'R027...","['pcoq:105807354', 'tda:119673143', 'cpea:1043...",['hsa:79017'],['GGCT'],"['hsa00480', 'hsa01100']"
7,2.1.3.3,"['R00256', 'R00485', 'R01398', 'R01579', 'R019...","['bham:ABN702_08535', 'sttn:BSL84_25755', 'sam...",['hsa:5009'],['OTC'],"['hsa00220', 'hsa01100', 'hsa01110']"
8,1.13.11.33,"['R01185', 'R01186', 'R01187', 'R01313', 'R013...","['tvp:118856898', 'umr:103660453', 'zca:113908...",['hsa:246'],['ALOX15'],"['hsa00590', 'hsa00591', 'hsa01100']"
9,3.1.1.4,"['R01185', 'R01186', 'R01187', 'R01313', 'R013...","['psiu:116753850', 'pleu:114697183', 'ipu:1082...","['hsa:50487', 'hsa:30814', 'hsa:64600', 'hsa:5...","['PLA2G3', 'PLA2G2E', 'PLA2G2F', 'PLA2G5', 'PL...","['hsa00564', 'hsa00565', 'hsa00590', 'hsa00591..."


In [9]:
import pandas as pd
import ast # For safely evaluating string representations of lists

# Assume 'ReacEnzyPath' and 'analysis_ready_df' are already loaded pandas DataFrames.

print("\n--- Finding compounds in ReacEnzyPath sharing (or not sharing) enzymes with analysis_ready_df ---")

# Ensure DataFrames are loaded
if not ('ReacEnzyPath' in locals() or 'ReacEnzyPath' in globals()) or \
   not ('analysis_ready_df' in locals() or 'analysis_ready_df' in globals()):
    print("ERROR: 'ReacEnzyPath' or 'analysis_ready_df' DataFrame not found. Please ensure they are loaded.")
else:
    if 'kegg_enzyme' not in analysis_ready_df.columns:
        print("ERROR: 'kegg_enzyme' column not found in analysis_ready_df.")
    elif 'kegg_enzymes' not in ReacEnzyPath.columns:
        print("ERROR: 'kegg_enzymes' column not found in ReacEnzyPath.")
    elif 'input_compound_name' not in ReacEnzyPath.columns:
        print("ERROR: 'input_compound_name' column not found in ReacEnzyPath.")
    else:
        # 1. Get unique enzymes from analysis_ready_df
        enzymes_in_analysis_df = set(analysis_ready_df['kegg_enzyme'].dropna().unique())
        print(f"Unique enzymes in analysis_ready_df: {enzymes_in_analysis_df if enzymes_in_analysis_df else 'None'}")

        # Get all unique compound names from ReacEnzyPath
        all_reac_enzypath_compound_names = set(ReacEnzyPath['input_compound_name'].dropna().unique())
        
        matching_compound_names = set()

        # 2. Iterate through ReacEnzyPath
        for index, row in ReacEnzyPath.iterrows():
            compound_name = row['input_compound_name']
            enzymes_cell = row['kegg_enzymes']
            
            current_compound_enzymes = set()
            if isinstance(enzymes_cell, str):
                try:
                    evaluated_list = ast.literal_eval(enzymes_cell)
                    if isinstance(evaluated_list, list):
                        current_compound_enzymes.update(item for item in evaluated_list if pd.notna(item))
                except (ValueError, SyntaxError):
                    pass 
            elif isinstance(enzymes_cell, list):
                current_compound_enzymes.update(item for item in enzymes_cell if pd.notna(item))
            
            if current_compound_enzymes and not enzymes_in_analysis_df.isdisjoint(current_compound_enzymes):
                matching_compound_names.add(compound_name)

        # 3. Determine non-matching compounds
        non_matching_compound_names = all_reac_enzypath_compound_names - matching_compound_names

        # 4. Print results
        if matching_compound_names:
            print(f"\nCompounds in 'ReacEnzyPath' that SHARE enzymes with 'analysis_ready_df':")
            for name in sorted(list(matching_compound_names)):
                print(f"- {name}")
        else:
            print("\nNo compounds in 'ReacEnzyPath' were found to share common enzymes with 'analysis_ready_df'.")

        if non_matching_compound_names:
            print(f"\nCompounds in 'ReacEnzyPath' that DO NOT SHARE enzymes with 'analysis_ready_df':")
            for name in sorted(list(non_matching_compound_names)):
                print(f"- {name}")
        else:
            # This case means all compounds in ReacEnzyPath had at least one shared enzyme,
            # or ReacEnzyPath was empty to begin with.
            if all_reac_enzypath_compound_names: # Check if there were any compounds to begin with
                 print("\nAll compounds in 'ReacEnzyPath' had at least one enzyme shared with 'analysis_ready_df'.")
            # If all_reac_enzypath_compound_names is empty, the "No compounds found to share" message already covers it.


--- Finding compounds in ReacEnzyPath sharing (or not sharing) enzymes with analysis_ready_df ---
Unique enzymes in analysis_ready_df: {'1.13.11.33', '2.7.1.30', '2.3.1.23', '2.4.1.17', '3.1.3.5', '3.1.3.25', '2.7.1.74', '1.1.1.62', '1.5.1.8', '1.1.1.30', '4.1.2.4', '3.1.1.4', '3.1.3.91', '4.1.1.28', '4.2.1.22', '3.1.2.21', '2.7.1.15', '3.6.1.64', '2.1.3.3', '2.5.1.22', '2.4.2.28', '1.2.3.1', '3.6.1.66', '3.1.3.1', '2.3.1.43', '1.4.3.3', '4.3.2.9', '2.5.1.16', '1.14.14.14', '1.1.1.51', '3.5.1.12'}

Compounds in 'ReacEnzyPath' that SHARE enzymes with 'analysis_ready_df':
- 17β-Estradiol
- 2'-Deoxyinosine-5'-monophosphate
- 5'-Deoxy-5'-(Methylthio) Adenosine
- Asp-Arg
- Carnitine C7:DC
- Cyclo(Phe-Glu)
- Cytarabine
- D-Erythronolactone
- Deoxyribose 5-phosphate
- LPC(13:0/0:0)
- LPC(16:1/0:0)
- LPC(18:3/0:0)
- LPE(17:1/0:0)
- LPI(16:2/0:0)
- Methylcysteine
- N(Alpha)-Acetyl-Epsilon-(2-Propenal)Lysine
- Quinoline-4-carboxylic acid
- Thiamine Monophosphate

Compounds in 'ReacEnzyPath' tha

<h2>🔍 Enzyme-Based Compound Filtering: ReacEnzyPath vs. CRC-Enriched Enzymes</h2>

<h3>🧪 What This Script Did</h3>
<p>
This script compares two datasets:
</p>
<ul>
  <li><code>ReacEnzyPath</code>: Full annotation of CRC-relevant metabolites with enzyme and pathway links.</li>
  <li><code>analysis_ready_df</code>: Curated list of metabolites with enzymes and genes potentially relevant to colorectal cancer.</li>
</ul>
<p>
The goal is to identify which compounds in <code>ReacEnzyPath</code> are catalyzed by enzymes also found in the CRC-enriched <code>analysis_ready_df</code>.
</p>

<h3>🔬 Key Steps</h3>
<ol>
  <li>Extract unique EC (enzyme) numbers from <code>analysis_ready_df</code>.</li>
  <li>Parse the enzyme list for each metabolite in <code>ReacEnzyPath</code>.</li>
  <li>Check for overlap with ECs in the CRC-enriched list.</li>
  <li>Label metabolites as either <strong>Matching</strong> or <strong>Non-Matching</strong>.</li>
</ol>

<hr>

<h3>📊 Results Overview</h3>

<h4>✅ Compounds Sharing Enzymes with CRC-Enriched Set (18 total)</h4>
<p>
These metabolites are catalyzed by enzymes also linked to CRC activity. They may serve as:
</p>
<ul>
  <li>Biomarkers</li>
  <li>Drug targets</li>
  <li>Readouts of pathway dysregulation</li>
</ul>

<p>Examples include:</p>
<ul>
  <li><strong>Cytarabine</strong>, <strong>Deoxyribose 5-phosphate</strong>: nucleoside metabolism</li>
  <li><strong>17β-Estradiol</strong>, <strong>Methylcysteine</strong>: hormone/redox signaling</li>
  <li><strong>Carnitine C7:DC</strong>, <strong>LPC(13:0/0:0)</strong>: lipid metabolism</li>
</ul>

<h4>❌ Compounds Without Shared Enzymes (6 total)</h4>
<p>
These metabolites are not catalyzed by any enzyme present in the CRC-enriched list:
</p>
<table border="1" cellspacing="0" cellpadding="6">
  <tr>
    <th>Compound</th>
    <th>Potential Reason</th>
  </tr>
  <tr>
    <td>1,6-anhydro-β-D-glucose</td>
    <td>Rare sugar; limited enzyme annotation</td>
  </tr>
  <tr>
    <td>17a-Estradiol</td>
    <td>Epimer of estradiol; may lack known enzymatic links</td>
  </tr>
  <tr>
    <td>cyclo(glu-glu)</td>
    <td>Cyclic dipeptide; lacks curated KEGG enzyme links</td>
  </tr>
  <tr>
    <td>2-Aminobenzenesulfonic acid</td>
    <td>Likely xenobiotic; not classically metabolized</td>
  </tr>
  <tr>
    <td>P-sulfanilic acid</td>
    <td>Limited enzymatic annotation; possibly non-metabolic</td>
  </tr>
  <tr>
    <td>Quinoline-2-carboxylic acid</td>
    <td>Aromatic heterocycle; obscure or partial annotation</td>
  </tr>
</table>

<hr>

<h3>🧠 Why This Matters</h3>
<ul>
  <li>This filtering step focuses attention on <strong>metabolites with direct biochemical roles</strong> in CRC via enzyme catalysis.</li>
  <li>It helps distinguish between <strong>structural matches</strong> vs. <strong>functional relevance</strong>.</li>
  <li>It flags <strong>gaps in annotation</strong> or cases of potential novel biology.</li>
</ul>

<hr>



In [14]:
import pandas as pd
analysis_ready_df = pd.read_excel('analysis_ready_df.xlsx')
analysis_ready_df.head(24)

Unnamed: 0,kegg_enzyme,kegg_reactions,gene_list,human_gene_ids,gene_symbols,kegg_pathways
0,3.1.1.13,"['R00048', 'R00053', 'R00054', 'R00327', 'R006...","['lmix:132956016', 'atra:106135161', 'otc:1213...",['hsa:3988'],['LIPA'],['hsa00100']
1,3.1.1.17,"['R00048', 'R00053', 'R00054', 'R00327', 'R006...","['xva:C7V42_21680', 'bvz:BRAD3257_5897', 'mens...",['hsa:9104'],['RGN'],"['hsa00030', 'hsa00053', 'hsa00930', 'hsa01100..."
2,3.1.1.2,"['R00048', 'R00053', 'R00054', 'R00327', 'R006...","['nni:104012814', 'mgp:100303695', 'tng:GSTEN0...","['hsa:5446', 'hsa:5445', 'hsa:5444']","['PON3', 'PON2', 'PON1']","['hsa00363', 'hsa01100', 'hsa01120']"
3,3.1.1.4,"['R00048', 'R00053', 'R00054', 'R00327', 'R006...","['tsr:106551534', 'cvg:107086825', 'cgib:12797...","['hsa:8399', 'hsa:5320', 'hsa:84647', 'hsa:308...","['PLA2G10', 'PLA2G2A', 'PLA2G12B', 'PLA2G2E', ...","['hsa00564', 'hsa00565', 'hsa00590', 'hsa00591..."
4,3.1.1.47,"['R00048', 'R00053', 'R00054', 'R00327', 'R006...","['fkr:NCS57_01155700', 'dvt:126903004', 'alat:...","['hsa:5051', 'hsa:7941']","['PAFAH2', 'PLA2G7']","['hsa00565', 'hsa01100']"
5,3.1.1.64,"['R00048', 'R00053', 'R00054', 'R00327', 'R006...","['cgob:115022075', 'srx:107752442', 'tup:10249...",['hsa:6121'],['RPE65'],"['hsa00830', 'hsa01100']"
6,1.1.1.14,"['R00875', 'R01431', 'R01896', 'R07145', 'R094...","['ccav:112522796', 'aba:Acid345_1743', 'mavm:M...",['hsa:6652'],['SORD'],"['hsa00040', 'hsa00051', 'hsa01100']"
7,2.7.1.15,"['R01066', 'R02749', 'R02750']","['plf:PANA5342_0088', 'thug:KNN16_07020', 'atx...",['hsa:64080'],['RBKS'],"['hsa00030', 'hsa01100']"
8,4.1.2.4,"['R01066', 'R02749', 'R02750']","['eto:RIN69_03330', 'sam:MW2061', 'cgrn:441266...",['hsa:51071'],['DERA'],"['hsa00030', 'hsa01100']"
9,1.2.3.1,"['R00229', 'R00235', 'R00236', 'R00316', 'R006...","['tup:102490052', 'hgl:101722324', 'rno:493909...",['hsa:316'],['AOX1'],"['hsa00280', 'hsa00350', 'hsa00380', 'hsa00750..."


In [12]:
import pandas as pd
import ast  # For safely evaluating string representations of lists

# --- Assume 'ReacEnzyPath' and 'analysis_ready_df' are already loaded pandas DataFrames. ---

# Example sanity check (remove these lines if your data is already in memory):
# ReacEnzyPath = pd.read_csv("ReacEnzyPath.csv")
# analysis_ready_df = pd.read_csv("analysis_ready_df.csv")

# 0) Verify that required columns exist
missing_cols = []
for col in ['kegg_enzymes', 'input_compound_name']:
    if col not in ReacEnzyPath.columns:
        missing_cols.append(col)
if 'kegg_enzyme' not in analysis_ready_df.columns:
    missing_cols.append('kegg_enzyme')
if missing_cols:
    raise KeyError(f"Missing required column(s) in DataFrames: {missing_cols}")

print("\n--- Building enzyme-to-compound mapping (no 'Type' column required) ---")

# 1) Build an Enzyme → [CompoundName, CompoundName, ...] mapping
enzyme_to_compounds_map = {}

for _, row in ReacEnzyPath.iterrows():
    compound_name = row['input_compound_name']
    enzymes_cell = row['kegg_enzymes']
    
    # Safely parse the cell into a Python list if it's a string
    enzyme_list = []
    if isinstance(enzymes_cell, str):
        try:
            evaluated = ast.literal_eval(enzymes_cell)
            if isinstance(evaluated, list):
                enzyme_list = evaluated
        except (ValueError, SyntaxError):
            pass
    elif isinstance(enzymes_cell, list):
        enzyme_list = enzymes_cell
    
    # For each enzyme ID, add the compound_name
    for enzyme_id in enzyme_list:
        if pd.notna(enzyme_id) and isinstance(enzyme_id, str) and enzyme_id.strip():
            key = enzyme_id.strip()
            if key not in enzyme_to_compounds_map:
                enzyme_to_compounds_map[key] = set()
            enzyme_to_compounds_map[key].add(compound_name)

# Convert each set of names into a sorted list
for enz in enzyme_to_compounds_map:
    enzyme_to_compounds_map[enz] = sorted(enzyme_to_compounds_map[enz])

print(f"  Completed map for {len(enzyme_to_compounds_map)} unique enzyme IDs.")

# 2) Define a helper to look up associated compound names for a given enzyme
def get_associated_compound_names(enzyme_id):
    """
    Returns a list of compound names associated with the given KEGG enzyme ID.
    If the enzyme ID is not found or is null/empty, returns an empty list.
    """
    if pd.isna(enzyme_id):
        return []
    enzyme_key = str(enzyme_id).strip()
    return enzyme_to_compounds_map.get(enzyme_key, [])

print("\n--- Populating 'associated_compound_names' in analysis_ready_df ---")

# 3) Apply the helper to each row in analysis_ready_df['kegg_enzyme']
analysis_ready_df['associated_compound_names'] = analysis_ready_df['kegg_enzyme'].apply(
    get_associated_compound_names
)

print("\n✅ Added 'associated_compound_names' column.")

# 4) Display the updated DataFrame (showing relevant columns)
cols_to_show = ['kegg_enzyme', 'kegg_reactions', 'gene_symbols', 'associated_compound_names']
existing_cols = [c for c in cols_to_show if c in analysis_ready_df.columns]

with pd.option_context('display.max_colwidth', None, 
                       'display.max_rows', 10,
                       'display.max_columns', None,
                       'display.width', 1000):
    print("\n--- Updated analysis_ready_df (first 5 rows) ---")
    print(analysis_ready_df[existing_cols].head().to_string())
    if len(analysis_ready_df) > 5:
        print("...\n")
        print(analysis_ready_df[existing_cols].tail().to_string())



--- Building enzyme-to-compound mapping (no 'Type' column required) ---
  Completed map for 85 unique enzyme IDs.

--- Populating 'associated_compound_names' in analysis_ready_df ---

✅ Added 'associated_compound_names' column.

--- Updated analysis_ready_df (first 5 rows) ---
  kegg_enzyme                                                                                                                                                                                            kegg_reactions gene_symbols      associated_compound_names
0    1.1.1.30                                                                                                                          ['R00145', 'R00146', 'R00717', 'R01088', 'R01361', 'R01388', 'R01434', 'R02196']     ['BDH1']           [D-Erythronolactone]
1     4.1.2.4                                                                                                                                                                            ['R02750', 'R027

In [13]:
display(analysis_ready_df)

Unnamed: 0,kegg_enzyme,kegg_reactions,gene_list,human_gene_ids,gene_symbols,kegg_pathways,associated_compound_names
0,1.1.1.30,"['R00145', 'R00146', 'R00717', 'R01088', 'R013...","['sufl:FIL70_21435', 'aus:IPK37_09295', 'ofa:N...",['hsa:622'],['BDH1'],"['hsa00650', 'hsa01100']",[D-Erythronolactone]
1,4.1.2.4,"['R02750', 'R02749', 'R01066']","['svb:NCTC12167_00820', 'sfp:QUY26_14155', 'ah...",['hsa:51071'],['DERA'],"['hsa00030', 'hsa01100']",[Deoxyribose 5-phosphate]
2,2.7.1.15,"['R02750', 'R02749', 'R01066']","['shf:CEQ32_12405', 'baml:BAM5036_3234', 'rsu:...",['hsa:64080'],['RBKS'],"['hsa00030', 'hsa01100']",[Deoxyribose 5-phosphate]
3,1.2.3.1,"['R00366', 'R00372', 'R00635', 'R01340', 'R017...","['plop:125349936', 'puc:125911786', 'bbis:1050...",['hsa:316'],['AOX1'],"['hsa00280', 'hsa00350', 'hsa00380', 'hsa00750...",[Quinoline-4-carboxylic acid]
4,1.4.3.3,"['R00366', 'R00372', 'R00635', 'R01340', 'R017...","['sbro:GQF42_37510', 'amyd:K1T34_31600', 'mamb...",['hsa:1610'],['DAO'],"['hsa00260', 'hsa00311', 'hsa00330', 'hsa00470...",[Quinoline-4-carboxylic acid]
5,4.2.1.22,"['R00891', 'R00894', 'R01289', 'R01290', 'R027...","['aue:C5O00_01525', 'nsd:BST91_05810', 'mgen:1...","['hsa:102724560', 'hsa:875']","['CBS', 'CBS']","['hsa00260', 'hsa00270', 'hsa00670', 'hsa01100...",[Methylcysteine]
6,4.3.2.9,"['R00891', 'R00894', 'R01289', 'R01290', 'R027...","['pcoq:105807354', 'tda:119673143', 'cpea:1043...",['hsa:79017'],['GGCT'],"['hsa00480', 'hsa01100']",[Methylcysteine]
7,2.1.3.3,"['R00256', 'R00485', 'R01398', 'R01579', 'R019...","['bham:ABN702_08535', 'sttn:BSL84_25755', 'sam...",['hsa:5009'],['OTC'],"['hsa00220', 'hsa01100', 'hsa01110']",[Asp-Arg]
8,1.13.11.33,"['R01185', 'R01186', 'R01187', 'R01313', 'R013...","['tvp:118856898', 'umr:103660453', 'zca:113908...",['hsa:246'],['ALOX15'],"['hsa00590', 'hsa00591', 'hsa01100']",[LPI(16:2/0:0)]
9,3.1.1.4,"['R01185', 'R01186', 'R01187', 'R01313', 'R013...","['psiu:116753850', 'pleu:114697183', 'ipu:1082...","['hsa:50487', 'hsa:30814', 'hsa:64600', 'hsa:5...","['PLA2G3', 'PLA2G2E', 'PLA2G2F', 'PLA2G5', 'PL...","['hsa00564', 'hsa00565', 'hsa00590', 'hsa00591...","[LPC(16:1/0:0), LPC(18:3/0:0), LPI(16:2/0:0)]"


In [14]:

analysis_ready_df.to_excel('analysis_ready_df.xlsx', index=False)


<div style="
    border-left: 4px solid #2e7d32;
    background: #EEE3B7;
    padding: 20px;
    margin: 25px 0;
    border-radius: 8px;
    font-family: Arial, sans-serif;
    line-height: 1.5;
">
  
  <h5 style="margin: 16px 0 8px 0; color: #2e7d32;">6.2 Transcriptomic Inference &amp; Mechanistic Integration</h5>
  <ul style="margin: 6px 0 0 20px; padding: 0; list-style-type: disc;">
    <li><strong>Import RNA-Seq Data &amp; Clinical Metadata:</strong> Load normalized CRC tumor and normal expression matrices (e.g., TCGA) alongside sample annotations (tumor type, barcode).</li>
    <li><strong>PROGENy Pathway Activity Scoring:</strong> Compute single-sample GSEA scores for PROGENy’s 14 signaling pathways (e.g., TGFB, WNT, PI3K) using gseapy, generating tumor-normal contrasts.</li>
    <li><strong>TF Activity Inference via VIPER:</strong> Employ VIPER (via decoupler) with DoRothEA and TRRUST prior networks to infer transcription factor activity per sample, yielding TF enrichment scores.</li>
    <li><strong>Final Table Population:</strong> For each metabolite–gene pair, annotate with its regulating TF and TF activity score; link the gene’s KEGG pathway to a PROGENy pathway and its activity; compile into a unified mechanistic table.</li>
  </ul>
</div>

In [255]:
import pandas as pd
analysis_ready_df = pd.read_excel('analysis_ready_df.xlsx')
analysis_ready_df.head(24)

Unnamed: 0,kegg_enzyme,kegg_reactions,gene_list,human_gene_ids,gene_symbols,kegg_pathways,associated_compound_names
0,1.1.1.30,"['R00145', 'R00146', 'R00717', 'R01088', 'R013...","['sufl:FIL70_21435', 'aus:IPK37_09295', 'ofa:N...",['hsa:622'],['BDH1'],"['hsa00650', 'hsa01100']",['D-Erythronolactone']
1,4.1.2.4,"['R02750', 'R02749', 'R01066']","['svb:NCTC12167_00820', 'sfp:QUY26_14155', 'ah...",['hsa:51071'],['DERA'],"['hsa00030', 'hsa01100']",['Deoxyribose 5-phosphate']
2,2.7.1.15,"['R02750', 'R02749', 'R01066']","['shf:CEQ32_12405', 'baml:BAM5036_3234', 'rsu:...",['hsa:64080'],['RBKS'],"['hsa00030', 'hsa01100']",['Deoxyribose 5-phosphate']
3,1.2.3.1,"['R00366', 'R00372', 'R00635', 'R01340', 'R017...","['plop:125349936', 'puc:125911786', 'bbis:1050...",['hsa:316'],['AOX1'],"['hsa00280', 'hsa00350', 'hsa00380', 'hsa00750...",['Quinoline-4-carboxylic acid']
4,1.4.3.3,"['R00366', 'R00372', 'R00635', 'R01340', 'R017...","['sbro:GQF42_37510', 'amyd:K1T34_31600', 'mamb...",['hsa:1610'],['DAO'],"['hsa00260', 'hsa00311', 'hsa00330', 'hsa00470...",['Quinoline-4-carboxylic acid']
5,4.2.1.22,"['R00891', 'R00894', 'R01289', 'R01290', 'R027...","['aue:C5O00_01525', 'nsd:BST91_05810', 'mgen:1...","['hsa:102724560', 'hsa:875']","['CBS', 'CBS']","['hsa00260', 'hsa00270', 'hsa00670', 'hsa01100...",['Methylcysteine']
6,4.3.2.9,"['R00891', 'R00894', 'R01289', 'R01290', 'R027...","['pcoq:105807354', 'tda:119673143', 'cpea:1043...",['hsa:79017'],['GGCT'],"['hsa00480', 'hsa01100']",['Methylcysteine']
7,2.1.3.3,"['R00256', 'R00485', 'R01398', 'R01579', 'R019...","['bham:ABN702_08535', 'sttn:BSL84_25755', 'sam...",['hsa:5009'],['OTC'],"['hsa00220', 'hsa01100', 'hsa01110']",['Asp-Arg']
8,1.13.11.33,"['R01185', 'R01186', 'R01187', 'R01313', 'R013...","['tvp:118856898', 'umr:103660453', 'zca:113908...",['hsa:246'],['ALOX15'],"['hsa00590', 'hsa00591', 'hsa01100']",['LPI(16:2/0:0)']
9,3.1.1.4,"['R01185', 'R01186', 'R01187', 'R01313', 'R013...","['psiu:116753850', 'pleu:114697183', 'ipu:1082...","['hsa:50487', 'hsa:30814', 'hsa:64600', 'hsa:5...","['PLA2G3', 'PLA2G2E', 'PLA2G2F', 'PLA2G5', 'PL...","['hsa00564', 'hsa00565', 'hsa00590', 'hsa00591...","['LPC(16:1/0:0)', 'LPC(18:3/0:0)', 'LPI(16:2/0..."


In [36]:
import requests
import json

# common filters
base_filters = [
    {"op":"in","content":{"field":"cases.project.project_id","value":["TCGA-COAD"]}},
    {"op":"in","content":{"field":"files.data_category","value":["Transcriptome Profiling"]}},
    {"op":"in","content":{"field":"files.data_type","value":["Gene Expression Quantification"]}}
]

def count_for_workflow(workflow):
    payload = {
        "filters":{
            "op":"and",
            "content": base_filters + [
                {"op":"in","content":{"field":"files.analysis.workflow_type","value":[workflow]}}
            ]
        },
        "format":"JSON",
        "size":0
    }
    r = requests.post("https://api.gdc.cancer.gov/files", json=payload)
    r.raise_for_status()
    return r.json()["data"]["pagination"]["total"]

for wf in ["STAR - Counts","HTSeq - Counts"]:
    total = count_for_workflow(wf)
    print(f"{wf}: {total} files")


STAR - Counts: 524 files
HTSeq - Counts: 0 files


In [41]:
import json
import requests
import pandas as pd
from tqdm.auto import tqdm # Not used in this diagnostic version, but keep for your main script
from io import StringIO   # Not used in this diagnostic version, but keep for your main script

# 1) Query for STAR - Counts files
payload = {
    "filters": {
        "op": "and",
        "content": [
            {"op":"in","content":{"field":"cases.project.project_id","value":["TCGA-COAD"]}},
            {"op":"in","content":{"field":"files.data_category","value":["Transcriptome Profiling"]}},
            {"op":"in","content":{"field":"files.data_type","value":["Gene Expression Quantification"]}},
            {"op":"in","content":{"field":"files.analysis.workflow_type","value":["STAR - Counts"]}}
        ]
    },
    "format": "JSON",
    "size": 1000, 
    "fields": "file_id,cases.samples.submitter_id"
}

print("Querying GDC API for file list...")
r = requests.post("https://api.gdc.cancer.gov/files", json=payload)
r.raise_for_status()
files_metadata = r.json()["data"]["hits"]
print(f"Found {len(files_metadata)} STAR – Counts files.")

# --- Download and inspect only the FIRST file ---
if files_metadata:
    first_file_info = files_metadata[0]
    fid = first_file_info["file_id"]
    barcode = "N/A" # Initialize barcode
    # Check if cases and samples path exists before accessing
    if "cases" in first_file_info and first_file_info["cases"] and \
       "samples" in first_file_info["cases"][0] and first_file_info["cases"][0]["samples"]:
        barcode = first_file_info["cases"][0]["samples"][0]["submitter_id"]
    
    print(f"\nDownloading first file (ID: {fid}, Barcode: {barcode}) for inspection...")
    
    # --- THIS IS THE CORRECTED PAYLOAD ---
    download_payload = {"ids": [fid]}
    # ------------------------------------
    
    resp = requests.post(
        "https://api.gdc.cancer.gov/data",
        headers={"Content-Type": "application/json"},
        data=json.dumps(download_payload) # Send the corrected payload
    )
    resp.raise_for_status() # This will raise an error if download fails
    
    print("\n--- First ~30 lines of the downloaded file content ---")
    file_content = resp.text
    for i, line in enumerate(file_content.splitlines()):
        if i < 30: # Print first 30 lines
            print(line)
        else:
            break
    print("----------------------------------------------------")
    
else:
    print("No files found based on the initial query.")

Querying GDC API for file list...
Found 524 STAR – Counts files.

Downloading first file (ID: 2c73d8ac-84b9-4094-9f5e-fbd6c2e8eabf, Barcode: TCGA-AA-3688-01A) for inspection...

--- First ~30 lines of the downloaded file content ---
# gene-model: GENCODE v36
gene_id	gene_name	gene_type	unstranded	stranded_first	stranded_second	tpm_unstranded	fpkm_unstranded	fpkm_uq_unstranded
N_unmapped			394222	394222	394222			
N_multimapping			3444304	3444304	3444304			
N_noFeature			1268386	14581111	14519831			
N_ambiguous			2454255	653998	608338			
ENSG00000000003.15	TSPAN6	protein_coding	11714	5774	5940	318.9535	105.1931	99.1317
ENSG00000000005.6	TNMD	protein_coding	26	13	13	2.1756	0.7175	0.6762
ENSG00000000419.13	DPM1	protein_coding	2490	1265	1225	254.7929	84.0325	79.1904
ENSG00000000457.14	SCYL3	protein_coding	482	465	390	8.6490	2.8525	2.6881
ENSG00000000460.17	C1orf112	protein_coding	602	465	522	12.4542	4.1075	3.8708
ENSG00000000938.13	FGR	protein_coding	85	40	45	3.1041	1.0238	0.9648
ENSG000000

In [66]:
import pandas as pd
import numpy as np
# mygene is not needed if your CSV already has gene symbols

# --- Step 1: Load your locally saved TCGA data ---

# Use the full paths to your CSV files
expression_csv_path = r"TCGA_COAD_RNASeq2Gene_counts.csv"
sample_info_csv_path = r"TCGA_COAD_RNASeq2Gene_sampleInfo.csv"

print("Step 1: Loading local TCGA datasets from CSV files...")

# Initialize df_expression and df_sample_info
df_expression = pd.DataFrame()
df_sample_info = pd.DataFrame()

try:
    # Load the expression data.
    # The 'gene_id' column (which contains gene symbols) will be the index.
    df_expression = pd.read_csv(expression_csv_path, index_col='gene_id')
    
    # Load the sample information.
    # The 'barcode' column should be the index.
    df_sample_info = pd.read_csv(sample_info_csv_path, index_col='barcode')
    
    print("✅ Local CSV files loaded successfully.")
    print(f"Shape of loaded df_expression: {df_expression.shape}")

    # --- Ensure Gene Symbols in index are UPPERCASE and handle duplicates ---
    if not df_expression.empty:
        print("\nProcessing df_expression index (gene symbols)...")
        original_index_name = df_expression.index.name # Preserve index name if any
        
        # Convert index to string, strip whitespace, and uppercase
        df_expression.index = df_expression.index.astype(str).str.strip().str.upper()
        df_expression.index.name = original_index_name # Restore index name
        
        # Handle potential duplicate gene symbols by taking the mean
        # Check for duplicates first
        if df_expression.index.duplicated().any():
            print(f"  Found duplicate gene symbols. Aggregating by taking the mean...")
            df_expression = df_expression.groupby(df_expression.index).mean()
            print(f"  Shape after aggregating duplicates: {df_expression.shape}")
        else:
            print("  No duplicate gene symbols found in index.")
        print("✅ df_expression index standardized to uppercase gene symbols.")

    # Convert expression data to numeric, coercing errors (e.g., if some non-numeric data exists)
    # This is important before any mathematical operations like log transform or ssGSEA
    if not df_expression.empty:
        print("\nConverting expression data to numeric...")
        for col in df_expression.columns:
            df_expression[col] = pd.to_numeric(df_expression[col], errors='coerce')
        # Decide how to handle NaNs that might be created if 'coerce' was used.
        # For many downstream analyses like ssGSEA, NaNs need to be handled (e.g., fill with 0 or mean, or drop genes/samples).
        # For now, let's fill with 0, but this is a critical data quality step.
        original_nan_count = df_expression.isna().sum().sum()
        if original_nan_count > 0:
            print(f"  Found {original_nan_count} NaN values after to_numeric. Filling with 0.")
            df_expression = df_expression.fillna(0)
        print("✅ Expression data converted to numeric.")


except FileNotFoundError as e:
    print(f"❌ ERROR: A file was not found. {e}")
    print("     Please double-check that the file paths above are correct and the files exist in that location.")
except Exception as ex:
    print(f"❌ An unexpected error occurred during data loading or initial processing: {ex}")


# --- Display a quick check of the loaded and processed data ---
if not df_expression.empty and not df_sample_info.empty:
    print("\n--- First 5x5 of Processed Expression Data (df_expression) ---")
    # Gene symbols (UPPERCASE) should be rows, samples as columns
    try:
        display(df_expression.iloc[:5, :5])
    except NameError:
        print(df_expression.iloc[:5, :5].to_string())
    print("\nFirst 5 gene identifiers in df_expression.index:")
    print(list(df_expression.index[:5]))

    print("\n--- First 5 rows of Sample Info (df_sample_info) ---")
    try:
        display(df_sample_info.head())
    except NameError:
        print(df_sample_info.head().to_string())
    
    print("\nSample type counts from loaded sample_info:")
    print(df_sample_info['sample_type'].value_counts())
else:
    print("\n⚠️ df_expression or df_sample_info was not loaded/processed successfully.")

Step 1: Loading local TCGA datasets from CSV files...
✅ Local CSV files loaded successfully.
Shape of loaded df_expression: (20501, 326)

Processing df_expression index (gene symbols)...
  No duplicate gene symbols found in index.
✅ df_expression index standardized to uppercase gene symbols.

Converting expression data to numeric...
✅ Expression data converted to numeric.

--- First 5x5 of Processed Expression Data (df_expression) ---


Unnamed: 0_level_0,TCGA-3L-AA1B-01A-11R-A37K-07,TCGA-4N-A93T-01A-11R-A37K-07,TCGA-4T-AA8H-01A-11R-A41B-07,TCGA-5M-AAT4-01A-11R-A41B-07,TCGA-5M-AAT6-01A-11R-A41B-07
gene_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A1BG,45.8,354.01,28.72,15.01,75.89
A1CF,457.0,208.0,238.0,352.0,0.0
A2BP1,5.0,21.0,1.0,4.0,5.0
A2LD1,366.88,767.61,404.41,296.27,137.34
A2ML1,1.0,1.0,50.0,16.0,4.0



First 5 gene identifiers in df_expression.index:
['A1BG', 'A1CF', 'A2BP1', 'A2LD1', 'A2ML1']

--- First 5 rows of Sample Info (df_sample_info) ---


Unnamed: 0_level_0,sample_type
barcode,Unnamed: 1_level_1
TCGA-3L-AA1B-01A-11R-A37K-07,Primary Tumor
TCGA-4N-A93T-01A-11R-A37K-07,Primary Tumor
TCGA-4T-AA8H-01A-11R-A41B-07,Primary Tumor
TCGA-5M-AAT4-01A-11R-A41B-07,Primary Tumor
TCGA-5M-AAT6-01A-11R-A41B-07,Primary Tumor



Sample type counts from loaded sample_info:
sample_type
Primary Tumor          285
Solid Tissue Normal     41
Name: count, dtype: int64


<h2>🧬 TCGA COAD RNA-Seq Dataset: Expression Matrix & Sample Metadata Load</h2>

<h3>🔍 Objective</h3>
<p>
The above code aims to load and preprocess RNA-Seq gene expression data for colorectal cancer (COAD) from the TCGA, along with matched sample annotations.
This dataset will later be used to:
<ul>
  <li>Test whether CRC-linked metabolic genes are <strong>differentially expressed</strong></li>
  <li>Layer structural predictions (e.g. SMARTS motifs) with real biological activity</li>
</ul>
</p>

<hr>

<h3>📂 What This Script Does</h3>

<ol>
  <li><strong>Loads</strong> two CSVs:
    <ul>
      <li><code>TCGA_COAD_RNASeq2Gene_counts.csv</code> → raw expression matrix (gene symbols × samples)</li>
      <li><code>TCGA_COAD_RNASeq2Gene_sampleInfo.csv</code> → sample type metadata (e.g., tumor vs. normal)</li>
    </ul>
  </li>
  <li><strong>Standardizes</strong> gene symbols to <code>UPPERCASE</code>, handles duplicates by averaging</li>
  <li><strong>Converts</strong> all expression values to numeric, coercing non-numeric entries</li>
  <li><strong>Fills</strong> any resulting <code>NaN</code> values with 0 to ensure a complete matrix for downstream use</li>
</ol>

<hr>

<h3>📊 Results Snapshot</h3>

<h4>✅ Expression Matrix Summary (<code>df_expression</code>)</h4>
<ul>
  <li>🧬 <strong>Genes:</strong> 20,501</li>
  <li>🧪 <strong>Samples:</strong> 326</li>
  <li>✅ All gene symbols standardized and numeric</li>
</ul>

<h4>📑 Sample Metadata (<code>df_sample_info</code>)</h4>
<ul>
  <li>🔬 <strong>Primary Tumor samples:</strong> 285</li>
  <li>🧪 <strong>Solid Tissue Normal samples:</strong> 41</li>
</ul>

<h4>🧾 Example Expression Snapshot (First 5 genes × 5 samples)</h4>
<table border="1" cellpadding="6">
  <tr><th>Gene</th><th>Sample 1</th><th>Sample 2</th><th>Sample 3</th><th>Sample 4</th><th>Sample 5</th></tr>
  <tr><td>A1BG</td><td>45.80</td><td>354.01</td><td>28.72</td><td>15.01</td><td>75.89</td></tr>
  <tr><td>A1CF</td><td>457.00</td><td>208.00</td><td>238.00</td><td>352.00</td><td>0.00</td></tr>
  <tr><td>A2BP1</td><td>5.00</td><td>21.00</td><td>1.00</td><td>4.00</td><td>5.00</td></tr>
  <tr><td>A2LD1</td><td>366.88</td><td>767.61</td><td>404.41</td><td>296.27</td><td>137.34</td></tr>
  <tr><td>A2ML1</td><td>1.00</td><td>1.00</td><td>50.00</td><td>16.00</td><td>4.00</td></tr>
</table>


In [67]:
import pandas as pd
import ast # For safely evaluating string representations if 'gene_symbols' was read as string

# --- Ensure 'analysis_ready_df' is defined ---
# If you haven't run the cells that create analysis_ready_df in your current session,
# you'll need to do that first.
# For this example, I'll create a placeholder analysis_ready_df.
# REPLACE THIS WITH YOUR ACTUAL analysis_ready_df if it's not already in memory.
if 'analysis_ready_df' not in locals() and 'analysis_ready_df' not in globals():
    print("INFO: 'analysis_ready_df' not found. Creating a placeholder for demonstration.")
    data_for_ardf = {
        'kegg_enzyme': ['2.7.1.15', '3.4.13.18', '3.5.1.14', '1.1.1.1'],
        'kegg_reactions': [['R02750', 'R02749'], ['R00669', 'R01166'], ['R00669', 'R01166'], ['R00001']],
        'gene_symbols': [['RBKS', 'GENE_X'], ['CNDP2'], ['ACY1', 'ABHD14A-ACY1'], ['ADH1A', 'NON_EXISTENT_GENE']]
    }
    analysis_ready_df = pd.DataFrame(data_for_ardf)
    print("Placeholder 'analysis_ready_df' created. Please ensure your actual one is loaded for accurate results.")
else:
    print("Using existing 'analysis_ready_df'.")

# --- Ensure 'df_expression' is defined (loaded from your CSV) ---
# The code you provided in your last message loads df_expression.
# If it's not loaded, the following lines will cause an error.
if 'df_expression' not in locals() and 'df_expression' not in globals():
    print("❌ ERROR: 'df_expression' (from TCGA_COAD_RNASeq2Gene_counts.csv) is not loaded.")
    print("          Please run the cell that loads this CSV file first.")
    # exit() # Or handle as appropriate
else:
    print("✅ 'df_expression' (TCGA data) is loaded.")

    # --- Step 1: Extract unique gene symbols from your analysis_ready_df ---
    if 'gene_symbols' not in analysis_ready_df.columns:
        print("❌ ERROR: 'analysis_ready_df' does not have a 'gene_symbols' column.")
    else:
        # The 'gene_symbols' column contains lists of symbols. We need to flatten them.
        # First, ensure all entries are lists (in case some are strings due to how it was made or loaded)
        def ensure_list(item):
            if isinstance(item, str):
                try:
                    return ast.literal_eval(item) # Safely evaluate string like "['GENE1', 'GENE2']"
                except (ValueError, SyntaxError):
                    return [item] # If it's a single gene symbol string, put it in a list
            elif isinstance(item, list):
                return item
            return [] # Default to empty list for other types or NaNs

        analysis_ready_df['gene_symbols_list'] = analysis_ready_df['gene_symbols'].apply(ensure_list)
        
        all_user_gene_symbols_flat = [symbol for sublist in analysis_ready_df['gene_symbols_list'] for symbol in sublist]
        unique_user_gene_symbols = sorted(list(set(all_user_gene_symbols_flat)))
        print(f"\nFound {len(unique_user_gene_symbols)} unique gene symbols in your 'analysis_ready_df'.")
        if len(unique_user_gene_symbols) < 20: # Print if list is short
            print(f"   Your gene symbols: {unique_user_gene_symbols}")


        # --- Step 2: Get the list of gene symbols available in the TCGA df_expression ---
        # The gene symbols are the index of df_expression
        tcga_gene_symbols = df_expression.index.tolist()
        unique_tcga_gene_symbols = sorted(list(set(tcga_gene_symbols))) # Should already be unique from index
        print(f"Found {len(unique_tcga_gene_symbols)} unique gene symbols in the loaded TCGA expression data.")


        # --- Step 3: Find the intersection (genes present in both lists) ---
        user_genes_set = set(unique_user_gene_symbols)
        tcga_genes_set = set(unique_tcga_gene_symbols)

        common_genes = sorted(list(user_genes_set.intersection(tcga_genes_set)))
        genes_in_user_list_not_in_tcga = sorted(list(user_genes_set - tcga_genes_set))

        print(f"\n--- Gene List Comparison ---")
        print(f"Number of unique genes from your analysis_ready_df: {len(unique_user_gene_symbols)}")
        print(f"Number of unique genes in the TCGA COAD expression data: {len(unique_tcga_gene_symbols)}")
        print(f"Number of your genes FOUND in the TCGA expression data: {len(common_genes)}")
        
        if len(common_genes) < 20 and len(common_genes) > 0: # Print if list is short and not empty
            print(f"   Common genes: {common_genes}")

        print(f"Number of your genes NOT FOUND in the TCGA expression data: {len(genes_in_user_list_not_in_tcga)}")
        if genes_in_user_list_not_in_tcga:
            print(f"   Genes from your list not found in TCGA data: {genes_in_user_list_not_in_tcga}")
        else:
            print("   All genes from your list were found in the TCGA data!")

Using existing 'analysis_ready_df'.
✅ 'df_expression' (TCGA data) is loaded.

Found 74 unique gene symbols in your 'analysis_ready_df'.
Found 20501 unique gene symbols in the loaded TCGA expression data.

--- Gene List Comparison ---
Number of unique genes from your analysis_ready_df: 74
Number of unique genes in the TCGA COAD expression data: 20501
Number of your genes FOUND in the TCGA expression data: 65
Number of your genes NOT FOUND in the TCGA expression data: 9
   Genes from your list not found in TCGA data: ['AKR1C8', 'ALPG', 'GK3', 'NT5C1B-RDH14', 'NT5C3A', 'NT5C3B', 'NT5DC4', 'UGT2A2', 'UGT2B17']


<h2>🧬 Gene Symbol Cross-Mapping: Metabolic Targets vs. TCGA Expression Matrix</h2>

<h3>🔍 Purpose</h3>
<p>
This above analysis compares gene symbols derived from KEGG metabolic enzymes (via <code>analysis_ready_df</code>) to those available in the TCGA-COAD RNA-seq expression dataset. 
It ensures that genes identified from metabolite/enzyme mapping can actually be evaluated in tumor vs. normal transcriptomic data.
</p>

<hr>

<h3>📦 Summary of Data Loaded</h3>
<ul>
  <li>✅ <strong>Metabolic Targets (from KEGG):</strong> 74 unique gene symbols</li>
  <li>✅ <strong>TCGA Expression Dataset:</strong> 20,501 unique gene symbols (rows of expression matrix)</li>
</ul>

<hr>

<h3>🧪 Intersection Results</h3>
<ul>
  <li>🔎 <strong>Common Genes (Present in Both):</strong> 65</li>
  <li>⚠️ <strong>Missing Genes (Not Found in TCGA COAD):</strong> 9</li>
</ul>

<h4>❌ Genes <u>Not Found</u> in TCGA COAD Expression Matrix</h4>
<ul>
  <li>AKR1C8</li>
  <li>ALPG</li>
  <li>GK3</li>
  <li>NT5C1B-RDH14</li>
  <li>NT5C3A</li>
  <li>NT5C3B</li>
  <li>NT5DC4</li>
  <li>UGT2A2</li>
  <li>UGT2B17</li>
</ul>

<hr>

<h3>🧠 Interpretation & Biological Considerations</h3>
<ul>
  <li>These 9 genes may represent:
    <ul>
      <li>🧬 Genes with very low or no expression in colorectal tissues</li>
      <li>💡 Non-standard or alias symbols not captured in TCGA’s normalization pipeline</li>
      <li>❓ Fusion or predicted genes (e.g. <code>NT5C1B-RDH14</code>) that don’t match official gene annotations</li>
    </ul>
  </li>
  <li>For downstream expression analysis, only the 65 matched genes can be analyzed reliably</li>
</ul>




In [68]:
import pandas as pd # Should already be imported
import numpy as np  # Should already be imported

# This cell assumes 'df_expression' and 'df_sample_info' are already loaded from your CSVs
# and 'common_genes' list is available from your last script execution.

if 'df_expression' not in locals() or 'df_sample_info' not in locals():
    print("❌ ERROR: 'df_expression' or 'df_sample_info' not found. Please load them from your CSVs first.")
else:
    print("\nStep A: Preparing and processing TCGA data for analysis...")

    # Transpose the expression DataFrame so samples are rows and genes are columns
    df_expression_T = df_expression.T

    # Merge with sample information
    # The index of df_expression_T (sample barcodes) should match the index of df_sample_info
    df_merged_tcga = pd.merge(df_sample_info, df_expression_T, left_index=True, right_index=True)

    # This df_merged_tcga now contains the 'sample_type' column and all gene expression columns
    # It is now our 'valid_samples' DataFrame
    valid_samples = df_merged_tcga.copy()

    # Log-transform the expression data
    # (curatedTCGAData RNASeq2Gene is often RSEM/TPM-like, log2(x+1) is a common next step)
    if 'sample_type' in valid_samples.columns:
        gene_columns_for_log = valid_samples.columns.drop('sample_type')
    else:
        gene_columns_for_log = valid_samples.columns.copy()
        print("⚠️ WARNING: 'sample_type' column not found directly after merge. Check df_sample_info structure.")

    # Ensure gene columns are numeric before log transformation
    for col in gene_columns_for_log:
        valid_samples[col] = pd.to_numeric(valid_samples[col], errors='coerce')

    valid_samples[gene_columns_for_log] = np.log2(valid_samples[gene_columns_for_log] + 1)

    print(f"✅ Data prepared. 'valid_samples' DataFrame now contains {valid_samples.shape[0]} samples.")
    print("   It includes columns for your gene symbols and a 'sample_type' column.")
    print("Sample type counts in 'valid_samples':")
    print(valid_samples['sample_type'].value_counts())
    print("\n--- Head of 'valid_samples' (some columns) ---")
    # Display sample_type and the first few common genes if available
    display_cols = ['sample_type'] + common_genes[:4] if 'common_genes' in locals() and common_genes else ['sample_type']
    existing_display_cols = [col for col in display_cols if col in valid_samples.columns]
    if existing_display_cols:
        display(valid_samples[existing_display_cols].head())
    else:
        display(valid_samples.iloc[:5,:5])


Step A: Preparing and processing TCGA data for analysis...
✅ Data prepared. 'valid_samples' DataFrame now contains 326 samples.
   It includes columns for your gene symbols and a 'sample_type' column.
Sample type counts in 'valid_samples':
sample_type
Primary Tumor          285
Solid Tissue Normal     41
Name: count, dtype: int64

--- Head of 'valid_samples' (some columns) ---


Unnamed: 0,sample_type,AASS,AKR1C3,ALOX15,ALPI
TCGA-3L-AA1B-01A-11R-A37K-07,Primary Tumor,9.579316,10.461479,5.044394,4.523562
TCGA-4N-A93T-01A-11R-A37K-07,Primary Tumor,8.787903,12.154185,7.523562,5.754888
TCGA-4T-AA8H-01A-11R-A41B-07,Primary Tumor,8.247928,9.748193,6.672425,2.584963
TCGA-5M-AAT4-01A-11R-A41B-07,Primary Tumor,5.523562,8.797662,5.523562,5.321928
TCGA-5M-AAT6-01A-11R-A41B-07,Primary Tumor,8.228819,9.739781,5.754888,0.0


<h2>🧪 Preparing TCGA Expression Data for Differential Analysis</h2>

<h3>🎯 Objective</h3>
<p>
To create a clean, biologically interpretable gene expression matrix from TCGA-COAD RNA-seq data, where:
<ul>
  <li>🧬 Rows represent individual tumor or normal tissue samples</li>
  <li>🧪 Columns represent log<sub>2</sub>-transformed expression values of matched metabolic genes</li>
</ul>
This matrix enables differential expression comparisons between tumor and normal tissue for metabolically linked genes.
</p>

---

<h3>🔍 What This Step Does</h3>
<ol>
  <li><strong>Transposes</strong> the gene expression matrix so samples become rows and genes become columns.</li>
  <li><strong>Merges</strong> sample metadata (e.g. tumor vs. normal) with the expression matrix via sample barcodes.</li>
  <li><strong>Applies</strong> <code>log2(x + 1)</code> transformation to all gene expression values — a standard step for RNA-seq normalization and variance stabilization.</li>
  <li><strong>Validates</strong> gene identity by retaining only the common genes between the metabolite-inferred set and TCGA's dataset.</li>
</ol>

---

<h3>📊 Output Overview</h3>
<ul>
  <li><strong>Samples Processed:</strong> 326 total TCGA COAD samples</li>
  <li><strong>Sample Type Breakdown:</strong>
    <ul>
      <li>🧪 <strong>Primary Tumor:</strong> 285 samples</li>
      <li>🧫 <strong>Solid Tissue Normal:</strong> 41 samples</li>
    </ul>
  </li>
</ul>

---

<h3>✅ Example Preview of <code>valid_samples</code> DataFrame</h3>
<p>Below are log-transformed expression values for the first few common genes:</p>

<pre><code>
sample_type         AASS     AKR1C3    ALOX15    ALPI
TCGA-3L-AA1B...    9.58     10.46     5.04      4.52
TCGA-4N-A93T...    8.79     12.15     7.52      5.75
TCGA-4T-AA8H...    8.25     9.75      6.67      2.58
...                ...      ...       ...       ...
</code></pre>

---

<h3>💡 Why This Step Matters</h3>
<ul>
  <li>Ensures gene expression values are on a consistent and interpretable scale (log<sub>2</sub>).</li>
  <li>Aligns omics data (RNA expression) with metabolomic predictions, enabling:
    <ul>



In [73]:
from scipy.stats import ttest_ind
import pandas as pd
import numpy as np
import ast # For ast.literal_eval

# This cell assumes 'analysis_ready_df' (with your original metabolite/pathway info and 'gene_symbols' column),
# 'valid_samples' (from Code Cell 1), and 'common_genes' (the list of 105 genes) are defined.

if 'analysis_ready_df' not in locals() or 'valid_samples' not in locals() or 'common_genes' not in locals():
    print("❌ ERROR: 'analysis_ready_df', 'valid_samples', or 'common_genes' not found. Please run previous cells.")
else:
    gene_expression_summary_data = []
    print(f"\nGenerating gene expression summary for {len(common_genes)} common genes...")

    for gene in common_genes: # Iterate only through genes known to be in valid_samples
        if gene in valid_samples.columns:
            tumor_expr = valid_samples[valid_samples['sample_type'] == 'Primary Tumor'][gene].dropna()
            normal_expr = valid_samples[valid_samples['sample_type'] == 'Solid Tissue Normal'][gene].dropna()

            mean_tumor = tumor_expr.mean()
            mean_normal = normal_expr.mean()
            log2fc = mean_tumor - mean_normal

            p_value = np.nan
            if len(tumor_expr) >= 2 and len(normal_expr) >= 2:
                stat, p_value = ttest_ind(tumor_expr, normal_expr, equal_var=False, nan_policy='omit')

            gene_expression_summary_data.append({
                'Gene_Symbol': gene, # This will be a string
                'Mean_Tumor_Expr_log2': mean_tumor,
                'Mean_Normal_Expr_log2': mean_normal,
                'Log2FC_GeneExpr': log2fc,
                'P_value_GeneExpr': p_value
            })
        # Removed the 'else' block that appended NaNs, as common_genes should all be in valid_samples columns
        # If a gene from common_genes is NOT in valid_samples.columns, it indicates an issue with common_genes creation.

    Table1_GeneCentric_summary = pd.DataFrame(gene_expression_summary_data)

    # --- Crucial fix for analysis_ready_df before exploding and merging ---
    print("\nCleaning 'analysis_ready_df' for merging...")
    
    # Define the function to ensure 'gene_symbols' cells are lists of strings
    def clean_gene_symbols_column(item):
        if isinstance(item, str): # If it's a string like "['GENE1', 'GENE2']" or "GENE1"
            try:
                evaluated_item = ast.literal_eval(item) # Try to evaluate as a list
                if isinstance(evaluated_item, list):
                    return [str(s).strip() for s in evaluated_item] # Ensure elements are strings
                return [str(evaluated_item).strip()] # If it evaluates to a single item
            except (ValueError, SyntaxError):
                return [item.strip()] # If it's a plain string, put it in a list
        elif isinstance(item, list):
            return [str(s).strip() for s in item] # Ensure elements are strings
        return [] # Default for other types or NaNs

    # Apply the cleaning function to the 'gene_symbols' column
    # Create a new column for the cleaned lists to avoid modifying original df if not desired
    analysis_ready_df_cleaned = analysis_ready_df.copy()
    analysis_ready_df_cleaned['gene_symbols_for_explode'] = analysis_ready_df_cleaned['gene_symbols'].apply(clean_gene_symbols_column)
    
    # Explode analysis_ready_df on the cleaned list column
    analysis_exploded_df = analysis_ready_df_cleaned.explode('gene_symbols_for_explode')
    analysis_exploded_df.rename(columns={'gene_symbols_for_explode': 'Gene_Symbol'}, inplace=True)
    
    # Ensure 'Gene_Symbol' in analysis_exploded_df is string type and not empty
    analysis_exploded_df = analysis_exploded_df[analysis_exploded_df['Gene_Symbol'].apply(lambda x: isinstance(x, str) and x != '')]

    print("--- Head of analysis_exploded_df (left side of merge, 'Gene_Symbol' should be string) ---")
    display(analysis_exploded_df[['kegg_enzyme', 'Gene_Symbol']].head())
    print("--- Head of Table1_GeneCentric_summary (right side of merge, 'Gene_Symbol' is string) ---")
    display(Table1_GeneCentric_summary[['Gene_Symbol']].head())
    # --- End of crucial fix ---

    Table1_GeneCentric_final = pd.merge(analysis_exploded_df, Table1_GeneCentric_summary, on='Gene_Symbol', how='left')
    
    print("\n--- Table1_GeneCentric (Your data merged with Gene Expression Info) ---")
    display_cols_t1 = ['Metabolite', 'Metabolite_KEGG_ID', 'kegg_enzyme', 'Original_Pathway_Name', 
                       'Gene_Symbol', 'Mean_Tumor_Expr_log2', 'Mean_Normal_Expr_log2', 
                       'Log2FC_GeneExpr', 'P_value_GeneExpr']
    existing_cols_t1 = [col for col in display_cols_t1 if col in Table1_GeneCentric_final.columns]
    
    # Drop the temporary column if it exists
    if 'gene_symbols_list' in Table1_GeneCentric_final.columns: # From previous ensure_list if analysis_ready_df was reused
        Table1_GeneCentric_final = Table1_GeneCentric_final.drop(columns=['gene_symbols_list'])
    if 'gene_symbols' in Table1_GeneCentric_final.columns and 'Gene_Symbol' in Table1_GeneCentric_final.columns and 'gene_symbols' != 'Gene_Symbol':
        Table1_GeneCentric_final = Table1_GeneCentric_final.drop(columns=['gene_symbols'])


    display(Table1_GeneCentric_final[existing_cols_t1].head())


Generating gene expression summary for 65 common genes...

Cleaning 'analysis_ready_df' for merging...
--- Head of analysis_exploded_df (left side of merge, 'Gene_Symbol' should be string) ---


Unnamed: 0,kegg_enzyme,Gene_Symbol
0,1.1.1.30,BDH1
1,4.1.2.4,DERA
2,2.7.1.15,RBKS
3,1.2.3.1,AOX1
4,1.4.3.3,DAO


--- Head of Table1_GeneCentric_summary (right side of merge, 'Gene_Symbol' is string) ---


Unnamed: 0,Gene_Symbol
0,AASS
1,AKR1C3
2,ALOX15
3,ALPI
4,ALPL



--- Table1_GeneCentric (Your data merged with Gene Expression Info) ---


Unnamed: 0,kegg_enzyme,Gene_Symbol,Mean_Tumor_Expr_log2,Mean_Normal_Expr_log2,Log2FC_GeneExpr,P_value_GeneExpr
0,1.1.1.30,BDH1,11.435665,12.463213,-1.027547,1.224089e-11
1,4.1.2.4,DERA,10.709742,11.434309,-0.724568,2.791374e-09
2,2.7.1.15,RBKS,7.193134,8.619459,-1.426326,9.665781e-15
3,1.2.3.1,AOX1,5.818059,8.536311,-2.718253,3.681753e-23
4,1.4.3.3,DAO,0.891153,6.657734,-5.766581,3.764261e-22


<h2>🧬 Table 1: Differential Expression of CRC-Relevant Metabolic Genes</h2>

<h3>🎯 Objective</h3>
<p>
This analysis connects metabolite-driven enzyme predictions to gene-level transcriptional changes in colorectal cancer (CRC).
We assess whether genes linked to CRC-relevant metabolites are <strong>differentially expressed</strong> in tumor versus normal tissue samples in the TCGA-COAD RNA-seq dataset.
</p>

---

<h3>🧪 What This Table Contains</h3>
<ul>
  <li><strong>Gene Symbol:</strong> Gene encoding the enzyme that acts on a CRC-associated metabolite.</li>
  <li><strong>Mean_Tumor_Expr_log2:</strong> Mean log<sub>2</sub>-transformed expression across all primary tumor samples.</li>
  <li><strong>Mean_Normal_Expr_log2:</strong> Mean log<sub>2</sub>-transformed expression across all matched solid tissue normal samples.</li>
  <li><strong>Log2FC_GeneExpr:</strong> Log<sub>2</sub> fold change: Tumor - Normal (positive = overexpressed in tumor).</li>
  <li><strong>P_value_GeneExpr:</strong> Welch's t-test p-value testing differential expression between tumor and normal.</li>
</ul>

---

<h3>📊 Example Results</h3>
<table border="1" cellspacing="0" cellpadding="4">
<thead>
  <tr>
    <th>Gene Symbol</th>
    <th>Log<sub>2</sub> Fold Change</th>
    <th>P-value</th>
    <th>Interpretation</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><code>BDH1</code></td>
    <td>-1.03</td>
    <td>1.22e-11</td>
    <td>Significantly downregulated in tumors</td>
  </tr>
  <tr>
    <td><code>RBKS</code></td>
    <td>-1.43</td>
    <td>9.67e-15</td>
    <td>Repressed pentose phosphate entry via ribokinase</td>
  </tr>
  <tr>
    <td><code>AOX1</code></td>
    <td>-2.72</td>
    <td>3.68e-23</td>
    <td>Strong loss of oxidative metabolism signature</td>
  </tr>
  <tr>
    <td><code>DAO</code></td>
    <td>-5.77</td>
    <td>3.76e-22</td>
    <td>Major downregulation — potentially silenced</td>
  </tr>
</tbody>
</table>

---

<h3>🔬 Why This Matters</h3>
<ul>
  <li>This table provides direct transcriptomic evidence for the activity of specific metabolic pathways inferred from CRC metabolites.</li>
  <li>Helps prioritize enzymes not just by structure-based predictions but also by <strong>tumor-associated expression shifts</strong>.</li>
  <li>Identifies potential targets for inhibition (if overexpressed) or reactivation (if silenced) in CRC metabolic reprogramming.</li>
</ul>

---

<h3>⚠️ Considerations</h3>
<ul>
  <li>Gene expression is only one layer — future integration with mutation or epigenetic data is needed.</li>
  <li>Some metabolic genes may show subtle expression changes but dramatic flux differences.</li>
</ul>


In [70]:
import pandas as pd
import numpy as np

# This cell assumes 'valid_samples' DataFrame is available from your earlier data preparation
# (i.e., after loading TCGA data from your R-generated CSVs and processing it into 'valid_samples')

if 'valid_samples' not in locals() or valid_samples.empty:
    print("❌ ERROR: 'valid_samples' DataFrame is not defined or empty. \nPlease ensure the cell that creates 'valid_samples' (from merging TCGA expression and sample info) has been run successfully.")
else:
    print("Preparing 'expression_linear_for_decoupler' from 'valid_samples'...")
    
    # Ensure 'valid_samples' has samples as rows and genes as columns,
    # and a 'sample_type' column.
    if 'sample_type' in valid_samples.columns:
        # Get the expression data part, transpose so genes are rows and samples are columns
        expression_data_log_transformed = valid_samples.drop(columns=['sample_type']).T 
    else:
        # If 'sample_type' column was somehow dropped or not present in valid_samples
        print("⚠️ WARNING: 'sample_type' column not found in 'valid_samples'. Transposing all columns.")
        expression_data_log_transformed = valid_samples.T 
    
    # Inverse of log2(x+1) transformation to get back to a pseudo-linear scale
    expression_linear_for_decoupler = np.power(2, expression_data_log_transformed) - 1
    
    # Handle potential NaNs that might have been introduced if original values were -1 after log before inverse
    # or if some genes had all NaNs before transformation.
    expression_linear_for_decoupler = expression_linear_for_decoupler.fillna(0).astype(float) 
    
    print("✅ 'expression_linear_for_decoupler' (genes x samples, linear scale) created successfully.")
    print("Shape:", expression_linear_for_decoupler.shape)
    print("\n--- Head of 'expression_linear_for_decoupler' ---")
    display(expression_linear_for_decoupler.iloc[:5, :5])

Preparing 'expression_linear_for_decoupler' from 'valid_samples'...
✅ 'expression_linear_for_decoupler' (genes x samples, linear scale) created successfully.
Shape: (20501, 326)

--- Head of 'expression_linear_for_decoupler' ---


Unnamed: 0,TCGA-3L-AA1B-01A-11R-A37K-07,TCGA-4N-A93T-01A-11R-A37K-07,TCGA-4T-AA8H-01A-11R-A41B-07,TCGA-5M-AAT4-01A-11R-A41B-07,TCGA-5M-AAT6-01A-11R-A41B-07
A1BG,45.8,354.01,28.72,15.01,75.89
A1CF,457.0,208.0,238.0,352.0,0.0
A2BP1,5.0,21.0,1.0,4.0,5.0
A2LD1,366.88,767.61,404.41,296.27,137.34
A2ML1,1.0,1.0,50.0,16.0,4.0


<h2>🧬 Preparing Expression Matrix for Activity Inference</h2>

<h3>📌 What is <code>expression_linear_for_decoupler</code>?</h3>
<p>
This is a matrix of gene expression values (rows = genes, columns = TCGA-COAD samples) that has been:
</p>
<ul>
  <li><strong>Log-transformed → reversed</strong> via <code>2<sup>x</sup> - 1</code> to return to a pseudo-linear scale (from prior <code>log2(x + 1)</code>).</li>
  <li><strong>Converted to numeric</strong> and cleaned of missing values by filling NaNs with <code>0</code>.</li>
</ul>

<p>
It is now suitable as input for downstream tools that require <strong>linear-scale expression data</strong>, such as:
</p>
<ul>
  <li><strong><code>decoupler-py</code></strong> or <strong><code>viper</code></strong> (to infer transcription factor or kinase activity)</li>
  <li>Gene Set Enrichment Analysis (GSEA) using raw counts</li>
  <li>Metabolic activity scoring (e.g. single-sample GSEA)</li>
</ul>

---

<h3>🧪 Why Transform It Back from Log Space?</h3>
<p>
Some inference tools (especially those that apply matrix factorization or enrichment) assume additive behavior, which aligns more naturally with raw (or pseudo-linear) expression levels. 
Operating in log space would distort such additivity-based models.
</p>

<p>
For instance:
<code>log2(100 + 1) - log2(10 + 1) ≠ log2(100/10)</code>  
but  
<code>(100 - 10) = 90</code> does match additive expectations in linear scale.
</p>

---

<h3>📊 Example Snapshot (First 5 Genes × 5 Samples)</h3>
<p>The first few rows show typical linearized expression values:</p>
<table border="1" cellpadding="4" cellspacing="0">
  <thead>
    <tr>
      <th>Gene</th>
      <th>TCGA-3L-AA1B</th>
      <th>TCGA-4N-A93T</th>
      <th>TCGA-4T-AA8H</th>
      <th>TCGA-5M-AAT4</th>
      <th>TCGA-5M-AAT6</th>
    </tr>
  </thead>
  <tbody>
    <tr><td>A1BG</td><td>45.80</td><td>354.01</td><td>28.72</td><td>15.01</td><td>75.89</td></tr>
    <tr><td>A1CF</td><td>457.00</td><td>208.00</td><td>238.00</td><td>352.00</td><td>0.00</td></tr>
    <tr><td>A2BP1</td><td>5.00</td><td>21.00</td><td>1.00</td><td>4.00</td><td>5.00</td></tr>
    <tr><td>A2LD1</td><td>366.88</td><td>767.61</td><td>404.41</td><td>296.27</td><td>137.34</td></tr>
    <tr><td>A2ML1</td><td>1.00</td><td>1.00</td><td>50.00</td><td>16.00</td><td>4.00</td></tr>
  </tbody>
</table>

---

<h3>✅ Next Step</h3>
<p>
We can now use <code>expression_linear_for_decoupler</code> as input to construct a regulon or activity matrix for transcription factors, kinases, or enzyme-centric gene sets inferred from your CRC metabolic rule pipeline.
</p>


In [58]:
expression_linear_for_gseapy = expression_linear_for_decoupler.copy()
print("Assigned 'expression_linear_for_decoupler' to 'expression_linear_for_gseapy' for the next cell.")
print(f"Head of expression_linear_for_gseapy index: {list(expression_linear_for_gseapy.index[:5])}")

Assigned 'expression_linear_for_decoupler' to 'expression_linear_for_gseapy' for the next cell.
Head of expression_linear_for_gseapy index: ['A1BG', 'A1CF', 'A2BP1', 'A2LD1', 'A2ML1']


In [59]:
import pandas as pd
import numpy as np
import gseapy as gp
import decoupler as dc # For loading PROGENy
from scipy.stats import ttest_ind # For summarization
from tqdm.auto import tqdm # For progress bars (optional, but good for ssGSEA if many samples)

# --- Prerequisites ---
# This script assumes the following DataFrames are ALREADY LOADED and defined
# in your Python environment/notebook:
#
# 1. expression_linear_for_gseapy:
#    Your gene expression data.
#    Format: pandas DataFrame, genes as index (rows), samples as columns.
#    Values should be on a linear scale (e.g., de-logged TPM or FPKM).
#
# 2. valid_samples:
#    Your sample information DataFrame.
#    Format: pandas DataFrame, TCGA barcodes (or your sample IDs) as index.
#    MUST contain a column named 'sample_type' with values like
#    'Primary Tumor' and 'Solid Tissue Normal'.

# Example placeholders if they are not already defined (replace with your actual loading):
if 'expression_linear_for_gseapy' not in locals() or not isinstance(expression_linear_for_gseapy, pd.DataFrame) or expression_linear_for_gseapy.empty:
    print("⚠️ 'expression_linear_for_gseapy' not defined or empty. Creating a dummy DataFrame for demonstration.")
    expression_linear_for_gseapy = pd.DataFrame(
        np.abs(np.random.rand(500, 50) * 1000), # Ensure non-negative for linear scale
        columns=[f'Sample{i+1}' for i in range(50)], 
        index=[f'GENE{j+1}' for j in range(500)] # Start with uppercase for dummy
    )
    # In case dummy genes are not uppercase:
    # expression_linear_for_gseapy.index = expression_linear_for_gseapy.index.astype(str).str.lower() 
    # print(expression_linear_for_gseapy.head())


if 'valid_samples' not in locals() or not isinstance(valid_samples, pd.DataFrame) or valid_samples.empty:
    print("⚠️ 'valid_samples' not defined or empty. Creating a dummy DataFrame for demonstration.")
    sample_ids_demo = expression_linear_for_gseapy.columns
    sample_types_demo = ['Primary Tumor'] * (len(sample_ids_demo) // 2) + ['Solid Tissue Normal'] * (len(sample_ids_demo) - (len(sample_ids_demo) // 2))
    if len(sample_types_demo) < len(sample_ids_demo): # Adjust if odd number of samples
        sample_types_demo.append('Primary Tumor')
    valid_samples = pd.DataFrame({'sample_type': sample_types_demo[:len(sample_ids_demo)]}, index=sample_ids_demo)
# --- End Prerequisites ---


# --- Step 1: Load PROGENy Gene Sets using DECOUPLER ---
print("\n--- Loading PROGENy Gene Sets via decoupler ---")
progeny_gene_sets_for_gseapy = {} 

try:
    progeny_model_df = dc.op.progeny(organism='human', top=100) 
    progeny_model_df['target'] = progeny_model_df['target'].astype(str).str.strip().str.upper()
    progeny_model_df['source'] = progeny_model_df['source'].astype(str).str.strip().str.upper()
    print(f"✅ PROGENy model loaded from decoupler. Shape: {progeny_model_df.shape}")

    for pathway_name, group_df in progeny_model_df.groupby('source'):
        progeny_gene_sets_for_gseapy[pathway_name] = group_df['target'].unique().tolist()
    
    if progeny_gene_sets_for_gseapy:
        print(f"✅ Successfully converted PROGENy model to gseapy format. Found {len(progeny_gene_sets_for_gseapy)} pathways.")
    else:
        print("❌ PROGENy gene sets dictionary is empty after conversion from decoupler model.")
        
except Exception as e:
    print(f"❌ Error loading or processing PROGENy gene sets via decoupler: {e}")
    progeny_gene_sets_for_gseapy = None


# --- Step 2: Prepare Expression Data (Ensure Uppercase Gene IDs) ---
if 'expression_linear_for_gseapy' in locals() and isinstance(expression_linear_for_gseapy, pd.DataFrame) and not expression_linear_for_gseapy.empty:
    print("\n--- Preparing Expression Data ---")
    # *** CRUCIAL FIX: Ensure gene identifiers in expression_matrix index are UPPERCASE ***
    original_index_name = expression_linear_for_gseapy.index.name # Preserve index name if any
    expression_linear_for_gseapy.index = expression_linear_for_gseapy.index.astype(str).str.upper()
    expression_linear_for_gseapy.index.name = original_index_name
    print("✅ Gene identifiers in expression_linear_for_gseapy index converted to UPPERCASE.")
    # print(expression_linear_for_gseapy.head()) # Optional: check
else:
    print("⚠️ 'expression_linear_for_gseapy' not ready for case conversion.")


# --- Step 3: Run ssGSEA for PROGENy Pathway Activity ---
Table2_PathwayActivitySummary = pd.DataFrame() # Initialize

prerequisites_ok_for_progeny = True
if not ('expression_linear_for_gseapy' in locals() and isinstance(expression_linear_for_gseapy, pd.DataFrame) and not expression_linear_for_gseapy.empty):
    print("❌ ERROR: 'expression_linear_for_gseapy' DataFrame is not defined or empty.")
    prerequisites_ok_for_progeny = False
if not progeny_gene_sets_for_gseapy: 
    print("❌ ERROR: PROGENy gene sets (from decoupler) not loaded. Cannot run ssGSEA for Table2.")
    prerequisites_ok_for_progeny = False

if prerequisites_ok_for_progeny:
    print("\nRunning gseapy.ssgsea for PROGENy pathway activity...")
    try:
        expression_data_for_ssgsea = expression_linear_for_gseapy.select_dtypes(include=[np.number])
        if expression_data_for_ssgsea.empty and not expression_linear_for_gseapy.empty :
            print("❌ ERROR: Expression data is empty after selecting only numeric types.")
            raise ValueError("Numeric expression data became empty before ssGSEA.")

        ssgsea_progeny_obj = gp.ssgsea(
            data=expression_data_for_ssgsea,
            gene_sets=progeny_gene_sets_for_gseapy,
            scale=True,    
            min_size=5,    # PROGENy pathways are small, ensure min_size is appropriate
            permutation_num=0, 
            outdir=None,   
            verbose=True 
        )
        
        print("Extracting and processing ssGSEA results...")
        ssgsea_progeny_results_df = ssgsea_progeny_obj.res2d.copy()
        ssgsea_progeny_results_df['ES'] = pd.to_numeric(ssgsea_progeny_results_df['ES'], errors='coerce')
        ssgsea_progeny_results_df.dropna(subset=['ES'], inplace=True)

        if not ssgsea_progeny_results_df.empty:
            pathway_activities_df = ssgsea_progeny_results_df.pivot_table(
                index='Name', columns='Term', values='ES'
            )
            print(f"  Pivoted ssGSEA results. Shape: {pathway_activities_df.shape}")
            
            if 'valid_samples' in locals() and isinstance(valid_samples, pd.DataFrame) and 'sample_type' in valid_samples.columns:
                pathway_activities_merged = pd.merge(
                    valid_samples[['sample_type']], 
                    pathway_activities_df,
                    left_index=True, 
                    right_index=True,
                    how='inner' 
                )
                print(f"  Merged with sample type information. Shape: {pathway_activities_merged.shape}")

                if not pathway_activities_merged.empty:
                    pathway_summary_data = []
                    for pathway in pathway_activities_merged.columns.drop('sample_type'):
                        tumor_activity = pathway_activities_merged[pathway_activities_merged['sample_type'] == 'Primary Tumor'][pathway].dropna()
                        normal_activity = pathway_activities_merged[pathway_activities_merged['sample_type'] == 'Solid Tissue Normal'][pathway].dropna()

                        mean_tumor_act = tumor_activity.mean()
                        mean_normal_act = normal_activity.mean()
                        diff_act = mean_tumor_act - mean_normal_act
                        
                        p_value_act = np.nan
                        if len(tumor_activity) >= 2 and len(normal_activity) >= 2:
                            stat_act, p_value_act = ttest_ind(tumor_activity, normal_activity, equal_var=False, nan_policy='omit')
                        
                        pathway_summary_data.append({
                            'PROGENy_Pathway': pathway, 
                            'Mean_Tumor_Activity_ES': mean_tumor_act,
                            'Mean_Normal_Activity_ES': mean_normal_act,
                            'Activity_Difference_ES': diff_act,
                            'P_value_Activity': p_value_act
                        })
                    
                    Table2_PathwayActivitySummary = pd.DataFrame(pathway_summary_data)
                    if Table2_PathwayActivitySummary.empty and not pathway_activities_df.empty:
                         print("⚠️ WARNING: Table2_PathwayActivitySummary is empty after summarization, but pivot table had data. Check summarization logic or if enough tumor/normal samples exist for all pathways.")
                else:
                    print("❌ ERROR: Merged pathway activities DataFrame is empty. Check 'valid_samples' and sample IDs in expression data (index matching).")
            else:
                print("❌ ERROR: 'valid_samples' DataFrame with 'sample_type' column not found or empty. Cannot summarize PROGENy results.")
        else:
            print("❌ ERROR: PROGENy ssGSEA result (res2d) was empty or became empty after processing ES scores. This likely means no gene sets passed the min_size filter after intersection with your (now uppercased) expression data's genes.")
            
    except ValueError as ve: 
        print(f"❌ ValueError during ssGSEA for PROGENy: {ve}")
    except Exception as e:
        print(f"❌ An unexpected ERROR occurred during ssGSEA for PROGENy: {type(e).__name__} - {e}")
else:
    print("--- ssGSEA for PROGENy skipped due to missing prerequisites. ---")

# --- Display Final Table2 ---
if 'Table2_PathwayActivitySummary' in locals() and not Table2_PathwayActivitySummary.empty:
    print("\n--- Table2_PathwayActivitySummary (PROGENy pathways via gseapy ssGSEA using decoupler gene sets) ---")
    try:
        display(Table2_PathwayActivitySummary.sort_values(by='P_value_Activity'))
    except NameError:
        print(Table2_PathwayActivitySummary.sort_values(by='P_value_Activity').to_string())
else:
    print("\n--- Table2_PathwayActivitySummary is empty or was not generated. ---")


--- Loading PROGENy Gene Sets via decoupler ---


2025-06-03 08:49:31,759 [INFO] Parsing data files for ssGSEA...........................


✅ PROGENy model loaded from decoupler. Shape: (1399, 4)
✅ Successfully converted PROGENy model to gseapy format. Found 14 pathways.

--- Preparing Expression Data ---
✅ Gene identifiers in expression_linear_for_gseapy index converted to UPPERCASE.

Running gseapy.ssgsea for PROGENy pathway activity...


2025-06-03 08:49:33,229 [INFO] 0000 gene_sets have been filtered out when max_size=500 and min_size=5
2025-06-03 08:49:33,231 [INFO] 0014 gene_sets used for further statistical testing.....
2025-06-03 08:49:33,232 [INFO] Start to run ssGSEA...Might take a while................


Extracting and processing ssGSEA results...
  Pivoted ssGSEA results. Shape: (326, 14)
  Merged with sample type information. Shape: (326, 15)

--- Table2_PathwayActivitySummary (PROGENy pathways via gseapy ssGSEA using decoupler gene sets) ---


Unnamed: 0,PROGENy_Pathway,Mean_Tumor_Activity_ES,Mean_Normal_Activity_ES,Activity_Difference_ES,P_value_Activity
12,VEGF,-1297.894026,-514.229959,-783.664066,7.555829e-42
0,ANDROGEN,1827.109852,2544.896194,-717.786342,1.602557e-27
5,MAPK,5261.357795,4669.906196,591.4516,1.2769e-22
2,ESTROGEN,2790.723288,3058.468945,-267.745657,2.7734650000000004e-17
10,TNFA,3345.756582,2825.223854,520.532729,3.812655e-09
3,HYPOXIA,4812.095643,4612.989069,199.106574,1.526125e-07
1,EGFR,3588.274751,3342.534745,245.740006,1.636409e-07
6,NFKB,2732.019591,2348.27974,383.739852,1.082017e-05
9,TGFB,1786.627922,2143.214684,-356.586762,5.640181e-05
7,P53,1577.603415,1450.610282,126.993134,6.180324e-05


<h2>Table 2: PROGENy Pathway Activity Summary (via ssGSEA)</h2>

<p>✅ Single-sample GSEA (ssGSEA) was performed on TCGA-COAD expression data using <strong>PROGENy pathway gene sets</strong> derived via <code>decoupler</code> and executed through <code>gseapy</code>. This provides an interpretable summary of pathway-level dysregulation between tumor and normal tissues.</p>

<hr>

<h3>🧬 Analytical Steps</h3>
<ol>
  <li><strong>Expression → Pathway Activity</strong><br>
    Per-sample activity scores for 14 curated signaling pathways (e.g., MAPK, VEGF, ESTROGEN) were calculated from the gene expression matrix (<code>expression_linear_for_gseapy</code>) using PROGENy-derived weights.
  </li>
  <li><strong>Tumor vs Normal Comparison</strong><br>
    Mean pathway activities were compared between <strong>Primary Tumor</strong> and <strong>Solid Tissue Normal</strong> samples using Welch’s t-test. The difference in activity (Δ) and associated p-values were computed.
  </li>
</ol>

<hr>

<h3>📊 Summary of Results</h3>

<table border="1" cellpadding="6" cellspacing="0">
  <thead>
    <tr>
      <th>PROGENy Pathway</th>
      <th>Mean Tumor Activity</th>
      <th>Mean Normal Activity</th>
      <th>Δ Activity (Tumor - Normal)</th>
      <th>p-value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>VEGF</strong></td>
      <td>-1297.89</td>
      <td>-514.23</td>
      <td><strong>-784</strong></td>
      <td>7.5×10⁻⁴² ✅</td>
    </tr>
    <tr>
      <td><strong>ANDROGEN</strong></td>
      <td>1827.11</td>
      <td>2544.90</td>
      <td><strong>-718</strong></td>
      <td>1.6×10⁻²⁷ ✅</td>
    </tr>
    <tr>
      <td><strong>MAPK</strong></td>
      <td>5261.36</td>
      <td>4669.91</td>
      <td><strong>+591</strong></td>
      <td>1.3×10⁻²² ✅</td>
    </tr>
    <tr>
      <td><strong>ESTROGEN</strong></td>
      <td>2790.72</td>
      <td>3058.47</td>
      <td><strong>-268</strong></td>
      <td>2.8×10⁻¹⁷ ✅</td>
    </tr>
  </tbody>
</table>

<hr>

<h3>🔍 Biological Interpretation</h3>
<ul>
  <li><strong>VEGF activity is significantly reduced</strong> in tumor tissue. Despite VEGF's role in angiogenesis, this downregulation may reflect compensatory feedback mechanisms or post-transcriptional regulation.</li>
  <li><strong>MAPK pathway activity is significantly increased</strong>, consistent with its well-characterized role in colorectal cancer via activation of the RAS/RAF/MEK/ERK cascade.</li>
  <li><strong>ANDROGEN and ESTROGEN signaling pathways are suppressed</strong> in tumor samples, potentially indicating reduced hormone sensitivity or pathway silencing in colorectal malignancies.</li>
</ul>


In [60]:
import pandas as pd
import numpy as np
import gseapy as gp
import decoupler as dc # For DoRothEA
from scipy.stats import ttest_ind # For summarization

# --- Prerequisites ---
# This script assumes the following DataFrames are ALREADY LOADED and defined
# in your Python environment/notebook:
#
# 1. expression_linear_for_gseapy:
#    Your gene expression data.
#    Format: pandas DataFrame, genes as index (rows), samples as columns.
#    Values should be on a linear scale.
#
# 2. valid_samples:
#    Your sample information DataFrame.
#    Format: pandas DataFrame, TCGA barcodes (or your sample IDs) as index.
#    MUST contain a column named 'sample_type' with values like
#    'Primary Tumor' and 'Solid Tissue Normal'.

# Example placeholders if they are not already defined (replace with your actual data loading)
if 'expression_linear_for_gseapy' not in locals() or not isinstance(expression_linear_for_gseapy, pd.DataFrame) or expression_linear_for_gseapy.empty:
    print("⚠️ 'expression_linear_for_gseapy' not defined or empty. Please load/generate it for ssGSEA.")
    expression_linear_for_gseapy = pd.DataFrame(
        np.random.rand(500, 50), 
        columns=[f'Sample{i+1}' for i in range(50)], 
        index=[f'Gene{j+1}' for j in range(500)]
    )
if 'valid_samples' not in locals() or not isinstance(valid_samples, pd.DataFrame) or valid_samples.empty:
    print("⚠️ 'valid_samples' not defined or empty. Please load/generate it for summarization.")
    sample_ids_demo = expression_linear_for_gseapy.columns
    sample_types_demo = ['Primary Tumor'] * (len(sample_ids_demo) // 2) + ['Solid Tissue Normal'] * (len(sample_ids_demo) - (len(sample_ids_demo) // 2))
    valid_samples = pd.DataFrame({'sample_type': sample_types_demo}, index=sample_ids_demo)
# --- End Prerequisites ---


# --- Step 1: Load DoRothEA Regulons and Prepare Gene Sets ---
print("\n--- Loading DoRothEA Regulons and Preparing TF Gene Sets ---")
tf_gene_sets_dict = {} # This will be the input for gseapy.ssgsea

try:
    # Load DoRothEA regulons using decoupler
    print("Loading DoRothEA regulons via decoupler...")
    dorothea_regulon_df = dc.op.dorothea(organism='human', levels=['A', 'B', 'C', 'D']) # Using A-D levels
    dorothea_regulon_df['target'] = dorothea_regulon_df['target'].astype(str).str.strip().str.upper()
    dorothea_regulon_df['source'] = dorothea_regulon_df['source'].astype(str).str.strip().str.upper() # TF names
    print("✅ DoRothEA regulons loaded.")

    # Convert DoRothEA regulons to a dictionary format suitable for gseapy
    for tf_symbol, group_df in dorothea_regulon_df.groupby('source'): # 'source' is the TF
        tf_gene_sets_dict[tf_symbol] = group_df['target'].unique().tolist()
    
    if not tf_gene_sets_dict:
        print("❌ ERROR: DoRothEA TF gene sets dictionary is empty after processing decoupler model.")
    else:
        print(f"✅ Prepared {len(tf_gene_sets_dict)} TF-target gene sets for gseapy.")
        # print("   Example TF:", list(tf_gene_sets_dict.keys())[0])
        # print("   Example genes for this TF:", tf_gene_sets_dict[list(tf_gene_sets_dict.keys())[0]][:5])
        
except Exception as e:
    print(f"❌ ERROR loading or processing DoRothEA regulons: {e}")
    tf_gene_sets_dict = None # Ensure it's None if loading failed


# --- Step 2: Run ssGSEA for DoRothEA TF Activity ---
Table3_TFActivitySummary = pd.DataFrame() # Initialize

prerequisites_ok_for_tf = True
if not ('expression_linear_for_gseapy' in locals() and isinstance(expression_linear_for_gseapy, pd.DataFrame) and not expression_linear_for_gseapy.empty):
    print("❌ ERROR: 'expression_linear_for_gseapy' DataFrame is not defined or empty for TF ssGSEA.")
    prerequisites_ok_for_tf = False
if not tf_gene_sets_dict: 
    print("❌ ERROR: DoRothEA TF gene sets dictionary not ready. Cannot run ssGSEA for Table3.")
    prerequisites_ok_for_tf = False

if prerequisites_ok_for_tf:
    print("\nRunning gseapy.ssgsea for DoRothEA TF activity...")
    try:
        expression_data_for_ssgsea_tf = expression_linear_for_gseapy.select_dtypes(include=[np.number])
        if expression_data_for_ssgsea_tf.empty and not expression_linear_for_gseapy.empty:
            print("❌ ERROR: Expression data is empty after selecting only numeric types for TF ssGSEA.")
            raise ValueError("Numeric expression data became empty before TF ssGSEA.")

        ssgsea_tf_obj = gp.ssgsea(
            data=expression_data_for_ssgsea_tf,
            gene_sets=tf_gene_sets_dict,
            scale=True,    
            min_size=5,  # Adjust if needed, DoRothEA sets can be small
            permutation_num=0, 
            outdir=None,   
            verbose=True 
        )
        
        print("Extracting and processing ssGSEA TF results...")
        ssgsea_tf_results_df = ssgsea_tf_obj.res2d.copy()
        ssgsea_tf_results_df['ES'] = pd.to_numeric(ssgsea_tf_results_df['ES'], errors='coerce')
        ssgsea_tf_results_df.dropna(subset=['ES'], inplace=True)

        if not ssgsea_tf_results_df.empty:
            tf_activities_df = ssgsea_tf_results_df.pivot_table(
                index='Name', columns='Term', values='ES' # 'Term' is the TF
            )
            print(f"  Pivoted TF ssGSEA results. Shape: {tf_activities_df.shape}")
            
            if 'valid_samples' in locals() and isinstance(valid_samples, pd.DataFrame) and 'sample_type' in valid_samples.columns:
                tf_activities_merged = pd.merge(
                    valid_samples[['sample_type']], 
                    tf_activities_df,
                    left_index=True, 
                    right_index=True,
                    how='inner'
                )
                print(f"  Merged TF activities with sample type. Shape: {tf_activities_merged.shape}")

                if not tf_activities_merged.empty:
                    tf_summary_data = []
                    for tf_symbol in tf_activities_merged.columns.drop('sample_type'):
                        tumor_activity = tf_activities_merged[tf_activities_merged['sample_type'] == 'Primary Tumor'][tf_symbol].dropna()
                        normal_activity = tf_activities_merged[tf_activities_merged['sample_type'] == 'Solid Tissue Normal'][tf_symbol].dropna()

                        mean_tumor_act = tumor_activity.mean()
                        mean_normal_act = normal_activity.mean()
                        diff_act = mean_tumor_act - mean_normal_act
                        
                        p_value_act = np.nan
                        if len(tumor_activity) >= 2 and len(normal_activity) >= 2:
                            stat_act, p_value_act = ttest_ind(tumor_activity, normal_activity, equal_var=False, nan_policy='omit')
                        
                        tf_summary_data.append({
                            'TF_Symbol': tf_symbol, # Changed for consistency
                            'Mean_Tumor_Activity_ES': mean_tumor_act,
                            'Mean_Normal_Activity_ES': mean_normal_act,
                            'Activity_Difference_ES': diff_act,
                            'P_value_Activity': p_value_act
                        })
                    
                    Table3_TFActivitySummary = pd.DataFrame(tf_summary_data)
                    if Table3_TFActivitySummary.empty and not tf_activities_df.empty:
                         print("⚠️ WARNING: Table3_TFActivitySummary is empty after summarization, but pivot table had data. Check summarization logic or if enough tumor/normal samples exist for all TFs.")
                else:
                    print("❌ ERROR: Merged TF activities DataFrame is empty. Check 'valid_samples' and sample IDs in expression data.")
            else:
                print("❌ ERROR: 'valid_samples' DataFrame with 'sample_type' column not found or empty. Cannot summarize DoRothEA TF results.")
        else:
            print("❌ ERROR: DoRothEA TF ssGSEA result (res2d) was empty or became empty after processing ES scores. This often means no TF gene sets passed the min_size filter after intersection with your expression data's genes.")
            
    except ValueError as ve: 
        print(f"❌ ValueError during ssGSEA for DoRothEA TFs: {ve}")
    except Exception as e:
        print(f"❌ An unexpected ERROR occurred during ssGSEA for DoRothEA TFs: {e}")
else:
    print("--- ssGSEA for DoRothEA TFs skipped due to missing prerequisites. ---")

# --- Display Final Table3 ---
if 'Table3_TFActivitySummary' in locals() and not Table3_TFActivitySummary.empty:
    print("\n--- Table3_TFActivitySummary (DoRothEA TFs via gseapy ssGSEA) ---")
    try:
        display(Table3_TFActivitySummary.sort_values(by='P_value_Activity').head(20)) # Show top 20
    except NameError:
        print(Table3_TFActivitySummary.sort_values(by='P_value_Activity').head(20).to_string())
else:
    print("\n--- Table3_TFActivitySummary is empty or was not generated. ---")


--- Loading DoRothEA Regulons and Preparing TF Gene Sets ---
Loading DoRothEA regulons via decoupler...


2025-06-03 08:50:02,866 [INFO] Parsing data files for ssGSEA...........................


✅ DoRothEA regulons loaded.
✅ Prepared 643 TF-target gene sets for gseapy.

Running gseapy.ssgsea for DoRothEA TF activity...


2025-06-03 08:50:04,249 [INFO] 0278 gene_sets have been filtered out when max_size=500 and min_size=5
2025-06-03 08:50:04,251 [INFO] 0365 gene_sets used for further statistical testing.....
2025-06-03 08:50:04,251 [INFO] Start to run ssGSEA...Might take a while................


Extracting and processing ssGSEA TF results...
  Pivoted TF ssGSEA results. Shape: (326, 365)
  Merged TF activities with sample type. Shape: (326, 366)

--- Table3_TFActivitySummary (DoRothEA TFs via gseapy ssGSEA) ---


Unnamed: 0,TF_Symbol,Mean_Tumor_Activity_ES,Mean_Normal_Activity_ES,Activity_Difference_ES,P_value_Activity
39,DDIT3,5222.38491,4194.323749,1028.061162,4.014668e-47
193,NFYC,3773.449692,4945.080887,-1171.631195,9.239093e-47
38,DBP,-861.551538,2351.565261,-3213.116799,1.024184e-43
77,FOXO3,4901.636959,5703.574978,-801.938018,3.6541820000000005e-43
205,NR2C2,5070.05623,4728.052735,342.003496,9.884424e-36
273,SNAPC4,3661.611659,3446.525858,215.085801,1.199403e-34
337,ZFX,6224.654385,5911.303062,313.351323,1.439863e-32
219,PAX2,3057.537123,5468.161283,-2410.62416,3.000919e-32
312,TFCP2,1638.601774,555.728509,1082.873265,1.070533e-31
212,NR4A2,-293.892378,-3703.639412,3409.747034,1.3815099999999999e-30


<h2>Table 3: DoRothEA Transcription Factor Activity Summary (via ssGSEA)</h2>

<p>✅ Transcription factor (TF) activity estimation was performed using <strong>single-sample GSEA (ssGSEA)</strong> on TCGA-COAD gene expression data. The <strong>DoRothEA regulons</strong>, which define TF–target relationships, were loaded via <code>decoupler</code> and processed through <code>gseapy</code> to infer regulatory activity patterns.</p>

<hr>

<h3>⚙️ Analytical Workflow</h3>
<ol>
  <li><strong>Gene Expression → TF Activity</strong><br>
    A matrix of gene expression values was used to compute enrichment scores (ES) for 365 transcription factors based on their downstream targets as defined by the DoRothEA knowledgebase (confidence levels A–D).
  </li>
  <li><strong>Tumor vs Normal Comparison</strong><br>
    Mean TF activity scores were computed separately for <strong>Primary Tumor</strong> and <strong>Solid Tissue Normal</strong> samples. Activity differences were quantified using Welch’s t-test.
  </li>
</ol>

<hr>

<h3>📊 Summary of Top Differentially Active TFs</h3>

<table border="1" cellpadding="6" cellspacing="0">
  <thead>
    <tr>
      <th>TF Symbol</th>
      <th>Mean Tumor Activity</th>
      <th>Mean Normal Activity</th>
      <th>Δ Activity (Tumor - Normal)</th>
      <th>p-value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>DDIT3</strong></td>
      <td>5222.38</td>
      <td>4194.32</td>
      <td><strong>+1028</strong></td>
      <td>4.0×10⁻⁴⁷ ✅</td>
    </tr>
    <tr>
      <td><strong>NFYC</strong></td>
      <td>3773.45</td>
      <td>4945.08</td>
      <td><strong>-1172</strong></td>
      <td>9.2×10⁻⁴⁷ ✅</td>
    </tr>
    <tr>
      <td><strong>DBP</strong></td>
      <td>-861.55</td>
      <td>2351.57</td>
      <td><strong>-3213</strong></td>
      <td>1.0×10⁻⁴³ ✅</td>
    </tr>
    <tr>
      <td><strong>FOXO3</strong></td>
      <td>4901.64</td>
      <td>5703.57</td>
      <td><strong>-802</strong></td>
      <td>3.7×10⁻⁴³ ✅</td>
    </tr>
    <tr>
      <td><strong>NR2C2</strong></td>
      <td>5070.06</td>
      <td>4728.05</td>
      <td><strong>+342</strong></td>
      <td>9.9×10⁻³⁶ ✅</td>
    </tr>
    <tr>
      <td><strong>SNAPC4</strong></td>
      <td>3661.61</td>
      <td>3446.53</td>
      <td><strong>+215</strong></td>
      <td>1.2×10⁻³⁴ ✅</td>
    </tr>
  </tbody>
</table>

<hr>

<h3>🔬 Biological Observations</h3>
<ul>
  <li><strong>DDIT3</strong> (a stress-responsive transcription factor) is strongly upregulated in tumors, possibly reflecting ER stress or unfolded protein response in colorectal cancer cells.</li>
  <li><strong>NFYC</strong> and <strong>FOXO3</strong> activities are significantly reduced, suggesting loss of transcriptional control mechanisms tied to cell cycle regulation and apoptosis.</li>
  <li><strong>DBP</strong>, a circadian and metabolic regulator, shows a strong decrease in tumor samples, indicating possible circadian deregulation.</li>
  <li><strong>NR2C2</strong> and <strong>SNAPC4</strong> show moderate increases in tumor tissues, which may reflect changes in nuclear receptor signaling and snRNA biogenesis, respectively.</li>
</ul>


In [61]:
import decoupler as dc
import pandas as pd

# --- Step 0: Load and Prepare DoRothEA Regulon for VIPER ---
print("Loading DoRothEA regulons for VIPER...")

# Initialize to prevent errors if loading fails
dorothea_regulon_viper = pd.DataFrame()
dorothea_regulon_raw = pd.DataFrame()

try:
    # Load DoRothEA regulons using decoupler.op.dorothea
    # Select appropriate confidence levels (e.g., A, B, C)
    dorothea_regulon_raw = dc.op.dorothea(organism='human', levels=['A', 'B', 'C']) # FIX: Use dc.op.dorothea

    if not dorothea_regulon_raw.empty:
        # VIPER in decoupler often expects columns: 'source', 'target', 'mor' (mode of regulation)
        # The 'weight' column from get_dorothea contains the interaction sign (1 or -1)
        # Rename 'weight' to 'mor'
        if 'weight' in dorothea_regulon_raw.columns:
            dorothea_regulon_viper = dorothea_regulon_raw.rename(columns={'weight': 'mor'})
            print("✅ DoRothEA regulon loaded and 'weight' column renamed to 'mor'.")
        # If it's already 'mor' (less likely for dc.op.dorothea direct output but good to check)
        elif 'mor' in dorothea_regulon_raw.columns:
             dorothea_regulon_viper = dorothea_regulon_raw.copy()
             print("✅ DoRothEA regulon loaded with existing 'mor' column.")
        else:
            print("❌ ERROR: DoRothEA regulon loaded from dc.op.dorothea does not contain a 'weight' or 'mor' column.")
            # dc.op.dorothea should return 'source', 'target', 'weight'. If 'weight' is missing, there's an issue.

        if not dorothea_regulon_viper.empty:
            print(f"Shape of dorothea_regulon_viper: {dorothea_regulon_viper.shape}")
            # Ensure TF names ('source') and target gene symbols ('target') are uppercase
            dorothea_regulon_viper['source'] = dorothea_regulon_viper['source'].astype(str).str.upper()
            dorothea_regulon_viper['target'] = dorothea_regulon_viper['target'].astype(str).str.upper()
            print("   'source' (TFs) and 'target' (genes) converted to uppercase.")
            # display(dorothea_regulon_viper.head())
        else:
            print("   ⚠️ dorothea_regulon_viper is empty after attempting to prepare columns.")
    else:
        print("❌ ERROR: dc.op.dorothea returned an empty DataFrame.")

except AttributeError as ae:
    print(f"❌ AttributeError: {ae}. This might mean 'dc.op.dorothea' is not available. Check your decoupler version or installation.")
except Exception as e:
    print(f"❌ An unexpected error occurred while loading DoRothEA: {e}")


# --- Your existing script assumptions ---
# (Ensure expression_matrix and valid_samples are defined as in your previous attempts)

if 'expression_linear_for_decoupler' in locals():
    expression_matrix = expression_linear_for_decoupler.copy()
    expression_matrix.index = expression_matrix.index.astype(str).str.upper()
    print(f"\nUsing 'expression_linear_for_decoupler' as expression_matrix. Shape: {expression_matrix.shape}")
else:
    print("\n❌ ERROR: 'expression_linear_for_decoupler' (for expression_matrix) not found. Please define it.")
    expression_matrix = pd.DataFrame() # Avoid NameError later

if 'valid_samples' not in locals():
    print("❌ ERROR: 'valid_samples' DataFrame not found. Please define it.")
    valid_samples = pd.DataFrame() # Avoid NameError later
else:
    print(f"Using 'valid_samples'. Shape: {valid_samples.shape}")


# 1) Prune your regulon against the genes in your data
if not dorothea_regulon_viper.empty and not expression_matrix.empty:
    print("\nPruning regulon...")
    pruned_net = dc.pp.net.prune(
        features=expression_matrix.index,
        net=dorothea_regulon_viper,
        tmin=1,
        verbose=True
    )
    print(f"✅ Pruned network created. Shape: {pruned_net.shape}")

    # 2) Run VIPER
    print("\nRunning VIPER...")
    try:
        viper_scores_df, viper_pvals_df = dc.mt.viper(
            expression_matrix.T,
            pruned_net,
            verbose=True
        )
        print(f"✅ VIPER scores computed. Shape: {viper_scores_df.shape}")

        # 3) Merge with sample types
        if not valid_samples.empty:
            viper_annot = valid_samples[['sample_type']].join(viper_scores_df, how='inner')
            print(f"✅ VIPER scores merged with sample types. Shape: {viper_annot.shape}")

            # 4) Summarize tumor vs. normal for each TF
            print("\nSummarizing TF activities...")
            summary = []
            for tf in viper_scores_df.columns:
                tum = viper_annot.loc[viper_annot['sample_type'] == "Primary Tumor", tf].dropna()
                nor = viper_annot.loc[viper_annot['sample_type'] == "Solid Tissue Normal", tf].dropna()
                m_t, m_n = tum.mean(), nor.mean()
                p_val = np.nan
                if len(tum) >= 2 and len(nor) >= 2:
                    _, p_val = ttest_ind(tum, nor, equal_var=False, nan_policy='omit')
                summary.append({
                    'TF_Symbol': tf,
                    'Mean_Tumor_Activity': m_t,
                    'Mean_Normal_Activity': m_n,
                    'Activity_Difference': m_t - m_n if pd.notna(m_t) and pd.notna(m_n) else np.nan,
                    'P_value': p_val
                })

            Table3_TFActivitySummary = pd.DataFrame(summary).sort_values('P_value')
            print("✅ TF activity summarization complete.")
            print("\n--- Table3_TFActivitySummary (DoRothEA + VIPER) ---")
            display(Table3_TFActivitySummary.head())
        else:
            print("❌ Could not merge VIPER scores: 'valid_samples' is missing or empty.")
    except Exception as e:
        print(f"❌ An error occurred during VIPER or summarization: {e}")
else:
    print("\n⚠️ Pruning or VIPER run skipped due to missing 'dorothea_regulon_viper' or 'expression_matrix'.")

Loading DoRothEA regulons for VIPER...


net['weight'] = 1
2025-06-03 08:51:14 | [INFO] viper - Running viper
2025-06-03 08:51:14 | [INFO] Extracted omics mat with 326 rows (observations) and 20501 columns (features)


✅ DoRothEA regulon loaded and 'weight' column renamed to 'mor'.
Shape of dorothea_regulon_viper: (32275, 4)
   'source' (TFs) and 'target' (genes) converted to uppercase.

Using 'expression_linear_for_decoupler' as expression_matrix. Shape: (20501, 326)
Using 'valid_samples'. Shape: (326, 20502)

Pruning regulon...
✅ Pruned network created. Shape: (30054, 3)

Running VIPER...


2025-06-03 08:51:14 | [INFO] Network adjacency matrix has 8327 unique features and 298 unique sources
2025-06-03 08:51:14 | [INFO] viper - calculating 298 scores across 326 observations
2025-06-03 08:51:16 | [INFO] viper - refining scores based on pleiotropy


  0%|          | 0/326 [00:00<?, ?it/s]

  pval1 = np.log10(tmp[:, 1]) - np.log10(tmp[:, 0])
2025-06-03 09:03:21 | [INFO] viper - adjusting p-values by FDR
2025-06-03 09:03:21 | [INFO] viper - done


✅ VIPER scores computed. Shape: (326, 298)
✅ VIPER scores merged with sample types. Shape: (326, 299)

Summarizing TF activities...
✅ TF activity summarization complete.

--- Table3_TFActivitySummary (DoRothEA + VIPER) ---


Unnamed: 0,TF_Symbol,Mean_Tumor_Activity,Mean_Normal_Activity,Activity_Difference,P_value
49,ELK1,8.623249,7.906978,0.716271,1.4727729999999998e-50
74,FOXO3,3.697429,5.890467,-2.193038,8.198847e-49
83,GATA6,3.689836,6.017266,-2.32743,2.653629e-46
1,AR,7.871972,9.982428,-2.110456,3.470692e-46
252,TCF7L1,1.803922,0.85084,0.953082,3.762908e-46


<h2>Table 3: DoRothEA Transcription Factor Activity (VIPER)</h2>

<p>✅ Transcription factor (TF) activity was inferred using <strong>VIPER</strong> (Virtual Inference of Protein activity by Enriched Regulon analysis) based on the <strong>DoRothEA regulons</strong>. This method estimates regulator activity directly from expression data by considering the strength and direction of TF-target interactions.</p>

<hr>

<h3>⚙️ Analytical Workflow</h3>
<ol>
  <li><strong>Regulon Preparation</strong><br>
    A curated regulatory network was loaded using <code>decoupler.op.dorothea</code> with confidence levels A–C, and formatted to match VIPER’s expected input structure (<code>source</code>, <code>target</code>, <code>mor</code>).
  </li>
  <li><strong>Network Pruning</strong><br>
    The regulon was filtered to retain only interactions relevant to genes present in the expression matrix.
  </li>
  <li><strong>TF Activity Inference</strong><br>
    VIPER computed enrichment scores for 298 transcription factors across 326 samples using a model of probabilistic pleiotropy correction.
  </li>
  <li><strong>Summarization</strong><br>
    Mean TF activity in <strong>Primary Tumor</strong> versus <strong>Solid Tissue Normal</strong> samples was compared using Welch’s t-test.
  </li>
</ol>

<hr>

<h3>📊 Top Differentially Active TFs (VIPER-based)</h3>

<table border="1" cellpadding="6" cellspacing="0">
  <thead>
    <tr>
      <th>TF Symbol</th>
      <th>Mean Tumor Activity</th>
      <th>Mean Normal Activity</th>
      <th>Δ Activity (Tumor - Normal)</th>
      <th>p-value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>ELK1</strong></td>
      <td>8.62</td>
      <td>7.91</td>
      <td><strong>+0.72</strong></td>
      <td>1.5×10⁻⁵⁰ ✅</td>
    </tr>
    <tr>
      <td><strong>FOXO3</strong></td>
      <td>3.70</td>
      <td>5.89</td>
      <td><strong>-2.19</strong></td>
      <td>8.2×10⁻⁴⁹ ✅</td>
    </tr>
    <tr>
      <td><strong>GATA6</strong></td>
      <td>3.69</td>
      <td>6.02</td>
      <td><strong>-2.33</strong></td>
      <td>2.7×10⁻⁴⁶ ✅</td>
    </tr>
    <tr>
      <td><strong>AR</strong></td>
      <td>7.87</td>
      <td>9.98</td>
      <td><strong>-2.11</strong></td>
      <td>3.5×10⁻⁴⁶ ✅</td>
    </tr>
    <tr>
      <td><strong>TCF7L1</strong></td>
      <td>1.80</td>
      <td>0.85</td>
      <td><strong>+0.95</strong></td>
      <td>3.8×10⁻⁴⁶ ✅</td>
    </tr>
  </tbody>
</table>

<hr>

<h3>🔬 Biological Observations</h3>
<ul>
  <li><strong>ELK1</strong> is significantly activated in tumor samples. This ETS-domain TF is known to integrate MAPK signaling and promote tumorigenesis.</li>
  <li><strong>FOXO3</strong> is strongly suppressed, consistent with reduced pro-apoptotic signaling in colorectal cancer.</li>
  <li><strong>GATA6</strong> and <strong>AR</strong> activities are downregulated, suggesting loss of differentiation and hormone responsiveness.</li>
  <li><strong>TCF7L1</strong> shows increased activity, potentially reinforcing Wnt pathway-driven transcription, a hallmark of colorectal tumorigenesis.</li>
</ul>


In [246]:
# --- 0) Prerequisites (Ensure these are loaded from previous cells) ---
# Make sure 'expression_linear_for_decoupler' and 'valid_samples' exist.

if 'expression_linear_for_decoupler' in locals() and isinstance(expression_linear_for_decoupler, pd.DataFrame) and not expression_linear_for_decoupler.empty:
    expression_matrix = expression_linear_for_decoupler.copy()
    expression_matrix.index = expression_matrix.index.astype(str).str.upper()
    print(f"✅ Using 'expression_linear_for_decoupler' as 'expression_matrix'. Shape: {expression_matrix.shape}")
else:
    print("❌ ERROR: 'expression_linear_for_decoupler' not found or empty. Please run the cell that creates it first.")
    # Fallback to dummy for script structure demonstration - remove in actual run
    expression_matrix = pd.DataFrame(np.random.rand(100,10), columns=[f'Sample{i+1}' for i in range(10)], index=[f'GENE{j+1}' for j in range(100)])
    expression_matrix.index = expression_matrix.index.str.upper()
    print("    ⚠️ Created a DUMMY expression_matrix for script execution.")

if 'valid_samples' not in locals() or not isinstance(valid_samples, pd.DataFrame) or valid_samples.empty:
    print("❌ ERROR: 'valid_samples' not found or empty. Please run the cell that creates it first.")
    # Fallback to dummy - remove in actual run
    valid_samples = pd.DataFrame({'sample_type': ['Primary Tumor']*5 + ['Solid Tissue Normal']*5}, index=expression_matrix.columns)
    print("    ⚠️ Created a DUMMY valid_samples for script execution.")
# --- End Prerequisites ---


# --- 1) Load & prepare TRRUST regulon ---
print("\n--- Loading & preparing TRRUST regulon ---")
trrust = (
    pd.read_csv(
        r"trrust_rawdata.human.tsv",
        sep="\t", header=None,
        names=["source","target","effect","pmid"],
        comment="#"
    )
    .assign(mor=lambda df_assign:
            df_assign["effect"]
            .map({"Activation":1, "Repression":-1})
            .fillna(0)
            .astype(int)
    )
    .loc[:, ["source","target","mor"]]
)
trrust["source"] = trrust["source"].astype(str).str.upper()
trrust["target"] = trrust["target"].astype(str).str.upper()
trrust = trrust.drop_duplicates(subset=["source","target"])
print(f"✅ TRRUST loaded and prepared: {len(trrust):,d} edges → "
      f"{trrust['source'].nunique():,d} TFs, {trrust['target'].nunique():,d} targets. Weight column is 'mor'.")

# --- 2) Prune regulon to your measured genes ---
print("\n--- Pruning TRRUST regulon ---")
genes_in_data = set(expression_matrix.index)
pruned_net = dc.pp.net.prune(
    features=genes_in_data,
    net=trrust,
    # weight_col='mor', # <-- REMOVE THIS LINE. 'prune' does not take this argument.
    tmin=1,
    verbose=True
)
print(f"✅ Pruned regulon: {pruned_net['source'].nunique():,d} TFs retained.")

# --- 3) Build adjacency matrix (genes × TFs) - This step is NOT needed for dc.mt.viper ---
# (Skipped as before)

# --- 4) Call the VIPER function from decoupler.mt ---
print("\n⏳ Running VIPER with TRRUST (using dc.mt.viper)...")

# Initialize in case of error
Table3_TFActivitySummary = pd.DataFrame()
scores_df = pd.DataFrame() # Ensure scores_df is initialized

if not pruned_net.empty and not expression_matrix.empty:
    try:
        # dc.mt.viper will find the 'mor' column in pruned_net by default
        raw_viper_scores_df, raw_viper_pvals_df = dc.mt.viper(
            data=expression_matrix.T,
            net=pruned_net,
            # min_n=1, # This was correctly removed in the previous step
            verbose=True
        )
        scores_df = raw_viper_scores_df
        print(f"✅ VIPER scores computed: {scores_df.shape[0]} samples × {scores_df.shape[1]} TFs")

    except AttributeError as ae:
        print(f"❌ AttributeError running VIPER: {ae}. This might mean 'dc.mt.viper' is not found or an input is wrong.")
    except Exception as e:
        print(f"❌ An unexpected error occurred during VIPER: {type(e).__name__} - {e}")
else:
    print("⚠️ VIPER run skipped due to empty pruned_net or expression_matrix.")


# --- 5) Summarize into Table3_TFActivitySummary ---
if not scores_df.empty and not valid_samples.empty:
    print("\n--- Summarizing VIPER scores into Table3_TFActivitySummary ---")
    merged = valid_samples[["sample_type"]].join(scores_df, how="inner")

    out = []
    for tf_col_name in scores_df.columns:
        tum = merged.loc[merged["sample_type"]=="Primary Tumor", tf_col_name].dropna()
        nor = merged.loc[merged["sample_type"]=="Solid Tissue Normal", tf_col_name].dropna()
        
        mu, mn = tum.mean(), nor.mean()
        diff = mu - mn if pd.notna(mu) and pd.notna(mn) else np.nan
        
        pval = np.nan
        if len(tum)>=2 and len(nor)>=2:
            _, pval = ttest_ind(tum, nor, equal_var=False, nan_policy="omit")
        
        out.append({
            "TF_Symbol": tf_col_name,
            "Mean_Tumor_Activity": mu,
            "Mean_Normal_Activity": mn,
            "Activity_Difference": diff,
            "P_value": pval
        })

    Table3_TFActivitySummary = pd.DataFrame(out)
    if not Table3_TFActivitySummary.empty:
        Table3_TFActivitySummary = Table3_TFActivitySummary.sort_values("P_value")
    print("✅ Summarization complete.")
else:
    print("⚠️ Summarization skipped as VIPER scores or valid_samples are empty.")


print("\n--- Table3_TFActivitySummary (VIPER + TRRUST) ---")
if not Table3_TFActivitySummary.empty:
    try:
        display(Table3_TFActivitySummary.head(10))
    except NameError: # In case display is not defined (e.g. non-notebook environment)
        print(Table3_TFActivitySummary.head(10).to_string())
else:
    print("Table3_TFActivitySummary is empty.")

# Set the metric for the final integration script
tf_activity_metric_to_use = 'Activity_Difference'
print(f"\nSet 'tf_activity_metric_to_use' to '{tf_activity_metric_to_use}' for the final integration script.")

net['weight'] = 1
2025-06-03 13:28:02 | [INFO] viper - Running viper
2025-06-03 13:28:02 | [INFO] Extracted omics mat with 326 rows (observations) and 20501 columns (features)


✅ Using 'expression_linear_for_decoupler' as 'expression_matrix'. Shape: (20501, 326)

--- Loading & preparing TRRUST regulon ---
✅ TRRUST loaded and prepared: 8,427 edges → 795 TFs, 2,492 targets. Weight column is 'mor'.

--- Pruning TRRUST regulon ---
✅ Pruned regulon: 791 TFs retained.

⏳ Running VIPER with TRRUST (using dc.mt.viper)...


2025-06-03 13:28:02 | [INFO] Network adjacency matrix has 2291 unique features and 321 unique sources
2025-06-03 13:28:02 | [INFO] viper - calculating 321 scores across 326 observations
2025-06-03 13:28:03 | [INFO] viper - refining scores based on pleiotropy


  0%|          | 0/326 [00:00<?, ?it/s]

2025-06-03 13:29:28 | [INFO] viper - adjusting p-values by FDR
2025-06-03 13:29:28 | [INFO] viper - done


✅ VIPER scores computed: 326 samples × 321 TFs

--- Summarizing VIPER scores into Table3_TFActivitySummary ---
✅ Summarization complete.

--- Table3_TFActivitySummary (VIPER + TRRUST) ---


Unnamed: 0,TF_Symbol,Mean_Tumor_Activity,Mean_Normal_Activity,Activity_Difference,P_value
28,CEBPZ,2.349178,1.242655,1.106524,2.877345e-58
53,EPAS1,1.601178,0.368878,1.2323,5.2023559999999995e-42
319,ZNF24,3.358676,2.130537,1.22814,8.857749000000001e-40
261,SOX6,3.825503,3.109188,0.716314,1.445234e-36
149,MBD2,0.886892,0.301037,0.585856,3.215969e-36
14,ATF4,2.736682,1.282507,1.454175,1.331589e-35
163,MYBL2,3.296137,2.015262,1.280876,3.201993e-35
164,MYC,7.449066,5.250232,2.198834,6.050754e-35
205,PAX3,1.894633,1.276004,0.618629,6.232878999999999e-34
38,CTNNB1,3.781227,2.934444,0.846784,1.836822e-33



Set 'tf_activity_metric_to_use' to 'Activity_Difference' for the final integration script.


<h2>Table 3: Transcription Factor Activity Inference (VIPER + TRRUST)</h2>

<p>✅ Transcription factor (TF) activity was inferred from gene expression using <strong>VIPER</strong> with the <strong>TRRUST v2 human regulon</strong>. This approach evaluates regulator influence over their downstream targets, converting static transcript abundance into dynamic regulatory activity scores.</p>

<hr>

<h3>⚙️ Workflow Summary</h3>
<ol>
  <li><strong>Regulon Load & Format</strong><br>
    The TRRUST database was loaded and formatted to match VIPER input (<code>source</code>, <code>target</code>, <code>mor</code>), indicating activation (+1) or repression (-1).
  </li>
  <li><strong>Network Pruning</strong><br>
    The regulon was filtered to include only TF-target interactions supported by genes in the expression dataset.
  </li>
  <li><strong>VIPER Inference</strong><br>
    Regulatory influence scores were computed for each transcription factor across all samples.
  </li>
  <li><strong>Tumor vs Normal Comparison</strong><br>
    Welch’s t-test was used to assess differences in TF activity between <strong>Primary Tumor</strong> and <strong>Solid Tissue Normal</strong> samples.
  </li>
</ol>

<hr>

<h3>📊 Top Differentially Active TFs (TRRUST + VIPER)</h3>

<table border="1" cellpadding="6" cellspacing="0">
  <thead>
    <tr>
      <th>TF Symbol</th>
      <th>Mean Tumor Activity</th>
      <th>Mean Normal Activity</th>
      <th>Δ Activity (Tumor - Normal)</th>
      <th>p-value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>CEBPZ</strong></td>
      <td>2.35</td>
      <td>1.24</td>
      <td><strong>+1.11</strong></td>
      <td>2.9×10⁻⁵⁸ ✅</td>
    </tr>
    <tr>
      <td><strong>EPAS1</strong></td>
      <td>1.60</td>
      <td>0.37</td>
      <td><strong>+1.23</strong></td>
      <td>5.2×10⁻⁴² ✅</td>
    </tr>
    <tr>
      <td><strong>ZNF24</strong></td>
      <td>3.36</td>
      <td>2.13</td>
      <td><strong>+1.23</strong></td>
      <td>8.9×10⁻⁴⁰ ✅</td>
    </tr>
    <tr>
      <td><strong>SOX6</strong></td>
      <td>3.83</td>
      <td>3.11</td>
      <td><strong>+0.72</strong></td>
      <td>1.4×10⁻³⁶ ✅</td>
    </tr>
    <tr>
      <td><strong>MBD2</strong></td>
      <td>0.89</td>
      <td>0.30</td>
      <td><strong>+0.59</strong></td>
      <td>3.2×10⁻³⁶ ✅</td>
    </tr>
    <tr>
      <td><strong>ATF4</strong></td>
      <td>2.74</td>
      <td>1.28</td>
      <td><strong>+1.45</strong></td>
      <td>1.3×10⁻³⁵ ✅</td>
    </tr>
    <tr>
      <td><strong>MYBL2</strong></td>
      <td>3.30</td>
      <td>2.02</td>
      <td><strong>+1.28</strong></td>
      <td>3.2×10⁻³⁵ ✅</td>
    </tr>
  </tbody>
</table>

<hr>

<h3>🔬 Biological Interpretation</h3>
<ul>
  <li><strong>CEBPZ</strong> is highly activated in tumor samples, supporting its emerging role in chromatin remodeling and cell cycle regulation in cancer.</li>
  <li><strong>EPAS1</strong> (also known as HIF-2α) shows increased activity, consistent with hypoxic signaling in the tumor microenvironment.</li>
  <li><strong>ATF4</strong> and <strong>MYBL2</strong> are associated with stress responses and cell proliferation, respectively, both relevant to tumor biology.</li>
</ul>

<p><em>Metric used for differential ranking: <code>Activity_Difference</code> (Tumor − Normal)</em></p>


In [93]:
display(Table1_GeneCentric_final)

Unnamed: 0,kegg_enzyme,kegg_reactions,gene_list,human_gene_ids,kegg_pathways,associated_compound_names,Gene_Symbol,Mean_Tumor_Expr_log2,Mean_Normal_Expr_log2,Log2FC_GeneExpr,P_value_GeneExpr
0,1.1.1.30,"['R00145', 'R00146', 'R00717', 'R01088', 'R013...","['sufl:FIL70_21435', 'aus:IPK37_09295', 'ofa:N...",['hsa:622'],"['hsa00650', 'hsa01100']",['D-Erythronolactone'],BDH1,11.435665,12.463213,-1.027547,1.224089e-11
1,4.1.2.4,"['R02750', 'R02749', 'R01066']","['svb:NCTC12167_00820', 'sfp:QUY26_14155', 'ah...",['hsa:51071'],"['hsa00030', 'hsa01100']",['Deoxyribose 5-phosphate'],DERA,10.709742,11.434309,-0.724568,2.791374e-09
2,2.7.1.15,"['R02750', 'R02749', 'R01066']","['shf:CEQ32_12405', 'baml:BAM5036_3234', 'rsu:...",['hsa:64080'],"['hsa00030', 'hsa01100']",['Deoxyribose 5-phosphate'],RBKS,7.193134,8.619459,-1.426326,9.665781e-15
3,1.2.3.1,"['R00366', 'R00372', 'R00635', 'R01340', 'R017...","['plop:125349936', 'puc:125911786', 'bbis:1050...",['hsa:316'],"['hsa00280', 'hsa00350', 'hsa00380', 'hsa00750...",['Quinoline-4-carboxylic acid'],AOX1,5.818059,8.536311,-2.718253,3.681753e-23
4,1.4.3.3,"['R00366', 'R00372', 'R00635', 'R01340', 'R017...","['sbro:GQF42_37510', 'amyd:K1T34_31600', 'mamb...",['hsa:1610'],"['hsa00260', 'hsa00311', 'hsa00330', 'hsa00470...",['Quinoline-4-carboxylic acid'],DAO,0.891153,6.657734,-5.766581,3.764261e-22
...,...,...,...,...,...,...,...,...,...,...,...
100,3.1.3.5,"['R10235', 'R03531', 'R12958']","['zne:110830860', 'pgig:120600358', 'mrr:Moror...","['hsa:93034', 'hsa:284958', 'hsa:100526794', '...","['hsa00230', 'hsa00240', 'hsa00760', 'hsa01100...","[""2'-Deoxyinosine-5'-monophosphate""]",NT5C1A,0.182462,0.282560,-0.100098,2.158014e-01
101,3.1.3.5,"['R10235', 'R03531', 'R12958']","['zne:110830860', 'pgig:120600358', 'mrr:Moror...","['hsa:93034', 'hsa:284958', 'hsa:100526794', '...","['hsa00230', 'hsa00240', 'hsa00760', 'hsa01100...","[""2'-Deoxyinosine-5'-monophosphate""]",NT5M,5.658360,6.001406,-0.343046,5.356342e-03
102,3.1.3.5,"['R10235', 'R03531', 'R12958']","['zne:110830860', 'pgig:120600358', 'mrr:Moror...","['hsa:93034', 'hsa:284958', 'hsa:100526794', '...","['hsa00230', 'hsa00240', 'hsa00760', 'hsa01100...","[""2'-Deoxyinosine-5'-monophosphate""]",NT5C2,11.697450,11.829017,-0.131567,1.171201e-01
103,3.1.3.5,"['R10235', 'R03531', 'R12958']","['zne:110830860', 'pgig:120600358', 'mrr:Moror...","['hsa:93034', 'hsa:284958', 'hsa:100526794', '...","['hsa00230', 'hsa00240', 'hsa00760', 'hsa01100...","[""2'-Deoxyinosine-5'-monophosphate""]",NT5C,10.344593,10.562209,-0.217616,1.798923e-02


In [94]:
import pandas as pd
import ast # For safely evaluating string representations of lists
import numpy as np # For a complete script context, though not directly used in this snippet

# --- This script assumes 'Table1_GeneCentric_final' is already defined ---
# It should have columns like 'associated_compound_names', 'kegg_pathways', 
# 'kegg_enzyme', 'kegg_reactions', 'Gene_Symbol', 'Mean_Tumor_Expr_log2', etc.

if 'Table1_GeneCentric_final' not in locals() or Table1_GeneCentric_final.empty:
    print("❌ ERROR: 'Table1_GeneCentric_final' is not defined or is empty.")
    print("          Please ensure the cell that creates 'Table1_GeneCentric_final' has been run successfully.")
else:
    print("Processing 'Table1_GeneCentric_final' to explode 'kegg_pathways'...")
    
    # Make a copy to work on, to preserve the original if needed
    df_for_pathway_explode = Table1_GeneCentric_final.copy()

    # Step 1: Ensure the 'kegg_pathways' column contains actual lists of strings
    # This is important if 'kegg_pathways' might have been read as a string
    # representation of a list, e.g., "['hsa00010', 'hsa00020']"
    
    def clean_and_ensure_list(item):
        if isinstance(item, str):
            try:
                # Try to evaluate string as a Python literal (e.g., "['path1', 'path2']")
                evaluated_item = ast.literal_eval(item)
                if isinstance(evaluated_item, list):
                    return [str(s).strip() for s in evaluated_item] # Ensure elements are strings
                else:
                    return [str(evaluated_item).strip()] # If it evaluates to a single item not in a list
            except (ValueError, SyntaxError):
                # If it's not a list-like string, treat it as a single pathway string in a list
                # or handle it based on expected format (e.g. if it's already path1;path2)
                # For now, assuming if it's a string not evaluating to list, it's a single item
                return [item.strip()] 
        elif isinstance(item, list):
            return [str(s).strip() for s in item] # Already a list, ensure elements are strings
        elif pd.isna(item):
            return [] # Return empty list for NaN values to avoid errors
        else: # For other types, try to convert to string and put in a list
            return [str(item).strip()]

    if 'kegg_pathways' in df_for_pathway_explode.columns:
        print("Cleaning 'kegg_pathways' column to ensure it contains lists of strings...")
        df_for_pathway_explode['kegg_pathways_list'] = df_for_pathway_explode['kegg_pathways'].apply(clean_and_ensure_list)
        
        # Step 2: Explode the 'kegg_pathways_list' column
        print("Exploding 'kegg_pathways_list' column...")
        df_pathways_exploded = df_for_pathway_explode.explode('kegg_pathways_list')
        
        # Step 3: Rename the exploded column to 'Pathway Name'
        df_pathways_exploded.rename(columns={'kegg_pathways_list': 'Pathway Name'}, inplace=True)
        
        # Also rename 'associated_compound_names' to 'Metabolite' if it exists
        if 'associated_compound_names' in df_pathways_exploded.columns:
            df_pathways_exploded.rename(columns={'associated_compound_names': 'Metabolite'}, inplace=True)
        
        # This DataFrame is now further refined and ready for the final assembly
        Table1_Gene_Pathway_Centric = df_pathways_exploded.copy()

        # Drop the original 'kegg_pathways' if 'kegg_pathways_list' was successfully made and exploded from it
        if 'kegg_pathways' in Table1_Gene_Pathway_Centric.columns and 'Pathway Name' in Table1_Gene_Pathway_Centric.columns :
            Table1_Gene_Pathway_Centric = Table1_Gene_Pathway_Centric.drop(columns=['kegg_pathways'])

        print("\n✅ 'kegg_pathways' exploded and renamed to 'Pathway Name'.")
        print("--- Head of the further exploded table (Table1_Gene_Pathway_Centric) ---")
        
        # Define relevant columns to display
        # Ensure 'Metabolite' and 'Pathway Name' are now the correct column names
        display_cols_exploded = ['Metabolite', 'kegg_enzyme', 'Gene_Symbol', 'Pathway Name', 
                                 'Mean_Tumor_Expr_log2', 'Mean_Normal_Expr_log2', 
                                 'Log2FC_GeneExpr', 'P_value_GeneExpr', 'kegg_reactions']
        
        existing_display_cols_exploded = [col for col in display_cols_exploded if col in Table1_Gene_Pathway_Centric.columns]
        display(Table1_Gene_Pathway_Centric[existing_display_cols_exploded].head())

    else:
        print("❌ ERROR: 'kegg_pathways' column not found in Table1_GeneCentric_final.")

Processing 'Table1_GeneCentric_final' to explode 'kegg_pathways'...
Cleaning 'kegg_pathways' column to ensure it contains lists of strings...
Exploding 'kegg_pathways_list' column...

✅ 'kegg_pathways' exploded and renamed to 'Pathway Name'.
--- Head of the further exploded table (Table1_Gene_Pathway_Centric) ---


Unnamed: 0,Metabolite,kegg_enzyme,Gene_Symbol,Pathway Name,Mean_Tumor_Expr_log2,Mean_Normal_Expr_log2,Log2FC_GeneExpr,P_value_GeneExpr,kegg_reactions
0,['D-Erythronolactone'],1.1.1.30,BDH1,hsa00650,11.435665,12.463213,-1.027547,1.224089e-11,"['R00145', 'R00146', 'R00717', 'R01088', 'R013..."
0,['D-Erythronolactone'],1.1.1.30,BDH1,hsa01100,11.435665,12.463213,-1.027547,1.224089e-11,"['R00145', 'R00146', 'R00717', 'R01088', 'R013..."
1,['Deoxyribose 5-phosphate'],4.1.2.4,DERA,hsa00030,10.709742,11.434309,-0.724568,2.791374e-09,"['R02750', 'R02749', 'R01066']"
1,['Deoxyribose 5-phosphate'],4.1.2.4,DERA,hsa01100,10.709742,11.434309,-0.724568,2.791374e-09,"['R02750', 'R02749', 'R01066']"
2,['Deoxyribose 5-phosphate'],2.7.1.15,RBKS,hsa00030,7.193134,8.619459,-1.426326,9.665781e-15,"['R02750', 'R02749', 'R01066']"


In [29]:
display(Table1_Gene_Pathway_Centric)

Unnamed: 0,kegg_enzyme,kegg_reactions,gene_list,human_gene_ids,Metabolite,Gene_Symbol,Mean_Tumor_Expr_log2,Mean_Normal_Expr_log2,Log2FC_GeneExpr,P_value_GeneExpr,Pathway Name
0,1.1.1.30,"['R00145', 'R00146', 'R00717', 'R01088', 'R013...","['sufl:FIL70_21435', 'aus:IPK37_09295', 'ofa:N...",['hsa:622'],['D-Erythronolactone'],BDH1,11.435665,12.463213,-1.027547,1.224089e-11,hsa00650
0,1.1.1.30,"['R00145', 'R00146', 'R00717', 'R01088', 'R013...","['sufl:FIL70_21435', 'aus:IPK37_09295', 'ofa:N...",['hsa:622'],['D-Erythronolactone'],BDH1,11.435665,12.463213,-1.027547,1.224089e-11,hsa01100
1,4.1.2.4,"['R02750', 'R02749', 'R01066']","['svb:NCTC12167_00820', 'sfp:QUY26_14155', 'ah...",['hsa:51071'],['Deoxyribose 5-phosphate'],DERA,10.709742,11.434309,-0.724568,2.791374e-09,hsa00030
1,4.1.2.4,"['R02750', 'R02749', 'R01066']","['svb:NCTC12167_00820', 'sfp:QUY26_14155', 'ah...",['hsa:51071'],['Deoxyribose 5-phosphate'],DERA,10.709742,11.434309,-0.724568,2.791374e-09,hsa01100
2,2.7.1.15,"['R02750', 'R02749', 'R01066']","['shf:CEQ32_12405', 'baml:BAM5036_3234', 'rsu:...",['hsa:64080'],['Deoxyribose 5-phosphate'],RBKS,7.193134,8.619459,-1.426326,9.665781e-15,hsa00030
...,...,...,...,...,...,...,...,...,...,...,...
103,3.1.3.5,"['R10235', 'R03531', 'R12958']","['zne:110830860', 'pgig:120600358', 'mrr:Moror...","['hsa:93034', 'hsa:284958', 'hsa:100526794', '...","[""2'-Deoxyinosine-5'-monophosphate""]",NT5C,10.344593,10.562209,-0.217616,1.798923e-02,hsa00760
103,3.1.3.5,"['R10235', 'R03531', 'R12958']","['zne:110830860', 'pgig:120600358', 'mrr:Moror...","['hsa:93034', 'hsa:284958', 'hsa:100526794', '...","[""2'-Deoxyinosine-5'-monophosphate""]",NT5C,10.344593,10.562209,-0.217616,1.798923e-02,hsa01100
103,3.1.3.5,"['R10235', 'R03531', 'R12958']","['zne:110830860', 'pgig:120600358', 'mrr:Moror...","['hsa:93034', 'hsa:284958', 'hsa:100526794', '...","[""2'-Deoxyinosine-5'-monophosphate""]",NT5C,10.344593,10.562209,-0.217616,1.798923e-02,hsa01110
104,3.6.1.64,"['R10235', 'R03531', 'R12958']","['oda:120857024', 'cin:100175348', 'amj:109283...",['hsa:131870'],"[""2'-Deoxyinosine-5'-monophosphate""]",NUDT16,10.605166,11.392866,-0.787699,2.511496e-13,hsa00230


In [30]:
# 1) Extract unique pathway names (dropping any NaNs)
unique_pathways = Table1_Gene_Pathway_Centric['Pathway Name'] \
    .dropna() \
    .unique()

# 2) Print them out
print("Unique Pathway Names:")
for pw in unique_pathways:
    print(pw)


Unique Pathway Names:
hsa00650
hsa01100
hsa00030
hsa00280
hsa00350
hsa00380
hsa00750
hsa00760
hsa00830
hsa00982
hsa01120
hsa00260
hsa00311
hsa00330
hsa00470
hsa01110
hsa00270
hsa00670
hsa00480
hsa00220
hsa00590
hsa00591
hsa00564
hsa00565
hsa00592
hsa00521
hsa00562
hsa00730
hsa00790
hsa00240
hsa00061
hsa00780
hsa00140
hsa00984
hsa00040
hsa00053
hsa00860
hsa00980
hsa00983
hsa00561
hsa00310
hsa00360
hsa00950
hsa00965
hsa00230


In [31]:
# 1) Get the unique gene symbols (dropping any NaNs)
unique_genes = Table1_Gene_Pathway_Centric['Gene_Symbol'].dropna().unique()

# 2) Print them out
print("Unique Gene Symbols:")
for gene in unique_genes:
    print(gene)


Unique Gene Symbols:
BDH1
DERA
RBKS
AOX1
DAO
CBS
GGCT
OTC
ALOX15
PLA2G3
PLA2G2E
PLA2G2F
PLA2G5
PLA2G12B
PLA2G10
PLA2G1B
PLA2G2A
PLA2G12A
PLA2G2C
PLA2G2D
IMPA2
IMPA1
SMS
SRM
MTAP
ALPG
ALPP
ALPI
ALPL
DCK
NT5C3A
NT5C3B
LPCAT2
LPCAT1
LCAT
OLAH
BTD
CYP19A1
AKR1C8
AKR1C3
UGT2A2
UGT1A3
UGT1A5
UGT2B7
UGT1A1
UGT1A4
UGT2B15
UGT2B28
UGT2B4
UGT1A9
UGT2B17
UGT1A6
UGT1A7
UGT1A10
UGT2B10
UGT2A3
UGT2B11
UGT1A8
UGT2A1
HSD17B1
GK2
GK
GK3
AASS
DDC
ITPA
NT5C1B
NT5DC4
NT5C1B-RDH14
NT5C1A
NT5M
NT5C2
NT5C
NUDT16


In [32]:
import decoupler as dc
import pandas as pd
import re # For consistency, though not directly used here for critical logic

print("Loading DoRothEA regulons using dc.op.dorothea...")
try:
    # CORRECTED: Removed 'E' from the levels list
    dorothea_df = dc.op.dorothea(organism='human', levels=['A', 'B', 'C', 'D'])
    print("✅ DoRothEA regulons loaded successfully.")
    print(f"Shape of dorothea_df: {dorothea_df.shape}")
    print("Head of dorothea_df:")
    print(dorothea_df.head())

    # --- The rest of your original script ---

    # 2) Suppose unique_genes is your list of all genes
    #    For this example, let's create a dummy unique_genes list.
    #    In your actual code, replace this with your real unique_genes list.
    #    unique_genes = list(your_dataframe['Gene_Symbol'].dropna().unique())
    #    Example:
    if not dorothea_df.empty:
         # Get some example genes from the loaded regulon for demonstration
        example_targets = dorothea_df['target'].unique()[:1000]
        example_sources = dorothea_df['source'].unique()[:10]
        unique_genes = list(set(list(example_targets) + list(example_sources)))
        print(f"\nUsing an example unique_genes list with {len(unique_genes)} genes.")
    else:
        print("\nCould not create example unique_genes list as dorothea_df is empty.")
        unique_genes = []

    if unique_genes:
        # 3) Filter for mapped genes
        # Ensure 'target' and 'source' columns in dorothea_df and unique_genes are uppercase for consistent matching
        dorothea_df['target'] = dorothea_df['target'].astype(str).str.upper()
        dorothea_df['source'] = dorothea_df['source'].astype(str).str.upper() # TFs are in 'source'
        unique_genes_upper = {str(gene).upper() for gene in unique_genes}

        # Genes in DoRothEA are the 'target' genes regulated by 'source' TFs
        mapped_targets = set(dorothea_df['target']) & unique_genes_upper
        # Also consider if any of your unique_genes are TFs themselves (which are in 'source')
        mapped_tfs_as_targets_in_unique = set(dorothea_df['source']) & unique_genes_upper
        
        mapped = mapped_targets | mapped_tfs_as_targets_in_unique

        # 4) Compute unmapped
        unmapped = unique_genes_upper - mapped

        # 5) Print results
        print(f"\nMapped genes:   {len(mapped)}")
        print(f"Unmapped genes: {len(unmapped)}\n")

        if unmapped:
            print("Sample of Unmapped genes (up to 20):")
            for i, g in enumerate(sorted(list(unmapped))):
                if i < 20:
                    print(" -", g)
                else:
                    print(f" ... and {len(unmapped) - 20} more.")
                    break
        else:
            print("All genes were mapped!")
    else:
        print("Skipping mapping as unique_genes list is empty.")

except Exception as e:
    print(f"❌ An error occurred: {e}")

Loading DoRothEA regulons using dc.op.dorothea...
✅ DoRothEA regulons loaded successfully.
Shape of dorothea_df: (276731, 4)
Head of dorothea_df:
  source  target  weight confidence
0    MYC    TERT     1.0          A
1   ZEB1      AR    -1.0          A
2  STAT1  IGFBP3     1.0          A
3   ETS1     CD4     1.0          A
4  FOXA1    RPRM    -1.0          A

Using an example unique_genes list with 1003 genes.

Mapped genes:   1003
Unmapped genes: 0

All genes were mapped!


In [33]:
# 1) Get your real gene list
unique_genes_real = (
    Table1_Gene_Pathway_Centric['Gene_Symbol']
    .dropna()
    .str.upper()
    .unique()
)

# 2) Slice & copy the Dorothea df
filtered_dorothea = dorothea_df.loc[
    dorothea_df['target'].isin(unique_genes_real)
].copy()

# 3) Now it’s safe to add a column
filtered_dorothea['abs_weight'] = filtered_dorothea['weight'].abs()

# 4) Pick the top TF per gene by absolute weight
top_tf_per_gene = (
    filtered_dorothea
    .sort_values('abs_weight', ascending=False)
    .groupby('target')['source']
    .first()
    .to_dict()
)

# 5) Preview
print("✅ Auto-generated Gene → TF map (first 10):")
for gene, tf in list(top_tf_per_gene.items())[:10]:
    print(f"{gene} → {tf}")


✅ Auto-generated Gene → TF map (first 10):
AASS → CEBPA
AKR1C3 → AR
AKR1C8 → CEBPA
ALOX15 → STAT6
ALPG → AR
ALPI → FOXO1
ALPL → SP3
ALPP → CTCF
AOX1 → STAT1
BDH1 → E2F1


In [34]:
import pandas as pd
# Ensure decoupler is imported if you need to reload dorothea_df
# import decoupler as dc

# PREREQUISITE:
# Ensure 'Table1_Gene_Pathway_Centric' is loaded and contains your gene symbols.
# Example:
# if 'Table1_Gene_Pathway_Centric' not in locals():
#     print("ERROR: Table1_Gene_Pathway_Centric is not loaded!")
#     # You would load it here, e.g.,
#     # Table1_Gene_Pathway_Centric = pd.read_csv('path_to_your_table1.csv')

# PREREQUISITE:
# Ensure 'dorothea_df' (DoRothEA regulons) is loaded.
# This DataFrame is typically loaded using decoupler:
# if 'dorothea_df' not in locals():
#     print("Loading DoRothEA regulons via decoupler...")
#     dorothea_df = dc.op.dorothea(organism='human', levels=['A', 'B', 'C', 'D'])
#     dorothea_df['target'] = dorothea_df['target'].astype(str).str.upper()
#     dorothea_df['source'] = dorothea_df['source'].astype(str).str.upper() # TF names
#     print("✅ DoRothEA regulons loaded for TF mapping.")


print("--- Defining 'your_gene_to_tf_map' automatically ---")

# 1) Get your real gene list from Table1_Gene_Pathway_Centric
#    Adjust 'Gene_Symbol' if your column name is different (e.g., 'Gene')
if 'Gene_Symbol' in Table1_Gene_Pathway_Centric.columns:
    gene_column_name = 'Gene_Symbol'
elif 'Gene' in Table1_Gene_Pathway_Centric.columns:
    gene_column_name = 'Gene'
else:
    print("❌ ERROR: Could not find a 'Gene_Symbol' or 'Gene' column in Table1_Gene_Pathway_Centric.")
    print("         Please ensure your gene column is named correctly.")
    top_tf_per_gene = {} # Set to empty to avoid further errors

if gene_column_name in Table1_Gene_Pathway_Centric.columns:
    unique_genes_real = (
        Table1_Gene_Pathway_Centric[gene_column_name]
        .dropna()
        .str.upper()
        .unique()
    )
    print(f"Found {len(unique_genes_real)} unique genes in Table1 to map TFs for.")

    # 2) Slice & copy the Dorothea df based on your genes
    #    This assumes 'dorothea_df' is already loaded and processed (target/source are upper case)
    if 'dorothea_df' in locals() or 'dorothea_df' in globals():
        filtered_dorothea = dorothea_df.loc[
            dorothea_df['target'].isin(unique_genes_real)
        ].copy()

        # 3) Add absolute weight column
        if not filtered_dorothea.empty:
            filtered_dorothea['abs_weight'] = filtered_dorothea['weight'].abs()

            # 4) Pick the top TF per gene by absolute weight
            top_tf_per_gene = (
                filtered_dorothea
                .sort_values('abs_weight', ascending=False)
                .groupby('target')['source']  # 'target' is the gene, 'source' is the TF
                .first()
                .to_dict()
            )
        else:
            print("⚠️ No target genes from Table1 were found in the loaded DoRothEA regulons. 'top_tf_per_gene' will be empty.")
            top_tf_per_gene = {}
    else:
        print("❌ ERROR: 'dorothea_df' (DoRothEA regulons) is not loaded. Cannot generate TF map.")
        top_tf_per_gene = {} # Ensure it's defined if dorothea_df is missing

# 5) Assign the auto-generated map to 'your_gene_to_tf_map'
your_gene_to_tf_map = top_tf_per_gene

print(f"\n✅ 'your_gene_to_tf_map' has been defined with {len(your_gene_to_tf_map)} gene-TF pairings.")
print("   This map was generated by finding the TF with the strongest DoRothEA weight for each gene.")

print("\nPreview of 'your_gene_to_tf_map' (first 5 entries):")
if your_gene_to_tf_map:
    for i, (gene, tf) in enumerate(your_gene_to_tf_map.items()):
        if i < 5:
            print(f"  '{gene}' → '{tf}'")
        else:
            break
else:
    print("  The 'your_gene_to_tf_map' is empty.")

--- Defining 'your_gene_to_tf_map' automatically ---
Found 74 unique genes in Table1 to map TFs for.

✅ 'your_gene_to_tf_map' has been defined with 70 gene-TF pairings.
   This map was generated by finding the TF with the strongest DoRothEA weight for each gene.

Preview of 'your_gene_to_tf_map' (first 5 entries):
  'AASS' → 'CEBPA'
  'AKR1C3' → 'AR'
  'AKR1C8' → 'CEBPA'
  'ALOX15' → 'STAT6'
  'ALPG' → 'AR'


In [244]:
# Assume 'your_gene_to_tf_map' already exists and is populated by 'top_tf_per_gene'
# Assume Table3_TFActivitySummary is loaded to verify TF presence (optional here, but good practice)

print("--- Manually updating 'your_gene_to_tf_map' based on diagnostic suggestions ---")

# For LPCAT1: Choose STAT5A (example, as it's in Table3 from your diagnostic)
your_gene_to_tf_map['LPCAT1'.upper()] = 'STAT5A' 
print("Updated LPCAT1 mapping to STAT5A.")

# For UGT1A1: Choose HNF1A (example, as it's in Table3 from your diagnostic)
your_gene_to_tf_map['UGT1A1'.upper()] = 'HNF1A'
print("Updated UGT1A1 mapping to HNF1A.")

# For ABHD14A-ACY1: Map to TGIF2 (as it was a regulator of ACY1 and in Table3)
# Ensure 'TGIF2' is in the correct case as it appears in Table3_TFActivitySummary['TF_Symbol']
your_gene_to_tf_map['ABHD14A-ACY1'.upper()] = 'TGIF2' 
print("Updated ABHD14A-ACY1 mapping to TGIF2.")

# For NT5C1B-RDH14: The script output said it mapped it to GATA2 (from NT5C1B)
# and GATA2 has a score. So this should already be done by the script if it ran that part.
# If not, and 'NT5C1B' maps to 'GATA2' in your_gene_to_tf_map:
if 'NT5C1B' in your_gene_to_tf_map and your_gene_to_tf_map['NT5C1B'] == 'GATA2': # Assuming GATA2 is the TF for NT5C1B
    your_gene_to_tf_map['NT5C1B-RDH14'.upper()] = 'GATA2'
    print("Ensured NT5C1B-RDH14 is mapped to GATA2.")
else:
    # If NT5C1B mapped to something else (like CEBPA which had no score), 
    # you might need to check NT5C1B alternatives from DoRothEA that are in Table3
    # For now, let's assume the previous script already set it based on NT5C1B's mapping from top_tf_per_gene.
    # If NT5C1B's mapping was 'CEBPA' (no score), NT5C1B-RDH14 would also map to 'CEBPA'.
    if 'NT5C1B' in your_gene_to_tf_map:
        print(f"NT5C1B-RDH14 will be mapped to {your_gene_to_tf_map['NT5C1B']} (TF for NT5C1B). Check if this TF has a score.")
        your_gene_to_tf_map['NT5C1B-RDH14'.upper()] = your_gene_to_tf_map['NT5C1B']


# For UGT2A2 and UGT1A5: DoRothEA had no targets for them.
# So, they will remain unmapped unless you find TFs from other databases AND those TFs are in Table3.
# For now, we make no changes for them based on DoRothEA.
print("UGT2A2 and UGT1A5 remain unmapped by DoRothEA data as no targets were found.")

print("\n--- 'your_gene_to_tf_map' has been manually updated. ---")
print("Preview of updated/relevant entries:")
genes_to_show_final_map = ['LPCAT1', 'UGT1A1', 'ABHD14A-ACY1', 'NT5C1B-RDH14', 'UGT2A2', 'UGT1A5']
for gene_key in genes_to_show_final_map:
    gene_key_upper = gene_key.upper()
    if gene_key_upper in your_gene_to_tf_map:
        print(f"  '{gene_key_upper}' → '{your_gene_to_tf_map[gene_key_upper]}'")
    else:
        print(f"  '{gene_key_upper}' is not in your_gene_to_tf_map.")

--- Manually updating 'your_gene_to_tf_map' based on diagnostic suggestions ---
Updated LPCAT1 mapping to STAT5A.
Updated UGT1A1 mapping to HNF1A.
Updated ABHD14A-ACY1 mapping to TGIF2.
Ensured NT5C1B-RDH14 is mapped to GATA2.
UGT2A2 and UGT1A5 remain unmapped by DoRothEA data as no targets were found.

--- 'your_gene_to_tf_map' has been manually updated. ---
Preview of updated/relevant entries:
  'LPCAT1' → 'STAT5A'
  'UGT1A1' → 'HNF1A'
  'ABHD14A-ACY1' → 'TGIF2'
  'NT5C1B-RDH14' → 'GATA2'
  'UGT2A2' is not in your_gene_to_tf_map.
  'UGT1A5' is not in your_gene_to_tf_map.


<h2>Gene → Transcription Factor Map (DoRothEA-weighted)</h2>

<p>✅ For each gene in the curated CRC-metabolite-centric list (<code>Table1_Gene_Pathway_Centric</code>), a corresponding transcription factor (TF) was assigned using <strong>DoRothEA confidence-weighted regulatory data</strong>.</p>

<hr>

<h3>🔁 Mapping Procedure</h3>
<ol>
  <li><strong>Input Genes:</strong> Extracted from the <code>Gene_Symbol</code> column of <code>Table1_Gene_Pathway_Centric</code> (n = 74 unique genes).</li>
  <li><strong>TF-Gene Matching:</strong> The DoRothEA regulon (levels A–D) was filtered for entries whose <code>target</code> matched any input gene symbol.</li>
  <li><strong>TF Selection:</strong> For each gene, the transcription factor with the highest absolute regulatory weight (|weight|) was selected.</li>
  <li><strong>Output:</strong> A one-to-one mapping of gene → TF based on strongest interaction confidence.</li>
</ol>

<hr>

<h3>📌 Summary</h3>
<ul>
  <li>Total input genes in Table 1: <strong>74</strong></li>
  <li>Genes successfully mapped to a TF via DoRothEA: <strong>70</strong></li>
  <li>Genes not mapped: <strong>4</strong> (likely due to no known TFs in DoRothEA for those targets)</li>
</ul>

<hr>

<h3>🔍 Preview of Mappings (First 10)</h3>
<table border="1" cellpadding="6" cellspacing="0">
  <thead>
    <tr>
      <th>Gene Symbol</th>
      <th>Top TF (DoRothEA)</th>
    </tr>
  </thead>
  <tbody>
    <tr><td><strong>AASS</strong></td><td>CEBPA</td></tr>
    <tr><td><strong>AKR1C3</strong></td><td>AR</td></tr>
    <tr><td><strong>AKR1C8</strong></td><td>CEBPA</td></tr>
    <tr><td><strong>ALOX15</strong></td><td>STAT6</td></tr>
    <tr><td><strong>ALPG</strong></td><td>AR</td></tr>
    <tr><td><strong>ALPI</strong></td><td>FOXO1</td></tr>
    <tr><td><strong>ALPL</strong></td><td>SP3</td></tr>
    <tr><td><strong>ALPP</strong></td><td>CTCF</td></tr>
    <tr><td><strong>AOX1</strong></td><td>STAT1</td></tr>
    <tr><td><strong>BDH1</strong></td><td>E2F1</td></tr>
  </tbody>
</table>

<hr>

<h3>⚠️ Considerations</h3>
<ul>
  <li>DoRothEA mappings are derived from literature-curated TF-target relationships with signed confidence scores.</li>
  <li>Genes with no match may either lack annotated regulators or may be regulated post-transcriptionally (e.g., via metabolites or RNA-binding proteins).</li>
</ul>


In [47]:
# Lists of pathway IDs from your two “Unique Pathway Names” outputs
list1 = [
    "hsa00650", "hsa01100", "hsa00030", "hsa00280", "hsa00350", "hsa00380", "hsa00750",
    "hsa00760", "hsa00830", "hsa00982", "hsa01120", "hsa00260", "hsa00311", "hsa00330",
    "hsa00470", "hsa01110", "hsa00270", "hsa00670", "hsa00480", "hsa00220", "hsa00590",
    "hsa00591", "hsa00564", "hsa00565", "hsa00592", "hsa00521", "hsa00562", "hsa00730",
    "hsa00790", "hsa00240", "hsa00061", "hsa00780", "hsa00140", "hsa00984", "hsa00040",
    "hsa00053", "hsa00860", "hsa00980", "hsa00983", "hsa00561", "hsa00310", "hsa00360",
    "hsa00950", "hsa00965", "hsa00230"
]

list2 = [
    "hsa00100", "hsa00030", "hsa00053", "hsa00930", "hsa01100", "hsa01110", "hsa01120",
    "hsa00363", "hsa00564", "hsa00565", "hsa00590", "hsa00591", "hsa00592", "hsa00830",
    "hsa00040", "hsa00051", "hsa00280", "hsa00350", "hsa00380", "hsa00750", "hsa00760",
    "hsa00982", "hsa00010", "hsa00620", "hsa00630", "hsa00640", "hsa00680", "hsa00720",
    "hsa00260", "hsa00270", "hsa00670", "hsa00480", "hsa00330", "hsa00340", "hsa00410",
    "hsa00220", "hsa00360", "hsa00950", "hsa00965", "hsa00730", "hsa00790", "hsa00240",
    "hsa00052", "hsa00520", "hsa00900", "hsa00500", "hsa00521", "hsa00524", "hsa00562",
    "hsa00983", "hsa00140", "hsa00984", "hsa00860", "hsa00980", "hsa00230"
]

# Your mapping from pathway IDs to PROGENy targets
your_pathway_to_progeny_map = {
    "hsa00010": "Hypoxia",
    "hsa00030": "Hypoxia",
    "hsa00040": "MAPK",
    "hsa00051": "PI3K",
    "hsa00052": "EGFR",
    "hsa00053": "MAPK",
    "hsa00100": "Hypoxia",
    "hsa00140": "PI3K",
    "hsa00220": "NFkB",
    "hsa00230": "NFkB",
    "hsa00240": "NFkB",
    "hsa00260": "TNFa",
    "hsa00270": "TNFa",
    "hsa00280": "PI3K",
    "hsa00330": "NFkB",
    "hsa00340": "NFkB",
    "hsa00350": "TGFb",
    "hsa00360": "EGFR",
    "hsa00363": "MAPK",
    "hsa00380": "NFkB",
    "hsa00410": "NFkB",
    "hsa00480": "WNT",
    "hsa00500": "PI3K",
    "hsa00520": "NFkB",
    "hsa00521": "PI3K",
    "hsa00524": "NFkB",
    "hsa00562": "PI3K",
    "hsa00564": "PI3K",
    "hsa00565": "PI3K",
    "hsa00590": "PI3K",
    "hsa00591": "NFkB",
    "hsa00592": "NFkB",
    "hsa00620": "TGFb",
    "hsa00630": "Hypoxia",
    "hsa00640": "Hypoxia",
    "hsa00650": "Hypoxia",
    "hsa00670": "TGFb",
    "hsa00680": "WNT",
    "hsa00720": "MAPK",
    "hsa00730": "Hypoxia",
    "hsa00750": "TGFb",
    "hsa00760": "PI3K",
    "hsa00790": "NFkB",
    "hsa00830": "MAPK",
    "hsa00860": "WNT",
    "hsa00900": "Estrogen",
    "hsa00930": "NFkB",
    "hsa00950": "NFkB",
    "hsa00965": "NFkB",
    "hsa00982": "NFkB",
    "hsa00983": "NFkB",
    "hsa00984": "Estrogen",
    "hsa01100": None,
    "hsa01110": None,
    "hsa01120": None
}

# Convert lists to sets for comparison
set_all_pathways = set(list1) | set(list2)
map_keys = set(your_pathway_to_progeny_map.keys())

# Find which pathway IDs are not in your mapping keys
not_in_map = sorted(set_all_pathways - map_keys)

print("Pathway IDs not found in your_pathway_to_progeny_map:")
for pid in not_in_map:
    print("  -", pid)


Pathway IDs not found in your_pathway_to_progeny_map:
  - hsa00061
  - hsa00310
  - hsa00311
  - hsa00470
  - hsa00561
  - hsa00780
  - hsa00980


In [245]:
import pandas as pd
import numpy as np
import decoupler as dc # For DoRothEA regulons

# --- Prerequisites - Ensure these are correctly loaded or defined ---

# Placeholder: Load your actual DataFrames if they are not already in memory
# If you have them in CSVs, load them. If they are outputs of previous cells,
# ensure those cells have been run in this session.

# Example of how you might ensure they exist (replace with your actual loading if needed):
if 'Table1_Gene_Pathway_Centric' not in locals():
    print("⚠️ Table1_Gene_Pathway_Centric not found. Please load or generate it.")
    # Example: Table1_Gene_Pathway_Centric = pd.read_csv('path_to_table1.csv')
    # For now, creating a dummy one if it's missing, so the script doesn't break immediately,
    # BUT YOU SHOULD HAVE YOUR ACTUAL Table1.
    '''Table1_Gene_Pathway_Centric = pd.DataFrame({
        'Gene': ['LIPA', 'RGN', 'PON1', 'PPARA'], 
        'Pathway Name': ['hsa00100', 'hsa00030', 'hsa00363', 'hsa00071']
    })'''


if 'Table2_PathwayActivitySummary' not in locals():
    print("⚠️ Table2_PathwayActivitySummary (PROGENy results) not found. Please load or generate it.")
    # Example: Table2_PathwayActivitySummary = pd.read_csv('path_to_table2.csv')
    # Creating a dummy for script structure
    '''Table2_PathwayActivitySummary = pd.DataFrame({
        'PROGENy_Pathway': ['Hypoxia', 'PI3K', 'Estrogen', 'NFkB', 'TGFb', 'EGFR', 'MAPK', 'TNFa', 'WNT'],
        'Activity_Difference_ES': np.random.rand(9),
        'Mean_Tumor_Activity_ES': np.random.rand(9),
        'P_value_Activity': np.random.rand(9)
    })'''

if 'Table3_TFActivitySummary' not in locals():
    print("⚠️ Table3_TFActivitySummary (DoRothEA results) not found. Please load or generate it.")
    # Example: Table3_TFActivitySummary = pd.read_csv('path_to_table3.csv')
    # Creating a dummy for script structure
    '''Table3_TFActivitySummary = pd.DataFrame({
        'TF_Symbol': ['ESR1', 'ESRRA', 'FOS', 'AR', 'GATA3', 'PPARA', 'SP1', 'MYC', 'SREBF1', 'NFE2L2'],
        'Activity_Difference_ES': np.random.rand(10),
        'Mean_Tumor_Activity': np.random.rand(10),
        'P_value_Activity': np.random.rand(10)
    })'''

if 'dorothea_df' not in locals():
    print("Loading DoRothEA regulons via decoupler...")
    dorothea_df = dc.op.dorothea(organism='human', levels=['A', 'B', 'C', 'D'])
    dorothea_df['target'] = dorothea_df['target'].astype(str).str.upper()
    dorothea_df['source'] = dorothea_df['source'].astype(str).str.upper() # TF names
    print("✅ DoRothEA regulons loaded.")

# --- DECISIONS YOU HAVE MADE ---

# 1. Pathway Score Metric
pathway_score_metric_to_use = 'Activity_Difference_ES'
print(f"✅ pathway_score_metric_to_use set to: '{pathway_score_metric_to_use}'")

# 2. Your KEGG Pathway to PROGENy Pathway Map
your_pathway_to_progeny_map = {
    # Energy & Carbon Metabolism
    "hsa00010": "HYPOXIA",    # Glycolysis / Gluconeogenesis
    "hsa00030": "HYPOXIA",    # Pentose phosphate pathway
    "hsa00040": "MAPK",       # Pentose and glucuronate interconversions
    "hsa00051": "PI3K",       # Fructose and mannose metabolism
    "hsa00052": "EGFR",       # Galactose metabolism
    "hsa00053": "MAPK",       # Ascorbate and aldarate metabolism
    "hsa00061": "PI3K",       # Fatty acid biosynthesis
    "hsa00100": "PI3K",       # Steroid biosynthesis
    "hsa00140": "PI3K",       # Steroid hormone biosynthesis

    # Nucleotide, Cofactor, Redox & Amino Acid Metabolism
    "hsa00220": "NFKB",       # Arginine biosynthesis
    "hsa00230": "NFKB",       # Purine metabolism
    "hsa00240": "NFKB",       # Pyrimidine metabolism
    "hsa00250": "PI3K",       # Alanine, aspartate and glutamate metabolism
    "hsa00260": "TNFA",       # Glycine, serine and threonine metabolism
    "hsa00270": "TNFA",       # Cysteine and methionine metabolism
    "hsa00280": "PI3K",       # Valine, leucine and isoleucine degradation
    "hsa00310": "PI3K",       # Lysine degradation
    "hsa00330": "NFKB",       # Arginine and proline metabolism
    "hsa00340": "NFKB",       # Histidine metabolism
    "hsa00350": "TGFB",       # Tyrosine metabolism
    "hsa00360": "EGFR",       # Phenylalanine metabolism
    "hsa00380": "NFKB",       # Tryptophan metabolism
    "hsa00410": "NFKB",       # Beta-Alanine metabolism
    "hsa00470": "NFKB",       # D-Amino acid metabolism
    "hsa00480": "WNT",        # Glutathione metabolism
    "hsa00790": "NFKB",       # Folate biosynthesis

    # Other Carbohydrate Metabolism & Sugar Metabolism
    "hsa00500": "PI3K",       # Starch and sucrose metabolism
    "hsa00520": "NFKB",       # Amino sugar and nucleotide sugar metabolism
    "hsa00521": "PI3K",       # Keratan sulfate biosynthesis / Chondroitin sulfate biosynthesis (Glycosaminoglycan group)
    "hsa00524": "NFKB",       # Neomycin, kanamycin, and gentamicin biosynthesis

    # Lipid Metabolism
    "hsa00561": "PI3K",       # Glycerolipid metabolism
    "hsa00562": "PI3K",       # Inositol phosphate metabolism
    "hsa00564": "PI3K",       # Glycerophospholipid metabolism
    "hsa00565": "PI3K",       # Ether lipid metabolism
    "hsa00590": "PI3K",       # Arachidonic acid metabolism
    "hsa00591": "NFKB",       # Linoleic acid metabolism
    "hsa00592": "NFKB",       # Alpha-Linolenic acid metabolism
    "hsa00900": "ESTROGEN",   # Terpenoid backbone biosynthesis
    
    # Other Central/Energy Metabolism
    "hsa00620": "TGFB",       # Pyruvate metabolism
    "hsa00630": "HYPOXIA",    # Glyoxylate and dicarboxylate metabolism
    "hsa00640": "HYPOXIA",    # Propanoate metabolism
    "hsa00670": "TGFB",       # One carbon pool by folate
    "hsa00680": "WNT",        # Methane metabolism

    # Metabolism of Cofactors and Vitamins
    "hsa00730": "HYPOXIA",    # Thiamine metabolism
    "hsa00750": "TGFB",       # Vitamin B6 metabolism
    "hsa00780": "PI3K",       # Biotin metabolism
    "hsa00830": "MAPK",       # Retinol metabolism
    "hsa00860": "WNT",        # Porphyrin and chlorophyll metabolism
    
    # Xenobiotic / Drug Metabolism
    "hsa00930": "NFKB",       # Caprolactam degradation
    "hsa00950": "NFKB",       # Isoquinoline alkaloid biosynthesis (KEGG name)
    "hsa00965": "NFKB",       # Betalain biosynthesis
    "hsa00980": "PI3K",       # Metabolism of xenobiotics by cytochrome P450
    "hsa00982": "NFKB",       # Drug metabolism - cytochrome P450
    "hsa00983": "NFKB",       # Drug metabolism - other enzymes
    "hsa00984": "ESTROGEN",   # Steroid degradation

    # Specific Pathways 
    "hsa00311": "NFKB",       # Penicillinate and cephalosporin biosynthesis (host response if human genes linked)
    "hsa00362": "NFKB",       # Benzoate degradation (xenobiotic-like)
    "hsa00363": "MAPK",       # Non-homologous end-joining (This is DNA repair. If you meant Tau, KEGG ID might be different or not exist as a map)
    "hsa00720": "PI3K",       # Reductive carboxylate cycle (CO2 fixation)
    
    # Overview Pathways 
    "hsa01100": "PI3K",       # Metabolic pathways (broad oncogenic metabolic drive)
    "hsa01110": "PI3K",       # Biosynthesis of secondary metabolites (if linked to endogenous anabolism)
    "hsa01120": "NFKB",       # Microbial metabolism in diverse environments (host response context)
    "hsa00650": "P53" ,       # Butanoate metabolism
    "hsa00760": "PI3K"        # Naphthalene degradation
}

print(f"✅ 'your_pathway_to_progeny_map' defined with {len(your_pathway_to_progeny_map)} entries.")

# 3. TF Activity Score Metric
tf_activity_metric_to_use = 'Activity_Difference_ES'
print(f"✅ tf_activity_metric_to_use set to: '{tf_activity_metric_to_use}'")

# 4. Your Gene to TF Map (Automated generation)
print("\nAuto-generating Gene-to-TF map using DoRothEA weights...")
gene_column_for_tf_map = None
if 'Gene' in Table1_Gene_Pathway_Centric.columns:
    gene_column_for_tf_map = 'Gene'
elif 'Gene_Symbol' in Table1_Gene_Pathway_Centric.columns:
    gene_column_for_tf_map = 'Gene_Symbol'

top_tf_per_gene = {}
if gene_column_for_tf_map:
    unique_genes_real = (
        Table1_Gene_Pathway_Centric[gene_column_for_tf_map]
        .dropna()
        .str.upper()
        .unique()
    )
    filtered_dorothea = dorothea_df.loc[
        dorothea_df['target'].isin(unique_genes_real)
    ].copy()
    if not filtered_dorothea.empty:
        filtered_dorothea['abs_weight'] = filtered_dorothea['weight'].abs()
        top_tf_per_gene = (
            filtered_dorothea
            .sort_values('abs_weight', ascending=False)
            .groupby('target')['source']
            .first()
            .to_dict()
        )
your_gene_to_tf_map = top_tf_per_gene
print(f"✅ 'your_gene_to_tf_map' defined with {len(your_gene_to_tf_map)} gene-TF pairings.")

print("\n--- All prerequisites and custom mappings should now be ready. ---")

✅ pathway_score_metric_to_use set to: 'Activity_Difference_ES'
✅ 'your_pathway_to_progeny_map' defined with 64 entries.
✅ tf_activity_metric_to_use set to: 'Activity_Difference_ES'

Auto-generating Gene-to-TF map using DoRothEA weights...
✅ 'your_gene_to_tf_map' defined with 70 gene-TF pairings.

--- All prerequisites and custom mappings should now be ready. ---


<h2> Mapping of KEGG Metabolic Pathways to PROGENy Signaling Pathways</h2>

<p>This mapping aligns curated KEGG metabolic pathways with corresponding <strong>PROGENy signaling pathways</strong> based on known metabolic-control relationships and literature-based inference.</p>

<hr>

<table border="1" cellpadding="6" cellspacing="0">
  <thead>
    <tr>
      <th>KEGG Pathway ID</th>
      <th>KEGG Pathway Description</th>
      <th>Mapped PROGENy Pathway</th>
    </tr>
  </thead>
  <tbody>
    <tr><td>hsa00010</td><td>Glycolysis / Gluconeogenesis</td><td>HYPOXIA</td></tr>
    <tr><td>hsa00030</td><td>Pentose phosphate pathway</td><td>HYPOXIA</td></tr>
    <tr><td>hsa00040</td><td>Pentose and glucuronate interconversions</td><td>MAPK</td></tr>
    <tr><td>hsa00051</td><td>Fructose and mannose metabolism</td><td>PI3K</td></tr>
    <tr><td>hsa00052</td><td>Galactose metabolism</td><td>EGFR</td></tr>
    <tr><td>hsa00053</td><td>Ascorbate and aldarate metabolism</td><td>MAPK</td></tr>
    <tr><td>hsa00061</td><td>Fatty acid biosynthesis</td><td>PI3K</td></tr>
    <tr><td>hsa00100</td><td>Steroid biosynthesis</td><td>PI3K</td></tr>
    <tr><td>hsa00140</td><td>Steroid hormone biosynthesis</td><td>PI3K</td></tr>
    <tr><td>hsa00220</td><td>Arginine biosynthesis</td><td>NFKB</td></tr>
    <tr><td>hsa00230</td><td>Purine metabolism</td><td>NFKB</td></tr>
    <tr><td>hsa00240</td><td>Pyrimidine metabolism</td><td>NFKB</td></tr>
    <tr><td>hsa00250</td><td>Alanine, aspartate and glutamate metabolism</td><td>PI3K</td></tr>
    <tr><td>hsa00260</td><td>Glycine, serine and threonine metabolism</td><td>TNFA</td></tr>
    <tr><td>hsa00270</td><td>Cysteine and methionine metabolism</td><td>TNFA</td></tr>
    <tr><td>hsa00280</td><td>Valine, leucine and isoleucine degradation</td><td>PI3K</td></tr>
    <tr><td>hsa00310</td><td>Lysine degradation</td><td>PI3K</td></tr>
    <tr><td>hsa00330</td><td>Arginine and proline metabolism</td><td>NFKB</td></tr>
    <tr><td>hsa00340</td><td>Histidine metabolism</td><td>NFKB</td></tr>
    <tr><td>hsa00350</td><td>Tyrosine metabolism</td><td>TGFB</td></tr>
    <tr><td>hsa00360</td><td>Phenylalanine metabolism</td><td>EGFR</td></tr>
    <tr><td>hsa00380</td><td>Tryptophan metabolism</td><td>NFKB</td></tr>
    <tr><td>hsa00410</td><td>Beta-Alanine metabolism</td><td>NFKB</td></tr>
    <tr><td>hsa00470</td><td>D-Amino acid metabolism</td><td>NFKB</td></tr>
    <tr><td>hsa00480</td><td>Glutathione metabolism</td><td>WNT</td></tr>
    <tr><td>hsa00790</td><td>Folate biosynthesis</td><td>NFKB</td></tr>
    <tr><td>hsa00500</td><td>Starch and sucrose metabolism</td><td>PI3K</td></tr>
    <tr><td>hsa00520</td><td>Amino sugar and nucleotide sugar metabolism</td><td>NFKB</td></tr>
    <tr><td>hsa00521</td><td>Keratan sulfate / Chondroitin sulfate biosynthesis</td><td>PI3K</td></tr>
    <tr><td>hsa00524</td><td>Antibiotic biosynthesis (e.g., neomycin, gentamicin)</td><td>NFKB</td></tr>
    <tr><td>hsa00561</td><td>Glycerolipid metabolism</td><td>PI3K</td></tr>
    <tr><td>hsa00562</td><td>Inositol phosphate metabolism</td><td>PI3K</td></tr>
    <tr><td>hsa00564</td><td>Glycerophospholipid metabolism</td><td>PI3K</td></tr>
    <tr><td>hsa00565</td><td>Ether lipid metabolism</td><td>PI3K</td></tr>
    <tr><td>hsa00590</td><td>Arachidonic acid metabolism</td><td>PI3K</td></tr>
    <tr><td>hsa00591</td><td>Linoleic acid metabolism</td><td>NFKB</td></tr>
    <tr><td>hsa00592</td><td>Alpha-Linolenic acid metabolism</td><td>NFKB</td></tr>
    <tr><td>hsa00900</td><td>Terpenoid backbone biosynthesis</td><td>ESTROGEN</td></tr>
    <tr><td>hsa00620</td><td>Pyruvate metabolism</td><td>TGFB</td></tr>
    <tr><td>hsa00630</td><td>Glyoxylate and dicarboxylate metabolism</td><td>HYPOXIA</td></tr>
    <tr><td>hsa00640</td><td>Propanoate metabolism</td><td>HYPOXIA</td></tr>
    <tr><td>hsa00670</td><td>One carbon pool by folate</td><td>TGFB</td></tr>
    <tr><td>hsa00680</td><td>Methane metabolism</td><td>WNT</td></tr>
    <tr><td>hsa00730</td><td>Thiamine metabolism</td><td>HYPOXIA</td></tr>
    <tr><td>hsa00750</td><td>Vitamin B6 metabolism</td><td>TGFB</td></tr>
    <tr><td>hsa00780</td><td>Biotin metabolism</td><td>PI3K</td></tr>
    <tr><td>hsa00830</td><td>Retinol metabolism</td><td>MAPK</td></tr>
    <tr><td>hsa00860</td><td>Porphyrin and chlorophyll metabolism</td><td>WNT</td></tr>
    <tr><td>hsa00930</td><td>Caprolactam degradation</td><td>NFKB</td></tr>
    <tr><td>hsa00950</td><td>Isoquinoline alkaloid biosynthesis</td><td>NFKB</td></tr>
    <tr><td>hsa00965</td><td>Betalain biosynthesis</td><td>NFKB</td></tr>
    <tr><td>hsa00980</td><td>Xenobiotic metabolism (CYP450)</td><td>PI3K</td></tr>
    <tr><td>hsa00982</td><td>Drug metabolism - CYP450</td><td>NFKB</td></tr>
    <tr><td>hsa00983</td><td>Drug metabolism - other enzymes</td><td>NFKB</td></tr>
    <tr><td>hsa00984</td><td>Steroid degradation</td><td>ESTROGEN</td></tr>
    <tr><td>hsa00311</td><td>Penicillin/cephalosporin biosynthesis</td><td>NFKB</td></tr>
    <tr><td>hsa00362</td><td>Benzoate degradation</td><td>NFKB</td></tr>
    <tr><td>hsa00363</td><td>Non-homologous end-joining</td><td>MAPK</td></tr>
    <tr><td>hsa00720</td><td>Reductive carboxylate cycle</td><td>PI3K</td></tr>
    <tr><td>hsa01100</td><td>Metabolic pathways (overview)</td><td>PI3K</td></tr>
    <tr><td>hsa01110</td><td>Secondary metabolite biosynthesis</td><td>PI3K</td></tr>
    <tr><td>hsa01120</td><td>Microbial metabolism</td><td>NFKB</td></tr>
    <tr><td>hsa00650</td><td>Butanoate metabolism (P53 proxy)</td><td>P53</td></tr>
    <tr><td>hsa00760</td><td>Naphthalene degradation</td><td>PI3K</td></tr>
  </tbody>
</table>

<hr>

<h3>🧠 Interpretation</h3>
<p>This mapping enables cross-modal integration of KEGG-centric metabolic pathway information with PROGENy-inferred pathway activity scores, helping relate tumor metabolism to oncogenic signaling patterns.</p>


In [356]:
# THIS SCRIPT ASSUMES THE FOLLOWING ARE ALREADY DEFINED IN YOUR ENVIRONMENT:
# - Table1_Gene_Pathway_Centric (pandas DataFrame)
# - Table2_PathwayActivitySummary (pandas DataFrame, now with PROGENy data)
# - Table3_TFActivitySummary (pandas DataFrame, now with DoRothEA TF data)
# - your_pathway_to_progeny_map (dictionary)
# - your_gene_to_tf_map (dictionary)
# - pathway_score_metric_to_use (string, e.g., 'Activity_Difference_ES')
# - tf_activity_metric_to_use (string, e.g., 'Activity_Difference_ES')

import pandas as pd
import numpy as np

print("Step 0: Verifying prerequisite DataFrames and custom mappings are present...")

# Basic checks
required_dfs = ["Table1_Gene_Pathway_Centric", "Table2_PathwayActivitySummary", "Table3_TFActivitySummary"]
required_vars = ["your_pathway_to_progeny_map", "your_gene_to_tf_map", "pathway_score_metric_to_use", "tf_activity_metric_to_use"]
all_prerequisites_met = True

for df_name in required_dfs:
    if df_name not in locals() or not isinstance(locals()[df_name], pd.DataFrame) or locals()[df_name].empty:
        print(f"❌ ERROR: Prerequisite DataFrame '{df_name}' is not defined or is empty.")
        all_prerequisites_met = False
    else:
        print(f"✅ Prerequisite DataFrame '{df_name}' found. Shape: {locals()[df_name].shape}")


for var_name in required_vars:
    if var_name not in locals():
        print(f"❌ ERROR: Prerequisite variable '{var_name}' is not defined.")
        all_prerequisites_met = False
    else:
        print(f"✅ Prerequisite variable '{var_name}' found.")


if not all_prerequisites_met:
    print("\nStopping script. Please ensure all prerequisite DataFrames and variables are correctly defined in your session from previous steps.")
else:
    print("\n✅ All prerequisites seem to be present. Proceeding with table population...")

    # Make a copy of your main table to work on
    final_table = Table1_Gene_Pathway_Centric.copy()
    print(f"✅ Created 'final_table' as a copy of 'Table1_Gene_Pathway_Centric' (shape: {final_table.shape}).")

    # --- 1. Populate 'PathwayScore' (with new case-insensitive logic) ---
    print("\n--- Populating 'PathwayScore' ---")
    print(f"Using '{pathway_score_metric_to_use}' from Table2_PathwayActivitySummary as the PathwayScore.")

    progeny_scores_lookup = {}
    if not Table2_PathwayActivitySummary.empty and \
       'PROGENy_Pathway' in Table2_PathwayActivitySummary.columns and \
       pathway_score_metric_to_use in Table2_PathwayActivitySummary.columns:
        
        # Create a temporary Table2 with uppercase PROGENy_Pathway names for lookup
        Table2_temp_for_lookup = Table2_PathwayActivitySummary.copy()
        Table2_temp_for_lookup['PROGENy_Pathway_UPPER'] = Table2_temp_for_lookup['PROGENy_Pathway'].astype(str).str.upper()
        progeny_scores_lookup = Table2_temp_for_lookup.set_index('PROGENy_Pathway_UPPER')[pathway_score_metric_to_use].to_dict()
        print("    (progeny_scores_lookup created using UPPERCASE PROGENy pathway names for matching)")
    else:
        print(f"⚠️ WARNING: Table2_PathwayActivitySummary is empty or missing required columns ('PROGENy_Pathway', '{pathway_score_metric_to_use}') for PROGENy score lookup. 'PathwayScore' will be NaN.")
        
    def map_pathway_score_logic(original_pathway_name_from_table):
        path_id_to_lookup = str(original_pathway_name_from_table).strip()
        if ' ' in path_id_to_lookup: 
            path_id_to_lookup = path_id_to_lookup.split(' ')[0]
        
        # Get the PROGENy pathway name from your custom map (this name might have mixed case)
        chosen_progeny_pathway_original_case = your_pathway_to_progeny_map.get(path_id_to_lookup)
        
        if chosen_progeny_pathway_original_case:
            # Convert the mapped PROGENy pathway to UPPERCASE for lookup in progeny_scores_lookup
            chosen_progeny_pathway_upper = str(chosen_progeny_pathway_original_case).upper()
            return progeny_scores_lookup.get(chosen_progeny_pathway_upper, np.nan)
        return np.nan

    final_table['PathwayScore'] = np.nan # Initialize column
    if 'Pathway Name' in final_table.columns:
        final_table['PathwayScore'] = final_table['Pathway Name'].apply(map_pathway_score_logic)
        print(f"✅ 'PathwayScore' column populated. Number of non-NaN scores: {final_table['PathwayScore'].notna().sum()}")
    else:
        print("⚠️ WARNING: 'Pathway Name' column not found in 'final_table'. 'PathwayScore' will be NaN.")

    # --- 2. Populate 'TF' and 'TFActivity' ---
    print("\n--- Populating 'TF' and 'TFActivity' ---")
    print(f"Using '{tf_activity_metric_to_use}' from Table3_TFActivitySummary as the TFActivity.")

    tf_scores_lookup = {}
    if not Table3_TFActivitySummary.empty and \
       'TF_Symbol' in Table3_TFActivitySummary.columns and \
       tf_activity_metric_to_use in Table3_TFActivitySummary.columns:
        Table3_TFActivitySummary_for_lookup = Table3_TFActivitySummary.copy()
        # Ensure TF_Symbol in the lookup table is also uppercase for consistent matching
        Table3_TFActivitySummary_for_lookup['TF_Symbol'] = Table3_TFActivitySummary_for_lookup['TF_Symbol'].astype(str).str.upper()
        tf_scores_lookup = Table3_TFActivitySummary_for_lookup.set_index('TF_Symbol')[tf_activity_metric_to_use].to_dict()
        print("    (tf_scores_lookup created using UPPERCASE TF_Symbol for matching)")

    else:
        print(f"⚠️ WARNING: Table3_TFActivitySummary is empty or missing required columns ('TF_Symbol', '{tf_activity_metric_to_use}') for TF score lookup. 'TFActivity' will be NaN.")

    def map_tf_name_logic(gene_symbol):
        # your_gene_to_tf_map should ideally yield TF names that match the case (or lack thereof)
        # of keys in tf_scores_lookup (which are now uppercase).
        # For robustness, ensure TF name from map is also uppercased if tf_scores_lookup keys are uppercase.
        tf_name = your_gene_to_tf_map.get(str(gene_symbol).strip().upper(), "")
        return str(tf_name).upper() if tf_name else "" # Return uppercase TF name

    def map_tf_activity_logic(gene_symbol):
        # Get TF name, ensure it's uppercase for lookup
        tf_to_use_for_lookup = map_tf_name_logic(gene_symbol) # This will be uppercase
        if tf_to_use_for_lookup:
            return tf_scores_lookup.get(tf_to_use_for_lookup, np.nan) # Lookup with uppercase TF name
        return np.nan

    final_table['TF'] = ""
    final_table['TFActivity'] = np.nan

    gene_column_for_tf_map = None
    if 'Gene' in final_table.columns:
        gene_column_for_tf_map = 'Gene'
    elif 'Gene_Symbol' in final_table.columns:
        gene_column_for_tf_map = 'Gene_Symbol'

    if gene_column_for_tf_map:
        final_table[gene_column_for_tf_map] = final_table[gene_column_for_tf_map].astype(str)
        final_table['TF'] = final_table[gene_column_for_tf_map].apply(map_tf_name_logic)
        final_table['TFActivity'] = final_table[gene_column_for_tf_map].apply(map_tf_activity_logic)
        print(f"✅ 'TF' column populated. Number of mapped TFs: {(final_table['TF'] != '').sum()}")
        print(f"✅ 'TFActivity' column populated. Number of non-NaN scores: {final_table['TFActivity'].notna().sum()}")
    else:
        print("⚠️ WARNING: Neither 'Gene' nor 'Gene_Symbol' column found in 'final_table'. 'TF' and 'TFActivity' will be empty/NaN.")

    # --- 3. Finalize column names and order for display ---
    actual_gene_column_name = None
    if 'Gene' in final_table.columns:
        actual_gene_column_name = 'Gene'
    elif 'Gene_Symbol' in final_table.columns:
        actual_gene_column_name = 'Gene_Symbol'
    else:
        print("⚠️ WARNING: Could not determine the primary gene column ('Gene' or 'Gene_Symbol') for display.")
        actual_gene_column_name = 'Gene_Fallback' # Placeholder
        if actual_gene_column_name not in final_table.columns: # If 'Gene_Fallback' also doesn't exist
             final_table[actual_gene_column_name] = "N/A"


    # Desired columns in the specified order
    desired_display_columns = [
        'Metabolite',
        # 'associated_compound_types', # Uncomment if this column exists in final_table
        actual_gene_column_name,
        'Expr (log₂ TPM)',
        'TF',
        'TFActivity',
        'PathwayScore',        # New order
        'Pathway Name',        # New order
        'kegg_enzyme',         # New order
        'kegg_reactions'       # New order
    ]

    # Filter for columns that actually exist in the final_table to avoid errors
    final_columns_to_display = [col for col in desired_display_columns if col in final_table.columns]

    # Create the final output table with selected and reordered columns
    final_output_table = final_table[final_columns_to_display].copy() 

    print("\n--- Final Curated Table (First 20 Rows, Custom Columns) ---")
    try:
        # Assuming 'display' function is available (e.g., in Jupyter/Colab)
        display(final_output_table.head(20))
    except NameError: 
        print(final_output_table.head(20).to_string())

    print("\n✅ Script finished processing.")


Step 0: Verifying prerequisite DataFrames and custom mappings are present...
✅ Prerequisite DataFrame 'Table1_Gene_Pathway_Centric' found. Shape: (585, 11)
✅ Prerequisite DataFrame 'Table2_PathwayActivitySummary' found. Shape: (14, 5)
✅ Prerequisite DataFrame 'Table3_TFActivitySummary' found. Shape: (321, 5)
✅ Prerequisite variable 'your_pathway_to_progeny_map' found.
✅ Prerequisite variable 'your_gene_to_tf_map' found.
✅ Prerequisite variable 'pathway_score_metric_to_use' found.
✅ Prerequisite variable 'tf_activity_metric_to_use' found.

✅ All prerequisites seem to be present. Proceeding with table population...
✅ Created 'final_table' as a copy of 'Table1_Gene_Pathway_Centric' (shape: (585, 11)).

--- Populating 'PathwayScore' ---
Using 'Activity_Difference_ES' from Table2_PathwayActivitySummary as the PathwayScore.
    (progeny_scores_lookup created using UPPERCASE PROGENy pathway names for matching)
✅ 'PathwayScore' column populated. Number of non-NaN scores: 585

--- Populating 'T

Unnamed: 0,Metabolite,Gene_Symbol,TF,TFActivity,PathwayScore,Pathway Name,kegg_enzyme,kegg_reactions
0,['D-Erythronolactone'],BDH1,E2F1,0.980517,126.993134,hsa00650,1.1.1.30,"['R00145', 'R00146', 'R00717', 'R01088', 'R013..."
0,['D-Erythronolactone'],BDH1,E2F1,0.980517,-141.148347,hsa01100,1.1.1.30,"['R00145', 'R00146', 'R00717', 'R01088', 'R013..."
1,['Deoxyribose 5-phosphate'],DERA,TP53,1.779543,199.106574,hsa00030,4.1.2.4,"['R02750', 'R02749', 'R01066']"
1,['Deoxyribose 5-phosphate'],DERA,TP53,1.779543,-141.148347,hsa01100,4.1.2.4,"['R02750', 'R02749', 'R01066']"
2,['Deoxyribose 5-phosphate'],RBKS,GATA2,-0.424943,199.106574,hsa00030,2.7.1.15,"['R02750', 'R02749', 'R01066']"
2,['Deoxyribose 5-phosphate'],RBKS,GATA2,-0.424943,-141.148347,hsa01100,2.7.1.15,"['R02750', 'R02749', 'R01066']"
3,['Quinoline-4-carboxylic acid'],AOX1,STAT1,-0.43692,-141.148347,hsa00280,1.2.3.1,"['R00366', 'R00372', 'R00635', 'R01340', 'R017..."
3,['Quinoline-4-carboxylic acid'],AOX1,STAT1,-0.43692,-356.586762,hsa00350,1.2.3.1,"['R00366', 'R00372', 'R00635', 'R01340', 'R017..."
3,['Quinoline-4-carboxylic acid'],AOX1,STAT1,-0.43692,383.739852,hsa00380,1.2.3.1,"['R00366', 'R00372', 'R00635', 'R01340', 'R017..."
3,['Quinoline-4-carboxylic acid'],AOX1,STAT1,-0.43692,-356.586762,hsa00750,1.2.3.1,"['R00366', 'R00372', 'R00635', 'R01340', 'R017..."



✅ Script finished processing.


<h2>Integrated Metabolite–Gene–TF–Pathway Summary</h2>

<p>This table integrates <strong>metabolite-centric enzyme data</strong> with <strong>inferred TF activity (from VIPER)</strong> and <strong>signaling pathway dysregulation (from PROGENy)</strong> for colorectal cancer (CRC) samples in TCGA-COAD. Each row connects a metabolite-linked gene to its dominant transcription factor and the dysregulation score of the associated pathway.</p>

<table border="1" cellpadding="6" cellspacing="0">
  <thead>
    <tr>
      <th>Metabolite</th>
      <th>Gene</th>
      <th>Transcription Factor</th>
      <th>TF Activity Δ (Tumor - Normal)</th>
      <th>PROGENy Pathway Score</th>
      <th>KEGG Pathway</th>
      <th>Enzyme EC</th>
      <th>Reactions</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>D-Erythronolactone</td>
      <td>BDH1</td>
      <td>E2F1</td>
      <td>0.9805</td>
      <td>126.99</td>
      <td>hsa00650</td>
      <td>1.1.1.30</td>
      <td>R00145, R00146, R00717...</td>
    </tr>
    <tr>
      <td>D-Erythronolactone</td>
      <td>BDH1</td>
      <td>E2F1</td>
      <td>0.9805</td>
      <td>-141.15</td>
      <td>hsa01100</td>
      <td>1.1.1.30</td>
      <td>R00145, R00146, R00717...</td>
    </tr>
    <tr>
      <td>Deoxyribose 5-phosphate</td>
      <td>DERA</td>
      <td>TP53</td>
      <td>1.7795</td>
      <td>199.11</td>
      <td>hsa00030</td>
      <td>4.1.2.4</td>
      <td>R02750, R02749...</td>
    </tr>
    <tr>
      <td>Deoxyribose 5-phosphate</td>
      <td>DERA</td>
      <td>TP53</td>
      <td>1.7795</td>
      <td>-141.15</td>
      <td>hsa01100</td>
      <td>4.1.2.4</td>
      <td>R02750, R02749...</td>
    </tr>
    <tr>
      <td>Deoxyribose 5-phosphate</td>
      <td>RBKS</td>
      <td>GATA2</td>
      <td>-0.4249</td>
      <td>199.11</td>
      <td>hsa00030</td>
      <td>2.7.1.15</td>
      <td>R02750, R02749...</td>
    </tr>
    <tr>
      <td>Deoxyribose 5-phosphate</td>
      <td>RBKS</td>
      <td>GATA2</td>
      <td>-0.4249</td>
      <td>-141.15</td>
      <td>hsa01100</td>
      <td>2.7.1.15</td>
      <td>R02750, R02749...</td>
    </tr>
    <tr>
      <td>Quinoline-4-carboxylic acid</td>
      <td>AOX1</td>
      <td>STAT1</td>
      <td>-0.4369</td>
      <td>-141.15</td>
      <td>hsa00280</td>
      <td>1.2.3.1</td>
      <td>R00366, R00372...</td>
    </tr>
    <tr>
      <td>Quinoline-4-carboxylic acid</td>
      <td>AOX1</td>
      <td>STAT1</td>
      <td>-0.4369</td>
      <td>-356.59</td>
      <td>hsa00350</td>
      <td>1.2.3.1</td>
      <td>R00366, R00372...</td>
    </tr>
    <tr>
      <td>Quinoline-4-carboxylic acid</td>
      <td>AOX1</td>
      <td>STAT1</td>
      <td>-0.4369</td>
      <td>383.74</td>
      <td>hsa00380</td>
      <td>1.2.3.1</td>
      <td>R00366, R00372...</td>
    </tr>
    <tr>
      <td>Quinoline-4-carboxylic acid</td>
      <td>AOX1</td>
      <td>STAT1</td>
      <td>-0.4369</td>
      <td>-356.59</td>
      <td>hsa00750</td>
      <td>1.2.3.1</td>
      <td>R00366, R00372...</td>
    </tr>
  </tbody>
</table>

<hr>

<h3>🧠 Why This Table Matters</h3>
<p>This table enables metabolic genes to be functionally contextualized within two layers of cancer-specific regulatory change:</p>
<ul>
  <li><strong>Transcription factor activity</strong> (VIPER): Reflects regulatory pressure on the gene</li>
  <li><strong>Signaling pathway activation</strong> (PROGENy): Indicates higher-order oncogenic context</li>
</ul>
<p>These linkages help prioritize metabolite-enzyme pairs for further study or therapeutic targeting, especially when both <strong>TF and pathway are dysregulated</strong>.</p>


<div style="
    border-left: 6px solid #3F51B5;
    background-color: #E8EAF6;
    padding: 16px 20px;
    margin: 24px 0;
    border-radius: 8px;
    box-shadow: 0 2px 4px rgba(0,0,0,0.05);
">
  <h3 style="margin-top: 0; margin-bottom: 8px; color: #3F51B5;">🧹 Data Cleaning</h3>
  <p style="margin: 0; color: #333;">
    In this step, you systematically remove or correct erroneous entries (e.g., missing SMILES, invalid pathway IDs, or duplicated metabolite names) from your dataset. By enforcing consistent formatting, handling NaNs, and validating that each entry maps to a known KEGG ID or HMDB identifier, you ensure that downstream analyses (e.g., pathway enrichment, transcriptomic integration) are built upon a reliable, high‐quality foundation.
  </p>
</div>


In [362]:
import pandas as pd
import requests
import time # For a small delay between API calls

# Assume your 'final_output_table' DataFrame is already loaded.
# For demonstration, I'll create a snippet if it's not loaded.
if 'final_output_table' not in locals() or not isinstance(final_output_table, pd.DataFrame):
    print("⚠️ 'final_output_table' not found. Creating a sample DataFrame for demonstration.")
    data_final_output = {
        'Pathway Name': ['hsa00240', 'hsa00260', 'hsa00270', 'hsa00280', 'hsa00330', 
                         'hsa00340', 'hsa00350', 'hsa00360', 'hsa00363', 'hsa00380', 
                         'hsa00010', 'hsa00020', 'hsa00030'] # Sample IDs
        # Add other necessary columns for context if needed for your actual df
    }
    for i in range(42): # Add more dummy hsa IDs to reach ~55 unique
        data_final_output['Pathway Name'].append(f"hsa{1000+i:05d}")

    data_final_output['Metabolite'] = [f'M{i}' for i in range(len(data_final_output['Pathway Name']))]
    data_final_output['Gene_Symbol'] = [f'G{i}' for i in range(len(data_final_output['Pathway Name']))]
    final_output_table = pd.DataFrame(data_final_output)
# --- End of sample data ---

def get_kegg_pathway_common_names_robust(pathway_id_list_with_optional_desc, chunk_size=10):
    """
    Fetches common names for KEGG pathway IDs in smaller chunks for robustness.
    Handles IDs that might have descriptions appended.
    """
    pathway_names_map = {}
    if not pathway_id_list_with_optional_desc:
        return pathway_names_map

    cleaned_pathway_ids = set()
    for entry in pathway_id_list_with_optional_desc:
        if pd.isna(entry):
            continue
        path_id = str(entry).strip()
        if ' ' in path_id: 
            path_id = path_id.split(' ')[0]
        # Ensure it's a valid hsa ID format before adding
        if path_id.startswith("hsa") and len(path_id) == 8 and path_id[3:].isdigit():
             cleaned_pathway_ids.add(path_id)
    
    unique_cleaned_ids = sorted(list(cleaned_pathway_ids))

    if not unique_cleaned_ids:
        print("No valid hsaXXXXX pathway IDs found to query.")
        return pathway_names_map
        
    print(f"Querying KEGG for common names of {len(unique_cleaned_ids)} unique pathway IDs in chunks of {chunk_size}...")
    
    for i in range(0, len(unique_cleaned_ids), chunk_size):
        chunk_ids = unique_cleaned_ids[i:i+chunk_size]
        if not chunk_ids:
            continue
            
        query_string = "+".join(chunk_ids)
        url = f"http://rest.kegg.jp/get/{query_string}"
        print(f"  Querying chunk: {chunk_ids} (URL: {url})")
        
        try:
            response = requests.get(url, timeout=45) # Slightly longer timeout for API
            response.raise_for_status()
            time.sleep(0.2) # Be polite between batch calls

            entries_text = response.text.strip().split('///\n')
            
            for entry_text in entries_text:
                if not entry_text.strip():
                    continue
                
                current_entry_id_from_record = None
                common_name = None
                
                for line in entry_text.split('\n'):
                    if line.startswith('ENTRY'):
                        parts = line.split()
                        if len(parts) > 1:
                            entry_id_map_or_hsa = parts[1]
                            if entry_id_map_or_hsa.startswith("map"):
                                current_entry_id_from_record = "hsa" + entry_id_map_or_hsa[3:]
                            elif entry_id_map_or_hsa.startswith("hsa"):
                                current_entry_id_from_record = entry_id_map_or_hsa
                                
                    elif line.startswith('NAME        '): 
                        common_name = line.replace('NAME        ', '').strip()
                    
                    if current_entry_id_from_record and common_name:
                        if current_entry_id_from_record in chunk_ids: # Ensure we only map IDs we queried in this chunk
                             pathway_names_map[current_entry_id_from_record] = common_name
                        break 
        except requests.exceptions.RequestException as e:
            print(f"  Warning: KEGG API call for chunk {chunk_ids} failed: {e}")
        except Exception as e_parse:
            print(f"  Warning: Error parsing KEGG response for chunk {chunk_ids}: {e_parse}")
            
    print(f"\n✅ Successfully fetched common names for {len(pathway_names_map)} pathways after chunking.")
    return pathway_names_map

# --- Main part of the script ---

if 'final_output_table' in locals() and isinstance(final_output_table, pd.DataFrame):
    if 'Pathway Name' in final_output_table.columns:
        unique_pathway_ids_in_df = final_output_table['Pathway Name'].dropna().unique().tolist()

        pathway_id_to_common_name_map_robust = get_kegg_pathway_common_names_robust(unique_pathway_ids_in_df, chunk_size=10)

        def map_common_name_robust(pathway_name_entry):
            if pd.isna(pathway_name_entry):
                return "N/A"
            path_id = str(pathway_name_entry).strip()
            if ' ' in path_id:
                path_id = path_id.split(' ')[0]
            return pathway_id_to_common_name_map_robust.get(path_id, "N/A - Name not found")

        final_output_table['Pathway_Common_Name'] = final_output_table['Pathway Name'].apply(map_common_name_robust)

        print("\n--- 'final_output_table' with Pathway Common Names (using robust fetching) ---")
        display_cols_final = ['Metabolite', 'Gene_Symbol', 'Pathway Name', 'Pathway_Common_Name', 
                              'PathwayScore', 'TF', 'TFActivity', 'kegg_enzyme', 'kegg_reactions']
        
        if 'Gene_Symbol' not in final_output_table.columns and 'Gene' in final_output_table.columns:
            display_cols_final[1] = 'Gene'
            
        existing_display_cols_final = [col for col in display_cols_final if col in final_output_table.columns]
        
        try:
            display(final_output_table[existing_display_cols_final].head(20))
        except NameError:
            print(final_output_table[existing_display_cols_final].head(20).to_string())
    else:
        print("❌ ERROR: 'Pathway Name' column not found in final_output_table.")
else:
    print("❌ ERROR: 'final_output_table' is not defined or is not a DataFrame.")

Querying KEGG for common names of 45 unique pathway IDs in chunks of 10...
  Querying chunk: ['hsa00030', 'hsa00040', 'hsa00053', 'hsa00061', 'hsa00140', 'hsa00220', 'hsa00230', 'hsa00240', 'hsa00260', 'hsa00270'] (URL: http://rest.kegg.jp/get/hsa00030+hsa00040+hsa00053+hsa00061+hsa00140+hsa00220+hsa00230+hsa00240+hsa00260+hsa00270)
  Querying chunk: ['hsa00280', 'hsa00310', 'hsa00311', 'hsa00330', 'hsa00350', 'hsa00360', 'hsa00380', 'hsa00470', 'hsa00480', 'hsa00521'] (URL: http://rest.kegg.jp/get/hsa00280+hsa00310+hsa00311+hsa00330+hsa00350+hsa00360+hsa00380+hsa00470+hsa00480+hsa00521)
  Querying chunk: ['hsa00561', 'hsa00562', 'hsa00564', 'hsa00565', 'hsa00590', 'hsa00591', 'hsa00592', 'hsa00650', 'hsa00670', 'hsa00730'] (URL: http://rest.kegg.jp/get/hsa00561+hsa00562+hsa00564+hsa00565+hsa00590+hsa00591+hsa00592+hsa00650+hsa00670+hsa00730)
  Querying chunk: ['hsa00750', 'hsa00760', 'hsa00780', 'hsa00790', 'hsa00830', 'hsa00860', 'hsa00950', 'hsa00965', 'hsa00980', 'hsa00982'] (URL: 

Unnamed: 0,Metabolite,Gene_Symbol,Pathway Name,Pathway_Common_Name,PathwayScore,TF,TFActivity,kegg_enzyme,kegg_reactions
0,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),126.993134,E2F1,0.980517,1.1.1.30,"['R00145', 'R00146', 'R00717', 'R01088', 'R013..."
1,D-Erythronolactone,BDH1,hsa01100,Metabolic pathways - Homo sapiens (human),-141.148347,E2F1,0.980517,1.1.1.30,"['R00145', 'R00146', 'R00717', 'R01088', 'R013..."
2,Deoxyribose 5-phosphate,DERA,hsa00030,Pentose phosphate pathway - Homo sapiens (human),199.106574,TP53,1.779543,4.1.2.4,"['R02750', 'R02749', 'R01066']"
3,Deoxyribose 5-phosphate,DERA,hsa01100,Metabolic pathways - Homo sapiens (human),-141.148347,TP53,1.779543,4.1.2.4,"['R02750', 'R02749', 'R01066']"
4,Deoxyribose 5-phosphate,RBKS,hsa00030,Pentose phosphate pathway - Homo sapiens (human),199.106574,GATA2,-0.424943,2.7.1.15,"['R02750', 'R02749', 'R01066']"
5,Deoxyribose 5-phosphate,RBKS,hsa01100,Metabolic pathways - Homo sapiens (human),-141.148347,GATA2,-0.424943,2.7.1.15,"['R02750', 'R02749', 'R01066']"
6,Quinoline-4-carboxylic acid,AOX1,hsa00280,"Valine, leucine and isoleucine degradation - H...",-141.148347,STAT1,-0.43692,1.2.3.1,"['R00366', 'R00372', 'R00635', 'R01340', 'R017..."
7,Quinoline-4-carboxylic acid,AOX1,hsa00350,Tyrosine metabolism - Homo sapiens (human),-356.586762,STAT1,-0.43692,1.2.3.1,"['R00366', 'R00372', 'R00635', 'R01340', 'R017..."
8,Quinoline-4-carboxylic acid,AOX1,hsa00380,Tryptophan metabolism - Homo sapiens (human),383.739852,STAT1,-0.43692,1.2.3.1,"['R00366', 'R00372', 'R00635', 'R01340', 'R017..."
9,Quinoline-4-carboxylic acid,AOX1,hsa00750,Vitamin B6 metabolism - Homo sapiens (human),-356.586762,STAT1,-0.43692,1.2.3.1,"['R00366', 'R00372', 'R00635', 'R01340', 'R017..."


In [363]:
import pandas as pd
import requests
import time
import numpy as np # For np.nan if needed in Pathway_Common_Name or for fillna

# --- THIS SCRIPT ASSUMES 'final_output_table' IS ALREADY LOADED AND DEFINED ---
# It must have 'Pathway Name' and 'Pathway_Common_Name' columns.

# --- Step 1: Define the Robust Pathway Name Fetching Function ---
def fetch_kegg_pathway_name_final(pathway_id_hsa): # Expects a cleaned hsaID like "hsa01120"
    pathway_name_found = None
    url_hsa = f"http://rest.kegg.jp/get/{pathway_id_hsa}"
    # print(f"    DEBUG Querying (attempt 1): {url_hsa}") # Uncomment for verbose debugging
    try:
        time.sleep(0.2) 
        r = requests.get(url_hsa, timeout=15)
        # print(f"    DEBUG Status Code for {pathway_id_hsa}: {r.status_code}") # Uncomment for verbose debugging
        if r.status_code == 200 and r.text.strip():
            for line in r.text.splitlines():
                if line.startswith("NAME"): 
                    parts = line.split("NAME", 1) # Split only on the first occurrence of "NAME"
                    if len(parts) > 1:
                        pathway_name_found = parts[1].strip()
                        break 
        if pathway_name_found:
            # print(f"    DEBUG SUCCESS (using {pathway_id_hsa}): Found NAME: '{pathway_name_found}'") # Uncomment for verbose debugging
            return pathway_name_found
        
        # If hsaID query was successful (200) but no NAME field, don't try mapID unless it was a 404 specifically.
        # If it's a 404, then mapID is a good fallback.
        if r.status_code != 404 and r.status_code == 200 : # Successful query but no NAME field
             print(f"    Info (with {pathway_id_hsa}): Status 200 but no NAME field found. Not trying mapID for this.")
             return None
        elif r.status_code != 200 and r.status_code != 404 : # Other HTTP errors
             print(f"    Warning (with {pathway_id_hsa}): HTTP Status {r.status_code}. Response: {r.text[:100]}")
             return None 
        # If 404, proceed to mapID attempt below

    except requests.exceptions.RequestException as e:
        print(f"    Warning (with {pathway_id_hsa}): RequestException: {e}")
        # Fall through to try mapID

    # Fallback: Try with "map" prefix if hsaID query failed (e.g. 404) or returned no name AND wasn't a non-404 error
    pathway_digits = pathway_id_hsa[3:]
    pathway_id_map = f"map{pathway_digits}"
    url_map = f"http://rest.kegg.jp/get/{pathway_id_map}"
    # print(f"    DEBUG Querying (attempt 2 - fallback): {url_map}") # Uncomment for verbose debugging
    try:
        time.sleep(0.2)
        r_map = requests.get(url_map, timeout=15)
        # print(f"    DEBUG Status Code for {pathway_id_map}: {r_map.status_code}") # Uncomment for verbose debugging
        r_map.raise_for_status() 
        if r_map.text.strip():
            for line in r_map.text.splitlines():
                if line.startswith("NAME"):
                    pathway_name_found = line.split("NAME", 1)[1].strip()
                    break
        if pathway_name_found:
            # print(f"    DEBUG SUCCESS (using {pathway_id_map}): Found NAME: '{pathway_name_found}'") # Uncomment for verbose debugging
            return pathway_name_found
    except requests.exceptions.RequestException as e:
        print(f"    Warning (with {pathway_id_map}): RequestException: {e}")
    # print(f"    DEBUG: Ultimately, no NAME field found for {pathway_id_hsa} (also tried as {pathway_id_map}).") # Uncomment for verbose debugging
    return None


# --- Step 2: Check prerequisites and identify rows needing update in final_output_table ---
if 'final_output_table' not in locals() or not isinstance(final_output_table, pd.DataFrame):
    print("❌ ERROR: DataFrame 'final_output_table' is not defined or is not a pandas DataFrame.")
    print("         Please ensure your main DataFrame is loaded and named 'final_output_table'.")
elif not ('Pathway_Common_Name' in final_output_table.columns and 'Pathway Name' in final_output_table.columns):
    print("❌ ERROR: 'final_output_table' is missing 'Pathway Name' or 'Pathway_Common_Name' column.")
else:
    print(f"✅ 'final_output_table' found (Shape: {final_output_table.shape}). Proceeding to update Pathway Common Names.")
    print("\n--- Re-fetching Missing Pathway Common Names for 'final_output_table' ---")

    missing_name_placeholders = [
        'N/A - Name not found', 'N/A - Still not found', 'None_Returned',
        'Invalid ID for lookup', 'N/A - KEGG NAME not found after re-fetch',
        'Invalid ID (format error)', 'Invalid ID for lookup', 
        'N/A - KEGG NAME not found', # Common default
        str(np.nan) # Catch string 'nan'
    ]
    
    # Ensure Pathway_Common_Name is string for .isin() and handle actual NaNs
    # Create a temporary column for mask creation to avoid issues with mixed types if original had NaNs
    temp_common_name_col = final_output_table['Pathway_Common_Name'].fillna("NAN_PLACEHOLDER_FOR_MASK").astype(str)
    
    mask_needs_refetch = temp_common_name_col.isin(missing_name_placeholders) | \
                         (temp_common_name_col == "NAN_PLACEHOLDER_FOR_MASK") | \
                         (temp_common_name_col.str.lower() == 'nan')
    
    rows_to_refetch_fot = final_output_table[mask_needs_refetch]
    
    if rows_to_refetch_fot.empty:
        print("✅ No rows found in 'final_output_table' currently needing pathway common name re-fetch.")
    else:
        pathway_ids_to_query_fot = rows_to_refetch_fot['Pathway Name'].dropna().astype(str).unique().tolist()
        print(f"Found {len(pathway_ids_to_query_fot)} unique 'Pathway Name' entries in 'final_output_table' to re-fetch names for.")
        
        refetched_names_map_fot = {} 
        
        for pid_original_from_df_column_val in pathway_ids_to_query_fot:
            pid_cleaned_hsa = str(pid_original_from_df_column_val).strip()
            if ' ' in pid_cleaned_hsa: 
                pid_cleaned_hsa = pid_cleaned_hsa.split(' ')[0]
            
            fetched_name_for_this_id = "N/A - KEGG NAME not found after robust re-fetch" 
            
            if (pid_cleaned_hsa.startswith("hsa") and len(pid_cleaned_hsa) == 8 and pid_cleaned_hsa[3:].isdigit()):
                # print(f"  Fetching name for cleaned ID: {pid_cleaned_hsa} (from original DF value: '{pid_original_from_df_column_val}')") # Optional
                retrieved_name = fetch_kegg_pathway_name_final(pid_cleaned_hsa)
                if retrieved_name:
                    fetched_name_for_this_id = retrieved_name
            else:
                # print(f"  Skipping invalid/unclean hsaID for API call: Original='{pid_original_from_df_column_val}', Cleaned='{pid_cleaned_hsa}'") # Optional
                fetched_name_for_this_id = "Invalid ID format in source"
            
            refetched_names_map_fot[pid_original_from_df_column_val] = fetched_name_for_this_id

        print("\nUpdating 'Pathway_Common_Name' in 'final_output_table'...")
        update_count_fot = 0
        
        for index_to_update, row_data in rows_to_refetch_fot.iterrows():
            original_path_name_in_row = row_data['Pathway Name'] 
            newly_fetched_name = refetched_names_map_fot.get(original_path_name_in_row)
            
            if newly_fetched_name: 
                is_valid_new_name = (newly_fetched_name not in missing_name_placeholders and
                                     "Invalid ID" not in newly_fetched_name and
                                     "not found" not in newly_fetched_name.lower())
                
                final_output_table.loc[index_to_update, 'Pathway_Common_Name'] = newly_fetched_name
                if is_valid_new_name:
                    update_count_fot += 1
        
        print(f"✅ Attempted to update 'Pathway_Common_Name'. {update_count_fot} rows potentially received new valid common names.")

        print("\n--- 'final_output_table' (rows that needed re-fetch) with Updated Pathway Common Names ---")
        display_cols = ['Metabolite', 'Gene_Symbol', 'Pathway Name', 'Pathway_Common_Name', 'PathwayScore', 'TF', 'TFActivity']
        
        # Adjust Gene_Symbol if needed
        actual_gene_column = 'Gene_Symbol' # Default
        if 'Gene_Symbol' not in final_output_table.columns and 'Gene' in final_output_table.columns:
            actual_gene_column = 'Gene'
        
        # Rebuild display_cols with the actual gene column name
        display_cols = ['Metabolite', actual_gene_column, 'Pathway Name', 'Pathway_Common_Name', 
                        'PathwayScore', 'TF', 'TFActivity']
        
        existing_display_cols = [col for col in display_cols if col in final_output_table.columns]
        
        if not rows_to_refetch_fot.empty:
            try:
                display(final_output_table.loc[rows_to_refetch_fot.index, existing_display_cols].head(20))
            except NameError:
                print(final_output_table.loc[rows_to_refetch_fot.index, existing_display_cols].head(20).to_string())
        else:
            print("No rows were identified in final_output_table for re-fetching common names (based on current placeholders).")

print("\n--- Script Complete ---")

✅ 'final_output_table' found (Shape: (1089, 9)). Proceeding to update Pathway Common Names.

--- Re-fetching Missing Pathway Common Names for 'final_output_table' ---
Found 7 unique 'Pathway Name' entries in 'final_output_table' to re-fetch names for.

Updating 'Pathway_Common_Name' in 'final_output_table'...
✅ Attempted to update 'Pathway_Common_Name'. 144 rows potentially received new valid common names.

--- 'final_output_table' (rows that needed re-fetch) with Updated Pathway Common Names ---


Unnamed: 0,Metabolite,Gene_Symbol,Pathway Name,Pathway_Common_Name,PathwayScore,TF,TFActivity
14,Quinoline-4-carboxylic acid,AOX1,hsa01120,Microbial metabolism in diverse environments,383.739852,STAT1,-0.43692
16,Quinoline-4-carboxylic acid,DAO,hsa00311,Penicillin and cephalosporin biosynthesis,383.739852,CTCF,0.809617
20,Quinoline-4-carboxylic acid,DAO,hsa01110,Biosynthesis of secondary metabolites,-141.148347,CTCF,0.809617
25,Methylcysteine,CBS,hsa01110,Biosynthesis of secondary metabolites,-141.148347,MYC,2.198834
30,Methylcysteine,CBS,hsa01110,Biosynthesis of secondary metabolites,-141.148347,MYC,2.198834
35,Asp-Arg,OTC,hsa01110,Biosynthesis of secondary metabolites,-141.148347,CEBPB,0.179903
57,LPC(16:1/0:0),PLA2G3,hsa01110,Biosynthesis of secondary metabolites,-141.148347,STAT1,-0.43692
58,LPC(18:3/0:0),PLA2G3,hsa01110,Biosynthesis of secondary metabolites,-141.148347,STAT1,-0.43692
59,LPI(16:2/0:0),PLA2G3,hsa01110,Biosynthesis of secondary metabolites,-141.148347,STAT1,-0.43692
78,LPC(16:1/0:0),PLA2G2E,hsa01110,Biosynthesis of secondary metabolites,-141.148347,HNF4A,-0.496992



--- Script Complete ---


In [364]:
final_output_table.to_csv("final_output_table.csv", index=False)

In [368]:
import pandas as pd

# Adjust the file path if 'final_output_table.cs' is not in your current working directory
file_path = "final_output_table.csv"

# Read the CSV file into a DataFrame
try:
    final_output_table = pd.read_csv(file_path)
    print("Successfully loaded 'final_output_table.cs'.")
    print("\nPreview of the DataFrame:")
    display(final_output_table.head())
except FileNotFoundError:
    print(f"File not found: {file_path}. Please check the path and filename.")
except pd.errors.EmptyDataError:
    print(f"The file {file_path} is empty.")
except pd.errors.ParserError as e:
    print(f"Parsing error while reading {file_path}: {e}")


Successfully loaded 'final_output_table.cs'.

Preview of the DataFrame:


Unnamed: 0,Metabolite,Gene_Symbol,TF,TFActivity,PathwayScore,Pathway Name,kegg_enzyme,kegg_reactions,Pathway_Common_Name
0,D-Erythronolactone,BDH1,E2F1,0.980517,126.993134,hsa00650,1.1.1.30,"['R00145', 'R00146', 'R00717', 'R01088', 'R013...",Butanoate metabolism - Homo sapiens (human)
1,D-Erythronolactone,BDH1,E2F1,0.980517,-141.148347,hsa01100,1.1.1.30,"['R00145', 'R00146', 'R00717', 'R01088', 'R013...",Metabolic pathways - Homo sapiens (human)
2,Deoxyribose 5-phosphate,DERA,TP53,1.779543,199.106574,hsa00030,4.1.2.4,"['R02750', 'R02749', 'R01066']",Pentose phosphate pathway - Homo sapiens (human)
3,Deoxyribose 5-phosphate,DERA,TP53,1.779543,-141.148347,hsa01100,4.1.2.4,"['R02750', 'R02749', 'R01066']",Metabolic pathways - Homo sapiens (human)
4,Deoxyribose 5-phosphate,RBKS,GATA2,-0.424943,199.106574,hsa00030,2.7.1.15,"['R02750', 'R02749', 'R01066']",Pentose phosphate pathway - Homo sapiens (human)


In [369]:
display(final_output_table)

Unnamed: 0,Metabolite,Gene_Symbol,TF,TFActivity,PathwayScore,Pathway Name,kegg_enzyme,kegg_reactions,Pathway_Common_Name
0,D-Erythronolactone,BDH1,E2F1,0.980517,126.993134,hsa00650,1.1.1.30,"['R00145', 'R00146', 'R00717', 'R01088', 'R013...",Butanoate metabolism - Homo sapiens (human)
1,D-Erythronolactone,BDH1,E2F1,0.980517,-141.148347,hsa01100,1.1.1.30,"['R00145', 'R00146', 'R00717', 'R01088', 'R013...",Metabolic pathways - Homo sapiens (human)
2,Deoxyribose 5-phosphate,DERA,TP53,1.779543,199.106574,hsa00030,4.1.2.4,"['R02750', 'R02749', 'R01066']",Pentose phosphate pathway - Homo sapiens (human)
3,Deoxyribose 5-phosphate,DERA,TP53,1.779543,-141.148347,hsa01100,4.1.2.4,"['R02750', 'R02749', 'R01066']",Metabolic pathways - Homo sapiens (human)
4,Deoxyribose 5-phosphate,RBKS,GATA2,-0.424943,199.106574,hsa00030,2.7.1.15,"['R02750', 'R02749', 'R01066']",Pentose phosphate pathway - Homo sapiens (human)
...,...,...,...,...,...,...,...,...,...
1084,2'-Deoxyinosine-5'-monophosphate,NT5C,ETS1,-0.214890,-141.148347,hsa00760,3.1.3.5,"['R10235', 'R03531', 'R12958']",Nicotinate and nicotinamide metabolism - Homo ...
1085,2'-Deoxyinosine-5'-monophosphate,NT5C,ETS1,-0.214890,-141.148347,hsa01100,3.1.3.5,"['R10235', 'R03531', 'R12958']",Metabolic pathways - Homo sapiens (human)
1086,2'-Deoxyinosine-5'-monophosphate,NT5C,ETS1,-0.214890,-141.148347,hsa01110,3.1.3.5,"['R10235', 'R03531', 'R12958']",Biosynthesis of secondary metabolites
1087,2'-Deoxyinosine-5'-monophosphate,NUDT16,HNF4A,-0.496992,383.739852,hsa00230,3.6.1.64,"['R10235', 'R03531', 'R12958']",Purine metabolism - Homo sapiens (human)


In [370]:
# 1) Get every non‐NA metabolite and find the unique values
unique_mets = final_output_table["Metabolite"].dropna().unique()

# 2) See how many unique metabolites there are
print("Number of unique metabolites:", len(unique_mets))

# 3) Print each unique metabolite, one per line
print("Unique metabolites:")
for m in unique_mets:
    print(m)


Number of unique metabolites: 18
Unique metabolites:
D-Erythronolactone
Deoxyribose 5-phosphate
Quinoline-4-carboxylic acid
Methylcysteine
Asp-Arg
LPI(16:2/0:0)
LPC(16:1/0:0)
LPC(18:3/0:0)
5'-Deoxy-5'-(Methylthio) Adenosine
Thiamine Monophosphate
Cytarabine
Carnitine C7:DC
LPC(13:0/0:0)
LPE(17:1/0:0)
N(Alpha)-Acetyl-Epsilon-(2-Propenal)Lysine
17β-Estradiol
Cyclo(Phe-Glu)
2'-Deoxyinosine-5'-monophosphate


In [371]:
import pandas as pd
import numpy as np
import ast # For safely evaluating string representations of lists

# This script assumes 'final_output_table' (or 'final_table') is your DataFrame
# that has been populated with 'PathwayScore', 'TF', 'TFActivity', AND 'Pathway_Common_Name'

# --- Safety Check for the input DataFrame ---
df_source_name = None
if 'final_output_table' in locals() and isinstance(final_output_table, pd.DataFrame):
    print("✅ 'final_output_table' found. Proceeding with categorization.")
    df_to_categorize = final_output_table.copy()
    df_source_name = 'final_output_table'
elif 'final_table' in locals() and isinstance(final_table, pd.DataFrame):
    print("✅ 'final_table' found. Proceeding with categorization (using 'final_table').")
    df_to_categorize = final_table.copy()
    df_source_name = 'final_table'
else:
    print("❌ ERROR: Neither 'final_output_table' nor 'final_table' DataFrame found.")
    print("         Please ensure it's loaded and populated from the previous script.")
    # For the script to run if above fails in a test, create a dummy df_to_categorize
    # In your actual run, this else block should not be hit if previous steps were successful.
    df_to_categorize = pd.DataFrame({
        'Metabolite': ['M1', 'M2', 'M3', 'M4', 'M5'],
        'Gene_Symbol': ['G1', 'G2', 'G3', 'G4', 'G5'],
        'Pathway Name': ['P1', 'P2', 'P3', 'P4', 'P5'],
        'Pathway_Common_Name': ['Name1', 'Name2', 'Name3', 'Name4', 'Name5'], # Added for dummy
        'kegg_enzyme': ['E1', 'E2', 'E3', 'E4', 'E5'],
        'kegg_reactions': [['R1'], ['R2'], ['R3'], ['R4'], ['R5']],
        'PathwayScore': [1.0, np.nan, 0.5, np.nan, np.nan],
        'TF': ['TF1', 'TF2', '', 'TF4', ''],
        'TFActivity': [0.5, np.nan, np.nan, 0.3, np.nan]
    })
    df_source_name = 'dummy_df_to_categorize'
    print(f"⚠️ Using a dummy DataFrame named '{df_source_name}' for demonstration.")

# Ensure the 'Pathway_Common_Name' column exists from the previous step
if 'Pathway_Common_Name' not in df_to_categorize.columns:
    print(f"⚠️ WARNING: 'Pathway_Common_Name' column not found in '{df_source_name}'.")
    print("         It will be missing from the categorized tables.")
    # Optionally, create it with default values if it's absolutely required later
    # df_to_categorize['Pathway_Common_Name'] = "N/A - Not Found"


# --- Step 1: Create boolean indicator columns for non-missing scores/TFs ---
print("\nStep 1: Creating indicator columns for score completeness...")

if 'PathwayScore' in df_to_categorize.columns:
    df_to_categorize['has_PathwayScore'] = df_to_categorize['PathwayScore'].notna()
else:
    print("⚠️ 'PathwayScore' column missing. Cannot create 'has_PathwayScore'.")
    df_to_categorize['has_PathwayScore'] = False 

if 'TF' in df_to_categorize.columns:
    df_to_categorize['has_TF'] = df_to_categorize['TF'].apply(lambda x: isinstance(x, str) and x != "")
else:
    print("⚠️ 'TF' column missing. Cannot create 'has_TF'.")
    df_to_categorize['has_TF'] = False

if 'TFActivity' in df_to_categorize.columns:
    df_to_categorize['has_TFActivity'] = df_to_categorize['TFActivity'].notna()
else:
    print("⚠️ 'TFActivity' column missing. Cannot create 'has_TFActivity'.")
    df_to_categorize['has_TFActivity'] = False

# --- Step 2: Count the number of valid scores/annotations for each row ---
print("\nStep 2: Counting number of valid scores/annotations per row...")
score_indicator_cols = []
if 'has_PathwayScore' in df_to_categorize.columns: score_indicator_cols.append('has_PathwayScore')
if 'has_TF' in df_to_categorize.columns: score_indicator_cols.append('has_TF')
if 'has_TFActivity' in df_to_categorize.columns: score_indicator_cols.append('has_TFActivity')

if score_indicator_cols and all(col in df_to_categorize.columns for col in score_indicator_cols) : 
    df_to_categorize['num_valid_annotations'] = df_to_categorize[score_indicator_cols].sum(axis=1)
    print("✅ 'num_valid_annotations' column added.")
else:
    print("❌ ERROR: Could not create 'num_valid_annotations' as key indicator columns are missing.")
    # To allow script to proceed if this step fails, we'll define it with zeros
    # but this indicates an issue with the input df_to_categorize structure
    if 'num_valid_annotations' not in df_to_categorize.columns:
         df_to_categorize['num_valid_annotations'] = 0


# --- Step 3: Filter and create the new DataFrames ---
print("\nStep 3: Creating categorized DataFrames...")

actual_gene_column_name = None
if 'Gene' in df_to_categorize.columns:
    actual_gene_column_name = 'Gene'
elif 'Gene_Symbol' in df_to_categorize.columns:
    actual_gene_column_name = 'Gene_Symbol'
else:
    actual_gene_column_name = 'Gene_Symbol' 
    if actual_gene_column_name not in df_to_categorize.columns:
         print(f"⚠️ WARNING: Gene column ('{actual_gene_column_name}') not found.")

# *** MODIFIED constant_cols TO INCLUDE 'Pathway_Common_Name' ***
constant_cols = ['Metabolite', actual_gene_column_name, 
                 'Pathway Name', 'Pathway_Common_Name', # Added Pathway_Common_Name here
                 'kegg_enzyme', 'kegg_reactions']
score_cols = ['PathwayScore', 'TF', 'TFActivity']

# Select only columns that actually exist in df_to_categorize to build final_display_cols
final_display_cols = [col for col in constant_cols if col in df_to_categorize.columns] + \
                     [col for col in score_cols if col in df_to_categorize.columns]

# Ensure 'num_valid_annotations' exists before filtering
if 'num_valid_annotations' in df_to_categorize.columns:
    df_3_of_3 = df_to_categorize[df_to_categorize['num_valid_annotations'] == 3][final_display_cols].copy()
    df_2_of_3 = df_to_categorize[df_to_categorize['num_valid_annotations'] == 2][final_display_cols].copy()
    df_1_of_3 = df_to_categorize[df_to_categorize['num_valid_annotations'] == 1][final_display_cols].copy()
    df_0_of_3 = df_to_categorize[df_to_categorize['num_valid_annotations'] == 0][final_display_cols].copy()

    print("✅ Categorized DataFrames created.")

    # --- Step 4: Display the shapes and heads of the new DataFrames ---
    print("\n\n--- Compounds with 3/3 Valid Annotations (PathwayScore, TF, AND TFActivity) ---")
    print(f"Shape: {df_3_of_3.shape}")
    if not df_3_of_3.empty:
        try:
            display(df_3_of_3.head())
        except NameError:
            print(df_3_of_3.head().to_string())

    print("\n\n--- Compounds with Exactly 2/3 Valid Annotations ---")
    print(f"Shape: {df_2_of_3.shape}")
    if not df_2_of_3.empty:
        try:
            display(df_2_of_3.head())
        except NameError:
            print(df_2_of_3.head().to_string())

    print("\n\n--- Compounds with Exactly 1/3 Valid Annotations ---")
    print(f"Shape: {df_1_of_3.shape}")
    if not df_1_of_3.empty:
        try:
            display(df_1_of_3.head())
        except NameError:
            print(df_1_of_3.head().to_string())

    print("\n\n--- Compounds with 0/3 Valid Annotations (All scores/TF are NaN/empty) ---")
    print(f"Shape: {df_0_of_3.shape}")
    if not df_0_of_3.empty:
        try:
            display(df_0_of_3.head())
        except NameError:
            print(df_0_of_3.head().to_string())
else:
    print("❌ Could not perform categorization because 'num_valid_annotations' column was not created (likely due to missing score/TF columns).")

# The DataFrames df_3_of_3, df_2_of_3, df_1_of_3, df_0_of_3 are now available.
# Helper columns are in 'df_to_categorize'.

✅ 'final_output_table' found. Proceeding with categorization.

Step 1: Creating indicator columns for score completeness...

Step 2: Counting number of valid scores/annotations per row...
✅ 'num_valid_annotations' column added.

Step 3: Creating categorized DataFrames...
✅ Categorized DataFrames created.


--- Compounds with 3/3 Valid Annotations (PathwayScore, TF, AND TFActivity) ---
Shape: (1048, 9)


Unnamed: 0,Metabolite,Gene_Symbol,Pathway Name,Pathway_Common_Name,kegg_enzyme,kegg_reactions,PathwayScore,TF,TFActivity
0,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,"['R00145', 'R00146', 'R00717', 'R01088', 'R013...",126.993134,E2F1,0.980517
1,D-Erythronolactone,BDH1,hsa01100,Metabolic pathways - Homo sapiens (human),1.1.1.30,"['R00145', 'R00146', 'R00717', 'R01088', 'R013...",-141.148347,E2F1,0.980517
2,Deoxyribose 5-phosphate,DERA,hsa00030,Pentose phosphate pathway - Homo sapiens (human),4.1.2.4,"['R02750', 'R02749', 'R01066']",199.106574,TP53,1.779543
3,Deoxyribose 5-phosphate,DERA,hsa01100,Metabolic pathways - Homo sapiens (human),4.1.2.4,"['R02750', 'R02749', 'R01066']",-141.148347,TP53,1.779543
4,Deoxyribose 5-phosphate,RBKS,hsa00030,Pentose phosphate pathway - Homo sapiens (human),2.7.1.15,"['R02750', 'R02749', 'R01066']",199.106574,GATA2,-0.424943




--- Compounds with Exactly 2/3 Valid Annotations ---
Shape: (14, 9)


Unnamed: 0,Metabolite,Gene_Symbol,Pathway Name,Pathway_Common_Name,kegg_enzyme,kegg_reactions,PathwayScore,TF,TFActivity
540,LPC(16:1/0:0),LPCAT1,hsa00564,Glycerophospholipid metabolism - Homo sapiens ...,2.3.1.23,"['R07859', 'R01318', 'R07291', 'R02746', 'R013...",-141.148347,ZNF263,
541,LPC(18:3/0:0),LPCAT1,hsa00564,Glycerophospholipid metabolism - Homo sapiens ...,2.3.1.23,"['R07859', 'R01318', 'R07291', 'R02746', 'R013...",-141.148347,ZNF263,
777,LPC(16:1/0:0),LPCAT1,hsa00564,Glycerophospholipid metabolism - Homo sapiens ...,2.3.1.23,"['R07859', 'R01318', 'R07291', 'R02746', 'R013...",-141.148347,ZNF263,
778,LPC(18:3/0:0),LPCAT1,hsa00564,Glycerophospholipid metabolism - Homo sapiens ...,2.3.1.23,"['R07859', 'R01318', 'R07291', 'R02746', 'R013...",-141.148347,ZNF263,
845,17β-Estradiol,UGT1A1,hsa00040,Pentose and glucuronate interconversions - Hom...,2.4.1.17,"['R00535', 'R03090', 'R03091', 'R02353', 'R030...",591.4516,NR1I3,




--- Compounds with Exactly 1/3 Valid Annotations ---
Shape: (27, 9)


Unnamed: 0,Metabolite,Gene_Symbol,Pathway Name,Pathway_Common_Name,kegg_enzyme,kegg_reactions,PathwayScore,TF,TFActivity
805,17β-Estradiol,UGT2A2,hsa00040,Pentose and glucuronate interconversions - Hom...,2.4.1.17,"['R00535', 'R03090', 'R03091', 'R02353', 'R030...",591.4516,,
806,17β-Estradiol,UGT2A2,hsa00053,Ascorbate and aldarate metabolism - Homo sapie...,2.4.1.17,"['R00535', 'R03090', 'R03091', 'R02353', 'R030...",591.4516,,
807,17β-Estradiol,UGT2A2,hsa00140,Steroid hormone biosynthesis - Homo sapiens (h...,2.4.1.17,"['R00535', 'R03090', 'R03091', 'R02353', 'R030...",-141.148347,,
808,17β-Estradiol,UGT2A2,hsa00830,Retinol metabolism - Homo sapiens (human),2.4.1.17,"['R00535', 'R03090', 'R03091', 'R02353', 'R030...",591.4516,,
809,17β-Estradiol,UGT2A2,hsa00860,Porphyrin metabolism - Homo sapiens (human),2.4.1.17,"['R00535', 'R03090', 'R03091', 'R02353', 'R030...",110.769497,,




--- Compounds with 0/3 Valid Annotations (All scores/TF are NaN/empty) ---
Shape: (0, 9)


In [372]:
display(df_3_of_3)

Unnamed: 0,Metabolite,Gene_Symbol,Pathway Name,Pathway_Common_Name,kegg_enzyme,kegg_reactions,PathwayScore,TF,TFActivity
0,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,"['R00145', 'R00146', 'R00717', 'R01088', 'R013...",126.993134,E2F1,0.980517
1,D-Erythronolactone,BDH1,hsa01100,Metabolic pathways - Homo sapiens (human),1.1.1.30,"['R00145', 'R00146', 'R00717', 'R01088', 'R013...",-141.148347,E2F1,0.980517
2,Deoxyribose 5-phosphate,DERA,hsa00030,Pentose phosphate pathway - Homo sapiens (human),4.1.2.4,"['R02750', 'R02749', 'R01066']",199.106574,TP53,1.779543
3,Deoxyribose 5-phosphate,DERA,hsa01100,Metabolic pathways - Homo sapiens (human),4.1.2.4,"['R02750', 'R02749', 'R01066']",-141.148347,TP53,1.779543
4,Deoxyribose 5-phosphate,RBKS,hsa00030,Pentose phosphate pathway - Homo sapiens (human),2.7.1.15,"['R02750', 'R02749', 'R01066']",199.106574,GATA2,-0.424943
...,...,...,...,...,...,...,...,...,...
1084,2'-Deoxyinosine-5'-monophosphate,NT5C,hsa00760,Nicotinate and nicotinamide metabolism - Homo ...,3.1.3.5,"['R10235', 'R03531', 'R12958']",-141.148347,ETS1,-0.214890
1085,2'-Deoxyinosine-5'-monophosphate,NT5C,hsa01100,Metabolic pathways - Homo sapiens (human),3.1.3.5,"['R10235', 'R03531', 'R12958']",-141.148347,ETS1,-0.214890
1086,2'-Deoxyinosine-5'-monophosphate,NT5C,hsa01110,Biosynthesis of secondary metabolites,3.1.3.5,"['R10235', 'R03531', 'R12958']",-141.148347,ETS1,-0.214890
1087,2'-Deoxyinosine-5'-monophosphate,NUDT16,hsa00230,Purine metabolism - Homo sapiens (human),3.6.1.64,"['R10235', 'R03531', 'R12958']",383.739852,HNF4A,-0.496992


In [373]:
# Assuming df_3_of_3 is already defined and has a column named 'Metabolite'

# 1. Extract unique metabolite names (ignoring NaNs)
unique_metabs = df_3_of_3['Metabolite'].dropna().unique()

# 2. (Optional) Sort them for readability
unique_metabs_sorted = sorted(unique_metabs)

# 3. Print the unique names and the total count
print("Unique metabolites in df_3_of_3['Metabolite']:")
for m in unique_metabs_sorted:
    print(m)

print(f"\nTotal number of unique metabolites: {len(unique_metabs_sorted)}")


Unique metabolites in df_3_of_3['Metabolite']:
17β-Estradiol
2'-Deoxyinosine-5'-monophosphate
5'-Deoxy-5'-(Methylthio) Adenosine
Asp-Arg
Carnitine C7:DC
Cyclo(Phe-Glu)
Cytarabine
D-Erythronolactone
Deoxyribose 5-phosphate
LPC(13:0/0:0)
LPC(16:1/0:0)
LPC(18:3/0:0)
LPE(17:1/0:0)
LPI(16:2/0:0)
Methylcysteine
N(Alpha)-Acetyl-Epsilon-(2-Propenal)Lysine
Quinoline-4-carboxylic acid
Thiamine Monophosphate

Total number of unique metabolites: 18


In [311]:
import pandas as pd
import ast

# --- Step 0: (Optional) Dummy‐Data Creation for Testing ---
# Remove or comment out this block if your real DataFrames already exist.

if 'df_3_of_3' not in locals() or not isinstance(df_3_of_3, pd.DataFrame):
    print("⚠️ 'df_3_of_3' not found. Creating a small dummy for demonstration.")
    df_3_of_3 = pd.DataFrame({
        'Metabolite': ["['MetaboliteA']", "['MetaboliteB']", "['MetaboliteC']"],
        'OtherData': [1, 2, 3]
    })

if 'df_2_of_3' not in locals() or not isinstance(df_2_of_3, pd.DataFrame):
    print("⚠️ 'df_2_of_3' not found. Creating a small dummy for demonstration.")
    df_2_of_3 = pd.DataFrame({
        'Metabolite': ["['MetaboliteB']", "['MetaboliteD']", "['MetaboliteE']"],
        'OtherData': [4, 5, 6]
    })

if 'df_1_of_3' not in locals() or not isinstance(df_1_of_3, pd.DataFrame):
    print("⚠️ 'df_1_of_3' not found. Creating a small dummy for demonstration.")
    df_1_of_3 = pd.DataFrame({
        'Metabolite': ["['MetaboliteE']", "['MetaboliteF']", "['MetaboliteG']"],
        'OtherData': [7, 8, 9]
    })
# --- End of Dummy Data Block ---


print("\n--- Identifying compounds in df_1_of_3 or df_2_of_3 but NOT in df_3_of_3 ---\n")

# --- Step 1: Define a helper to clean entries like "['XYZ']" → "XYZ" ---
def clean_metabolite_column_entry_for_set(entry):
    """
    Take a raw cell from the 'Metabolite' column, which might be:
      - a Python list in string form, e.g. "['MetaboliteA']"
      - an actual Python list like ['MetaboliteA']
      - a standalone string "MetaboliteA"
      - NaN or something else
    Return the first element if it's a list, or the string itself if not,
    or None if it can't be parsed.
    """
    if pd.isna(entry):
        return None

    s_entry = str(entry).strip()

    # 1a) Try to literal_eval it (handles "['A']" → ['A'])
    try:
        evaluated = ast.literal_eval(s_entry)
        # If that returns a list/tuple, take its first element (if it exists)
        if isinstance(evaluated, (list, tuple)):
            if len(evaluated) > 0:
                return str(evaluated[0]).strip()
            return None
        # If literal_eval returned a single string or number, just cast to str
        return str(evaluated).strip()
    except (ValueError, SyntaxError, TypeError):
        # 1b) If literal_eval fails, manually strip off brackets/quotes if present
        cleaned = s_entry
        if cleaned.startswith("['") and cleaned.endswith("']"):
            cleaned = cleaned[2:-2]
        elif cleaned.startswith("[") and cleaned.endswith("]"):
            cleaned = cleaned[1:-1]
        return cleaned.strip("'\" ")

# --- Step 2: Verify that each DataFrame exists and has a 'Metabolite' column ---
dfs_to_check = {
    "df_3_of_3": df_3_of_3,
    "df_2_of_3": df_2_of_3,
    "df_1_of_3": df_1_of_3
}
all_valid = True

for name, df_obj in dfs_to_check.items():
    if not isinstance(df_obj, pd.DataFrame):
        print(f"❌ ERROR: '{name}' is not a DataFrame.")
        all_valid = False
    else:
        if 'Metabolite' not in df_obj.columns:
            print(f"❌ ERROR: '{name}' does not have a 'Metabolite' column.")
            all_valid = False
        elif df_obj.empty:
            print(f"ℹ️  NOTE: '{name}' is defined but empty. (No rows to process.)")

if not all_valid:
    print("\nAborting because one or more DataFrames are invalid or missing 'Metabolite'.")
else:
    # --- Step 3: Extract unique, cleaned metabolite names from each DataFrame ---
    print("✔️ All DataFrames valid. Proceeding with cleaning/extraction...\n")

    # Apply the cleaning helper, drop any None results, then convert to set
    metabolites_df3 = set(
        df_3_of_3['Metabolite']
        .dropna()
        .map(clean_metabolite_column_entry_for_set)
        .dropna()
    )
    metabolites_df2 = set(
        df_2_of_3['Metabolite']
        .dropna()
        .map(clean_metabolite_column_entry_for_set)
        .dropna()
    )
    metabolites_df1 = set(
        df_1_of_3['Metabolite']
        .dropna()
        .map(clean_metabolite_column_entry_for_set)
        .dropna()
    )

    print(f"  Unique cleaned metabolites in df_3_of_3: {len(metabolites_df3)}")
    print(f"  Unique cleaned metabolites in df_2_of_3: {len(metabolites_df2)}")
    print(f"  Unique cleaned metabolites in df_1_of_3: {len(metabolites_df1)}\n")

    # --- Step 4: Combine df_1_of_3 ∪ df_2_of_3, then subtract df_3_of_3 ---
    metabolites_1_or_2 = metabolites_df1.union(metabolites_df2)
    compounds_not_in_df3 = metabolites_1_or_2 - metabolites_df3

    print(f"🔍 Metabolites in (df_1_of_3 ∪ df_2_of_3) but not in df_3_of_3: {len(compounds_not_in_df3)} total\n")

    if compounds_not_in_df3:
        print("Here are the first 30 (if more exist):")
        for i, name in enumerate(sorted(compounds_not_in_df3)):
            print(f"  - {name}")
            if i == 29 and len(compounds_not_in_df3) > 30:
                print(f"    ... and {len(compounds_not_in_df3) - 30} more.")
                break
    else:
        print("All metabolites from df_1_of_3 and df_2_of_3 also appear in df_3_of_3.\n")

    print("\n--- Script Complete ---")



--- Identifying compounds in df_1_of_3 or df_2_of_3 but NOT in df_3_of_3 ---

✔️ All DataFrames valid. Proceeding with cleaning/extraction...

  Unique cleaned metabolites in df_3_of_3: 18
  Unique cleaned metabolites in df_2_of_3: 2
  Unique cleaned metabolites in df_1_of_3: 3

🔍 Metabolites in (df_1_of_3 ∪ df_2_of_3) but not in df_3_of_3: 0 total

All metabolites from df_1_of_3 and df_2_of_3 also appear in df_3_of_3.


--- Script Complete ---


In [5]:
import pandas as pd
ReacEnzyPath = pd.read_csv('ReacEnzyPath.csv')
ReacEnzyPath.head(24)

Unnamed: 0,input_compound_name,input_original_index,input_hmdb,input_pubchem_cid,input_cas,derived_kegg_compound_id,kegg_id_source_identifier,kegg_id_source_type,kegg_reactions,kegg_enzymes,kegg_pathways,KEGG_Hits_Count,SMILES
0,D-Erythronolactone,MADN0053,HMDB0000349,5325915,15667-21-7,,,,"['R00145', 'R00146', 'R00717', 'R01088', 'R013...","['1.1.1.29', '1.1.1.30', '1.4.1.9']","['hsa00260', 'hsa00280', 'hsa00650']",0,C1[C@H]([C@H](C(=O)O1)O)O
1,"1,6-anhydro-β-D-glucose",MADN0166,HMDB0000640,724705,498-07-7,,,,"['R01431', 'R01896', 'R07152', 'R09477', 'R116...","['1.1.1.307', '1.1.1.9', '1.1.3.41']",['hsa00040'],0,C1[C@@H]2[C@H]([C@@H]([C@H]([C@H](O1)O2)O)O)O
2,Deoxyribose 5-phosphate,MADN0220,HMDB0001031,45934311,-,C00673,HMDB0001031,HMDB,"['R02750', 'R02749', 'R01066']","['4.1.2.4', '2.7.1.229', '2.7.1.15', '5.4.2.7']","['hsa00030', 'hsa01100']",3,C1[C@@H]([C@H](O[C@H]1O)COP(=O)(O)O)O
3,2-Aminobenzenesulfonic acid,MADN0329,-,6926,88-21-1,,,,"['R01842', 'R02354', 'R02355', 'R02356', 'R025...",['1.14.14.1'],['hsa00071'],0,C1=CC=C(C(=C1)N)S(=O)(=O)O
4,Quinoline-4-carboxylic acid,MADN0333,-,10243,486-74-8,,,,"['R00366', 'R00372', 'R00635', 'R01340', 'R017...","['1.2.3.1', '1.4.3.3', '2.6.1.4']","['hsa00260', 'hsa00280']",0,C1=CC=C2C(=C1)C(=CC=N2)C(=O)O
5,cyclo(glu-glu),MADN0466,-,7408481,16691-00-2,,,,"['R00093', 'R00114', 'R00243', 'R00248', 'R002...","['1.4.1.13', '1.4.1.14', '3.5.3.7']","['hsa00250', 'hsa00330']",0,C(CC(=O)O)[C@H]1C(=O)N[C@H](C(=O)N1)CCC(=O)O
6,P-sulfanilic acid,MADN0498,-,8479,121-57-3,,,,"['R01842', 'R02354', 'R02355', 'R02356', 'R025...",['1.14.14.1'],['hsa00071'],0,C1=CC(=CC=C1N)S(=O)(=O)O
7,Methylcysteine,MADP0119,HMDB0002108,24417,1187-84-4,,,,"['R00891', 'R00894', 'R01289', 'R01290', 'R027...","['4.2.1.22', '4.3.2.9', '6.3.2.2']","['hsa00260', 'hsa00270', 'hsa00480']",0,CSC[C@@H](C(=O)O)N
8,Asp-Arg,MADP0548,-,16122509,-,,,,"['R00256', 'R00485', 'R01398', 'R01579', 'R019...","['2.1.3.3', '3.5.1.38', '3.5.3.7']","['hsa00220', 'hsa00330']",0,C(C[C@@H](C(=O)O)NC(=O)[C@H](CC(=O)O)N)CN=C(N)N
9,LPI(16:2/0:0),MEDN1253,-,,-,,,,"['R01185', 'R01186', 'R01187', 'R01313', 'R013...","['1.13.11.33', '3.1.1.4', '3.1.3.25']","['hsa00521', 'hsa00564', 'hsa00590']",0,CCCC/C=C\C/C=C\CCCCCCC(=O)OC[C@H](O)COP(=O)([O...


In [374]:
import pandas as pd

# (Assuming df_3_of_3 is already defined as a pandas DataFrame)

# Columns to check for NaNs in df_3_of_3
cols_to_check = [
    'Metabolite',
    'Gene_Symbol',
    'Pathway Name',
    'Pathway_Common_Name',
    'PathwayScore',
    'TF',
    'TFActivity'
]

# 1) Count NaNs in each specified column
nan_counts = df_3_of_3[cols_to_check].isna().sum()
print("NaN counts per column in df_3_of_3:")
print(nan_counts)

# 2) Show rows where any of these columns is NaN
mask_any_nan = df_3_of_3[cols_to_check].isna().any(axis=1)
rows_with_na = df_3_of_3.loc[mask_any_nan, cols_to_check]
print(f"\nRows in df_3_of_3 with at least one NaN in the specified columns (up to 10 shown):")
print(rows_with_na.head(10).to_string(index=False))

# 3) Detailed NaN flags for each row (first 10 rows)
nan_flags = df_3_of_3[cols_to_check].isna()
detailed_flags = pd.concat([df_3_of_3[['Metabolite', 'Gene_Symbol']], nan_flags], axis=1)
print(f"\nDetailed NaN flags for each row in df_3_of_3 (first 10 rows):")
print(detailed_flags.head(10).to_string(index=False))


NaN counts per column in df_3_of_3:
Metabolite             0
Gene_Symbol            0
Pathway Name           0
Pathway_Common_Name    0
PathwayScore           0
TF                     0
TFActivity             0
dtype: int64

Rows in df_3_of_3 with at least one NaN in the specified columns (up to 10 shown):
Empty DataFrame
Columns: [Metabolite, Gene_Symbol, Pathway Name, Pathway_Common_Name, PathwayScore, TF, TFActivity]
Index: []

Detailed NaN flags for each row in df_3_of_3 (first 10 rows):
                 Metabolite  Gene_Symbol  Metabolite  Gene_Symbol  Pathway Name  Pathway_Common_Name  PathwayScore    TF  TFActivity
         D-Erythronolactone         BDH1       False        False         False                False         False False       False
         D-Erythronolactone         BDH1       False        False         False                False         False False       False
    Deoxyribose 5-phosphate         DERA       False        False         False                False    

In [261]:
df_3_of_3.shape

(546, 9)

In [262]:
# 1) Select the columns you care about
cols = ['TF', 'PathwayScore', 'TFActivity']

# 2) Compute how many NaNs are in each
nan_counts = df_2_of_3[cols].isna().sum()

# 3) (Optional) Also compute percentage of the total
total = len(df_2_of_3)
nan_pct  = (nan_counts / total * 100).round(2)

# 4) Print out the results
print("❓ Missing (NaN) counts:")
for c in cols:
    print(f"  • {c:12s}: {nan_counts[c]:4d} / {total}  ({nan_pct[c]:5.2f}%)")


❓ Missing (NaN) counts:
  • TF          :    0 / 12  ( 0.00%)
  • PathwayScore:    0 / 12  ( 0.00%)
  • TFActivity  :   12 / 12  (100.00%)


In [375]:
import pandas as pd
import ast # For safely evaluating string representations of lists

# Assuming 'df_3_of_3_filtered' is already loaded and looks like your example.
# If 'kegg_reactions' is stored as strings that look like lists (e.g., "['R00048', 'R00053']"),
# you first need to convert them into actual list objects.

# --- Optional: Convert string representation of lists to actual lists ---
# Apply this step ONLY if 'kegg_reactions' are strings.
# If they are already lists, you can skip this.
def safe_literal_eval(val):
    try:
        # Safely evaluate string to a Python literal (e.g., list)
        return ast.literal_eval(val)
    except (ValueError, SyntaxError, TypeError):
        # Handle cases where val is not a string representation of a list,
        # or is already a list, or is NaN/None
        if isinstance(val, list):
            return val # Already a list
        return [] # Default to empty list if conversion fails or not a list-like string

# Check the type of the first element to see if conversion is needed
if not df_3_of_3.empty and isinstance(df_3_of_3['kegg_reactions'].iloc[0], str):
    print("Converting 'kegg_reactions' from string to list...")
    df_3_of_3['kegg_reactions'] = df_3_of_3['kegg_reactions'].apply(safe_literal_eval)
    print("Conversion complete.")
# --- End of optional conversion ---

# Explode the 'kegg_reactions' column
df_exploded = df_3_of_3.explode('kegg_reactions')

# Display the head of the new exploded DataFrame
print("\n--- Head of df_exploded (with 'kegg_reactions' exploded) ---")
try:
    display(df_exploded.head())
except NameError: # If 'display' is not defined (e.g., not in a Jupyter environment)
    print(df_exploded.head().to_string())

# You can also check the shape before and after to see the effect
print(f"\nShape of original DataFrame: {df_3_of_3.shape}")
print(f"Shape of exploded DataFrame: {df_exploded.shape}")

Converting 'kegg_reactions' from string to list...
Conversion complete.

--- Head of df_exploded (with 'kegg_reactions' exploded) ---


Unnamed: 0,Metabolite,Gene_Symbol,Pathway Name,Pathway_Common_Name,kegg_enzyme,kegg_reactions,PathwayScore,TF,TFActivity
0,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R00145,126.993134,E2F1,0.980517
0,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R00146,126.993134,E2F1,0.980517
0,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R00717,126.993134,E2F1,0.980517
0,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R01088,126.993134,E2F1,0.980517
0,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R01361,126.993134,E2F1,0.980517



Shape of original DataFrame: (1048, 9)
Shape of exploded DataFrame: (11959, 9)


In [376]:
display(df_exploded)

Unnamed: 0,Metabolite,Gene_Symbol,Pathway Name,Pathway_Common_Name,kegg_enzyme,kegg_reactions,PathwayScore,TF,TFActivity
0,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R00145,126.993134,E2F1,0.980517
0,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R00146,126.993134,E2F1,0.980517
0,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R00717,126.993134,E2F1,0.980517
0,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R01088,126.993134,E2F1,0.980517
0,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R01361,126.993134,E2F1,0.980517
...,...,...,...,...,...,...,...,...,...
1087,2'-Deoxyinosine-5'-monophosphate,NUDT16,hsa00230,Purine metabolism - Homo sapiens (human),3.6.1.64,R03531,383.739852,HNF4A,-0.496992
1087,2'-Deoxyinosine-5'-monophosphate,NUDT16,hsa00230,Purine metabolism - Homo sapiens (human),3.6.1.64,R12958,383.739852,HNF4A,-0.496992
1088,2'-Deoxyinosine-5'-monophosphate,NUDT16,hsa01100,Metabolic pathways - Homo sapiens (human),3.6.1.64,R10235,-141.148347,HNF4A,-0.496992
1088,2'-Deoxyinosine-5'-monophosphate,NUDT16,hsa01100,Metabolic pathways - Homo sapiens (human),3.6.1.64,R03531,-141.148347,HNF4A,-0.496992


In [377]:
import pandas as pd
import ast # For safely evaluating string representations of lists if necessary

# --- Helper function to safely convert string representations of lists to actual lists ---
def string_to_list_parser(series_column):
    """
    Converts a pandas Series containing string representations of lists,
    or single strings, or actual lists, into a Series of actual lists.
    Handles NaNs by converting them to empty lists.
    """
    processed_items = []
    for item in series_column:
        if isinstance(item, str):
            try:
                evaluated = ast.literal_eval(item)
                if isinstance(evaluated, list):
                    processed_items.append(evaluated)
                else:
                    # It was a string that evaluated to something not a list (e.g. "R00001")
                    processed_items.append([str(evaluated)]) # Wrap single items in a list
            except (ValueError, SyntaxError):
                # Plain string not evaluatable as a list (e.g. "R00001" or "MetaboliteName")
                processed_items.append([item])
        elif isinstance(item, list):
            processed_items.append(item) # Already a list
        elif pd.isna(item):
            processed_items.append([]) # Convert NaNs to empty list
        else:
            # For other types, wrap them in a list (e.g. numbers if they appear)
            processed_items.append([str(item)]) 
    return pd.Series(processed_items, index=series_column.index)

# --- 1. Preprocess df_exploded ---
# Create a copy to avoid modifying the original df_exploded
df_exploded_processed = df_exploded.copy()

# Your 'Metabolite' column in df_exploded (derived from df_3_of_3_filtered)
# is already a list (e.g., ['D-Erythronolactone']).
# We need to explode it so each metabolite has its own row.
# Ensure 'Metabolite' column is actually list type if it was read as string.
if 'Metabolite' in df_exploded_processed.columns and not df_exploded_processed.empty:
    if isinstance(df_exploded_processed['Metabolite'].iloc[0], str):
        print("Converting 'Metabolite' in df_exploded from string-list to list...")
        df_exploded_processed['Metabolite'] = string_to_list_parser(df_exploded_processed['Metabolite'])
    
    print("Exploding 'Metabolite' column in df_exploded...")
    df_exploded_std = df_exploded_processed.explode('Metabolite')
    # After exploding, 'Metabolite' column now contains single string metabolite names.
else:
    print("⚠️ 'Metabolite' column not in df_exploded or df_exploded is empty. Using df_exploded as is for structure.")
    df_exploded_std = df_exploded_processed.copy()


# --- 2. Preprocess ReacEnzyPath ---
rep_processed = ReacEnzyPath.copy()
print("\nPreprocessing ReacEnzyPath...")

# Columns in ReacEnzyPath that are lists and need to be exploded
list_cols_in_rep = ['kegg_reactions', 'kegg_enzymes', 'kegg_pathways']

for col_name in list_cols_in_rep:
    if col_name in rep_processed.columns:
        # Ensure the column contains actual list objects if they are strings
        if not rep_processed.empty and isinstance(rep_processed[col_name].iloc[0], str):
            print(f"Converting '{col_name}' in ReacEnzyPath from string-list to list...")
            rep_processed[col_name] = string_to_list_parser(rep_processed[col_name])
        
        print(f"Exploding '{col_name}' in ReacEnzyPath...")
        rep_processed = rep_processed.explode(col_name)
    else:
        print(f"⚠️ Column '{col_name}' not found in ReacEnzyPath. It will not be used for matching.")

# Rename columns in the processed ReacEnzyPath to match those in df_exploded_std for the merge.
# After exploding, 'kegg_enzymes' will hold single enzyme IDs, similar to 'kegg_enzyme' in df_exploded.
# Similarly for 'kegg_pathways' vs 'Pathway Name'.
rename_map_rep = {
    'input_compound_name': 'Metabolite',    # Matches df_exploded_std['Metabolite']
    'kegg_enzymes': 'kegg_enzyme',        # Matches df_exploded_std['kegg_enzyme']
    'kegg_pathways': 'Pathway Name'      # Matches df_exploded_std['Pathway Name']
    # 'kegg_reactions' column name is assumed to be the same in both after their respective explosions.
}
rep_to_merge = rep_processed.rename(columns=rename_map_rep)
print("Renamed columns in processed ReacEnzyPath for merging.")


# --- 3. Merge to find common entries ---
# Define the columns that must match between the two DataFrames
merge_on_columns = ['Metabolite', 'kegg_reactions', 'kegg_enzyme', 'Pathway Name']
print(f"\nAttempting to merge on columns: {merge_on_columns}")

# Ensure all merge_on_columns exist in both DataFrames before attempting the merge
all_keys_present_df_exploded = all(col in df_exploded_std.columns for col in merge_on_columns)
all_keys_present_rep_to_merge = all(col in rep_to_merge.columns for col in merge_on_columns)

if all_keys_present_df_exploded and all_keys_present_rep_to_merge:
    # Select only the key columns from rep_to_merge to act as a unique set of filters.
    # This prevents duplication of other columns from ReacEnzyPath into the final result.
    filter_keys_from_rep = rep_to_merge[merge_on_columns].drop_duplicates()

    # Perform an inner merge. This operation effectively filters df_exploded_std,
    # keeping only rows where the combination of merge_on_columns values
    # exists in filter_keys_from_rep. The columns will be those of df_exploded_std.
    df_common_entries = pd.merge(
        df_exploded_std,
        filter_keys_from_rep,
        on=merge_on_columns,
        how='inner'
    )
    print("✅ Merge successful.")
else:
    print("❌ ERROR: One or more merge key columns are missing from the DataFrames.")
    if not all_keys_present_df_exploded:
        print(f"   Missing in df_exploded_std: {[col for col in merge_on_columns if col not in df_exploded_std.columns]}")
    if not all_keys_present_rep_to_merge:
        print(f"   Missing in rep_to_merge (derived from ReacEnzyPath): {[col for col in merge_on_columns if col not in rep_to_merge.columns]}")
    df_common_entries = pd.DataFrame(columns=df_exploded_std.columns) # Return an empty df with original columns


# --- Display results ---
print("\n--- Head of the new DataFrame with common entries (df_common_entries) ---")
try:
    display(df_common_entries.head())
except NameError:
    print(df_common_entries.head().to_string())

print(f"\nShape of original df_exploded: {df_exploded.shape}")
if 'df_exploded_std' in locals():
    print(f"Shape of metabolite-exploded df_exploded_std: {df_exploded_std.shape}")
print(f"Shape of ReacEnzyPath (original): {ReacEnzyPath.shape}")
if 'rep_to_merge' in locals():
    print(f"Shape of fully exploded and renamed ReacEnzyPath (rep_to_merge): {rep_to_merge.shape}")
print(f"Shape of the new DataFrame (df_common_entries): {df_common_entries.shape}")


Converting 'Metabolite' in df_exploded from string-list to list...
Exploding 'Metabolite' column in df_exploded...

Preprocessing ReacEnzyPath...
Converting 'kegg_reactions' in ReacEnzyPath from string-list to list...
Exploding 'kegg_reactions' in ReacEnzyPath...
Converting 'kegg_enzymes' in ReacEnzyPath from string-list to list...
Exploding 'kegg_enzymes' in ReacEnzyPath...
Converting 'kegg_pathways' in ReacEnzyPath from string-list to list...
Exploding 'kegg_pathways' in ReacEnzyPath...
Renamed columns in processed ReacEnzyPath for merging.

Attempting to merge on columns: ['Metabolite', 'kegg_reactions', 'kegg_enzyme', 'Pathway Name']
✅ Merge successful.

--- Head of the new DataFrame with common entries (df_common_entries) ---


Unnamed: 0,Metabolite,Gene_Symbol,Pathway Name,Pathway_Common_Name,kegg_enzyme,kegg_reactions,PathwayScore,TF,TFActivity
0,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R00145,126.993134,E2F1,0.980517
1,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R00146,126.993134,E2F1,0.980517
2,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R00717,126.993134,E2F1,0.980517
3,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R01088,126.993134,E2F1,0.980517
4,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R01361,126.993134,E2F1,0.980517



Shape of original df_exploded: (11959, 9)
Shape of metabolite-exploded df_exploded_std: (11959, 9)
Shape of ReacEnzyPath (original): (24, 13)
Shape of fully exploded and renamed ReacEnzyPath (rep_to_merge): (5614, 13)
Shape of the new DataFrame (df_common_entries): (2200, 9)


In [378]:
display(df_common_entries)

Unnamed: 0,Metabolite,Gene_Symbol,Pathway Name,Pathway_Common_Name,kegg_enzyme,kegg_reactions,PathwayScore,TF,TFActivity
0,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R00145,126.993134,E2F1,0.980517
1,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R00146,126.993134,E2F1,0.980517
2,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R00717,126.993134,E2F1,0.980517
3,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R01088,126.993134,E2F1,0.980517
4,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R01361,126.993134,E2F1,0.980517
...,...,...,...,...,...,...,...,...,...
2195,2'-Deoxyinosine-5'-monophosphate,NT5C,hsa00230,Purine metabolism - Homo sapiens (human),3.1.3.5,R03531,383.739852,ETS1,-0.214890
2196,2'-Deoxyinosine-5'-monophosphate,NT5C,hsa00230,Purine metabolism - Homo sapiens (human),3.1.3.5,R12958,383.739852,ETS1,-0.214890
2197,2'-Deoxyinosine-5'-monophosphate,NUDT16,hsa00230,Purine metabolism - Homo sapiens (human),3.6.1.64,R10235,383.739852,HNF4A,-0.496992
2198,2'-Deoxyinosine-5'-monophosphate,NUDT16,hsa00230,Purine metabolism - Homo sapiens (human),3.6.1.64,R03531,383.739852,HNF4A,-0.496992


In [379]:
import pandas as pd
import numpy as np  # Usually imported alongside pandas

# Check for existence of df_exploded in the current namespace:
if 'df_exploded' not in locals():
    print("⚠️ 'df_exploded' not found. Creating a small sample DataFrame for demonstration.")
    # (In your real workflow, df_exploded should already exist, so you can skip this block.)
    data_sample = {
        'Metabolite': ['MetA', 'MetB', 'MetC', 'MetA', 'MetB'],
        'Gene_Symbol': ['Gene1', 'Gene2', None, 'Gene1', 'Gene3'],  # None will become NaN
        'PathwayScore': [10.5, np.nan, 8.0, 10.5, 9.1],              # np.nan is a NaN
        'TF': ['TF1', 'TF2', 'TF1', 'TF3', None],
        'Log2FC_GeneExpr': [1.5, -0.5, 1.2, np.nan, 0.8],
        'P_value_GeneExpr': [0.01, 0.05, 0.001, 0.02, np.nan]
    }
    df_exploded = pd.DataFrame(data_sample)
    print("Sample df_exploded created:")
    try:
        display(df_exploded)
    except NameError:
        print(df_exploded.to_string())
else:
    print("Using existing 'df_exploded'.")

# --- 1. Check if any NaN values exist in the entire DataFrame ---
any_nan_in_df = df_exploded.isnull().values.any()
print(f"\n1. Are there any NaN values in df_exploded? {any_nan_in_df}")

# --- 2. Count NaN values per column ---
nan_counts_per_column = df_exploded.isnull().sum()
print("\n2. Count of NaN values per column in df_exploded:")
print(nan_counts_per_column)

# Filter to show only columns that have at least one NaN
columns_with_nans = nan_counts_per_column[nan_counts_per_column > 0]
if not columns_with_nans.empty:
    print("\n   Columns in df_exploded that contain NaN values:")
    print(columns_with_nans)
else:
    print("\n   No columns in df_exploded contain NaN values.")

# --- 3. Count the total number of NaN values in the DataFrame ---
total_nan_count = df_exploded.isnull().sum().sum()
print(f"\n3. Total number of NaN values in df_exploded: {total_nan_count}")

# --- 4. Get a boolean DataFrame indicating the location of NaNs ---
print("\n4. Boolean DataFrame showing NaN locations (first 5 rows if large):")
nan_locations_df = df_exploded.isnull()
try:
    display(nan_locations_df.head())
except NameError:
    print(nan_locations_df.head().to_string())

# --- 5. List rows that contain any NaN values ---
rows_with_any_nan = df_exploded[df_exploded.isnull().any(axis=1)]
if not rows_with_any_nan.empty:
    print(f"\n5. Rows in df_exploded containing at least one NaN (showing first 5):")
    try:
        display(rows_with_any_nan.head())
    except NameError:
        print(rows_with_any_nan.head().to_string())
    print(f"   Total number of rows with at least one NaN: {len(rows_with_any_nan)}")
else:
    print("\n5. No rows in df_exploded contain any NaN values.")

# --- 6. Percentage of NaN values per column ---
nan_percentage_per_column = (df_exploded.isnull().sum() / len(df_exploded)) * 100
print("\n6. Percentage of NaN values per column in df_exploded:")
print(nan_percentage_per_column)


Using existing 'df_exploded'.

1. Are there any NaN values in df_exploded? False

2. Count of NaN values per column in df_exploded:
Metabolite             0
Gene_Symbol            0
Pathway Name           0
Pathway_Common_Name    0
kegg_enzyme            0
kegg_reactions         0
PathwayScore           0
TF                     0
TFActivity             0
dtype: int64

   No columns in df_exploded contain NaN values.

3. Total number of NaN values in df_exploded: 0

4. Boolean DataFrame showing NaN locations (first 5 rows if large):


Unnamed: 0,Metabolite,Gene_Symbol,Pathway Name,Pathway_Common_Name,kegg_enzyme,kegg_reactions,PathwayScore,TF,TFActivity
0,False,False,False,False,False,False,False,False,False
0,False,False,False,False,False,False,False,False,False
0,False,False,False,False,False,False,False,False,False
0,False,False,False,False,False,False,False,False,False
0,False,False,False,False,False,False,False,False,False



5. No rows in df_exploded contain any NaN values.

6. Percentage of NaN values per column in df_exploded:
Metabolite             0.0
Gene_Symbol            0.0
Pathway Name           0.0
Pathway_Common_Name    0.0
kegg_enzyme            0.0
kegg_reactions         0.0
PathwayScore           0.0
TF                     0.0
TFActivity             0.0
dtype: float64


In [380]:
import pandas as pd

# Assume df_exploded, Table1_GeneCentric_summary,
# Table2_PathwayActivitySummary, and Table3_TFActivitySummary are pre-loaded.

# --- 1. Merge with Table1_GeneCentric_summary (Gene Expression) ---
if 'Table1_GeneCentric_summary' in locals():
    print("Merging df_exploded with Table1_GeneCentric_summary (Gene Expression)...")
    # Select only the new columns we want to add, plus the key
    table1_to_merge = Table1_GeneCentric_summary[['Gene_Symbol', 'Log2FC_GeneExpr', 'P_value_GeneExpr']].copy()
    
    df_exploded = pd.merge(
        df_exploded,
        table1_to_merge,
        on='Gene_Symbol',
        how='left'  # Keep all rows from df_exploded
    )
    print("Done merging gene expression columns.")
else:
    print("⚠️ Table1_GeneCentric_summary not found. Skipping merge.")

# --- 2. Merge with Table2_PathwayActivitySummary (PROGENy Pathway Activity) ---
if 'Table2_PathwayActivitySummary' in locals():
    print("\nMerging df_exploded with Table2_PathwayActivitySummary (PROGENy Pathway Activity)...")
    # Select and rename columns from Table2 for clarity and to prepare the merge key
    table2_to_merge = Table2_PathwayActivitySummary[[
        'PROGENy_Pathway',
        'Activity_Difference_ES',  # Key that matches df_exploded['PathwayScore']
        'P_value_Activity'
    ]].copy()
    table2_to_merge.rename(columns={'P_value_Activity': 'P_value_PROGENyActivity'}, inplace=True)
    
    df_exploded = pd.merge(
        df_exploded,
        table2_to_merge,
        left_on='PathwayScore',          # Column in df_exploded
        right_on='Activity_Difference_ES',  # Column in Table2
        how='left'
    )
    # Drop the redundant key column from Table2 if it was added
    if 'Activity_Difference_ES' in df_exploded.columns:
        df_exploded.drop(columns=['Activity_Difference_ES'], inplace=True)
    print("Done merging PROGENy pathway activity columns.")
else:
    print("⚠️ Table2_PathwayActivitySummary not found. Skipping merge.")

# --- 3. Merge with Table3_TFActivitySummary (TF Activity) ---
if 'Table3_TFActivitySummary' in locals():
    print("\nMerging df_exploded with Table3_TFActivitySummary (TF Activity)...")
    # Select columns and rename for clarity
    table3_to_merge = Table3_TFActivitySummary[[
        'TF_Symbol',
        'Mean_Tumor_Activity',
        'Mean_Normal_Activity',
        'P_value'  # TF activity p-value
    ]].copy()
    table3_to_merge.rename(columns={
        'Mean_Tumor_Activity': 'Mean_Tumor_TFActivity',
        'Mean_Normal_Activity': 'Mean_Normal_TFActivity',
        'P_value': 'P_value_TFActivity'
    }, inplace=True)
    
    df_exploded = pd.merge(
        df_exploded,
        table3_to_merge,
        left_on='TF',       # Column in df_exploded
        right_on='TF_Symbol',
        how='left'
    )
    # Drop the redundant key column from Table3 if it was added
    if 'TF_Symbol' in df_exploded.columns and 'TF' in df_exploded.columns and 'TF_Symbol' != 'TF':
        df_exploded.drop(columns=['TF_Symbol'], inplace=True)
    print("Done merging TF activity columns.")
else:
    print("⚠️ Table3_TFActivitySummary not found. Skipping merge.")

# --- Display the enriched DataFrame ---
print("\n--- Head of enriched df_exploded ---")
try:
    display(df_exploded.head())
except NameError:
    print(df_exploded.head().to_string())

print(f"\nShape of enriched df_exploded: {df_exploded.shape}")
print("\nColumns in enriched df_exploded:")
print(list(df_exploded.columns))


Merging df_exploded with Table1_GeneCentric_summary (Gene Expression)...
Done merging gene expression columns.

Merging df_exploded with Table2_PathwayActivitySummary (PROGENy Pathway Activity)...
Done merging PROGENy pathway activity columns.

Merging df_exploded with Table3_TFActivitySummary (TF Activity)...
Done merging TF activity columns.

--- Head of enriched df_exploded ---


Unnamed: 0,Metabolite,Gene_Symbol,Pathway Name,Pathway_Common_Name,kegg_enzyme,kegg_reactions,PathwayScore,TF,TFActivity,Log2FC_GeneExpr,P_value_GeneExpr,PROGENy_Pathway,P_value_PROGENyActivity,Mean_Tumor_TFActivity,Mean_Normal_TFActivity,P_value_TFActivity
0,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R00145,126.993134,E2F1,0.980517,-1.027547,1.224089e-11,P53,6.2e-05,3.985033,3.004516,4.3487100000000003e-29
1,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R00146,126.993134,E2F1,0.980517,-1.027547,1.224089e-11,P53,6.2e-05,3.985033,3.004516,4.3487100000000003e-29
2,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R00717,126.993134,E2F1,0.980517,-1.027547,1.224089e-11,P53,6.2e-05,3.985033,3.004516,4.3487100000000003e-29
3,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R01088,126.993134,E2F1,0.980517,-1.027547,1.224089e-11,P53,6.2e-05,3.985033,3.004516,4.3487100000000003e-29
4,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R01361,126.993134,E2F1,0.980517,-1.027547,1.224089e-11,P53,6.2e-05,3.985033,3.004516,4.3487100000000003e-29



Shape of enriched df_exploded: (11959, 16)

Columns in enriched df_exploded:
['Metabolite', 'Gene_Symbol', 'Pathway Name', 'Pathway_Common_Name', 'kegg_enzyme', 'kegg_reactions', 'PathwayScore', 'TF', 'TFActivity', 'Log2FC_GeneExpr', 'P_value_GeneExpr', 'PROGENy_Pathway', 'P_value_PROGENyActivity', 'Mean_Tumor_TFActivity', 'Mean_Normal_TFActivity', 'P_value_TFActivity']


In [381]:
import pandas as pd
import numpy as np

def count_special_values(df: pd.DataFrame) -> pd.DataFrame:
    """
    For each column in df, count occurrences of:
      - NaN
      - numeric 0
      - string "0"
      - empty string ""
      - empty list []
    
    Returns a DataFrame with those counts per column.
    """
    summary = {}
    for col in df.columns:
        ser = df[col]

        # 1. Count NaNs
        num_nan = ser.isnull().sum()

        # 2. Count numeric 0 (only if dtype is numeric; but comparing to 0 will return False for non‐numeric)
        #    We also explicitly exclude NaNs so they are not double‐counted:
        num_zero_numeric = ((ser == 0) & ~ser.isnull()).sum()

        # 3. Count string "0"
        num_str_zero = ser.apply(lambda x: isinstance(x, str) and x == "0").sum()

        # 4. Count empty string ""
        num_empty_string = ser.apply(lambda x: isinstance(x, str) and x == "").sum()

        # 5. Count empty list [] (only if some cells might actually be lists)
        num_empty_list = ser.apply(lambda x: isinstance(x, list) and len(x) == 0).sum()

        summary[col] = {
            "NaN_count": num_nan,
            "Zero_numeric_count": num_zero_numeric,
            "String_'0'_count": num_str_zero,
            "Empty_string_count": num_empty_string,
            "Empty_list_count": num_empty_list
        }

    return pd.DataFrame(summary).T

# === Usage ===
# (Assuming df_exploded is already defined in your notebook)

special_counts = count_special_values(df_exploded)

print("Counts of special values per column:\n")
try:
    display(special_counts)
except NameError:
    print(special_counts.to_string())


Counts of special values per column:



Unnamed: 0,NaN_count,Zero_numeric_count,String_'0'_count,Empty_string_count,Empty_list_count
Metabolite,0,0,0,0,0
Gene_Symbol,0,0,0,0,0
Pathway Name,0,0,0,0,0
Pathway_Common_Name,0,0,0,0,0
kegg_enzyme,0,0,0,0,0
kegg_reactions,0,0,0,0,0
PathwayScore,0,0,0,0,0
TF,0,0,0,0,0
TFActivity,0,0,0,0,0
Log2FC_GeneExpr,200,0,0,0,0


In [382]:
import pandas as pd
import numpy as np
# Ensure scipy.stats is imported if you run the ttest part
# from scipy.stats import ttest_ind

# --- Assume these are loaded from your previous steps ---
# common_genes: list of unique gene symbols you are analyzing
# valid_samples: DataFrame with expression data, indexed by sample ID,
#                with a 'sample_type' column, and gene symbols as other columns.
# Table1_GeneCentric_summary: The DataFrame that contains Log2FC_GeneExpr and P_value_GeneExpr

print("--- Investigating NaNs in Gene Expression Stats ---")

if 'Table1_GeneCentric_summary' not in locals() or not isinstance(Table1_GeneCentric_summary, pd.DataFrame):
    print("❌ ERROR: 'Table1_GeneCentric_summary' not found. Please ensure it's loaded/generated.")
elif 'P_value_GeneExpr' not in Table1_GeneCentric_summary.columns or 'Gene_Symbol' not in Table1_GeneCentric_summary.columns:
    print("❌ ERROR: 'Table1_GeneCentric_summary' is missing 'P_value_GeneExpr' or 'Gene_Symbol' column.")
else:
    genes_with_nan_p_value = Table1_GeneCentric_summary[
        Table1_GeneCentric_summary['P_value_GeneExpr'].isna()
    ]['Gene_Symbol'].unique()

    print(f"\nFound {len(genes_with_nan_p_value)} genes with NaN P_value_GeneExpr.")
    
    if 'valid_samples' in locals() and isinstance(valid_samples, pd.DataFrame) and 'sample_type' in valid_samples.columns:
        print("Checking sample counts for these genes in 'valid_samples':")
        for gene in list(genes_with_nan_p_value)[:10]: # Check first 10
            if gene in valid_samples.columns:
                tumor_expr_count = valid_samples[valid_samples['sample_type'] == 'Primary Tumor'][gene].dropna().shape[0]
                normal_expr_count = valid_samples[valid_samples['sample_type'] == 'Solid Tissue Normal'][gene].dropna().shape[0]
                print(f"  - Gene: {gene}, Valid Tumor Samples: {tumor_expr_count}, Valid Normal Samples: {normal_expr_count}")
                if tumor_expr_count < 2 or normal_expr_count < 2:
                    print("    -> Insufficient samples for t-test, hence NaN P-value is expected.")
                mean_tumor = valid_samples[valid_samples['sample_type'] == 'Primary Tumor'][gene].dropna().mean()
                mean_normal = valid_samples[valid_samples['sample_type'] == 'Solid Tissue Normal'][gene].dropna().mean()
                if pd.isna(mean_tumor) or pd.isna(mean_normal):
                    print("    -> Mean tumor or normal expression is NaN, hence NaN Log2FC is expected.")
            else:
                print(f"  - Gene: {gene} - Not found as a column in 'valid_samples' (check gene ID consistency).")
    else:
        print("⚠️ 'valid_samples' DataFrame not found or 'sample_type' column missing. Cannot check sample counts.")

--- Investigating NaNs in Gene Expression Stats ---

Found 0 genes with NaN P_value_GeneExpr.
Checking sample counts for these genes in 'valid_samples':


In [383]:
import pandas as pd
import numpy as np
# Assume ast is imported if your 'Pathway Name' in df_exploded needs literal_eval initially

print("\n--- Investigating NaNs in PROGENy Pathway Details ---")

if 'df_exploded' not in locals() or not isinstance(df_exploded, pd.DataFrame):
    print("❌ ERROR: 'df_exploded' not found. Please ensure it's loaded/generated.")
elif not all(col in df_exploded.columns for col in ['Pathway Name', 'PROGENy_Pathway', 'P_value_PROGENyActivity']):
    print("❌ ERROR: 'df_exploded' is missing 'Pathway Name', 'PROGENy_Pathway', or 'P_value_PROGENyActivity' column.")
else:
    # Filter for rows where these are NaN
    nan_progeny_path_details = df_exploded[
        df_exploded['PROGENy_Pathway'].isna() | df_exploded['P_value_PROGENyActivity'].isna()
    ]
    print(f"Found {len(nan_progeny_path_details)} rows in df_exploded with NaN PROGENy_Pathway or NaN P_value_PROGENyActivity.")

    if not nan_progeny_path_details.empty:
        print("\nChecking unique KEGG Pathway Names from these rows against 'your_pathway_to_progeny_map':")
        
        unique_kegg_paths_with_nan_progeny = nan_progeny_path_details['Pathway Name'].astype(str).unique()
        
        available_progeny_in_table2_upper = set()
        progeny_pvalue_lookup = {}

        if 'Table2_PathwayActivitySummary' in locals() and \
           'PROGENy_Pathway' in Table2_PathwayActivitySummary.columns and \
           'P_value_Activity' in Table2_PathwayActivitySummary.columns:
            available_progeny_in_table2_upper = {str(p).upper() for p in Table2_PathwayActivitySummary['PROGENy_Pathway'].unique()}
            # Create lookup for p-values, ensuring uppercase keys for PROGENy names
            Table2_temp = Table2_PathwayActivitySummary.copy()
            Table2_temp['PROGENy_Pathway_UPPER'] = Table2_temp['PROGENy_Pathway'].astype(str).str.upper()
            progeny_pvalue_lookup = Table2_temp.set_index('PROGENy_Pathway_UPPER')['P_value_Activity'].to_dict()

        else:
            print("⚠️ WARNING: Table2_PathwayActivitySummary or required columns not found. Cannot fully diagnose PROGENy NaNs.")

        for kegg_path_name_full in list(unique_kegg_paths_with_nan_progeny)[:15]: # Check first 15
            kegg_path_id = str(kegg_path_name_full).strip()
            if ' ' in kegg_path_id: # Extract hsaID if full name is present
                kegg_path_id = kegg_path_id.split(' ')[0]
            
            print(f"\n  KEGG Pathway from df_exploded: '{kegg_path_name_full}' (cleaned ID: '{kegg_path_id}')")
            
            if not (kegg_path_id.startswith("hsa") and len(kegg_path_id) == 8 and kegg_path_id[3:].isdigit()):
                print(f"    -> This does not look like a standard hsaID. Check data source.")
                continue

            mapped_progeny_original_case = your_pathway_to_progeny_map.get(kegg_path_id)
            if mapped_progeny_original_case is None or not mapped_progeny_original_case: # Maps to None or empty string
                print(f"    -> Mapped to '{mapped_progeny_original_case}' in 'your_pathway_to_progeny_map'. This will result in NaN for PROGENy_Pathway and its P-value.")
            elif mapped_progeny_original_case:
                mapped_progeny_upper = str(mapped_progeny_original_case).upper()
                if mapped_progeny_upper not in available_progeny_in_table2_upper:
                    print(f"    -> Maps to PROGENy pathway '{mapped_progeny_original_case}' (as {mapped_progeny_upper}).")
                    print(f"       BUT '{mapped_progeny_upper}' IS NOT FOUND as a PROGENy_Pathway in Table2_PathwayActivitySummary (available: {sorted(list(available_progeny_in_table2_upper))[:5]}...).")
                    print(f"       Check spelling/casing in 'your_pathway_to_progeny_map' or ensure it's in Table2.")
                else:
                    # It maps to a valid PROGENy pathway in Table2, check if that pathway has a NaN P-value
                    p_value_for_mapped_progeny = progeny_pvalue_lookup.get(mapped_progeny_upper)
                    if pd.isna(p_value_for_mapped_progeny):
                        print(f"    -> Maps to PROGENy pathway '{mapped_progeny_original_case}' (as {mapped_progeny_upper}).")
                        print(f"       This PROGENy pathway IS IN Table2, but its 'P_value_Activity' IS NaN in Table2.")
                    else:
                        print(f"    -> Maps to PROGENy pathway '{mapped_progeny_original_case}' (P-value in Table2: {p_value_for_mapped_progeny}).")
                        print(f"       This case should NOT have NaN for P_value_PROGENyActivity if PROGENy_Pathway column was also populated. Check 'PROGENy_Pathway' column content for this row in df_exploded.")


--- Investigating NaNs in PROGENy Pathway Details ---
Found 329 rows in df_exploded with NaN PROGENy_Pathway or NaN P_value_PROGENyActivity.

Checking unique KEGG Pathway Names from these rows against 'your_pathway_to_progeny_map':

  KEGG Pathway from df_exploded: 'hsa00030' (cleaned ID: 'hsa00030')
    -> Maps to PROGENy pathway 'HYPOXIA' (P-value in Table2: 1.526125149054707e-07).
       This case should NOT have NaN for P_value_PROGENyActivity if PROGENy_Pathway column was also populated. Check 'PROGENy_Pathway' column content for this row in df_exploded.

  KEGG Pathway from df_exploded: 'hsa00350' (cleaned ID: 'hsa00350')
    -> Maps to PROGENy pathway 'TGFB' (P-value in Table2: 5.640181448724774e-05).
       This case should NOT have NaN for P_value_PROGENyActivity if PROGENy_Pathway column was also populated. Check 'PROGENy_Pathway' column content for this row in df_exploded.

  KEGG Pathway from df_exploded: 'hsa00750' (cleaned ID: 'hsa00750')
    -> Maps to PROGENy pathway '

In [384]:
import pandas as pd
import numpy as np
# Ensure your_pathway_to_progeny_map and Table2_PathwayActivitySummary are loaded and correct

print("--- Attempting to re-populate PROGENy details in df_exploded ---")

# Ensure prerequisite variables exist
if 'df_exploded' not in locals() or not isinstance(df_exploded, pd.DataFrame):
    print("❌ ERROR: 'df_exploded' is not defined.")
elif 'your_pathway_to_progeny_map' not in locals() or not isinstance(your_pathway_to_progeny_map, dict):
    print("❌ ERROR: 'your_pathway_to_progeny_map' is not defined.")
elif 'Table2_PathwayActivitySummary' not in locals() or not isinstance(Table2_PathwayActivitySummary, pd.DataFrame):
    print("❌ ERROR: 'Table2_PathwayActivitySummary' is not defined.")
elif not all(col in Table2_PathwayActivitySummary.columns for col in ['PROGENy_Pathway', 'P_value_Activity']):
    print("❌ ERROR: 'Table2_PathwayActivitySummary' is missing 'PROGENy_Pathway' or 'P_value_Activity' column.")
elif 'Pathway Name' not in df_exploded.columns:
    print("❌ ERROR: 'df_exploded' is missing 'Pathway Name' column.")
else:
    # 1. Prepare the lookup dictionary from Table2 (ensuring uppercase keys for robustness)
    progeny_details_lookup_repop = {}
    Table2_temp_for_lookup_repop = Table2_PathwayActivitySummary.copy()
    Table2_temp_for_lookup_repop['PROGENy_Pathway_UPPER'] = Table2_temp_for_lookup_repop['PROGENy_Pathway'].astype(str).str.upper()
    for _, row_t2 in Table2_temp_for_lookup_repop.iterrows():
        progeny_details_lookup_repop[row_t2['PROGENy_Pathway_UPPER']] = {
            'name_original_case': row_t2['PROGENy_Pathway'], # Store original case for the name column
            'p_value': row_t2['P_value_Activity']
        }
    print("✅ PROGENy details lookup dictionary created for re-population.")

    # 2. Define the mapping function again
    def get_progeny_details_repop(original_kegg_pathway_name_from_df):
        path_id_to_lookup = str(original_kegg_pathway_name_from_df).strip()
        if ' ' in path_id_to_lookup:
            path_id_to_lookup = path_id_to_lookup.split(' ')[0]
        
        # Get the PROGENy pathway name from your custom map (e.g., "Hypoxia", "MAPK")
        chosen_progeny_pathway_mapped_case = your_pathway_to_progeny_map.get(path_id_to_lookup)
        
        if chosen_progeny_pathway_mapped_case:
            # Convert to UPPERCASE for lookup in progeny_details_lookup_repop
            chosen_progeny_pathway_upper = str(chosen_progeny_pathway_mapped_case).upper()
            details = progeny_details_lookup_repop.get(chosen_progeny_pathway_upper)
            if details:
                # Return the original case name and its p-value
                return pd.Series([details['name_original_case'], details['p_value']])
        return pd.Series([None, np.nan]) # Return None for name, NaN for p-value if any step fails

    # 3. Apply the function to create/update the columns in df_exploded
    # Assuming your target columns in df_exploded are 'PROGENy_Pathway' and 'P_value_PROGENyActivity'
    target_progeny_name_col = 'PROGENy_Pathway'
    target_progeny_pval_col = 'P_value_PROGENyActivity'

    print(f"Re-populating '{target_progeny_name_col}' and '{target_progeny_pval_col}' in 'df_exploded'...")
    
    # This will create a temporary DataFrame with two columns from the Series returned by apply
    temp_progeny_cols = df_exploded['Pathway Name'].apply(get_progeny_details_repop)
    
    # Assign these to df_exploded
    df_exploded[target_progeny_name_col] = temp_progeny_cols[0]
    df_exploded[target_progeny_pval_col] = temp_progeny_cols[1]

    print("✅ Re-population attempt complete.")

    # 4. Verify the counts again
    print("\n--- Verification after re-population attempt ---")
    if target_progeny_name_col in df_exploded.columns and target_progeny_pval_col in df_exploded.columns:
        nan_progeny_path_count = df_exploded[target_progeny_name_col].isna().sum()
        nan_progeny_pval_count = df_exploded[target_progeny_pval_col].isna().sum()
        print(f"Number of NaN entries in '{target_progeny_name_col}': {nan_progeny_path_count}")
        print(f"Number of NaN entries in '{target_progeny_pval_col}': {nan_progeny_pval_count}")

        print("\nHead of df_exploded with re-populated PROGENy details:")
        display_cols_check = ['Pathway Name', target_progeny_name_col, target_progeny_pval_col]
        existing_cols_check = [col for col in display_cols_check if col in df_exploded.columns]
        try:
            display(df_exploded[existing_cols_check].head(15)) # Show more rows
        except NameError:
            print(df_exploded[existing_cols_check].head(15).to_string())

    else:
        print("Target PROGENy columns not found after re-population attempt.")

--- Attempting to re-populate PROGENy details in df_exploded ---
✅ PROGENy details lookup dictionary created for re-population.
Re-populating 'PROGENy_Pathway' and 'P_value_PROGENyActivity' in 'df_exploded'...
✅ Re-population attempt complete.

--- Verification after re-population attempt ---
Number of NaN entries in 'PROGENy_Pathway': 0
Number of NaN entries in 'P_value_PROGENyActivity': 0

Head of df_exploded with re-populated PROGENy details:


Unnamed: 0,Pathway Name,PROGENy_Pathway,P_value_PROGENyActivity
0,hsa00650,P53,6.2e-05
1,hsa00650,P53,6.2e-05
2,hsa00650,P53,6.2e-05
3,hsa00650,P53,6.2e-05
4,hsa00650,P53,6.2e-05
5,hsa00650,P53,6.2e-05
6,hsa00650,P53,6.2e-05
7,hsa00650,P53,6.2e-05
8,hsa01100,PI3K,0.003193
9,hsa01100,PI3K,0.003193


In [385]:
import pandas as pd
import numpy as np

def count_special_values(df: pd.DataFrame) -> pd.DataFrame:
    """
    For each column in df, count occurrences of:
      - NaN
      - numeric 0
      - string "0"
      - empty string ""
      - empty list []
    
    Returns a DataFrame with those counts per column.
    """
    summary = {}
    for col in df.columns:
        ser = df[col]

        # 1. Count NaNs
        num_nan = ser.isnull().sum()

        # 2. Count numeric 0 (only if dtype is numeric; but comparing to 0 will return False for non‐numeric)
        #    We also explicitly exclude NaNs so they are not double‐counted:
        num_zero_numeric = ((ser == 0) & ~ser.isnull()).sum()

        # 3. Count string "0"
        num_str_zero = ser.apply(lambda x: isinstance(x, str) and x == "0").sum()

        # 4. Count empty string ""
        num_empty_string = ser.apply(lambda x: isinstance(x, str) and x == "").sum()

        # 5. Count empty list [] (only if some cells might actually be lists)
        num_empty_list = ser.apply(lambda x: isinstance(x, list) and len(x) == 0).sum()

        summary[col] = {
            "NaN_count": num_nan,
            "Zero_numeric_count": num_zero_numeric,
            "String_'0'_count": num_str_zero,
            "Empty_string_count": num_empty_string,
            "Empty_list_count": num_empty_list
        }

    return pd.DataFrame(summary).T

# === Usage ===
# (Assuming df_exploded is already defined in your notebook)

special_counts = count_special_values(df_exploded)

print("Counts of special values per column:\n")
try:
    display(special_counts)
except NameError:
    print(special_counts.to_string())


Counts of special values per column:



Unnamed: 0,NaN_count,Zero_numeric_count,String_'0'_count,Empty_string_count,Empty_list_count
Metabolite,0,0,0,0,0
Gene_Symbol,0,0,0,0,0
Pathway Name,0,0,0,0,0
Pathway_Common_Name,0,0,0,0,0
kegg_enzyme,0,0,0,0,0
kegg_reactions,0,0,0,0,0
PathwayScore,0,0,0,0,0
TF,0,0,0,0,0
TFActivity,0,0,0,0,0
Log2FC_GeneExpr,200,0,0,0,0


In [386]:
# Assuming df_exploded is already defined and has a column named 'Metabolite'

# 1. Extract unique metabolite names (ignoring NaNs)
unique_metabs = df_exploded['Metabolite'].dropna().unique()

# 2. (Optional) Sort them for readability
unique_metabs_sorted = sorted(unique_metabs)

# 3. Print the unique names and the total count
print("Unique metabolites in 'Metabolite':")
for m in unique_metabs_sorted:
    print(m)

print(f"\nTotal number of unique metabolites: {len(unique_metabs_sorted)}")


Unique metabolites in 'Metabolite':
17β-Estradiol
2'-Deoxyinosine-5'-monophosphate
5'-Deoxy-5'-(Methylthio) Adenosine
Asp-Arg
Carnitine C7:DC
Cyclo(Phe-Glu)
Cytarabine
D-Erythronolactone
Deoxyribose 5-phosphate
LPC(13:0/0:0)
LPC(16:1/0:0)
LPC(18:3/0:0)
LPE(17:1/0:0)
LPI(16:2/0:0)
Methylcysteine
N(Alpha)-Acetyl-Epsilon-(2-Propenal)Lysine
Quinoline-4-carboxylic acid
Thiamine Monophosphate

Total number of unique metabolites: 18


In [275]:
import pandas as pd
import numpy as np # For np.nan

# This script assumes 'df_exploded' and 'valid_samples' DataFrames are already loaded.

# --- Safety Check for Prerequisites ---
if 'df_exploded' not in locals() or not isinstance(df_exploded, pd.DataFrame):
    print("❌ ERROR: 'df_exploded' DataFrame not found. Please ensure it's loaded.")
    # exit() # Or handle as appropriate
elif not all(col in df_exploded.columns for col in ['Gene_Symbol', 'P_value_GeneExpr', 'Log2FC_GeneExpr']):
    print("❌ ERROR: 'df_exploded' is missing one or more required columns: 'Gene_Symbol', 'P_value_GeneExpr', 'Log2FC_GeneExpr'.")
    # exit()
elif 'valid_samples' not in locals() or not isinstance(valid_samples, pd.DataFrame):
    print("❌ ERROR: 'valid_samples' DataFrame not found. This is needed to check raw sample counts.")
    # exit()
elif 'sample_type' not in valid_samples.columns:
    print("❌ ERROR: 'valid_samples' DataFrame is missing the 'sample_type' column.")
    # exit()
else:
    print("✅ Prerequisites 'df_exploded' and 'valid_samples' seem to be loaded correctly.")

    # --- Step 1: Identify genes with NaN P_value_GeneExpr in df_exploded ---
    # We look at unique Gene_Symbol values that have NaN P_value.
    # It's possible a gene appears multiple times in df_exploded if it's in multiple pathways,
    # but its P_value_GeneExpr (an attribute of the gene itself) would be the same.
    
    genes_with_nan_pvalue = df_exploded[df_exploded['P_value_GeneExpr'].isna()]['Gene_Symbol'].dropna().unique()
    
    print(f"\n--- Investigating {len(genes_with_nan_pvalue)} Unique Genes with NaN P_value_GeneExpr ---")
    
    if len(genes_with_nan_pvalue) == 0:
        print("No genes found with NaN P_value_GeneExpr in df_exploded according to this check.")
    else:
        print("For each gene, checking valid sample counts in 'valid_samples' (showing first 15 problematic genes if many):")
        
        count_insufficient_samples = 0
        count_gene_not_in_valid_samples = 0

        for i, gene_symbol in enumerate(list(genes_with_nan_pvalue)):
            if i < 15 or len(genes_with_nan_pvalue) <= 15 : # Print details for first 15 or all if fewer
                print(f"\n  Gene: {gene_symbol}")
                if gene_symbol not in valid_samples.columns:
                    print("    ❌ This gene symbol is NOT a column in 'valid_samples'. Check consistency of gene IDs.")
                    count_gene_not_in_valid_samples +=1
                    continue

                tumor_expr_for_gene = valid_samples[valid_samples['sample_type'] == 'Primary Tumor'][gene_symbol].dropna()
                normal_expr_for_gene = valid_samples[valid_samples['sample_type'] == 'Solid Tissue Normal'][gene_symbol].dropna()
                
                num_tumor_samples = len(tumor_expr_for_gene)
                num_normal_samples = len(normal_expr_for_gene)
                
                mean_tumor = tumor_expr_for_gene.mean() # Will be NaN if num_tumor_samples is 0
                mean_normal = normal_expr_for_gene.mean() # Will be NaN if num_normal_samples is 0

                print(f"    Valid Tumor Samples: {num_tumor_samples} (Mean Expr: {mean_tumor:.2f})")
                print(f"    Valid Normal Samples: {num_normal_samples} (Mean Expr: {mean_normal:.2f})")

                if num_tumor_samples < 2 or num_normal_samples < 2:
                    print("    -> REASON FOR NaN P-VALUE: Insufficient samples in one or both groups for t-test (requires at least 2 per group).")
                    count_insufficient_samples +=1
                if pd.isna(mean_tumor) or pd.isna(mean_normal):
                    print("    -> REASON FOR NaN Log2FC: Mean expression is NaN for tumor or normal group (due to 0 valid samples).")
            elif i == 15:
                 print(f"\n  ... and {len(genes_with_nan_pvalue) - 15} more genes with NaN P_value_GeneExpr not detailed here.")
                 break # Stop printing details after 15

        print("\n--- Summary of Investigation ---")
        if count_gene_not_in_valid_samples > 0:
            print(f"⚠️ {count_gene_not_in_valid_samples} gene(s) with NaN P-values were not found as columns in 'valid_samples'. Gene ID consistency needs checking!")
        print(f"Identified {count_insufficient_samples} (out of the detailed ones) where NaN P-value is likely due to <2 samples in tumor or normal groups.")
        print("If Log2FC is also NaN for these, it's because the mean could not be calculated for one or both groups.")

print("\n--- Diagnostic Script Complete ---")

✅ Prerequisites 'df_exploded' and 'valid_samples' seem to be loaded correctly.

--- Investigating 6 Unique Genes with NaN P_value_GeneExpr ---
For each gene, checking valid sample counts in 'valid_samples' (showing first 15 problematic genes if many):

  Gene: ALPG
    ❌ This gene symbol is NOT a column in 'valid_samples'. Check consistency of gene IDs.

  Gene: NT5C3A
    ❌ This gene symbol is NOT a column in 'valid_samples'. Check consistency of gene IDs.

  Gene: NT5C3B
    ❌ This gene symbol is NOT a column in 'valid_samples'. Check consistency of gene IDs.

  Gene: AKR1C8
    ❌ This gene symbol is NOT a column in 'valid_samples'. Check consistency of gene IDs.

  Gene: UGT2B17
    ❌ This gene symbol is NOT a column in 'valid_samples'. Check consistency of gene IDs.

  Gene: NT5DC4
    ❌ This gene symbol is NOT a column in 'valid_samples'. Check consistency of gene IDs.

--- Summary of Investigation ---
⚠️ 6 gene(s) with NaN P-values were not found as columns in 'valid_samples'. Ge

In [387]:
import pandas as pd
import numpy as np # For creating sample NaN values

# This script assumes 'df_exploded' (or your main integrated table
# that contains 'Log2FC_GeneExpr' and 'P_value_GeneExpr') is already loaded.

# --- For demonstration, let's create a sample df_exploded DataFrame ---
# In your actual notebook, you'll use your existing df_exploded.
if 'df_exploded' not in locals() or not isinstance(df_exploded, pd.DataFrame):
    print("⚠️ 'df_exploded' not found. Creating a sample DataFrame for demonstration.")
    data_for_df_exploded = {
        'Metabolite': ['M1', 'M2', 'M3', 'M4', 'M5', 'M6_no_expr'],
        'Gene_Symbol': ['GENE_A', 'GENE_B', 'GENE_C', 'GENE_D', 'GENE_E', 'ALPG'], # ALPG was one of the problematic ones
        'Pathway Name': ['P1', 'P2', 'P3', 'P4', 'P5', 'P6'],
        'Log2FC_GeneExpr': [1.5, -0.5, np.nan, 2.0, -1.0, np.nan], # Includes NaNs
        'P_value_GeneExpr': [0.01, 0.04, np.nan, 0.001, 0.03, np.nan], # Includes NaNs
        # Add other columns that would be in df_exploded for completeness of example
        'Pathway_Common_Name': ['Pathway A', 'Pathway B', 'Pathway C', 'Pathway D', 'Pathway E', 'Pathway F'],
        'kegg_enzyme': ['E1', 'E2', 'E3', 'E4', 'E5', 'E6'],
        'kegg_reactions': [['R1'], ['R2'], ['R3'], ['R4'], ['R5'], ['R6']],
        'PathwayScore': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
        'TF': ['TF1', 'TF2', 'TF3', 'TF4', 'TF5', 'TF6'],
        'TFActivity': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
    }
    df_exploded = pd.DataFrame(data_for_df_exploded)
    print(f"Sample 'df_exploded' created. Shape: {df_exploded.shape}")
# --- End of sample data setup ---


print("\n--- Dropping rows with NaN in Gene Expression Statistics ---")

# Check if the DataFrame and necessary columns exist
if 'df_exploded' in locals() and isinstance(df_exploded, pd.DataFrame):
    required_cols_for_drop = ['Log2FC_GeneExpr', 'P_value_GeneExpr']
    if all(col in df_exploded.columns for col in required_cols_for_drop):
        
        print(f"Shape of df_exploded BEFORE dropping NaNs: {df_exploded.shape}")
        
        # Drop rows where ANY of the specified columns have NaN
        # This creates a new DataFrame; df_exploded itself is not modified unless you reassign
        df_exploded_cleaned = df_exploded.dropna(subset=required_cols_for_drop, how='any')
        
        print(f"Shape of df_exploded AFTER dropping NaNs: {df_exploded_cleaned.shape}")
        num_dropped = len(df_exploded) - len(df_exploded_cleaned)
        print(f"Number of rows dropped: {num_dropped}")

        if not df_exploded_cleaned.empty:
            print("\n--- Head of 'df_exploded_cleaned' (with NaNs in expression stats removed) ---")
            try:
                display(df_exploded_cleaned.head())
            except NameError: # If 'display' is not defined (e.g., not in Jupyter)
                print(df_exploded_cleaned.head().to_string())
        else:
            print("⚠️ The DataFrame became empty after dropping rows with NaN expression stats.")
            print("   This would mean all rows had NaN in either 'Log2FC_GeneExpr' or 'P_value_GeneExpr'.")
            
    else:
        print(f"❌ ERROR: 'df_exploded' is missing one or more required columns for dropping: {required_cols_for_drop}")
else:
    print("❌ ERROR: 'df_exploded' DataFrame not found. Please ensure it is loaded.")

# Now, 'df_exploded_cleaned' contains only the rows with valid Log2FC and P-value for gene expression.
# If you want to replace your original df_exploded, you would do:
# df_exploded = df_exploded_cleaned


--- Dropping rows with NaN in Gene Expression Statistics ---
Shape of df_exploded BEFORE dropping NaNs: (11959, 16)
Shape of df_exploded AFTER dropping NaNs: (11759, 16)
Number of rows dropped: 200

--- Head of 'df_exploded_cleaned' (with NaNs in expression stats removed) ---


Unnamed: 0,Metabolite,Gene_Symbol,Pathway Name,Pathway_Common_Name,kegg_enzyme,kegg_reactions,PathwayScore,TF,TFActivity,Log2FC_GeneExpr,P_value_GeneExpr,PROGENy_Pathway,P_value_PROGENyActivity,Mean_Tumor_TFActivity,Mean_Normal_TFActivity,P_value_TFActivity
0,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R00145,126.993134,E2F1,0.980517,-1.027547,1.224089e-11,P53,6.2e-05,3.985033,3.004516,4.3487100000000003e-29
1,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R00146,126.993134,E2F1,0.980517,-1.027547,1.224089e-11,P53,6.2e-05,3.985033,3.004516,4.3487100000000003e-29
2,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R00717,126.993134,E2F1,0.980517,-1.027547,1.224089e-11,P53,6.2e-05,3.985033,3.004516,4.3487100000000003e-29
3,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R01088,126.993134,E2F1,0.980517,-1.027547,1.224089e-11,P53,6.2e-05,3.985033,3.004516,4.3487100000000003e-29
4,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R01361,126.993134,E2F1,0.980517,-1.027547,1.224089e-11,P53,6.2e-05,3.985033,3.004516,4.3487100000000003e-29


In [388]:
# Assuming df_exploded is already defined and has a column named 'Metabolite'

# 1. Extract unique metabolite names (ignoring NaNs)
unique_metabs = df_exploded_cleaned['Metabolite'].dropna().unique()

# 2. (Optional) Sort them for readability
unique_metabs_sorted = sorted(unique_metabs)

# 3. Print the unique names and the total count
print("Unique metabolites in 'Metabolite':")
for m in unique_metabs_sorted:
    print(m)

print(f"\nTotal number of unique metabolites: {len(unique_metabs_sorted)}")


Unique metabolites in 'Metabolite':
17β-Estradiol
2'-Deoxyinosine-5'-monophosphate
5'-Deoxy-5'-(Methylthio) Adenosine
Asp-Arg
Carnitine C7:DC
Cyclo(Phe-Glu)
Cytarabine
D-Erythronolactone
Deoxyribose 5-phosphate
LPC(13:0/0:0)
LPC(16:1/0:0)
LPC(18:3/0:0)
LPE(17:1/0:0)
LPI(16:2/0:0)
Methylcysteine
N(Alpha)-Acetyl-Epsilon-(2-Propenal)Lysine
Quinoline-4-carboxylic acid
Thiamine Monophosphate

Total number of unique metabolites: 18


In [389]:
# Apply the same special‐value counting function to df_exploded_cleaned

# === Re-define or reuse the function ===
import pandas as pd
import numpy as np

def count_special_values(df: pd.DataFrame) -> pd.DataFrame:
    """
    For each column in df, count occurrences of:
      - NaN
      - numeric 0
      - string "0"
      - empty string ""
      - empty list []
    
    Returns a DataFrame with those counts per column.
    """
    summary = {}
    for col in df.columns:
        ser = df[col]

        # 1. Count NaNs
        num_nan = ser.isnull().sum()

        # 2. Count numeric 0 (only if dtype is numeric; exclude NaNs so they’re not double-counted)
        num_zero_numeric = ((ser == 0) & ~ser.isnull()).sum()

        # 3. Count string "0"
        num_str_zero = ser.apply(lambda x: isinstance(x, str) and x == "0").sum()

        # 4. Count empty string ""
        num_empty_string = ser.apply(lambda x: isinstance(x, str) and x == "").sum()

        # 5. Count empty list [] (if cells are actual lists)
        num_empty_list = ser.apply(lambda x: isinstance(x, list) and len(x) == 0).sum()

        summary[col] = {
            "NaN_count": num_nan,
            "Zero_numeric_count": num_zero_numeric,
            "String_'0'_count": num_str_zero,
            "Empty_string_count": num_empty_string,
            "Empty_list_count": num_empty_list
        }

    return pd.DataFrame(summary).T


# === Usage on df_exploded_cleaned ===
# (Assuming df_exploded_cleaned already exists in your notebook)

special_counts_cleaned = count_special_values(df_exploded_cleaned)

print("Counts of special values per column in df_exploded_cleaned:\n")
try:
    display(special_counts_cleaned)
except NameError:
    print(special_counts_cleaned.to_string())


Counts of special values per column in df_exploded_cleaned:



Unnamed: 0,NaN_count,Zero_numeric_count,String_'0'_count,Empty_string_count,Empty_list_count
Metabolite,0,0,0,0,0
Gene_Symbol,0,0,0,0,0
Pathway Name,0,0,0,0,0
Pathway_Common_Name,0,0,0,0,0
kegg_enzyme,0,0,0,0,0
kegg_reactions,0,0,0,0,0
PathwayScore,0,0,0,0,0
TF,0,0,0,0,0
TFActivity,0,0,0,0,0
Log2FC_GeneExpr,0,0,0,0,0


In [106]:
xtb_df = pd.read_excel('xtb_properties.xlsx')
print(xtb_df.shape)
xtb_df.head(24)

(24, 8)


Unnamed: 0,Compound,TotalEnergy,HOMO,LUMO,Gap,DipoleMoment,GFN,Error
0,D-Erythronolactone,-27.863108,4.01588,4.01588,0.147581,4.347,2,
1,"1,6-anhydro-β-D-glucose",-38.265851,10.163291,10.163291,0.373494,-6.845,2,
2,Deoxyribose 5-phosphate,-47.412065,6.484645,6.484645,0.238306,-2.751,2,
3,2-Aminobenzenesulfonic acid,-34.751298,3.30511,3.30511,0.12146,-7.745,2,
4,Quinoline-4-carboxylic acid,-36.013343,1.74371,1.74371,0.06408,0.227,2,
5,cyclo(glu-glu),-58.8091,3.919771,3.919771,0.144049,-5.833,2,
6,P-sulfanilic acid,-34.744605,3.62175,3.62175,0.133097,-0.91,2,
7,Methylcysteine,-27.418815,2.695181,2.695181,0.099046,-1.311,2,
8,Asp-Arg,51.177561,1.897123,1.897123,0.069718,-9.2964,2,
9,LPI(16:2/0:0),-127.573976,0.989721,0.989721,0.036372,74.54,2,


In [390]:
import pandas as pd
import numpy as np # For sample NaN
import ast         # For ast.literal_eval if metabolite names are strings like "['Name']"

# This script assumes 'df_exploded_cleaned' and 'xtb_df' are already loaded
# pandas DataFrames in your environment.

# --- For demonstration: Create sample DataFrames ---
# In your actual notebook, you'll use your existing DataFrames.
if 'df_exploded_cleaned' not in locals() or not isinstance(df_exploded_cleaned, pd.DataFrame):
    print("⚠️ 'df_exploded_cleaned' not found. Creating a sample DataFrame for demonstration.")
    data_exploded_cleaned = {
        'Metabolite': ["['D-Erythronolactone']", "['LPC(16:1/0:0)']", "['Cytarabine']", "['UnknownMetabolite']"],
        'Type': ['[down]', '[down]', '[up]', '[unknown]'],
        'Gene_Symbol': ['LIPA', 'PLA2G10', 'DCK', 'GENEX'],
        # Add other columns present in your df_exploded_cleaned
        'Pathway Name': ['hsa00100', 'hsa00564', 'hsa00240', 'hsa00000'],
        'Pathway_Common_Name': ['Steroid biosynthesis', 'Glycerophospholipid metabolism', 'Pyrimidine metabolism', 'Unknown Pathway'],
        'kegg_enzyme': ['3.1.1.13', '3.1.1.4', '2.7.1.148', '1.1.1.1'],
        'kegg_reactions': [["R1"],["R2"],["R3"],["R4"]],
        'PathwayScore': [200, -150, 300, np.nan],
        'TF': ['STAT1', 'FOXP1', 'E2F1', 'TF_Y'],
        'TFActivity': [-0.01, 0.23, 0.5, np.nan]
    }
    df_exploded_cleaned = pd.DataFrame(data_exploded_cleaned)

if 'xtb_df' not in locals() or not isinstance(xtb_df, pd.DataFrame):
    print("⚠️ 'xtb_df' not found. Creating a sample DataFrame for demonstration.")
    data_xtb = {
        'Compound': ['D-Erythronolactone', 'LPC(16:1/0:0)', 'Cytarabine', 'AnotherMetabolite'],
        'TotalEnergy': [-100.1, -200.2, -150.5, -50.0],
        'HOMO': [-0.1, -0.2, -0.15, -0.05],
        'LUMO': [0.01, 0.02, 0.015, 0.005],
        'Gap': [0.11, 0.22, 0.165, 0.055],
        'DipoleMoment': [1.0, 2.5, 1.8, 0.5]
    }
    xtb_df = pd.DataFrame(data_xtb)
# --- End of sample data setup ---


print("\n--- Adding XTB physicochemical descriptors to df_exploded_cleaned ---")

# Check if DataFrames and necessary columns exist
if not ('df_exploded_cleaned' in locals() and isinstance(df_exploded_cleaned, pd.DataFrame) and \
      'Metabolite' in df_exploded_cleaned.columns):
    print("❌ ERROR: 'df_exploded_cleaned' DataFrame (with 'Metabolite' column) not found.")
elif not ('xtb_df' in locals() and isinstance(xtb_df, pd.DataFrame) and \
        'Compound' in xtb_df.columns and \
        all(col in xtb_df.columns for col in ['TotalEnergy', 'HOMO', 'LUMO', 'Gap', 'DipoleMoment'])):
    print("❌ ERROR: 'xtb_df' DataFrame (with 'Compound' and descriptor columns) not found or incomplete.")
else:
    # 1. Prepare 'df_exploded_cleaned' for merging:
    #    Extract the clean metabolite name from the 'Metabolite' column.
    #    This handles cases like "['Metabolite Name']" or just "Metabolite Name".
    def clean_metabolite_name(met_val):
        if isinstance(met_val, str):
            try:
                # If it's a string representation of a list, like "['Metabolite']"
                eval_list = ast.literal_eval(met_val)
                if isinstance(eval_list, list) and len(eval_list) > 0:
                    return str(eval_list[0]).strip() # Take the first element
                # If ast.literal_eval results in a non-list (e.g. just a string that wasn't list-like)
                return str(eval_list).strip()
            except (ValueError, SyntaxError):
                # If it's a plain string not representing a list, e.g., "Metabolite"
                return met_val.strip("[]'") # General cleaning for simple strings too
        elif isinstance(met_val, list) and len(met_val) > 0: # If it's already a list
            return str(met_val[0]).strip()
        elif pd.notna(met_val): # For other non-string types, convert to string
            return str(met_val).strip()
        return None # For NaNs or unhandled types

    df_exploded_cleaned['Metabolite_for_Merge'] = df_exploded_cleaned['Metabolite'].apply(clean_metabolite_name)
    print("✅ Created 'Metabolite_for_Merge' column in df_exploded_cleaned.")

    # 2. Prepare 'xtb_df' for merging:
    #    Select only necessary columns and ensure 'Compound' is suitable for merging.
    xtb_cols_to_merge = ['Compound', 'TotalEnergy', 'HOMO', 'LUMO', 'Gap', 'DipoleMoment']
    xtb_df_subset = xtb_df[xtb_cols_to_merge].copy()
    xtb_df_subset['Compound_for_Merge'] = xtb_df_subset['Compound'].astype(str).str.strip()
    print("✅ Prepared subset of 'xtb_df' for merging.")

    # 3. Perform the merge
    print("\nMerging df_exploded_cleaned with xtb_df_subset...")
    df_exploded_cleaned_augmented = pd.merge(
        df_exploded_cleaned,
        xtb_df_subset,
        left_on='Metabolite_for_Merge',
        right_on='Compound_for_Merge',
        how='left' # Keep all rows from df_exploded_cleaned
    )
    print(f"✅ Merge complete. Shape of augmented table: {df_exploded_cleaned_augmented.shape}")

    # 4. Clean up temporary merge columns
    cols_to_drop = ['Metabolite_for_Merge']
    if 'Compound_for_Merge' in df_exploded_cleaned_augmented.columns: # This would be from xtb_df
        cols_to_drop.append('Compound_for_Merge')
    if 'Compound' in df_exploded_cleaned_augmented.columns and 'Compound' not in df_exploded_cleaned.columns: # If 'Compound' from xtb_df was added
         cols_to_drop.append('Compound')


    df_exploded_cleaned_augmented.drop(columns=cols_to_drop, inplace=True, errors='ignore')
    print("✅ Temporary merge columns dropped.")
    
    # Optional: You can rename df_exploded_cleaned_augmented back to df_exploded_cleaned if you want to overwrite
    # df_exploded_cleaned = df_exploded_cleaned_augmented

    # --- 5. Display the results ---
    print("\n--- Head of 'df_exploded_cleaned_augmented' with XTB descriptors ---")
    
    # Define columns to display, including new ones
    display_cols_xtb = ['Metabolite', 'Gene_Symbol', 'Pathway_Common_Name', # Key identifiers
                        'TotalEnergy', 'HOMO', 'LUMO', 'Gap', 'DipoleMoment', # New XTB columns
                        'PathwayScore', 'TF', 'TFActivity'] # Existing scores
    
    existing_display_cols_xtb = [col for col in display_cols_xtb if col in df_exploded_cleaned_augmented.columns]
    
    if not df_exploded_cleaned_augmented.empty:
        try:
            display(df_exploded_cleaned_augmented[existing_display_cols_xtb].head())
        except NameError:
            print(df_exploded_cleaned_augmented[existing_display_cols_xtb].head().to_string())
    else:
        print("Augmented DataFrame is empty.")

    print("\n--- Summary of NaNs in new XTB columns ---")
    xtb_descriptor_cols = ['TotalEnergy', 'HOMO', 'LUMO', 'Gap', 'DipoleMoment']
    for col in xtb_descriptor_cols:
        if col in df_exploded_cleaned_augmented.columns:
            nan_count = df_exploded_cleaned_augmented[col].isna().sum()
            print(f"  NaNs in '{col}': {nan_count} out of {len(df_exploded_cleaned_augmented)} rows.")
        else:
            print(f"  Column '{col}' not found in the augmented DataFrame.")


--- Adding XTB physicochemical descriptors to df_exploded_cleaned ---
✅ Created 'Metabolite_for_Merge' column in df_exploded_cleaned.
✅ Prepared subset of 'xtb_df' for merging.

Merging df_exploded_cleaned with xtb_df_subset...
✅ Merge complete. Shape of augmented table: (11759, 24)
✅ Temporary merge columns dropped.

--- Head of 'df_exploded_cleaned_augmented' with XTB descriptors ---


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_exploded_cleaned['Metabolite_for_Merge'] = df_exploded_cleaned['Metabolite'].apply(clean_metabolite_name)


Unnamed: 0,Metabolite,Gene_Symbol,Pathway_Common_Name,TotalEnergy,HOMO,LUMO,Gap,DipoleMoment,PathwayScore,TF,TFActivity
0,D-Erythronolactone,BDH1,Butanoate metabolism - Homo sapiens (human),-27.863108,4.01588,4.01588,0.147581,4.347,126.993134,E2F1,0.980517
1,D-Erythronolactone,BDH1,Butanoate metabolism - Homo sapiens (human),-27.863108,4.01588,4.01588,0.147581,4.347,126.993134,E2F1,0.980517
2,D-Erythronolactone,BDH1,Butanoate metabolism - Homo sapiens (human),-27.863108,4.01588,4.01588,0.147581,4.347,126.993134,E2F1,0.980517
3,D-Erythronolactone,BDH1,Butanoate metabolism - Homo sapiens (human),-27.863108,4.01588,4.01588,0.147581,4.347,126.993134,E2F1,0.980517
4,D-Erythronolactone,BDH1,Butanoate metabolism - Homo sapiens (human),-27.863108,4.01588,4.01588,0.147581,4.347,126.993134,E2F1,0.980517



--- Summary of NaNs in new XTB columns ---
  NaNs in 'TotalEnergy': 0 out of 11759 rows.
  NaNs in 'HOMO': 0 out of 11759 rows.
  NaNs in 'LUMO': 0 out of 11759 rows.
  NaNs in 'Gap': 0 out of 11759 rows.
  NaNs in 'DipoleMoment': 0 out of 11759 rows.


In [391]:
print(df_exploded_cleaned_augmented.columns)

Index(['Metabolite', 'Gene_Symbol', 'Pathway Name', 'Pathway_Common_Name',
       'kegg_enzyme', 'kegg_reactions', 'PathwayScore', 'TF', 'TFActivity',
       'Log2FC_GeneExpr', 'P_value_GeneExpr', 'PROGENy_Pathway',
       'P_value_PROGENyActivity', 'Mean_Tumor_TFActivity',
       'Mean_Normal_TFActivity', 'P_value_TFActivity', 'TotalEnergy', 'HOMO',
       'LUMO', 'Gap', 'DipoleMoment'],
      dtype='object')


In [327]:
display(df_exploded_cleaned_augmented)

Unnamed: 0,Metabolite,Gene_Symbol,Pathway Name,Pathway_Common_Name,kegg_enzyme,kegg_reactions,PathwayScore,TF,TFActivity,Log2FC_GeneExpr,...,PROGENy_Pathway,P_value_PROGENyActivity,Mean_Tumor_TFActivity,Mean_Normal_TFActivity,P_value_TFActivity,TotalEnergy,HOMO,LUMO,Gap,DipoleMoment
0,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R00145,126.993134,E2F1,0.980517,-1.027547,...,P53,0.000062,3.985033,3.004516,4.348710e-29,-27.863108,4.015880,4.015880,0.147581,4.347
1,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R00146,126.993134,E2F1,0.980517,-1.027547,...,P53,0.000062,3.985033,3.004516,4.348710e-29,-27.863108,4.015880,4.015880,0.147581,4.347
2,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R00717,126.993134,E2F1,0.980517,-1.027547,...,P53,0.000062,3.985033,3.004516,4.348710e-29,-27.863108,4.015880,4.015880,0.147581,4.347
3,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R01088,126.993134,E2F1,0.980517,-1.027547,...,P53,0.000062,3.985033,3.004516,4.348710e-29,-27.863108,4.015880,4.015880,0.147581,4.347
4,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R01361,126.993134,E2F1,0.980517,-1.027547,...,P53,0.000062,3.985033,3.004516,4.348710e-29,-27.863108,4.015880,4.015880,0.147581,4.347
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5774,2'-Deoxyinosine-5'-monophosphate,NUDT16,hsa00230,Purine metabolism - Homo sapiens (human),3.6.1.64,R03531,383.739852,HNF4A,-0.496992,-0.787699,...,NFKB,0.000011,-1.283298,-0.786305,4.171697e-03,-70.859728,2.724434,2.724434,0.100121,16.111
5775,2'-Deoxyinosine-5'-monophosphate,NUDT16,hsa00230,Purine metabolism - Homo sapiens (human),3.6.1.64,R12958,383.739852,HNF4A,-0.496992,-0.787699,...,NFKB,0.000011,-1.283298,-0.786305,4.171697e-03,-70.859728,2.724434,2.724434,0.100121,16.111
5776,2'-Deoxyinosine-5'-monophosphate,NUDT16,hsa01100,Metabolic pathways - Homo sapiens (human),3.6.1.64,R10235,-141.148347,HNF4A,-0.496992,-0.787699,...,PI3K,0.003193,-1.283298,-0.786305,4.171697e-03,-70.859728,2.724434,2.724434,0.100121,16.111
5777,2'-Deoxyinosine-5'-monophosphate,NUDT16,hsa01100,Metabolic pathways - Homo sapiens (human),3.6.1.64,R03531,-141.148347,HNF4A,-0.496992,-0.787699,...,PI3K,0.003193,-1.283298,-0.786305,4.171697e-03,-70.859728,2.724434,2.724434,0.100121,16.111


In [107]:
phys_props_df = pd.read_csv("physiochemical_properties.csv", index_col="Compound")
# 2. Peek at the first few rows to confirm it loaded correctly
phys_props_df.head(23)

Unnamed: 0_level_0,Unnamed: 0,SMILES,XLogP,FSP3,Complexity,HBondDonors,HBondAcceptors,TPSA,RotatableBonds,Class I,Class II,VIP,p_value,Fold_Change,Log2FC,Type,Molecular Weight (Da)
Compound,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
D-Erythronolactone,0,C1[C@H]([C@H](C(=O)O1)O)O,-1.3,0.75,110.605938,2.0,4.0,66.8,0.0,Carbohydrates and Its metabolites,Carboxylic acids and derivatives,3.20052,0.000679,0.276351,-1.855428,down,118.027
"1,6-anhydro-β-D-glucose",1,C1[C@@H]2[C@H]([C@@H]([C@H]([C@H](O1)O2)O)O)O,-2.1,1.0,141.051632,3.0,5.0,79.2,0.0,Carbohydrates and Its metabolites,Sugars,1.832149,0.005326,0.388313,-1.364709,down,162.05283
Deoxyribose 5-phosphate,2,C1[C@@H]([C@H](O[C@H]1O)COP(=O)(O)O)O,-2.6,1.0,212.928441,4.0,7.0,116.0,3.0,Nucleotide And Its metabolites,Nucleotide And Its metabolites,1.448724,0.019611,0.314954,-1.666788,down,214.02423
2-Aminobenzenesulfonic acid,3,C1=CC=C(C(=C1)N)S(=O)(=O)O,0.4,0.0,357.845074,2.0,4.0,88.8,1.0,Benzene and substituted derivatives,Benzene and substituted derivatives,1.988007,9.6e-05,0.481644,-1.05396,down,173.19
Quinoline-4-carboxylic acid,4,C1=CC=C2C(=C1)C(=CC=N2)C(=O)O,0.5,0.0,459.936329,1.0,3.0,50.2,1.0,Heterocyclic compounds,Pteridines and derivatives,1.988007,9.6e-05,0.481644,-1.05396,down,173.047678
cyclo(glu-glu),5,C(CC(=O)O)[C@H]1C(=O)N[C@H](C(=O)N1)CCC(=O)O,-1.5,0.6,344.564194,4.0,6.0,133.0,6.0,Amino acid and Its metabolites,Small Peptide,1.137473,0.041106,2.235048,1.160306,up,258.08573
P-sulfanilic acid,6,C1=CC(=CC=C1N)S(=O)(=O)O,-0.6,0.0,340.595074,2.0,4.0,88.8,1.0,Amino acid and Its metabolites,Amino acid derivatives,1.988007,9.6e-05,0.481644,-1.05396,down,173.19
Methylcysteine,7,CSC[C@@H](C(=O)O)N,-2.7,0.75,86.107496,2.0,4.0,88.6,3.0,Amino acid and Its metabolites,Amino acids,1.603143,0.021896,0.438727,-1.188603,down,135.0354
Asp-Arg,8,C(C[C@@H](C(=O)O)NC(=O)[C@H](CC(=O)O)N)CN=C(N)N,-5.3,0.6,393.499549,6.0,7.0,194.0,9.0,Amino acid and Its metabolites,Small Peptide,1.30826,0.012759,0.47731,-1.067001,down,289.13807
LPI(16:2/0:0),9,CCCC/C=C\C/C=C\CCCCCCC(=O)OC[C@H](O)COP(=O)([O...,0.6121,0.8,744.625633,6.0,12.0,206.27,19.0,GP,PI,2.600966,0.005109,0.466713,-1.099393,down,568.26541


In [392]:
import pandas as pd
import numpy as np
import ast

# This script assumes 'df_exploded_cleaned_augmented' (your table with XTB data)
# and 'phys_props_df' are already loaded pandas DataFrames.

# --- Safety Check for input DataFrames ---
if 'df_exploded_cleaned_augmented' not in locals() or not isinstance(df_exploded_cleaned_augmented, pd.DataFrame):
    print("❌ ERROR: 'df_exploded_cleaned_augmented' DataFrame not found. Please ensure it's loaded from the XTB merge step.")
    # For demonstration, if it's missing, create a sample. In your real run, this should not happen.
    df_exploded_cleaned_augmented = pd.DataFrame({
        'Metabolite': ["['D-Erythronolactone']", "['LPC(16:1/0:0)']"],
        'Gene_Symbol': ['LIPA', 'PLA2G10'],
        'TotalEnergy': [-100.0, -200.0] # Example of existing XTB column
        # Add other columns that df_exploded_cleaned_augmented is expected to have
    })
    print("   ⚠️ Created a DUMMY 'df_exploded_cleaned_augmented' for script execution.")


if 'phys_props_df' not in locals() or not isinstance(phys_props_df, pd.DataFrame):
    print("❌ ERROR: 'phys_props_df' not found. Please ensure it's loaded.")
    # For demonstration, if it's missing, create a sample.
    phys_props_df = pd.DataFrame({
        'Compound': ['D-Erythronolactone', 'LPC(16:1/0:0)'],
        'XLogP': [0.1, 5.2], 'FSP3': [0.8, 0.5], 'Complexity': [100,500],
        'HBondDonors': [4,2], 'HBondAcceptors': [5,6], 'TPSA': [90,120],
        'RotatableBonds': [2,15], 'Class I': ['Polyol', 'Lipid'], 'Class II': ['SA', 'LPL'],
        'VIP': [1.5,2.1], 'p_value': [0.01,0.001], 'Fold_Change': [2.0,0.5],
        'Log2FC': [1.0,-1.0], 'Type': ['Endo_Phys', 'Endo_Phys'], 'Molecular Weight (Da)': [148.11,495.68]
    })
    print("   ⚠️ Created a DUMMY 'phys_props_df' for script execution.")
# --- End Safety Checks ---


print("\n--- Adding further physicochemical properties from 'phys_props_df' to 'df_exploded_cleaned_augmented' ---")

cols_to_add_from_phys_props = [
    'XLogP', 'FSP3', 'Complexity', 'HBondDonors', 'HBondAcceptors',
    'TPSA', 'RotatableBonds', 'Class I', 'Class II', 'VIP', 'p_value',
    'Fold_Change', 'Log2FC', 'Type', # This 'Type' is from phys_props_df
    'Molecular Weight (Da)'
]

# Check if necessary columns exist in the input DataFrames
prerequisites_ok_for_merge = True
if 'Metabolite' not in df_exploded_cleaned_augmented.columns:
    print("❌ ERROR: 'df_exploded_cleaned_augmented' is missing the 'Metabolite' column for merging.")
    prerequisites_ok_for_merge = False
if 'Compound' not in phys_props_df.columns:
    print("❌ ERROR: 'phys_props_df' is missing the 'Compound' column for merging.")
    prerequisites_ok_for_merge = False
missing_phys_cols = [col for col in cols_to_add_from_phys_props if col not in phys_props_df.columns]
if missing_phys_cols:
    print(f"❌ ERROR: 'phys_props_df' is missing the following descriptor columns: {missing_phys_cols}")
    prerequisites_ok_for_merge = False

if prerequisites_ok_for_merge:
    # 1. Prepare 'df_exploded_cleaned_augmented' for merging by cleaning its 'Metabolite' column
    def clean_metabolite_name_for_merge(met_val):
        if isinstance(met_val, str):
            try:
                eval_list = ast.literal_eval(met_val)
                if isinstance(eval_list, list) and len(eval_list) > 0:
                    return str(eval_list[0]).strip()
                return str(eval_list).strip()
            except (ValueError, SyntaxError):
                return met_val.strip("[]'")
        elif isinstance(met_val, list) and len(met_val) > 0:
            return str(met_val[0]).strip()
        elif pd.notna(met_val):
            return str(met_val).strip()
        return None

    # Create the merge key on a temporary variable or directly on a copy
    df_target_with_merge_key = df_exploded_cleaned_augmented.copy()
    df_target_with_merge_key['Metabolite_for_Merge'] = df_target_with_merge_key['Metabolite'].apply(clean_metabolite_name_for_merge)
    print("✅ Created 'Metabolite_for_Merge' column in the target DataFrame copy.")

    # 2. Prepare 'phys_props_df' for merging
    phys_props_df_subset = phys_props_df[['Compound'] + cols_to_add_from_phys_props].copy()
    phys_props_df_subset['Compound_for_Merge'] = phys_props_df_subset['Compound'].astype(str).str.strip()
    print("✅ Prepared subset of 'phys_props_df' for merging.")

    # 3. Perform the merge
    print("\nMerging with phys_props_df_subset...")
    # The DataFrame being merged into is df_target_with_merge_key
    df_fully_augmented = pd.merge(
        df_target_with_merge_key,
        phys_props_df_subset,
        left_on='Metabolite_for_Merge',
        right_on='Compound_for_Merge',
        how='left',
        suffixes=('_df_orig', '_phys_props') # Suffixes for ALL overlapping column names
    )
    print(f"✅ Merge complete. Shape of new augmented table: {df_fully_augmented.shape}")

    # 4. Clean up temporary merge columns and handle potential suffixed columns
    cols_to_drop_after_merge = ['Metabolite_for_Merge']
    if 'Compound_for_Merge' in df_fully_augmented.columns:
        cols_to_drop_after_merge.append('Compound_for_Merge')
    # If 'Compound' was brought in from phys_props_df and isn't one of your original key columns
    if 'Compound' in df_fully_augmented.columns and 'Compound' not in df_exploded_cleaned_augmented.columns:
         cols_to_drop_after_merge.append('Compound')

    df_fully_augmented.drop(columns=cols_to_drop_after_merge, inplace=True, errors='ignore')
    print("✅ Temporary merge columns dropped.")

    # Handle the 'Type' column specifically if it was duplicated
    # Your df_exploded_cleaned_augmented already has a 'Type' column (from your earlier image_eadef4.png)
    # phys_props_df also has a 'Type' column.
    if 'Type_df_orig' in df_fully_augmented.columns and 'Type_phys_props' in df_fully_augmented.columns:
        print("  Handling duplicated 'Type' column from merge ('Type_df_orig' and 'Type_phys_props').")
        # Decide which to keep or how to combine. Let's keep the original one.
        df_fully_augmented.rename(columns={'Type_df_orig': 'Type'}, inplace=True)
        df_fully_augmented.drop(columns=['Type_phys_props'], inplace=True, errors='ignore')
        print("    Kept original 'Type' column (was 'Type_df_orig'), removed 'Type_phys_props'.")
    elif 'Type_phys_props' in df_fully_augmented.columns and 'Type' not in df_fully_augmented.columns:
        # If original df_exploded_cleaned_augmented didn't have 'Type' for some reason
        df_fully_augmented.rename(columns={'Type_phys_props': 'Type'}, inplace=True)
        print("    Renamed 'Type_phys_props' to 'Type'.")
    # If only 'Type_df_orig' exists, it will be renamed to 'Type' by the suffixes logic if no original 'Type'
    # If 'Type' column from df_exploded_cleaned_augmented didn't get a suffix, it's already fine.

    # Assign the fully augmented DataFrame back to the name you want to use moving forward
    df_exploded_cleaned_augmented = df_fully_augmented # Overwrite with the new data

    # --- 5. Display the results ---
    print("\n--- Head of 'df_exploded_cleaned_augmented' with all physicochemical properties ---")
    
    # Define all columns you might want to see
    all_possible_display_cols = [
        'Metabolite', 'Type', 'Gene_Symbol', 'Pathway_Common_Name', 
        'TotalEnergy', 'HOMO', 'LUMO', 'Gap', 'DipoleMoment', # XTB
        'XLogP', 'FSP3', 'Complexity', 'HBondDonors', 'HBondAcceptors', 
        'TPSA', 'RotatableBonds', 'Class I', 'Class II', 'VIP', 'p_value', 
        'Fold_Change', 'Log2FC', 'Molecular Weight (Da)', # From phys_props_df
        'PathwayScore', 'TF', 'TFActivity' # Existing scores
    ]
    
    existing_display_cols = [col for col in all_possible_display_cols if col in df_exploded_cleaned_augmented.columns]
    
    if not df_exploded_cleaned_augmented.empty:
        try:
            display(df_exploded_cleaned_augmented[existing_display_cols].head())
        except NameError:
            print(df_exploded_cleaned_augmented[existing_display_cols].head().to_string())
    else:
        print("Augmented DataFrame is empty.")

    print("\n--- Summary of NaNs in physicochemical columns from phys_props_df ---")
    for col in cols_to_add_from_phys_props:
        # Handle if 'Type' from phys_props_df was the one kept and renamed
        actual_col_name_to_check_nans = col
        if col == 'Type' and 'Type_df_orig' in df_fully_augmented.columns: # This means original Type was kept
             pass # NaN check for 'Type' (orig) will be correct
        elif col == 'Type' and 'Type_phys_props' in df_merged_with_phys_props.columns and 'Type_df_orig' not in df_merged_with_phys_props.columns:
             # This means 'Type_phys_props' was renamed to 'Type'
             pass


        if actual_col_name_to_check_nans in df_exploded_cleaned_augmented.columns:
            nan_count = df_exploded_cleaned_augmented[actual_col_name_to_check_nans].isna().sum()
            print(f"  NaNs in '{actual_col_name_to_check_nans}': {nan_count} out of {len(df_exploded_cleaned_augmented)} rows.")
        # else:
            # print(f"  Column '{col}' (expected as '{actual_col_name_to_check_nans}') not found in the augmented DataFrame for NaN check.")
else:
    print("\n--- Merge with phys_props_df skipped due to missing prerequisites. ---")


--- Adding further physicochemical properties from 'phys_props_df' to 'df_exploded_cleaned_augmented' ---
✅ Created 'Metabolite_for_Merge' column in the target DataFrame copy.
✅ Prepared subset of 'phys_props_df' for merging.

Merging with phys_props_df_subset...
✅ Merge complete. Shape of new augmented table: (11759, 39)
✅ Temporary merge columns dropped.

--- Head of 'df_exploded_cleaned_augmented' with all physicochemical properties ---


Unnamed: 0,Metabolite,Type,Gene_Symbol,Pathway_Common_Name,TotalEnergy,HOMO,LUMO,Gap,DipoleMoment,XLogP,...,Class I,Class II,VIP,p_value,Fold_Change,Log2FC,Molecular Weight (Da),PathwayScore,TF,TFActivity
0,D-Erythronolactone,down,BDH1,Butanoate metabolism - Homo sapiens (human),-27.863108,4.01588,4.01588,0.147581,4.347,-1.3,...,Carbohydrates and Its metabolites,Carboxylic acids and derivatives,3.20052,0.000679,0.276351,-1.855428,118.027,126.993134,E2F1,0.980517
1,D-Erythronolactone,down,BDH1,Butanoate metabolism - Homo sapiens (human),-27.863108,4.01588,4.01588,0.147581,4.347,-1.3,...,Carbohydrates and Its metabolites,Carboxylic acids and derivatives,3.20052,0.000679,0.276351,-1.855428,118.027,126.993134,E2F1,0.980517
2,D-Erythronolactone,down,BDH1,Butanoate metabolism - Homo sapiens (human),-27.863108,4.01588,4.01588,0.147581,4.347,-1.3,...,Carbohydrates and Its metabolites,Carboxylic acids and derivatives,3.20052,0.000679,0.276351,-1.855428,118.027,126.993134,E2F1,0.980517
3,D-Erythronolactone,down,BDH1,Butanoate metabolism - Homo sapiens (human),-27.863108,4.01588,4.01588,0.147581,4.347,-1.3,...,Carbohydrates and Its metabolites,Carboxylic acids and derivatives,3.20052,0.000679,0.276351,-1.855428,118.027,126.993134,E2F1,0.980517
4,D-Erythronolactone,down,BDH1,Butanoate metabolism - Homo sapiens (human),-27.863108,4.01588,4.01588,0.147581,4.347,-1.3,...,Carbohydrates and Its metabolites,Carboxylic acids and derivatives,3.20052,0.000679,0.276351,-1.855428,118.027,126.993134,E2F1,0.980517



--- Summary of NaNs in physicochemical columns from phys_props_df ---
  NaNs in 'XLogP': 0 out of 11759 rows.
  NaNs in 'FSP3': 0 out of 11759 rows.
  NaNs in 'Complexity': 0 out of 11759 rows.
  NaNs in 'HBondDonors': 0 out of 11759 rows.
  NaNs in 'HBondAcceptors': 0 out of 11759 rows.
  NaNs in 'TPSA': 0 out of 11759 rows.
  NaNs in 'RotatableBonds': 0 out of 11759 rows.
  NaNs in 'Class I': 0 out of 11759 rows.
  NaNs in 'Class II': 0 out of 11759 rows.
  NaNs in 'VIP': 0 out of 11759 rows.
  NaNs in 'p_value': 0 out of 11759 rows.
  NaNs in 'Fold_Change': 0 out of 11759 rows.
  NaNs in 'Log2FC': 0 out of 11759 rows.
  NaNs in 'Type': 0 out of 11759 rows.
  NaNs in 'Molecular Weight (Da)': 0 out of 11759 rows.


In [393]:
display(df_exploded_cleaned_augmented)

Unnamed: 0,Metabolite,Gene_Symbol,Pathway Name,Pathway_Common_Name,kegg_enzyme,kegg_reactions,PathwayScore,TF,TFActivity,Log2FC_GeneExpr,...,TPSA,RotatableBonds,Class I,Class II,VIP,p_value,Fold_Change,Log2FC,Type,Molecular Weight (Da)
0,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R00145,126.993134,E2F1,0.980517,-1.027547,...,66.8,0.0,Carbohydrates and Its metabolites,Carboxylic acids and derivatives,3.20052,0.000679,0.276351,-1.855428,down,118.027000
1,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R00146,126.993134,E2F1,0.980517,-1.027547,...,66.8,0.0,Carbohydrates and Its metabolites,Carboxylic acids and derivatives,3.20052,0.000679,0.276351,-1.855428,down,118.027000
2,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R00717,126.993134,E2F1,0.980517,-1.027547,...,66.8,0.0,Carbohydrates and Its metabolites,Carboxylic acids and derivatives,3.20052,0.000679,0.276351,-1.855428,down,118.027000
3,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R01088,126.993134,E2F1,0.980517,-1.027547,...,66.8,0.0,Carbohydrates and Its metabolites,Carboxylic acids and derivatives,3.20052,0.000679,0.276351,-1.855428,down,118.027000
4,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R01361,126.993134,E2F1,0.980517,-1.027547,...,66.8,0.0,Carbohydrates and Its metabolites,Carboxylic acids and derivatives,3.20052,0.000679,0.276351,-1.855428,down,118.027000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11754,2'-Deoxyinosine-5'-monophosphate,NUDT16,hsa00230,Purine metabolism - Homo sapiens (human),3.6.1.64,R03531,383.739852,HNF4A,-0.496992,-0.787699,...,156.0,4.0,Nucleotide And Its metabolites,Nucleotide And Its metabolites,1.27660,0.044854,0.322635,-1.632023,down,332.052185
11755,2'-Deoxyinosine-5'-monophosphate,NUDT16,hsa00230,Purine metabolism - Homo sapiens (human),3.6.1.64,R12958,383.739852,HNF4A,-0.496992,-0.787699,...,156.0,4.0,Nucleotide And Its metabolites,Nucleotide And Its metabolites,1.27660,0.044854,0.322635,-1.632023,down,332.052185
11756,2'-Deoxyinosine-5'-monophosphate,NUDT16,hsa01100,Metabolic pathways - Homo sapiens (human),3.6.1.64,R10235,-141.148347,HNF4A,-0.496992,-0.787699,...,156.0,4.0,Nucleotide And Its metabolites,Nucleotide And Its metabolites,1.27660,0.044854,0.322635,-1.632023,down,332.052185
11757,2'-Deoxyinosine-5'-monophosphate,NUDT16,hsa01100,Metabolic pathways - Homo sapiens (human),3.6.1.64,R03531,-141.148347,HNF4A,-0.496992,-0.787699,...,156.0,4.0,Nucleotide And Its metabolites,Nucleotide And Its metabolites,1.27660,0.044854,0.322635,-1.632023,down,332.052185


In [285]:
print(df_exploded_cleaned_augmented.columns)

Index(['Metabolite', 'Gene_Symbol', 'Pathway Name', 'Pathway_Common_Name',
       'kegg_enzyme', 'kegg_reactions', 'PathwayScore', 'TF', 'TFActivity',
       'Log2FC_GeneExpr', 'P_value_GeneExpr', 'PROGENy_Pathway',
       'P_value_PROGENyActivity', 'Mean_Tumor_TFActivity',
       'Mean_Normal_TFActivity', 'P_value_TFActivity', 'TotalEnergy', 'HOMO',
       'LUMO', 'Gap', 'DipoleMoment', 'XLogP', 'FSP3', 'Complexity',
       'HBondDonors', 'HBondAcceptors', 'TPSA', 'RotatableBonds', 'Class I',
       'Class II', 'VIP', 'p_value', 'Fold_Change', 'Log2FC', 'Type',
       'Molecular Weight (Da)'],
      dtype='object')


In [394]:
import pandas as pd
import numpy as np

def count_special_values(df: pd.DataFrame) -> pd.DataFrame:
    """
    For each column in df, count occurrences of:
      - NaN
      - numeric 0
      - string "0"
      - empty string ""
      - actual empty list [] 
      - string representation of empty list "[]"
    
    Returns a DataFrame with those counts per column.
    """
    summary = {}
    for col in df.columns:
        ser = df[col]

        # 1. Count NaNs
        num_nan = ser.isnull().sum()

        # 2. Count numeric 0 (exclude NaNs so they’re not double‐counted)
        num_zero_numeric = ((ser == 0) & ~ser.isnull()).sum()

        # 3. Count string "0"
        num_str_zero = ser.apply(lambda x: isinstance(x, str) and x == "0").sum()

        # 4. Count empty string ""
        num_empty_string = ser.apply(lambda x: isinstance(x, str) and x == "").sum()

        # 5. Count actual empty list [] (only if some cells are lists)
        num_empty_list = ser.apply(lambda x: isinstance(x, list) and len(x) == 0).sum()

        # 6. Count string "[]" (literal string with square brackets)
        num_str_empty_list = ser.apply(lambda x: isinstance(x, str) and x == "[]").sum()

        summary[col] = {
            "NaN_count": num_nan,
            "Zero_numeric_count": num_zero_numeric,
            "String_'0'_count": num_str_zero,
            "Empty_string_count": num_empty_string,
            "Empty_list_count": num_empty_list,
            "String_'[]'_count": num_str_empty_list
        }

    return pd.DataFrame(summary).T

# === Usage on df_exploded_cleaned ===
special_counts_cleaned = count_special_values(df_exploded_cleaned)

print("Counts of special values per column in df_exploded_cleaned:\n")
try:
    display(special_counts_cleaned)
except NameError:
    print(special_counts_cleaned.to_string())


Counts of special values per column in df_exploded_cleaned:



Unnamed: 0,NaN_count,Zero_numeric_count,String_'0'_count,Empty_string_count,Empty_list_count,String_'[]'_count
Metabolite,0,0,0,0,0,0
Gene_Symbol,0,0,0,0,0,0
Pathway Name,0,0,0,0,0,0
Pathway_Common_Name,0,0,0,0,0,0
kegg_enzyme,0,0,0,0,0,0
kegg_reactions,0,0,0,0,0,0
PathwayScore,0,0,0,0,0,0
TF,0,0,0,0,0,0
TFActivity,0,0,0,0,0,0
Log2FC_GeneExpr,0,0,0,0,0,0


In [395]:
# Define a mapping from existing column names to the desired new names
rename_mapping = {
    'Metabolite':               'metabolite_name',
    'Gene_Symbol':              'gene_symbol',
    'Pathway Name':             'kegg_pathway_id',
    'Pathway_Common_Name':      'kegg_pathway_name',
    'kegg_enzyme':              'kegg_enzyme_id',
    'kegg_reactions':           'kegg_reaction_id',
    'PathwayScore':             'pathway_score',
    'TF':                       'transcription_factor',
    'TFActivity':               'tf_activity_score',
    'Log2FC_GeneExpr':          'gene_log2fc',
    'P_value_GeneExpr':         'gene_expr_pval',
    'PROGENy_Pathway':          'progeny_pathway',
    'P_value_PROGENyActivity':  'progeny_pathway_pval',
    'Mean_Tumor_TFActivity':    'tf_activity_tumor_mean',
    'Mean_Normal_TFActivity':   'tf_activity_normal_mean',
    'P_value_TFActivity':       'tf_activity_pval',
    'TotalEnergy':              'qm_total_energy',
    'HOMO':                     'qm_homo',
    'LUMO':                     'qm_lumo',
    'Gap':                      'qm_gap',
    'DipoleMoment':             'qm_dipole_moment',
    'XLogP':                    'xlogp',
    'FSP3':                     'fsp3',
    'Complexity':               'complexity',
    'HBondDonors':              'hbond_donors',
    'HBondAcceptors':           'hbond_acceptors',
    'TPSA':                     'tpsa',
    'RotatableBonds':           'rotatable_bonds',
    'Class I':                  'class_i_flag',
    'Class II':                 'class_ii_flag',
    'VIP':                      'vip_score',
    'p_value':                  'metabolite_pval',
    'Fold_Change':              'metabolite_fold_change',
    'Log2FC':                   'metabolite_log2fc',
    'Type':                     'direction_flag',
    'Molecular Weight (Da)':    'mol_weight_da'
}

# Apply the renaming to df_exploded_cleaned_augmented
df_exploded_cleaned_augmented = df_exploded_cleaned_augmented.rename(columns=rename_mapping)

# Verify the new column names
print("Renamed columns:")
print(list(df_exploded_cleaned_augmented.columns))


Renamed columns:
['metabolite_name', 'gene_symbol', 'kegg_pathway_id', 'kegg_pathway_name', 'kegg_enzyme_id', 'kegg_reaction_id', 'pathway_score', 'transcription_factor', 'tf_activity_score', 'gene_log2fc', 'gene_expr_pval', 'progeny_pathway', 'progeny_pathway_pval', 'tf_activity_tumor_mean', 'tf_activity_normal_mean', 'tf_activity_pval', 'qm_total_energy', 'qm_homo', 'qm_lumo', 'qm_gap', 'qm_dipole_moment', 'xlogp', 'fsp3', 'complexity', 'hbond_donors', 'hbond_acceptors', 'tpsa', 'rotatable_bonds', 'class_i_flag', 'class_ii_flag', 'vip_score', 'metabolite_pval', 'metabolite_fold_change', 'metabolite_log2fc', 'direction_flag', 'mol_weight_da']


In [396]:
df_exploded_cleaned_augmented.shape

(11759, 36)

In [397]:
display(df_exploded_cleaned_augmented)

Unnamed: 0,metabolite_name,gene_symbol,kegg_pathway_id,kegg_pathway_name,kegg_enzyme_id,kegg_reaction_id,pathway_score,transcription_factor,tf_activity_score,gene_log2fc,...,tpsa,rotatable_bonds,class_i_flag,class_ii_flag,vip_score,metabolite_pval,metabolite_fold_change,metabolite_log2fc,direction_flag,mol_weight_da
0,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R00145,126.993134,E2F1,0.980517,-1.027547,...,66.8,0.0,Carbohydrates and Its metabolites,Carboxylic acids and derivatives,3.20052,0.000679,0.276351,-1.855428,down,118.027000
1,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R00146,126.993134,E2F1,0.980517,-1.027547,...,66.8,0.0,Carbohydrates and Its metabolites,Carboxylic acids and derivatives,3.20052,0.000679,0.276351,-1.855428,down,118.027000
2,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R00717,126.993134,E2F1,0.980517,-1.027547,...,66.8,0.0,Carbohydrates and Its metabolites,Carboxylic acids and derivatives,3.20052,0.000679,0.276351,-1.855428,down,118.027000
3,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R01088,126.993134,E2F1,0.980517,-1.027547,...,66.8,0.0,Carbohydrates and Its metabolites,Carboxylic acids and derivatives,3.20052,0.000679,0.276351,-1.855428,down,118.027000
4,D-Erythronolactone,BDH1,hsa00650,Butanoate metabolism - Homo sapiens (human),1.1.1.30,R01361,126.993134,E2F1,0.980517,-1.027547,...,66.8,0.0,Carbohydrates and Its metabolites,Carboxylic acids and derivatives,3.20052,0.000679,0.276351,-1.855428,down,118.027000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11754,2'-Deoxyinosine-5'-monophosphate,NUDT16,hsa00230,Purine metabolism - Homo sapiens (human),3.6.1.64,R03531,383.739852,HNF4A,-0.496992,-0.787699,...,156.0,4.0,Nucleotide And Its metabolites,Nucleotide And Its metabolites,1.27660,0.044854,0.322635,-1.632023,down,332.052185
11755,2'-Deoxyinosine-5'-monophosphate,NUDT16,hsa00230,Purine metabolism - Homo sapiens (human),3.6.1.64,R12958,383.739852,HNF4A,-0.496992,-0.787699,...,156.0,4.0,Nucleotide And Its metabolites,Nucleotide And Its metabolites,1.27660,0.044854,0.322635,-1.632023,down,332.052185
11756,2'-Deoxyinosine-5'-monophosphate,NUDT16,hsa01100,Metabolic pathways - Homo sapiens (human),3.6.1.64,R10235,-141.148347,HNF4A,-0.496992,-0.787699,...,156.0,4.0,Nucleotide And Its metabolites,Nucleotide And Its metabolites,1.27660,0.044854,0.322635,-1.632023,down,332.052185
11757,2'-Deoxyinosine-5'-monophosphate,NUDT16,hsa01100,Metabolic pathways - Homo sapiens (human),3.6.1.64,R03531,-141.148347,HNF4A,-0.496992,-0.787699,...,156.0,4.0,Nucleotide And Its metabolites,Nucleotide And Its metabolites,1.27660,0.044854,0.322635,-1.632023,down,332.052185


In [398]:
# Assuming df_exploded_cleaned is defined and has a column 'metabolite_name'

# 1. Get an array of unique metabolite names
unique_metabolites = df_exploded_cleaned_augmented['metabolite_name'].dropna().unique()

# 2. (Optional) Sort them for readability
unique_metabolites_sorted = sorted(unique_metabolites)

# 3. Print the unique names and the total count
print("Unique metabolites in 'metabolite_name':")
for metab in unique_metabolites_sorted:
    print(metab)

print(f"\nTotal number of unique metabolites: {len(unique_metabolites_sorted)}")


Unique metabolites in 'metabolite_name':
17β-Estradiol
2'-Deoxyinosine-5'-monophosphate
5'-Deoxy-5'-(Methylthio) Adenosine
Asp-Arg
Carnitine C7:DC
Cyclo(Phe-Glu)
Cytarabine
D-Erythronolactone
Deoxyribose 5-phosphate
LPC(13:0/0:0)
LPC(16:1/0:0)
LPC(18:3/0:0)
LPE(17:1/0:0)
LPI(16:2/0:0)
Methylcysteine
N(Alpha)-Acetyl-Epsilon-(2-Propenal)Lysine
Quinoline-4-carboxylic acid
Thiamine Monophosphate

Total number of unique metabolites: 18


In [399]:
# choose a descriptive filename
output_path = "enriched_metabolite_data.csv"

# save to CSV without the pandas index column
df_exploded_cleaned_augmented.to_csv(output_path, index=False)

print(f"✅ Saved df_enriched to '{output_path}'")


✅ Saved df_enriched to 'enriched_metabolite_data.csv'
