# Flux comparison
In FIG. 4 , the authors compared subsystems by selecting the most de-regulated reactions across cell types (LG and HG). I think They optimized to max biomass each of the 7 models  (they generated 7 FBA solutions) and then they computed the average of each flux across cell lines of the same family.

However, they selected one of the many solutions that respect max biomass growth. I suggest to:

1. Impose biomass at its upper bound for each model (LB = UB*0.90 to avoid solver numerical issues)
2. Run flux sampling with OPTGP (cobrapy) with thinning = 100 and 1k samples per cell line
3. DO not 'summarize' flux probability distributions with a simple average, but use more advanced methods. The objective here is to identify the most different reactions across the two cancer families. You could run non parametric statistical tests such as mann-whitney to check if two probability distributions are significantly different or not. You have 3 cells vs 2 cells (all pair combinations), so you could perform this test only on reactions belonging to core subsystems such as glycolysis, TCA cycle pentophosphate etc.. in order to redure the number of compared distribution per cell couple.
4. Once you identified the top-n most different probability distributions (reaction fluxes) across cells of different type, you could plot them with boxplots as the authors did 
5. It might be interesting to check if we have 'less differences' in distributions of cells belonging to the same family.

In [17]:

from cobra.io import load_model
from cobra.sampling import sample
import pandas as pd
import numpy as np
from scipy.stats import mannwhitneyu
from statsmodels.stats.multitest import multipletests

import pandas as pd
from cobra.io.json import load_json_model
import numpy as np
from cobra.flux_analysis import flux_variability_analysis
from cobra.flux_analysis import single_reaction_deletion
import math
from tqdm import tqdm

from cobra.io import read_sbml_model
from mewpy.simulation import get_simulator
from mewpy.simulation import set_default_solver
set_default_solver('gurobi')

from scipy.stats import mannwhitneyu
from statsmodels.stats.multitest import multipletests
import os


# 1. Load Models

In [3]:
cell_lines = {
    "LG_59M": pd.read_csv("data/fva_rnaseq_ACH-000520_LG.csv", index_col=0),
    "LG_HEYA8":  pd.read_csv("data/fva_rnaseq_ACH-000542_LG.csv", index_col=0),
    "LG_OV56": pd.read_csv("data/fva_rnaseq_ACH-000091_LG.csv", index_col=0),
    "HG_CAOV3": pd.read_csv("data/fva_rnaseq_ACH-000713_HG.csv", index_col=0),
    "HG_COV318": pd.read_csv("data/fva_rnaseq_ACH-000256_HG.csv", index_col=0),
    "HG_OAW28": pd.read_csv("data/fva_rnaseq_ACH-000116_HG.csv", index_col=0)
}

In [6]:
models = {}

for name in cell_lines.keys():
    model = load_json_model('./data/Recon3D.json')
    model.solver = 'gurobi'
    print("hi")
    df = cell_lines[name]
    oopsie_df= df.loc[df['minimum'] > df['maximum']]
    for index, row in oopsie_df.iterrows():
        df.loc[index, "minimum"]= 0
        df.loc[index, "maximum"]= 0
    for index, row in df.iterrows():
        model.reactions.get_by_id(index).bounds = (row['minimum'], row['maximum'])
    models[name] = model 
    sol= model.optimize()
    print(name, ": ", sol.objective_value)


hi
LG_59M :  178.90834242799173
hi
LG_HEYA8 :  167.3424025396205
hi
LG_OV56 :  177.6725358833259
hi
HG_CAOV3 :  187.23800345109214
hi
HG_COV318 :  168.3898619215588
hi
HG_OAW28 :  193.22163603973084


## 2. Map reactions to their subsystem

## Pathways
- Glycolysis
- TCA cycle
- PPP
- Amino acid metabolism
- Nucleotide metabolism
- Fatty acid metabolism
- Lipid metabolism

### 1. Impose biomass UB and LB=UP x 0.9

In [21]:
for id,model in models.items():   
    print(id, model.objective.expression)

LG_59M 1.0*BIOMASS_maintenance - 1.0*BIOMASS_maintenance_reverse_5b3f9
LG_HEYA8 1.0*BIOMASS_maintenance - 1.0*BIOMASS_maintenance_reverse_5b3f9
LG_OV56 1.0*BIOMASS_maintenance - 1.0*BIOMASS_maintenance_reverse_5b3f9
HG_CAOV3 1.0*BIOMASS_maintenance - 1.0*BIOMASS_maintenance_reverse_5b3f9
HG_COV318 1.0*BIOMASS_maintenance - 1.0*BIOMASS_maintenance_reverse_5b3f9
HG_OAW28 1.0*BIOMASS_maintenance - 1.0*BIOMASS_maintenance_reverse_5b3f9


In [8]:
def constrain_to_near_opt(model):
    solution = model.optimize()
    v_opt = solution.fluxes["BIOMASS_reaction"]
    rxn = model.reactions.get_by_id("BIOMASS_reaction")
    rxn.lower_bound = 0.9 * v_opt   # 90% of optimal growth
    rxn.upper_bound = v_opt         # at most optimal growth


### 2. Run flux sampling with OPTGP (cobrapy) with thinning = 100 and 1k samples per cell line

In [9]:
all_samples = {}

for id,model in models.items():
    constrain_to_near_opt(model)
    S= sample(model, n=1000, method="optgp", thinning=100, seed=42)
    all_samples[id]=S 

all_samples

Read LP format model from file /var/folders/0f/7pcrwybx3_zfsnzs2qjgxp8h0000gn/T/tmpvw_whdvj.lp
Reading time = 0.03 seconds
: 5835 rows, 21200 columns, 80850 nonzeros
Read LP format model from file /var/folders/0f/7pcrwybx3_zfsnzs2qjgxp8h0000gn/T/tmphg3fqpz5.lp
Reading time = 0.04 seconds
: 5835 rows, 21200 columns, 80850 nonzeros
Read LP format model from file /var/folders/0f/7pcrwybx3_zfsnzs2qjgxp8h0000gn/T/tmpfcdlrk9h.lp
Reading time = 0.04 seconds
: 5835 rows, 21200 columns, 80850 nonzeros
Read LP format model from file /var/folders/0f/7pcrwybx3_zfsnzs2qjgxp8h0000gn/T/tmpi_cuuj8t.lp
Reading time = 0.03 seconds
: 5835 rows, 21200 columns, 80850 nonzeros
Read LP format model from file /var/folders/0f/7pcrwybx3_zfsnzs2qjgxp8h0000gn/T/tmp5xew14uv.lp
Reading time = 0.03 seconds
: 5835 rows, 21200 columns, 80850 nonzeros
Read LP format model from file /var/folders/0f/7pcrwybx3_zfsnzs2qjgxp8h0000gn/T/tmpqnqvwafi.lp
Reading time = 0.03 seconds
: 5835 rows, 21200 columns, 80850 nonzeros


{'LG_59M':      24_25DHVITD3tm  25HVITD3t      COAtl  EX_5adtststerone_e  \
 0          0.036521   0.709319  45.260521            0.072090   
 1          0.039121   0.707917  39.600044            0.071947   
 2          0.039089   0.707323  37.903778            0.065803   
 3          0.035923   0.709030  37.991038            0.062751   
 4          0.033207   0.682011  37.825996            0.078666   
 ..              ...        ...        ...                 ...   
 995        0.120440   0.740407   3.084707            0.120115   
 996        0.122628   0.685514   3.122992            0.118145   
 997        0.122451   0.684662   3.070970            0.115416   
 998        0.119052   0.787824   2.835021            0.114874   
 999        0.116832   0.764886   2.646207            0.114877   
 
      EX_5adtststerones_e    EX_5fthf_e  EX_5htrp_e  EX_5mthf_e     EX_5thf_e  \
 0               0.050733 -2.405421e-08    0.105986   51.659498  3.653280e-09   
 1               0.050632 -2.40218

In [19]:
os.makedirs("data/flux_sampling_data", exist_ok=True)

In [20]:
for id,data in cell_lines.items():
    all_samples[id].to_csv(f"data/flux_sampling_data/flux_sampling_{id}.csv")

In [15]:
all_samples["LG_59M"]

Unnamed: 0,24_25DHVITD3tm,25HVITD3t,COAtl,EX_5adtststerone_e,EX_5adtststerones_e,EX_5fthf_e,EX_5htrp_e,EX_5mthf_e,EX_5thf_e,EX_6dhf_e,...,PVSitr,RSVLACitr,TLACFVSitr,TMDM1itr,TMDM5itr,ACMPGLUTTRsc,FVSCOAhc,MDZGLChr,TMACMPhr,CYSACMPitr
0,0.036521,0.709319,45.260521,0.072090,0.050733,-2.405421e-08,0.105986,51.659498,3.653280e-09,-4.952430e-10,...,4.234605e-08,-0.166020,-2.869360e-08,0.0,-2.522106e-08,4.225813e-08,6.794126e-09,0.0,2.163343e-08,4.240940e-08
1,0.039121,0.707917,39.600044,0.071947,0.050632,-2.402187e-08,0.105765,49.190133,2.800261e-09,-6.420394e-10,...,4.191323e-08,-0.176727,-2.942347e-08,0.0,-2.359806e-08,4.467662e-08,6.770609e-09,0.0,2.374358e-08,4.479220e-08
2,0.039089,0.707323,37.903778,0.065803,0.050590,-2.575946e-08,0.105669,48.870735,2.163628e-09,-1.352484e-09,...,4.008261e-08,-0.220126,-2.615998e-08,0.0,-2.237338e-08,4.597571e-08,6.867840e-09,0.0,2.464084e-08,4.638519e-08
3,0.035923,0.709030,37.991038,0.062751,0.051901,-2.615411e-08,0.105957,47.890061,2.568187e-09,-1.270401e-09,...,3.991410e-08,-0.213648,-2.692078e-08,0.0,-2.329486e-08,4.649615e-08,6.298212e-09,0.0,2.422644e-08,4.682154e-08
4,0.033207,0.682011,37.825996,0.078666,0.050365,-2.636702e-08,0.105490,47.116150,2.831308e-09,-8.499674e-10,...,3.996257e-08,-0.212549,-2.844249e-08,0.0,-2.303829e-08,4.707668e-08,6.272063e-09,0.0,2.423163e-08,4.740116e-08
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.120440,0.740407,3.084707,0.120115,0.205931,-2.292785e-08,0.277961,59.853474,5.738637e-10,-1.351722e-09,...,2.925442e-08,-0.216319,-2.501225e-08,0.0,-2.552544e-08,4.900079e-08,5.985081e-09,0.0,1.921420e-08,4.908673e-08
996,0.122628,0.685514,3.122992,0.118145,0.194442,-2.342642e-08,0.307978,60.464724,1.638639e-10,-8.855577e-10,...,3.005385e-08,-0.216163,-2.571597e-08,0.0,-2.558048e-08,4.909821e-08,6.144080e-09,0.0,1.949323e-08,4.918446e-08
997,0.122451,0.684662,3.070970,0.115416,0.193966,-2.343943e-08,0.302722,60.023990,1.030879e-10,-1.229499e-09,...,3.028667e-08,-0.221368,-2.514983e-08,0.0,-2.562073e-08,4.916840e-08,6.086442e-09,0.0,1.971674e-08,4.925180e-08
998,0.119052,0.787824,2.835021,0.114874,0.194778,-2.253846e-08,0.231407,59.800374,4.713387e-11,-1.416559e-09,...,2.992897e-08,-0.216522,-2.590258e-08,0.0,-2.547648e-08,4.887954e-08,6.005396e-09,0.0,2.001028e-08,4.895366e-08


### 3. DO not 'summarize' flux probability distributions with a simple average, but use more advanced methods. The objective here is to identify the most different reactions across the two cancer families. 

In [16]:
subsystem_dict = {}

for r in model.reactions:
    # get subsystem safely
    s = (getattr(r, "subsystem", "") or "").strip()

    # if it contains "/", keep only the first part
    if "/" in s:
        s = s.split("/")[0].strip()   # take text before "/" and remove spaces

    s = s.lower()

    # skip empty subsystems
    if not s:
        continue

    # add reaction to that subsystem list
    subsystem_dict.setdefault(s, []).append(r.id)

In [20]:
category_keywords = {
    "Glycolysis": [
        "glycolysis", "gluconeogenesis"
    ],
    
    "TCA": [
        "tca", "citric acid", "krebs"
    ],
    
    "Pentose Phosphate Pathway": [
        "pentose phosphate", "ppp"
    ],
    
    "Amino acid metabolism": [
        "amino acid",
        "tyrosine", "phenylalanine", "tryptophan",
        "lysine", "leucine", "isoleucine", "valine",
        "methionine", "cysteine", "serine", "threonine",
        "histidine", "arginine", "ornithine", "proline",
        "glutamate", "glutamine", "aspartate", "asparagine",
        "alanine", "glycine"
    ],
    
    "Nucleotide metabolism": [
        "nucleotide", "purine", "pyrimidine", "deoxynucleotide", "dntp"
    ],
    
    "Fatty acid metabolism": [
        "fatty acid", "beta oxidation", "beta-oxidation",
        "acyl", "acyl-coa"
    ],
    
    "Lipid metabolism": [
        "lipid", "phospholipid", "sphingolipid", "glycerolipid"
    ],
}

In [23]:
keys = list(subsystem_dict.keys())
print(keys)

['transport, mitochondrial', 'transport, extracellular', 'transport, lysosomal', 'extracellular exchange', 'vitamin d metabolism', 'transport, endoplasmic reticular', 'beta-alanine metabolism', 'glycine, serine, alanine, and threonine metabolism', 'methionine and cysteine metabolism', 'lysine metabolism', 'tryptophan metabolism', 'tyrosine metabolism', 'ubiquinone synthesis', 'taurine and hypotaurine metabolism', 'cytochrome metabolism', 'steroid metabolism', 'sphingolipid metabolism', 'o-glycan metabolism', 'blood group synthesis', 'glutamate metabolism', 'valine, leucine, and isoleucine metabolism', 'fatty acid oxidation', 'transport, peroxisomal', 'propanoate metabolism', 'transport, golgi apparatus', 'aminosugar metabolism', 'transport, nuclear', 'urea cycle', 'citric acid cycle', 'vitamin b2 metabolism', 'nucleotide interconversion', 'arginine and proline metabolism', 'purine synthesis', 'keratan sulfate synthesis', 'alanine and aspartate metabolism', 'n-glycan degradation', 'bile

In [24]:
selected = {cat: [] for cat in category_keywords}

In [25]:
for subsystem_name, rxns in subsystem_dict.items():
    name_lower = subsystem_name.lower()

    for category, keywords in category_keywords.items():
        if any(kw in name_lower for kw in keywords):
            selected[category].extend(rxns)

In [26]:
for cat, rxns in selected.items():
    print(f"{cat}: {len(set(rxns))} reactions")

Glycolysis: 42 reactions
TCA: 20 reactions
Pentose Phosphate Pathway: 41 reactions
Amino acid metabolism: 439 reactions
Nucleotide metabolism: 273 reactions
Fatty acid metabolism: 1200 reactions
Lipid metabolism: 369 reactions


In [27]:
for rxn_id in selected["Lipid metabolism"]:
    rxn = model.reactions.get_by_id(rxn_id)
    print(rxn_id, " → ", rxn.subsystem)

A4GALTc  →  Sphingolipid metabolism
A4GALTg  →  Sphingolipid metabolism
B3GALT3g  →  Sphingolipid metabolism
B3GALT42g  →  Sphingolipid metabolism
B3GNT31g  →  Sphingolipid metabolism
B3GNT34g  →  Sphingolipid metabolism
B3GNT37g  →  Sphingolipid metabolism
B3GNT39g  →  Sphingolipid metabolism
DHCRD2  →  Sphingolipid metabolism
DSAT  →  Sphingolipid metabolism
GALGT2  →  Sphingolipid metabolism
GALGT3  →  Sphingolipid metabolism
GAO1g  →  Sphingolipid metabolism
GBA  →  Sphingolipid metabolism
GBGT1  →  Sphingolipid metabolism
GLAl  →  Sphingolipid metabolism
GLB1  →  Sphingolipid metabolism
NAGAlby  →  Sphingolipid metabolism
SBPP3er  →  Sphingolipid metabolism
SGPL12r  →  Sphingolipid metabolism
SLCBK1  →  Sphingolipid metabolism
SMPD3g  →  Sphingolipid metabolism
SMS  →  Sphingolipid metabolism
SPHK21c  →  Sphingolipid metabolism
ST3GAL21g  →  Sphingolipid metabolism
ST3GAL22g  →  Sphingolipid metabolism
ST3GAL23g  →  Sphingolipid metabolism
ST6GALNAC21  →  Sphingolipid metabolism
S

In [34]:
fva_result_gly = flux_variability_analysis(
    model,
    fraction_of_optimum=0.9  ,  # 90% of optimal biomass,
    reaction_list=selected["Glycolysis"]
)

Set parameter Username
Set parameter LicenseID to value 2732828
Academic license - for non-commercial use only - expires 2026-11-04
Read LP format model from file /var/folders/0f/7pcrwybx3_zfsnzs2qjgxp8h0000gn/T/tmpqwm49o43.lp
Reading time = 0.02 seconds
: 5836 rows, 21201 columns, 80853 nonzeros
Set parameter Username
Set parameter LicenseID to value 2732828
Academic license - for non-commercial use only - expires 2026-11-04
Read LP format model from file /var/folders/0f/7pcrwybx3_zfsnzs2qjgxp8h0000gn/T/tmplqmwriou.lp
Reading time = 0.03 seconds
: 5836 rows, 21201 columns, 80853 nonzeros
Set parameter Username
Set parameter LicenseID to value 2732828
Academic license - for non-commercial use only - expires 2026-11-04
Read LP format model from file /var/folders/0f/7pcrwybx3_zfsnzs2qjgxp8h0000gn/T/tmpex877pey.lp
Reading time = 0.03 seconds
: 5836 rows, 21201 columns, 80853 nonzeros
Set parameter Username
Set parameter LicenseID to value 2732828
Academic license - for non-commercial use 

In [36]:
fva_result_gly

Unnamed: 0,minimum,maximum
ALCD2y,0.0,1000.0
ALDD2xm,0.0,1000.0
ALDD2y,0.0,1000.0
CBPPer,0.0,1000.0
DPGase,0.0,1000.0
G6PPer,0.0,1000.0
PDHm,0.0,1000.0
PYK2,0.0,1000.0
r0202,-1000.0,1000.0
r0354,0.0,1000.0


### 4. Use non-parametric test: You could run non parametric statistical tests such as mann-whitney to check if two probability distributions are significantly different or not. You have 3 cells vs 2 cells (all pair combinations), so you could perform this test only on reactions belonging to core subsystems such as glycolysis, TCA cycle pentophosphate etc.. in order to redure the number of compared distribution per cell couple.

In [None]:
reactions_by_category = {
    "glycolysis": selected["Glycolysis"],
    "tca": selected["TCA"],
    "ppp": selected["Pentose Phosphate Pathway"],
    "amino_acid": selected["Amino acid metabolism"],
    "nucleotide": selected["Nucleotide metabolism"],
    "fatty_acid": selected["Fatty acid metabolism"],
    "lipid": selected["Lipid metabolism"],
}

- index = rxn_id
- columns = cell_id

In [42]:
def subsystem_df(models, subsystem, N=4):
    # 1. 

SyntaxError: incomplete input (869462133.py, line 2)