### Mounting Google Drive in Colab  
We first need to mount our Google Drive to access datasets, scripts, and saved outputs.  
The `drive.mount('/content/drive')` command creates a link between Colab and your Google Drive account, so that all files stored there can be read and written just like a local directory.  
After running this cell, you’ll be prompted to authorize access by logging into your Google account.  
____

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
#!rm -rf /content/scFBApy

### Install and set up scFBApy  
We install the required Python packages (`cobra`, `scanpy`, `progressbar2`), then clone the scFBApy repository from GitHub.  
Next, we add the repo to the Python path so its utilities can be imported, and finally load the key libraries needed for flux balance analysis.  
____

In [None]:
# Install required packages
!pip install cobra scanpy progressbar2

# Clone the scFBApy repo if you haven't yet
!git clone https://github.com/CompBtBs/scFBApy.git

# Add scFBApy folder to Python path so you can import utils_scFBApy
import sys
sys.path.append('/content/scFBApy')

# Now import libraries
import cobra as cb
import scanpy as sc
from utils_scFBApy import scFBApy, repairNeg

print("All packages installed and imported successfully.")

Cloning into 'scFBApy'...
remote: Enumerating objects: 39, done.[K
remote: Counting objects: 100% (39/39), done.[K
remote: Compressing objects: 100% (36/36), done.[K
remote: Total 39 (delta 14), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (39/39), 1.46 MiB | 11.86 MiB/s, done.
Resolving deltas: 100% (14/14), done.
All packages installed and imported successfully.


### Import core libraries  
We load `numpy` for numerical operations, `pandas` for data handling, `time` for tracking runtimes, and `os` for file management.  
____

In [None]:
import numpy as np
import pandas as pd
import time
import os

### Inspect repairNeg function  
We use Python’s `inspect` module to check the input arguments of the `repairNeg` function from scFBApy.  
This helps us understand how to call the function correctly.  
The `repairNeg` function accepts `adata`, `bulk_el`, and optional parameters `filter_bulk` (default `True`) and `epsilon` (default `0.0001`).  
____

In [None]:
import inspect
from utils_scFBApy import repairNeg

print(inspect.getfullargspec(repairNeg))

FullArgSpec(args=['adata', 'bulk_el', 'filter_bulk', 'epsilon'], varargs=None, varkw=None, defaults=(True, 0.0001), kwonlyargs=[], kwonlydefaults=None, annotations={})


### Load processed single-cell data  
We load the preprocessed AnnData file and inspect its metadata columns using `adata.obs.columns`.

In [None]:
import scanpy as sc

# Load your processed AnnData file
adata = sc.read_h5ad('/content/drive/MyDrive/MultimodalCSVs/gene_expression_processed.h5ad')

# Now you can print obs columns
print(adata.obs.columns)

Index(['sample_id', 'patient_id', 'response'], dtype='object')


In [None]:
print(adata.obs_names[:10])


Index(['AAACCTGAGAAGGGTA-1', 'AAACCTGAGACTGTAA-1', 'AAACCTGAGCAGCGTA-1',
       'AAACCTGAGCCAACAG-1', 'AAACCTGAGCGTGAAC-1', 'AAACCTGAGCTACCTA-1',
       'AAACCTGAGCTGTTCA-1', 'AAACCTGAGGATTCGG-1', 'AAACCTGAGGCATTGG-1',
       'AAACCTGAGTGAATTG-1'],
      dtype='object')


____
### Create pseudo-bulk and repair negative fluxes  
We generate a pseudo-bulk cell by averaging gene expression across all cells, then append it to the original AnnData.  
This pseudo-bulk is used by `repairNeg` to correct any negative flux values in the dataset.  
Finally, we print how many genes were fixed and check the dataset shapes before and after repair.
____

In [None]:
import anndata

# Prepare pseudo-bulk expression vector (mean per gene)
X_dense = adata.X.toarray() if not isinstance(adata.X, np.ndarray) else adata.X
pseudo_bulk = np.mean(X_dense, axis=0)

# Create new AnnData for pseudo-bulk cell
pseudo_bulk_adata = anndata.AnnData(
    X=pseudo_bulk.reshape(1, -1),
    var=adata.var.copy(),
    obs=pd.DataFrame(index=['pseudo_bulk'])
)

# Concatenate using anndata.concat (recommended)
adata_with_bulk = anndata.concat([adata, pseudo_bulk_adata], join='outer', label=None, index_unique=None)

# Check the obs_names to find the exact name of the pseudo-bulk cell
print("Last 5 obs_names:", adata_with_bulk.obs_names[-5:])

# Now run repairNeg using the pseudo-bulk cell name
bulk_el_name = 'pseudo_bulk'  # should match the printed name exactly

adata_repaired, cont = repairNeg(adata_with_bulk, bulk_el=bulk_el_name, filter_bulk=True, epsilon=1e-4)

print(f"Total genes fixed: {cont}")
print(f"Shape before repair: {adata_with_bulk.shape}")
print(f"Shape after repair: {adata_repaired.shape}")


Last 5 obs_names: Index(['TTTGTCATCACATGCA-1', 'TTTGTCATCACGATGT-1', 'TTTGTCATCACTTCAT-1',
       'TTTGTCATCTGTTTGT-1', 'pseudo_bulk'],
      dtype='object')
Total genes changes: 0


  adata2.X=df.values


Total genes fixed: 0
Shape before repair: (58178, 2000)
Shape after repair: (58177, 2000)


____
### Load metabolic model with COBRA  
We read the SBML model file using `cobra.io.read_sbml_model`, which loads the metabolic network into Python.  
`model.summary()` gives a quick overview of the model’s objective function, uptake, and secretion fluxes, helping us inspect metabolic constraints before running flux balance analysis.
____

In [None]:
import cobra

model_path = "/content/scFBApy/models/model.xml"  # adjust path if needed
model = cobra.io.read_sbml_model(model_path)

print(model.summary())

Objective
1.0 Biomass = 95.20142307836949

Uptake
------
Metabolite     Reaction  Flux  C-Number C-Flux
 Lcystin_e EX_Lcystin_e 518.3         0  0.00%
  arg__L_e  EX_arg__L_e  34.2         6  1.43%
     fol_e     EX_fol_e 142.5        19 18.85%
  glc__D_e  EX_glc__D_e 544.3         6 22.73%
  gln__L_e  EX_gln__L_e  1000         5 34.79%
     h2o_e     EX_h2o_e  1000         0  0.00%
  his__L_e  EX_his__L_e 113.3         6  4.73%
  ile__L_e  EX_ile__L_e 27.24         6  1.14%
  leu__L_e  EX_leu__L_e 63.59         6  2.66%
  lys__L_e  EX_lys__L_e 56.37         6  2.35%
  met__L_e  EX_met__L_e 14.57         5  0.51%
      o2_e      EX_o2_e 554.1         0  0.00%
  phe__L_e  EX_phe__L_e  24.7         9  1.55%
      pi_c      EX_pi_e 91.27         0  0.00%
  thr__L_e  EX_thr__L_e 41.42         4  1.15%
  trp__L_e  EX_trp__L_e 1.267        11  0.10%
  tyr__L_e  EX_tyr__L_e  15.2         9  0.95%
  val__L_e  EX_val__L_e 203.2         5  7.07%

Secretion
---------
Metabolite    Reaction   Flux

In [None]:
print(model.objective) # Show the model's objective function (what FBA will optimize)

Maximize
1.0*Biomass - 1.0*Biomass_reverse_57a34


____
### Initialize scFBApy (stub version)  
This stub class sets up a simplified scFBApy pipeline: it matches genes between the model and data, prepares placeholder reaction activity scores (RAS), and loads the objective reaction.  
We test it on a small subset of 10 cells to verify that the setup works without running the full flux computation.
____

In [None]:
class scFBApy_stub:
    def __init__(self, model, adata, objective="Biomass", val_nan=0.0):
        t_start = time.time()
        print(" Starting scFBApy_stub initialization...")

        self.model = model
        self.adata = adata
        self.val_nan = val_nan
        self.objective = objective

        print(" Step 1: Model and data assigned.")

        # Step 2: Match genes
        t0 = time.time()
        try:
            self.genes = list(set(adata.var_names).intersection([g.id for g in model.genes]))
            print(f"Step 2: Matched {len(self.genes)} genes.  ({time.time() - t0:.2f}s)")
        except Exception as e:
            print(" Error during gene matching:", e)
            return

        # Step 3: Prepare placeholder reaction activity scores (RAS)
        t0 = time.time()
        try:
            ras_df = pd.DataFrame(index=adata.obs_names, columns=self.genes)
            ras_df.loc[:, :] = self.val_nan  # safer assignment
            self.ras_df = ras_df
            print(f" Step 3: Reaction activity scores prepared.  ({time.time() - t0:.2f}s)")
        except Exception as e:
            print(" Error preparing RAS:", e)
            return

        # Step 4: Placeholder for flux setup
        t0 = time.time()
        try:
            self.objective_reaction = model.reactions.get_by_id(objective)
            print(f" Step 4: Objective '{objective}' loaded.  ({time.time() - t0:.2f}s)")
        except Exception as e:
            print(f" Error finding objective '{objective}':", e)
            return

        print(f" scFBApy_stub fully initialized in {time.time() - t_start:.2f} seconds.")

# Run with very small adata to verify it works
adata_tiny = adata_repaired[:10].copy()  # just 10 cells to test
sf = scFBApy_stub(model, adata_tiny, objective="Biomass")

 Starting scFBApy_stub initialization...
 Step 1: Model and data assigned.
Step 2: Matched 29 genes.  (0.00s)
 Step 3: Reaction activity scores prepared.  (0.00s)
 Step 4: Objective 'Biomass' loaded.  (0.00s)
 scFBApy_stub fully initialized in 0.00 seconds.


____
### Compute and save fluxes in batches  
We divide the repaired AnnData into batches of 100 cells to avoid memory overload.  
For each batch, scFBApy computes fluxes based on the biomass objective, preserves metadata, and saves the results as an H5AD file.  
Existing batch files are skipped to prevent re-computation, and memory is cleared after each batch.  
____

In [None]:
save_dir = "/content/drive/MyDrive/flux_batches"
os.makedirs(save_dir, exist_ok=True)

batch_size = 100
all_obs = list(adata_repaired.obs_names)
n_total = len(all_obs)

for i in range(0, n_total, batch_size):
    batch_file = f"{save_dir}/flux_batch_{i}_{min(i+batch_size, n_total)}.h5ad"
    if os.path.exists(batch_file):
        print(f" Skipping batch {i}-{min(i+batch_size, n_total)} (already saved)")
        continue

    print(f" Processing batch {i} to {min(i+batch_size, n_total)}")

    batch_obs = all_obs[i:i+batch_size]
    adata_batch = adata_repaired[batch_obs].copy()

    flux_batch = scFBApy(
        model,
        adata_batch,
        objective="Biomass",
        eps=0.001,
        compute_fva=True,
        npop_fva=5,
        type_ras_normalization="max",
        and_expression=np.nanmin,
        or_expression=np.nansum,
        fraction_of_optimum=0,
        processes=1,
        round_c=10
    )

    # Preserve metadata in flux batch
    flux_batch.obs = adata_batch.obs.copy()

    flux_batch.write(batch_file)
    print(f" Saved: {batch_file}")

    # Free memory explicitly
    del flux_batch, adata_batch
    import gc; gc.collect()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
 Skipping batch 0-100 (already saved)
 Skipping batch 100-200 (already saved)
 Skipping batch 200-300 (already saved)
 Skipping batch 300-400 (already saved)
 Skipping batch 400-500 (already saved)
 Skipping batch 500-600 (already saved)
 Skipping batch 600-700 (already saved)
 Skipping batch 700-800 (already saved)
 Skipping batch 800-900 (already saved)
 Skipping batch 900-1000 (already saved)
 Skipping batch 1000-1100 (already saved)
 Skipping batch 1100-1200 (already saved)
 Skipping batch 1200-1300 (already saved)
 Skipping batch 1300-1400 (already saved)
 Skipping batch 1400-1500 (already saved)
 Skipping batch 1500-1600 (already saved)
 Skipping batch 1600-1700 (already saved)
 Skipping batch 1700-1800 (already saved)
 Skipping batch 1800-1900 (already saved)
 Skipping batch 1900-2000 (already saved)
 Skipping batch 2000-2100 (already saved)
 Skipping 

____
### Load and inspect a saved flux batch  
We read a previously computed flux batch and preview its `.obs` metadata.  
This lets us verify that cell-level identifiers and response labels are correctly preserved after flux computation.
____

In [None]:
# Load final saved batch
flux_batch_path = "/content/drive/MyDrive/flux_batches/flux_batch_58100_58177.h5ad"
adata_flux = sc.read_h5ad(flux_batch_path)

# Print .obs head
print(adata_flux.obs.head())
print("\n Metadata columns:", list(adata_flux.obs.columns))


                         sample_id patient_id       response
TTTCCTCGTAAGTAGT-1  GSM9061674_S10        PT5  Non-responder
TTTCCTCGTCTCCACT-1  GSM9061674_S10        PT5  Non-responder
TTTCCTCGTGATAAGT-1  GSM9061674_S10        PT5  Non-responder
TTTCCTCGTGGAAAGA-1  GSM9061674_S10        PT5  Non-responder
TTTCCTCGTTAAAGTG-1  GSM9061674_S10        PT5  Non-responder

 Metadata columns: ['sample_id', 'patient_id', 'response']


____
### Merge small flux batches  
We scan the flux batch directory and collect all batches with ≤100 cells.  
These small batches are concatenated into a single AnnData object and saved, simplifying downstream analysis while keeping memory usage manageable.
____

In [None]:
# Directory containing the individual batch files
flux_dir = "/content/drive/MyDrive/flux_batches"
all_files = sorted([f for f in os.listdir(flux_dir) if f.endswith(".h5ad") and "merged" not in f])

# Initialize a list to hold small batches (≤100 cells)
small_batches = []

for file in all_files:
    file_path = os.path.join(flux_dir, file)
    adata = sc.read_h5ad(file_path)

    if adata.n_obs <= 100:
        print(f" Including {file}: {adata.n_obs} cells")
        small_batches.append(adata)
    else:
        print(f"⏭ Skipping {file}: {adata.n_obs} cells")

# Merge all small batches together
if small_batches:
    print(f"\n Merging {len(small_batches)} batches...")
    merged_small = small_batches[0].concatenate(
        *small_batches[1:],
        join='outer',
        index_unique=None,
        batch_key=None
    )

    # Save merged result
    merged_path = os.path.join(flux_dir, "merged_small_batches_flux.h5ad")
    merged_small.write(merged_path)
    print(f"\n Merged small batches saved to: {merged_path}")
    print(f" Final shape: {merged_small.shape}")
else:
    print("\n No batches with ≤100 cells found.")

 Including flux_batch_0_100.h5ad: 100 cells
 Including flux_batch_10000_10100.h5ad: 100 cells
 Including flux_batch_1000_1100.h5ad: 100 cells
 Including flux_batch_100_200.h5ad: 100 cells
 Including flux_batch_10100_10200.h5ad: 100 cells
 Including flux_batch_10200_10300.h5ad: 100 cells
 Including flux_batch_10300_10400.h5ad: 100 cells
 Including flux_batch_10400_10500.h5ad: 100 cells
 Including flux_batch_10500_10600.h5ad: 100 cells
 Including flux_batch_10600_10700.h5ad: 100 cells
 Including flux_batch_10700_10800.h5ad: 100 cells
 Including flux_batch_10800_10900.h5ad: 100 cells
 Including flux_batch_10900_11000.h5ad: 100 cells
 Including flux_batch_11000_11100.h5ad: 100 cells
 Including flux_batch_1100_1200.h5ad: 100 cells
 Including flux_batch_11100_11200.h5ad: 100 cells
 Including flux_batch_11200_11300.h5ad: 100 cells
 Including flux_batch_11300_11400.h5ad: 100 cells
 Including flux_batch_11400_11500.h5ad: 100 cells
 Including flux_batch_11500_11600.h5ad: 100 cells
 Including flu

  merged_small = small_batches[0].concatenate(
  utils.warn_names_duplicates("obs")
  utils.warn_names_duplicates("obs")



 Merged small batches saved to: /content/drive/MyDrive/flux_batches/merged_small_batches_flux.h5ad
 Final shape: (58254, 400)


____
### Convert flux data to a single DataFrame  
We transform the flux values from the AnnData object into a pandas DataFrame and combine it with cell metadata.  
This creates a complete table where each row is a cell and columns include both metadata and fluxes, ready for downstream analysis.
____

In [None]:
# Create DataFrame directly
flux_df = pd.DataFrame(adata_flux.X, index=adata_flux.obs_names, columns=adata_flux.var_names)

# Combine with metadata
final_flux_df = pd.concat([adata_flux.obs, flux_df], axis=1)

# Inspect result
print(final_flux_df.shape)
print(final_flux_df.head())

(77, 403)
                         sample_id patient_id       response       PYRt2  \
TTTCCTCGTAAGTAGT-1  GSM9061674_S10        PT5  Non-responder -118.447496   
TTTCCTCGTCTCCACT-1  GSM9061674_S10        PT5  Non-responder   11.857397   
TTTCCTCGTGATAAGT-1  GSM9061674_S10        PT5  Non-responder    6.120228   
TTTCCTCGTGGAAAGA-1  GSM9061674_S10        PT5  Non-responder    0.000000   
TTTCCTCGTTAAAGTG-1  GSM9061674_S10        PT5  Non-responder   13.402617   

                          HEX1  G6PP         PGI         PFK  FBP         FBA  \
TTTCCTCGTAAGTAGT-1  547.796430   0.0  519.513524  503.902705  0.0  503.902705   
TTTCCTCGTCTCCACT-1  540.513190   0.0  516.540045  503.308009  0.0  503.308009   
TTTCCTCGTGATAAGT-1  540.082315   0.0  516.364135  503.272827  0.0  503.272827   
TTTCCTCGTGGAAAGA-1  545.373687   0.0  518.524407  503.704881  0.0  503.704881   
TTTCCTCGTTAAAGTG-1  547.744789   0.0  519.492441  503.898488  0.0  503.898488   

                    ...  Transport_ala_B_c_e  

In [None]:
save_dir = "/content/drive/MyDrive/flux_batches" # Directory containing flux batch files

batch_files = [f for f in os.listdir(save_dir) if f.endswith(".h5ad")] # List all .h5ad batch files

for bf in batch_files:
    path = os.path.join(save_dir, bf) # Full path to batch file
    adata = sc.read_h5ad(path)
    n_cells = adata.shape[0] # Number of cells in batch
    if n_cells > 100:
        print(f"{bf}: {adata.shape}  (cells, features)")

  utils.warn_names_duplicates("obs")


merged_flux_data.h5ad: (58177, 400)  (cells, features)
merged_flux_data_unique_obs.h5ad: (58177, 400)  (cells, features)
merged_flux_data_cleaned.h5ad: (58254, 400)  (cells, features)
merged_small_batches_flux.h5ad: (58254, 400)  (cells, features)


  utils.warn_names_duplicates("obs")


In [None]:
# Load the merged AnnData
adata = sc.read_h5ad("/content/drive/MyDrive/flux_batches/merged_flux_data.h5ad")

# Print shape
print(f" Shape of merged_flux_data.h5ad: {adata.shape}")

# Convert to DataFrame (no .toarray() needed)
flux_df = pd.DataFrame(adata.X, index=adata.obs_names, columns=adata.var_names)

# Print head of the flux values
print("\n Head of flux values:")
print(flux_df.head())

 Shape of merged_flux_data.h5ad: (58177, 400)

 Head of flux values:
             PYRt2        HEX1  G6PP         PGI         PFK  FBP         FBA  \
cell_0   19.003712  555.130865   0.0  522.507904  504.501581  0.0  504.501581   
cell_1   10.141017  540.123923   0.0  516.381122  503.276224  0.0  503.276224   
cell_2 -558.010609  548.619770   0.0  519.849663  503.969933  0.0  503.969933   
cell_3   99.899583  549.261167   0.0  520.111522  504.022304  0.0  504.022304   
cell_4 -135.671707  542.006910   0.0  517.149876  503.429975  0.0  503.429975   

               TPI    GAPD     PGK  ...  Transport_ala_B_c_e  TMDK1  THYMDt1  \
cell_0  504.501581  1000.0  1000.0  ...                  0.0    0.0      0.0   
cell_1  503.276224  1000.0  1000.0  ...                  0.0    0.0      0.0   
cell_2  503.969933  1000.0  1000.0  ...                  0.0    0.0      0.0   
cell_3  504.022304  1000.0  1000.0  ...                  0.0    0.0      0.0   
cell_4  503.429975  1000.0  1000.0  ...     

  utils.warn_names_duplicates("obs")


In [None]:
# Step 1: Load flux data from h5ad
flux_adata = sc.read_h5ad("/content/drive/MyDrive/flux_batches/merged_flux_data.h5ad")
flux_df = pd.DataFrame(flux_adata.X, columns=flux_adata.var_names)

# Step 2: Load gene expression CSV
gene_path = "/content/drive/MyDrive/MultimodalCSVs/gene_expression_with_metadata.csv"
gene_df = pd.read_csv(gene_path, index_col=0)

# Step 3: Verify matching shapes
print(f"Flux shape: {flux_df.shape}")
print(f"Gene shape: {gene_df.shape}")

if flux_df.shape[0] != gene_df.shape[0]:
    raise ValueError(" Row mismatch: Cannot safely concatenate without matching cell order!")

# Step 4: Concatenate (axis=1 means column-wise)
combined_df = pd.concat([flux_df.reset_index(drop=True), gene_df.reset_index(drop=True)], axis=1)

# Step 5: Save to Google Drive
combined_path = "/content/drive/MyDrive/MultimodalCSVs/combined_flux_gene.csv"
combined_df.to_csv(combined_path, index=False)
print(f" Combined dataset saved to:\n{combined_path}")

  utils.warn_names_duplicates("obs")


Flux shape: (58177, 400)
Gene shape: (58177, 2002)
 Combined dataset saved to:
/content/drive/MyDrive/MultimodalCSVs/combined_flux_gene.csv


In [None]:
# Display head of the combined dataset
print(" Head of combined flux + gene expression data:")
display(combined_df.head())

 Head of combined flux + gene expression data:


Unnamed: 0,PYRt2,HEX1,G6PP,PGI,PFK,FBP,FBA,TPI,GAPD,PGK,...,ENSG00000198712,ENSG00000228253,ENSG00000198899,ENSG00000198840,ENSG00000212907,ENSG00000198786,ENSG00000198727,ENSG00000276256,ENSG00000277856,ENSG00000275063
0,19.003712,555.130865,0.0,522.507904,504.501581,0.0,504.501581,504.501581,1000.0,1000.0,...,1.765706,2.43839,2.329344,2.054127,2.181029,2.719458,1.961893,-0.027983,-0.015876,-0.02753
1,10.141017,540.123923,0.0,516.381122,503.276224,0.0,503.276224,503.276224,1000.0,1000.0,...,-0.216031,0.30059,-0.860992,-0.160776,-0.052397,-0.184155,0.048195,-0.027983,-0.015876,-0.02753
2,-558.010609,548.61977,0.0,519.849663,503.969933,0.0,503.969933,503.969933,1000.0,1000.0,...,0.931615,0.417104,-0.129276,0.75976,-1.443564,0.27722,0.157571,-0.027983,-0.015876,-0.02753
3,99.899583,549.261167,0.0,520.111522,504.022304,0.0,504.022304,504.022304,1000.0,1000.0,...,0.324596,0.623172,0.39944,0.875306,0.765933,0.503159,0.436196,-0.027983,-0.015876,-0.02753
4,-135.671707,542.00691,0.0,517.149876,503.429975,0.0,503.429975,503.429975,1000.0,1000.0,...,-0.194472,0.643405,0.206952,0.203505,0.814024,0.129373,0.151363,-0.027983,-0.015876,-0.02753


In [None]:
# Show head of flux data
print(" Flux data head:")
display(flux_df.head())

# Show head of gene expression data
print(" Gene expression data head:")
display(gene_df.head())

 Flux data head:


Unnamed: 0,PYRt2,HEX1,G6PP,PGI,PFK,FBP,FBA,TPI,GAPD,PGK,...,Transport_ala_B_c_e,TMDK1,THYMDt1,Transport_HC00576_c_e,Transport_4abut_c_e,GLUVESSEC,t_Lcystin_ala__L,t_Lcystin_glu__L,t_Lcystin_leu__L,t_Lcystin_ser__L
0,19.003712,555.130865,0.0,522.507904,504.501581,0.0,504.501581,504.501581,1000.0,1000.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,506.23867,0.0,0.0
1,10.141017,540.123923,0.0,516.381122,503.276224,0.0,503.276224,503.276224,1000.0,1000.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,517.654206,0.0,0.0
2,-558.010609,548.61977,0.0,519.849663,503.969933,0.0,503.969933,503.969933,1000.0,1000.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,561.568172,0.0,0.0
3,99.899583,549.261167,0.0,520.111522,504.022304,0.0,504.022304,504.022304,1000.0,1000.0,...,0.0,0.0,0.0,0.0,0.0,0.0,152.80336,360.437134,0.0,0.0
4,-135.671707,542.00691,0.0,517.149876,503.429975,0.0,503.429975,503.429975,1000.0,1000.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,561.286482,0.0,39.4635


 Gene expression data head:


Unnamed: 0,response,sample_id,ENSG00000272512,ENSG00000230415,ENSG00000169885,ENSG00000142609,ENSG00000187730,ENSG00000287727,ENSG00000287384,ENSG00000162458,...,ENSG00000198712,ENSG00000228253,ENSG00000198899,ENSG00000198840,ENSG00000212907,ENSG00000198786,ENSG00000198727,ENSG00000276256,ENSG00000277856,ENSG00000275063
AAACCTGAGAAGGGTA-1,Responder,GSM9061665_S1,-0.013603,-0.010721,-0.010086,-0.008178,-0.01729,-0.023815,-0.00749,-0.005667,...,1.765706,2.43839,2.329344,2.054127,2.181029,2.719458,1.961893,-0.027983,-0.015876,-0.02753
AAACCTGAGACTGTAA-1,Responder,GSM9061665_S1,-0.013603,-0.010721,-0.010086,-0.008178,-0.01729,-0.023815,-0.00749,-0.005667,...,-0.216031,0.30059,-0.860992,-0.160776,-0.052397,-0.184155,0.048195,-0.027983,-0.015876,-0.02753
AAACCTGAGCAGCGTA-1,Responder,GSM9061665_S1,-0.013603,-0.010721,-0.010086,-0.008178,-0.01729,-0.023815,-0.00749,-0.005667,...,0.931615,0.417104,-0.129276,0.75976,-1.443564,0.27722,0.157571,-0.027983,-0.015876,-0.02753
AAACCTGAGCCAACAG-1,Responder,GSM9061665_S1,-0.013603,-0.010721,-0.010086,-0.008178,-0.01729,-0.023815,-0.00749,-0.005667,...,0.324596,0.623172,0.39944,0.875306,0.765933,0.503159,0.436196,-0.027983,-0.015876,-0.02753
AAACCTGAGCGTGAAC-1,Responder,GSM9061665_S1,-0.013603,-0.010721,-0.010086,-0.008178,-0.01729,-0.023815,-0.00749,-0.005667,...,-0.194472,0.643405,0.206952,0.203505,0.814024,0.129373,0.151363,-0.027983,-0.015876,-0.02753


In [None]:
# Path to your TCR features CSV file
tcr_csv_path = '/content/drive/MyDrive/MultimodalCSVs/tcr_features_only.csv'

# Load the CSV
tcr_df = pd.read_csv(tcr_csv_path, index_col=0)  # Assuming first column is the index (cell barcodes)

# Print shape and head
print(f"TCR features shape: {tcr_df.shape}")
print("First 5 rows:")
print(tcr_df.head())

TCR features shape: (124447, 11)
First 5 rows:
                   chain      v_gene d_gene   j_gene c_gene             cdr3  \
barcode                                                                        
AAACCTGAGACTGTAA-1   TRB     TRBV3-1  TRBD1  TRBJ1-1  TRBC1    CASGTGLNTEAFF   
AAACCTGAGACTGTAA-1   TRA  TRAV36/DV7    NaN   TRAJ53   TRAC     CAVEARNYKLTF   
AAACCTGAGCGTGAAC-1   TRB      TRBV30    NaN  TRBJ1-2  TRBC1  CAWSALLGTVNGYTF   
AAACCTGAGCGTGAAC-1   TRA  TRAV29/DV5    NaN   TRAJ48   TRAC    CAASAVGNEKLTF   
AAACCTGAGCTACCTA-1   TRA      TRAV19    NaN   TRAJ31   TRAC   CALSEAWGNARLMF   

                                                          cdr3_nt  reads  \
barcode                                                                    
AAACCTGAGACTGTAA-1        TGTGCCAGCGGGACAGGGTTGAACACTGAAGCTTTCTTT  23844   
AAACCTGAGACTGTAA-1           TGTGCTGTGGAGGCCAGGAACTATAAACTGACATTT   7520   
AAACCTGAGCGTGAAC-1  TGTGCCTGGAGTGCCCTATTAGGGACAGTAAATGGCTACACCTTC  17060   
AAACCTGAGCGT

In [None]:
# Paths (update accordingly)
combined_path = '/content/drive/MyDrive/MultimodalCSVs/combined_flux_gene.csv'
tcr_path = '/content/drive/MyDrive/MultimodalCSVs/tcr_features_only.csv'
output_path = '/content/drive/MyDrive/MultimodalCSVs/combined_with_tcr.csv'

# Load combined dataframe
combined_df = pd.read_csv(combined_path, index_col=0)

# Load TCR dataframe
tcr_df = pd.read_csv(tcr_path, index_col=0)

# Check initial shapes
print(f"Combined DF shape: {combined_df.shape}")
print(f"TCR DF shape: {tcr_df.shape}")

Combined DF shape: (58177, 2401)
TCR DF shape: (124447, 11)


In [None]:
# gene_df: your gene expression data (barcode index)
# flux_df: your flux data (no index shown)
# tcr_df: your TCR data (indexed by barcode)

# Step 1: Extract response and sample_id from gene data
gene_meta = gene_df[['response', 'sample_id']]

# Step 2: Add barcode as a column to flux if missing (assumes same order or needs to be aligned!)
# If you have flux_df with the same barcodes in the same order:
flux_df['barcode'] = gene_df.index  # Only safe if order matches!

# Step 3: Merge flux with metadata
flux_merged = pd.merge(flux_df, gene_meta, left_on='barcode', right_index=True)

# Step 4: Merge TCR — first, deduplicate TCR (e.g., one row per cell)
tcr_features = tcr_df.groupby(tcr_df.index).agg({
    'chain': 'count',  # total chains per cell
    'umis': 'sum',     # total umis
    'reads': 'sum',    # total reads
    'cdr3': 'nunique', # unique CDR3 sequences
}).rename(columns={
    'chain': 'num_chains',
    'umis': 'total_umis',
    'reads': 'total_reads',
    'cdr3': 'unique_cdr3s'
})

# Step 5: Merge TCR with gene_meta to get response
tcr_merged = pd.merge(tcr_features, gene_meta, left_index=True, right_index=True)

# Now you have:
# gene_df → original with response
# flux_merged → with response
# tcr_merged → with response

In [None]:
# Save gene_df (including response and features)
gene_df.to_csv('/content/drive/MyDrive/tri_modality_gene.csv')  # or local path if not using Drive

# Save flux_merged
flux_merged.to_csv('/content/drive/MyDrive/tri_modality_flux.csv', index=False)

# Save tcr_merged
tcr_merged.to_csv('/content/drive/MyDrive/tri_modality_tcr.csv')

In [None]:
# Load the data (adjust paths if not using Google Drive)
gene_df = pd.read_csv('/content/drive/MyDrive/tri_modality_gene.csv', index_col=0)
flux_df = pd.read_csv('/content/drive/MyDrive/tri_modality_flux.csv')
tcr_df = pd.read_csv('/content/drive/MyDrive/tri_modality_tcr.csv', index_col=0)

# Print heads
print(" Gene expression data (head):")
print(gene_df.head(), '\n')

print(" Fluxomics data (head):")
print(flux_df.head(), '\n')

print(" TCR data (head):")
print(tcr_df.head(), '\n')

# Check for NaNs
print(" NaN counts in gene_df:")
print(gene_df.isna().sum().sum(), "NaNs\n")

print(" NaN counts in flux_df:")
print(flux_df.isna().sum().sum(), "NaNs\n")

print(" NaN counts in tcr_df:")
print(tcr_df.isna().sum().sum(), "NaNs\n")

 Gene expression data (head):
                     response      sample_id  ENSG00000272512  \
AAACCTGAGAAGGGTA-1  Responder  GSM9061665_S1        -0.013603   
AAACCTGAGACTGTAA-1  Responder  GSM9061665_S1        -0.013603   
AAACCTGAGCAGCGTA-1  Responder  GSM9061665_S1        -0.013603   
AAACCTGAGCCAACAG-1  Responder  GSM9061665_S1        -0.013603   
AAACCTGAGCGTGAAC-1  Responder  GSM9061665_S1        -0.013603   

                    ENSG00000230415  ENSG00000169885  ENSG00000142609  \
AAACCTGAGAAGGGTA-1        -0.010721        -0.010086        -0.008178   
AAACCTGAGACTGTAA-1        -0.010721        -0.010086        -0.008178   
AAACCTGAGCAGCGTA-1        -0.010721        -0.010086        -0.008178   
AAACCTGAGCCAACAG-1        -0.010721        -0.010086        -0.008178   
AAACCTGAGCGTGAAC-1        -0.010721        -0.010086        -0.008178   

                    ENSG00000187730  ENSG00000287727  ENSG00000287384  \
AAACCTGAGAAGGGTA-1         -0.01729        -0.023815         -0.007