# 1_Create_MultiModulon_object

This notebook demonstrates the first step for multi-species/strain/modality analysis using the MultiModulon package.

In [1]:
# Import required libraries
from multimodulon import MultiModulon
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)

## Step 1: Initialize MultiModulon

Load data from the Input_Data directory containing expression matrices, gene annotations, and sample metadata for all strains.

In [2]:
# Path to the Input_Data folder
input_data_path = './Input_Data'

# Initialize MultiModulon object
multiModulon = MultiModulon(input_data_path)


Initializing MultiModulon...

Loading from Input_Data:

Nissle_1917:
  - Number of genes: 4678
  - Number of samples: 183
  - Data validation: PASSED

O157_H7:
  - Number of genes: 5176
  - Number of samples: 252
  - Data validation: PASSED

BL21:
  - Number of genes: 4132
  - Number of samples: 209
  - Data validation: PASSED

Successfully loaded 3 species/strains/modalities


## Step 2: Create Gene Tables

Parse GFF files to create gene annotation tables for each strain.

In [3]:
# Create gene tables from GFF files
print("Creating gene tables from GFF files...")
multiModulon.create_gene_table()

Creating gene tables from GFF files...

Creating gene tables for all species...

Processing Nissle_1917...
  ✓ Created gene table with 4678 genes

Processing O157_H7...
  ✓ Created gene table with 5137 genes

Processing BL21...
  ✓ Created gene table with 4132 genes

Gene table creation completed!


In [4]:
multiModulon.add_eggnog_annotation("./Output_eggnog_mapper")


Adding eggNOG annotations from ./Output_eggnog_mapper

Processing Nissle_1917...
  - Reading Nissle_1917.emapper.annotations
  ✓ Added eggNOG annotations to 4329/4678 genes

Processing O157_H7...
  - Reading O157_H7.emapper.annotations
  ✓ Added eggNOG annotations to 4681/5137 genes

Processing BL21...
  - Reading BL21.emapper.annotations
  ✓ Added eggNOG annotations to 3993/4132 genes

eggNOG annotation addition completed!


In [5]:
multiModulon['O157_H7'].gene_table

Unnamed: 0_level_0,accession,start,end,strand,gene_name,old_locus_tag,gene_product,ncbi_protein,seed_ortholog,evalue,score,eggNOG_OGs,max_annot_lvl,COG_category,Description,Preferred_name,GOs,EC,KEGG_ko,KEGG_Pathway,KEGG_Module,KEGG_Reaction,KEGG_rclass,BRITE,KEGG_TC,CAZy,BiGG_Reaction,PFAMs
locus_tag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1
mobA,AB011548.2,413,736,+,mobA,,plasmid mobilization,BAA31754.1,1219375.CM002140_gene4213,6.810000e-13,67.4,"2E8NX@1|root,33304@2|Bacteria,1NEFP@1224|Prote...",2|Bacteria,-,-,-,-,-,-,-,-,-,-,-,-,-,-,RHH_1
etpC,AB011549.2,2589,3464,+,etpC,,Type II secretion pathway related protein,BAA31758.1,155864.EDL933_p0030,7.900000e-186,518.0,"COG3031@1|root,COG3031@2|Bacteria,1RD3I@1224|P...",2|Bacteria,U,General secretion pathway protein C,gspC,"GO:0002790,GO:0006810,GO:0008104,GO:0008150,GO...",-,ko:K02452,"ko03070,ko05111,map03070,map05111",M00331,-,-,"ko00000,ko00001,ko00002,ko02044",3.A.15,-,-,"PDZ_2,T2SSC"
etpD,AB011549.2,3675,5432,+,etpD,,Type II secretion pathway related protein,BAA31759.1,155864.EDL933_p0031,0.000000e+00,1100.0,"COG1450@1|root,COG1450@2|Bacteria,1MUUA@1224|P...",2|Bacteria,NU,General secretion pathway protein,gspD,-,-,"ko:K02453,ko:K03219","ko03070,ko05111,map03070,map05111","M00331,M00332,M00542",-,-,"ko00000,ko00001,ko00002,ko02044","3.A.15,3.A.6.1,3.A.6.3",-,-,"Secretin,Secretin_N"
etpE,AB011549.2,5432,6937,+,etpE,,Type II secretion pathway related protein,BAA31760.1,155864.EDL933_p0032,0.000000e+00,890.0,"COG2804@1|root,COG2804@2|Bacteria,1MU7V@1224|P...",2|Bacteria,NU,"Type II secretory pathway, ATPase PulE Tfp pil...",gspE,-,-,ko:K02454,"ko03070,ko05111,map03070,map05111",M00331,-,-,"ko00000,ko00001,ko00002,ko02044",3.A.15,-,-,"T2SSE,T2SSE_N"
etpF,AB011549.2,6939,8162,+,etpF,,Type II secretion pathway related protein,BAA31761.1,155864.EDL933_p0033,1.280000e-275,756.0,"COG1459@1|root,COG1459@2|Bacteria,1MV4U@1224|P...",2|Bacteria,U,General secretion pathway,gspF,"GO:0002790,GO:0005575,GO:0005623,GO:0005886,GO...",-,"ko:K02455,ko:K02505,ko:K02653","ko03070,ko05111,map03070,map05111",M00331,-,-,"ko00000,ko00001,ko00002,ko02035,ko02044","3.A.15,3.A.15.2",-,-,T2SSF
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ECs_5356,BA000007.3,5492933,5493622,+,creB,,two-component regulatory system response regul...,BAB38779.1,155864.EDL933_5742,4.320000e-164,459.0,"COG0745@1|root,COG0745@2|Bacteria,1MVCB@1224|P...",2|Bacteria,K,"Transcriptional regulatory protein, C terminal",creB,"GO:0000976,GO:0000984,GO:0000986,GO:0000987,GO...",-,"ko:K07663,ko:K07664","ko02020,map02020","M00449,M00450,M00645,M00646,M00648",-,-,"ko00000,ko00001,ko00002,ko02022",-,-,-,"Response_reg,Trans_reg_C"
ECs_5357,BA000007.3,5493622,5495046,+,creC,,two-component system sensor histidine kinase CreC,BAB38780.1,155864.EDL933_5743,0.000000e+00,912.0,"COG0642@1|root,COG2205@2|Bacteria,1N17V@1224|P...",2|Bacteria,T,Member of the two-component regulatory system ...,creC,"GO:0000155,GO:0000160,GO:0003674,GO:0003824,GO...",2.7.13.3,"ko:K07641,ko:K07642,ko:K07711,ko:K14980","ko02020,ko02024,map02020,map02024","M00449,M00450,M00502,M00520,M00645,M00646,M00648",-,-,"ko00000,ko00001,ko00002,ko01000,ko01001,ko02022",-,-,-,"HAMP,HATPase_c,HisKA,dCache_3,sCache_3_2"
ECs_5358,BA000007.3,5495104,5496456,+,creD,,inner membrane protein,BAB38781.1,155864.EDL933_5744,0.000000e+00,877.0,"COG4452@1|root,COG4452@2|Bacteria,1MVVR@1224|P...",2|Bacteria,V,Inner membrane protein CreD,creD,"GO:0005575,GO:0005623,GO:0005886,GO:0016020,GO...",-,ko:K06143,-,-,-,-,ko00000,-,-,-,CreD
ECs_5359,BA000007.3,5496516,5497232,-,arcA,,two-component regulatory system response regul...,BAB38782.1,316407.85677140,7.170000e-172,479.0,"COG0745@1|root,COG0745@2|Bacteria,1MWJG@1224|P...",2|Bacteria,K,It also may be involved in the osmoregulation ...,arcA,"GO:0000156,GO:0000160,GO:0000976,GO:0001067,GO...",-,"ko:K07772,ko:K07773","ko02020,ko02026,map02020,map02026","M00455,M00456",-,-,"ko00000,ko00001,ko00002,ko02022",-,-,-,"Response_reg,Trans_reg_C"


## Step 2: Generate BBH Files

Generate Bidirectional Best Hits (BBH) files for ortholog detection between all strain pairs.

In [None]:
# Generate BBH files using multiple threads for faster computation
output_bbh_path = './Output_BBH'

multiModulon.generate_BBH(output_bbh_path, threads=16)

## Step 3: Align Genes Across Strains

Create a unified gene database by aligning genes across all strains using the BBH results.

In [7]:
# Align genes across all strains
output_gene_info_path = './Output_Gene_Info'

combined_gene_db = multiModulon.align_genes(
    input_bbh_dir=output_bbh_path,
    output_dir=output_gene_info_path,
    reference_order=['BL21', 'Nissle_1917', 'O157_H7'],  # optional: specify order
    # bbh_threshold=90  # optional: minimum percent identity threshold
)

combined_gene_db.head()


Gene counts in combined_gene_db:
  ✓ BL21: 4132 genes (complete)
  ✓ Nissle_1917: 4678 genes (complete)
  ✓ O157_H7: 5176 genes (complete)

✓ All species have complete gene sets in combined_gene_db!

Combined gene database shape: (6467, 3)
Number of gene groups: 6467


Unnamed: 0,BL21,Nissle_1917,O157_H7,row_label
0,ECD_00001,HW372_19965,,ECD_00001
1,ECD_00002,HW372_19960,ECs_0002,ECD_00002
2,ECD_00003,HW372_19955,ECs_0003,ECD_00003
3,ECD_00004,HW372_19950,ECs_0004,ECD_00004
4,ECD_00005,HW372_19945,ECs_0005,ECD_00005


## Step 5: Generate Aligned Expression Matrices

Create expression matrices with consistent gene indexing across all strains for multi-view ICA.

In [8]:
# Generate aligned expression matrices
print("Generating aligned expression matrices...")
multiModulon.generate_X(output_gene_info_path)

# The output shows aligned X matrices and dimension recommendations

Generating aligned expression matrices...

Generated aligned X matrices:
BL21: (6467, 209) (4132 non-zero gene groups)
Nissle_1917: (6467, 183) (4677 non-zero gene groups)
O157_H7: (6467, 252) (5175 non-zero gene groups)
Maximum dimension recommendation: 180


## Step 6: Optimize Number of Core Components

Use Cohen's d effect size metric to automatically determine the optimal number of core components.

In [9]:
# Optimize number of core components
print("Optimizing number of core components...")
print("This will test different values of k and find the optimal number.")

optimal_num_core_components = multiModulon.optimize_number_of_core_components(
    step=10,                        # Test k = 5, 10, 15, 20, ...
    save_path='./Output_Optimization_Figures', # Save plots to directory
    fig_size=(7, 5),              # Figure size
    num_runs_per_dimension=10,
    seed=10
)

Optimizing number of core components...
This will test different values of k and find the optimal number.
Optimizing core components for 3 species/strains: ['Nissle_1917', 'O157_H7', 'BL21']
Auto-determined max_k = 180 based on minimum samples (183)
Using GPU: NVIDIA GeForce RTX 3090


Run 1/1:   0%|          | 0/18 [00:00<?, ?it/s]


Optimal k = 90 (robust components passing filter = 31.0)

Plot saved to: Output_Optimization_Figures/num_core_optimization.svg


## Step 7: Optimize Number of Unique Components

Determine the optimal number of unique (species-specific) components for each strain.

In [10]:
# Optimize unique components for each species
print("Optimizing unique components per species...")
print("This will test different numbers of unique components for each species.\n")

optimal_unique, optimal_total = multiModulon.optimize_number_of_unique_components(
    optimal_num_core_components=optimal_num_core_components,
    step=10,
    save_path='./Output_Optimization_Figures',
    fig_size=(7, 5),
    num_runs_per_dimension=10,
    seed=10
)

Optimizing unique components per species...
This will test different numbers of unique components for each species.


Optimizing unique components with core k = 90

Optimizing unique components for Nissle_1917
Testing a values: [90, 100, 110, 120, 130, 140, 150, 160, 170]


Testing a values for Nissle_1917:   0%|          | 0/9 [00:00<?, ?it/s]

Plot saved to: Output_Optimization_Figures/num_unique_Nissle_1917_optimization.svg

Optimal a for Nissle_1917: 110 (16 robust unique components)

Optimizing unique components for O157_H7
Testing a values: [90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240]


Testing a values for O157_H7:   0%|          | 0/16 [00:00<?, ?it/s]

Plot saved to: Output_Optimization_Figures/num_unique_O157_H7_optimization.svg

Optimal a for O157_H7: 230 (49 robust unique components)

Optimizing unique components for BL21
Testing a values: [90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200]


Testing a values for BL21:   0%|          | 0/12 [00:00<?, ?it/s]

Plot saved to: Output_Optimization_Figures/num_unique_BL21_optimization.svg

Optimal a for BL21: 200 (37 robust unique components)

Optimization Summary
Core components (c): 90
Nissle_1917: a = 110 (unique components: 20)
O157_H7: a = 230 (unique components: 140)
BL21: a = 200 (unique components: 110)


In [11]:
optimal_num_core_components

90

In [12]:
optimal_total

{'Nissle_1917': 110, 'O157_H7': 230, 'BL21': 200}

## Step 8: Run Robust Multi-view ICA

Perform robust multi-view ICA with multiple runs and clustering to identify consistent components.

In [13]:
# Run robust multi-view ICA
print("Running robust multi-view ICA with clustering...")
print("This performs multiple ICA runs and clusters the results for robustness.\n")

M_matrices, A_matrices = multiModulon.run_robust_multiview_ica(
    a=optimal_total,                 # Dictionary of total components per species
    c=optimal_num_core_components,   # Number of core components
    num_runs=10,                     # Number of runs for robustness
    seed=100                         # Random seed for reproducibility
)

Running robust multi-view ICA with clustering...
This performs multiple ICA runs and clusters the results for robustness.


Running robust multi-view ICA with 10 runs
Species: ['Nissle_1917', 'O157_H7', 'BL21']
Total components (a): {'Nissle_1917': 110, 'O157_H7': 230, 'BL21': 200}
Core components (c): 90

Collecting components from 10 runs...


ICA runs:   0%|          | 0/10 [00:00<?, ?it/s]


Clustering components...

Creating final M matrices with robust components...

Saving robust M matrices to species objects...
✓ Nissle_1917: (6467, 39) (24 core, 15 unique components)
✓ O157_H7: (6467, 63) (24 core, 39 unique components)
✓ BL21: (6467, 48) (24 core, 24 unique components)

Generating A matrices from robust M matrices...
✓ Generated A matrix for Nissle_1917: (39, 183)
✓ Generated A matrix for O157_H7: (63, 252)
✓ Generated A matrix for BL21: (48, 209)

Robust multi-view ICA completed!
Total core components retained: 24
Nissle_1917: 15 unique components
O157_H7: 39 unique components
BL21: 24 unique components


## Step 9: Optimize thresholds to binarize the M matrices

use Otsu's method to calculates thresholds for each component in M matrices across all species

In [14]:
multiModulon.optimize_M_thresholds(method="Otsu's method", quantile_threshold=95)


Optimizing thresholds for Nissle_1917...
  Gene mapping: 4678/4678 genes have expression data
✓ Optimized thresholds for 39 components
  Average genes per component: 25.5

Optimizing thresholds for O157_H7...
  Gene mapping: 5137/5137 genes have expression data
✓ Optimized thresholds for 63 components
  Average genes per component: 23.0

Optimizing thresholds for BL21...
  Gene mapping: 4132/4132 genes have expression data
✓ Optimized thresholds for 48 components
  Average genes per component: 27.5

Threshold optimization completed!


## Step 10: Save the multiModulon object to json

save the multiModulon object to json in the given path and file name

In [15]:
multiModulon.save_to_json_multimodulon("./multiModulon_E_coli_comparison_iMM_2_paper.json.gz")

In [16]:
for i in multiModulon.species:
    print(i, " : ", multiModulon[i].M.shape)

Nissle_1917  :  (6467, 39)
O157_H7  :  (6467, 63)
BL21  :  (6467, 48)
