# 1_Create_MultiModulon_object

This notebook demonstrates the first step for multi-species/strain/modality analysis using the MultiModulon package.

In [1]:
# Import required libraries
from multimodulon import MultiModulon
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)

## Step 1: Initialize MultiModulon

Load data from the Input_Data directory containing expression matrices, gene annotations, and sample metadata for all strains.

In [2]:
# Path to the Input_Data folder
input_data_path = './Input_Data'

# Initialize MultiModulon object
multiModulon = MultiModulon(input_data_path)


Initializing MultiModulon...

Loading from Input_Data:

W3110:
  - Number of genes: 4229
  - Number of samples: 71
  - Data validation: PASSED

BL21:
  - Number of genes: 4196
  - Number of samples: 75
  - Data validation: PASSED

MG1655:
  - Number of genes: 4305
  - Number of samples: 69
  - Data validation: PASSED

Successfully loaded 3 species/strains/modalities


## Step 2: Create Gene Tables

Parse GFF files to create gene annotation tables for each strain.

In [3]:
# Create gene tables from GFF files
print("Creating gene tables from GFF files...")
multiModulon.create_gene_table()

Creating gene tables from GFF files...

Creating gene tables for all species...

Processing W3110...
  ✓ Created gene table with 4229 genes

Processing BL21...
  ✓ Created gene table with 4196 genes

Processing MG1655...
  ✓ Created gene table with 4305 genes

Gene table creation completed!


In [4]:
multiModulon.add_eggnog_annotation("./Output_eggnog_mapper")


Adding eggNOG annotations from ./Output_eggnog_mapper

Processing W3110...
  - Reading W3110.emapper.annotations
  ✓ Added eggNOG annotations to 4065/4229 genes

Processing BL21...
  - Reading BL21.emapper.annotations
  ✓ Added eggNOG annotations to 4057/4196 genes

Processing MG1655...
  - Reading MG1655.emapper.annotations
  ✓ Added eggNOG annotations to 4097/4305 genes

eggNOG annotation addition completed!


## Step 3: Generate BBH Files

Generate Bidirectional Best Hits (BBH) files for ortholog detection between all strain pairs.

In [5]:
# Generate BBH files using multiple threads for faster computation
output_bbh_path = './Output_BBH'

multiModulon.generate_BBH(output_bbh_path, threads=16)

INFO: Species 'W3110' - Using Note as fallback IDs


## Step 4: Align Genes Across Strains

Create a unified gene database by aligning genes across all strains using the BBH results.

In [6]:
# Align genes across all strains
output_gene_info_path = './Output_Gene_Info'

combined_gene_db = multiModulon.align_genes(
    input_bbh_dir=output_bbh_path,
    output_dir=output_gene_info_path,
    reference_order=['MG1655', 'BL21', 'W3110'],  # optional: specify order
    # bbh_threshold=90  # optional: minimum percent identity threshold
)

combined_gene_db.head()


Gene counts in combined_gene_db:
  ✓ MG1655: 4305 genes (complete)
  ✓ BL21: 4196 genes (complete)
  ✓ W3110: 4229 genes (complete)

✓ All species have complete gene sets in combined_gene_db!

Combined gene database shape: (4815, 3)
Number of gene groups: 4815


Unnamed: 0,MG1655,BL21,W3110,row_label
0,b0001,ECD_00001,JW4367,b0001
1,b0002,ECD_00002,JW0001,b0002
2,b0003,ECD_00003,JW0002,b0003
3,b0004,ECD_00004,JW0003,b0004
4,b0005,ECD_00005,JW0004,b0005


## Step 5: Generate Aligned Expression Matrices

Create expression matrices with consistent gene indexing across all strains for multi-view ICA.

In [7]:
# Generate aligned expression matrices
print("Generating aligned expression matrices...")
multiModulon.generate_X(output_gene_info_path)

# The output shows aligned X matrices and dimension recommendations

Generating aligned expression matrices...

Generated aligned X matrices:
MG1655: (4815, 69) (4300 non-zero gene groups)
BL21: (4815, 75) (4195 non-zero gene groups)
W3110: (4815, 71) (4225 non-zero gene groups)
Maximum dimension recommendation: 65


## Step 6: Optimize Number of Core Components

Determine the optimal number of core components.

In [8]:
# Optimize number of core components
print("Optimizing number of core components...")
print("This will test different values of k and find the optimal number.")

optimal_num_core_components = multiModulon.optimize_number_of_core_components(
    step=10,                        # Test k = 5, 10, 15, 20, ...
    save_path='./Output_Optimization_Figures', # Save plots to directory
    fig_size=(7, 5),              # Figure size
    num_runs_per_dimension=10,
    seed=10
)

Optimizing number of core components...
This will test different values of k and find the optimal number.
Optimizing core components for 3 species/strains: ['W3110', 'BL21', 'MG1655']
Auto-determined max_k = 60 based on minimum samples (69)
Using GPU: NVIDIA GeForce RTX 3090


Run 1/1:   0%|          | 0/6 [00:00<?, ?it/s]


Optimal k = 30 (robust non-single gene components = 21.0)

Plot saved to: Output_Optimization_Figures/num_core_optimization.svg


## Step 7: Optimize Number of Unique Components

Determine the optimal number of unique (species-specific) components for each strain.

In [9]:
# Optimize unique components for each species
print("Optimizing unique components per species...")
print("This will test different numbers of unique components for each species.\n")

optimal_unique, optimal_total = multiModulon.optimize_number_of_unique_components(
    optimal_num_core_components=optimal_num_core_components,
    step=10,
    save_path='./Output_Optimization_Figures',
    fig_size=(7, 5),
    num_runs_per_dimension=10,
    seed=10
)

Optimizing unique components per species...
This will test different numbers of unique components for each species.


Optimizing unique components with core k = 30

Optimizing unique components for W3110
Optimizing unique components: [30, 40, 50, 60]


Optimizing unique components for W3110:   0%|          | 0/4 [00:00<?, ?it/s]

Plot saved to: Output_Optimization_Figures/num_unique_W3110_optimization.svg

Optimal a for W3110: 60 (10 robust unique components)

Optimizing unique components for BL21
Optimizing unique components: [30, 40, 50, 60, 70]


Optimizing unique components for BL21:   0%|          | 0/5 [00:00<?, ?it/s]

Plot saved to: Output_Optimization_Figures/num_unique_BL21_optimization.svg

Optimal a for BL21: 70 (23 robust unique components)

Optimizing unique components for MG1655
Optimizing unique components: [30, 40, 50, 60]


Optimizing unique components for MG1655:   0%|          | 0/4 [00:00<?, ?it/s]

Plot saved to: Output_Optimization_Figures/num_unique_MG1655_optimization.svg

Optimal a for MG1655: 60 (18 robust unique components)

Optimization Summary
Core components (c): 30
W3110: a = 60 (unique components: 30)
BL21: a = 70 (unique components: 40)
MG1655: a = 60 (unique components: 30)


In [10]:
optimal_num_core_components

30

In [11]:
optimal_total

{'W3110': 60, 'BL21': 70, 'MG1655': 60}

## Step 8: Run Robust Multi-view ICA

Perform robust multi-view ICA with multiple runs and clustering to identify consistent components.

In [12]:
# Run robust multi-view ICA
print("Running robust multi-view ICA with clustering...")
print("This performs multiple ICA runs and clusters the results for robustness.\n")

M_matrices, A_matrices = multiModulon.run_robust_multiview_ica(
    a=optimal_total,                 # Dictionary of total components per species
    c=optimal_num_core_components,   # Number of core components
    num_runs=10,                     # Number of runs for robustness
    seed=100                         # Random seed for reproducibility
)

Running robust multi-view ICA with clustering...
This performs multiple ICA runs and clusters the results for robustness.


Running robust multi-view ICA with 10 runs
Species: ['W3110', 'BL21', 'MG1655']
Total components (a): {'W3110': 60, 'BL21': 70, 'MG1655': 60}
Core components (c): 30

Collecting components from 10 runs...


ICA runs:   0%|          | 0/10 [00:00<?, ?it/s]


Clustering components...

Creating final M matrices with robust components...

Saving robust M matrices to species objects...
✓ W3110: (4815, 28) (23 core, 5 unique components)
✓ BL21: (4815, 42) (23 core, 19 unique components)
✓ MG1655: (4815, 34) (23 core, 11 unique components)

Generating A matrices from robust M matrices...
✓ Generated A matrix for W3110: (28, 71)
✓ Generated A matrix for BL21: (42, 75)
✓ Generated A matrix for MG1655: (34, 69)

Robust multi-view ICA completed!
Total core components retained: 23
W3110: 5 unique components
BL21: 19 unique components
MG1655: 11 unique components


## Step 9: Optimize thresholds to binarize the M matrices

use Otsu's method to calculates thresholds for each component in M matrices across all species

In [13]:
multiModulon.optimize_M_thresholds(method="Otsu's method", quantile_threshold=95)


Optimizing thresholds for W3110...
  Gene mapping: 4229/4229 genes have expression data
✓ Optimized thresholds for 28 components
  Average genes per component: 32.9

Optimizing thresholds for BL21...
  Gene mapping: 4196/4196 genes have expression data
✓ Optimized thresholds for 42 components
  Average genes per component: 31.5

Optimizing thresholds for MG1655...
  Gene mapping: 4305/4305 genes have expression data
✓ Optimized thresholds for 34 components
  Average genes per component: 28.4

Threshold optimization completed!


## Step 10: Save the multiModulon object to json

save the multiModulon object to json in the given path and file name

In [14]:
multiModulon.save_to_json_multimodulon("./multiModulon_E_coli_comparison_demo.json.gz")

In [15]:
for i in multiModulon.species:
    print(i, " : ", multiModulon[i].M.shape)

W3110  :  (4815, 28)
BL21  :  (4815, 42)
MG1655  :  (4815, 34)
