# 1_Create_MultiModulon_object

This notebook demonstrates the first step for multi-species/strain/modality analysis using the MultiModulon package.

In [169]:
# Import required libraries
from multimodulon import MultiModulon
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)

## Step 1: Initialize MultiModulon

Load data from the Input_Data directory containing expression matrices, gene annotations, and sample metadata for all strains.

In [10]:
# Path to the Input_Data folder
input_data_path = './Input_Data'

# Initialize MultiModulon object
multiModulon = MultiModulon(input_data_path)


Initializing MultiModulon...

Loading from Input_Data:

Enterococcus_faecium:
  - Number of genes: 2833
  - Number of samples: 138
  - Data validation: PASSED

Enterococcus_faecalis:
  - Number of genes: 2680
  - Number of samples: 463
  - Data validation: PASSED

Successfully loaded 2 species/strains/modalities


## Step 2: Generate BBH Files

Generate Bidirectional Best Hits (BBH) files for ortholog detection between all strain pairs.

In [11]:
# Generate BBH files using multiple threads for faster computation
output_bbh_path = './Output_BBH'

multiModulon.generate_BBH(output_bbh_path, threads=16)

## Step 3: Align Genes Across Strains

Create a unified gene database by aligning genes across all strains using the BBH results.

In [12]:
# Align genes across all strains
output_gene_info_path = './Output_Gene_Info'

combined_gene_db = multiModulon.align_genes(
    input_bbh_dir=output_bbh_path,
    output_dir=output_gene_info_path
)

combined_gene_db.head()


Gene counts in combined_gene_db:
  ✓ Enterococcus_faecium: 2833 genes (complete)
  ✓ Enterococcus_faecalis: 2680 genes (complete)

✓ All species have complete gene sets in combined_gene_db!

Combined gene database shape: (3884, 2)
Number of gene groups: 3884


Unnamed: 0,Enterococcus_faecium,Enterococcus_faecalis,row_label
0,E6A31_00005,WMS_00242,E6A31_00005
1,E6A31_00010,WMS_00243,E6A31_00010
2,E6A31_00015,WMS_00244,E6A31_00015
3,E6A31_00020,WMS_00245,E6A31_00020
4,E6A31_00025,WMS_00246,E6A31_00025


## Step 4: Create Gene Tables

Parse GFF files to create gene annotation tables for each strain.

In [13]:
# Create gene tables from GFF files
print("Creating gene tables from GFF files...")
multiModulon.create_gene_table()

Creating gene tables from GFF files...

Creating gene tables for all species...

Processing Enterococcus_faecium...
  ✓ Created gene table with 2833 genes

Processing Enterococcus_faecalis...
  ✓ Created gene table with 2680 genes

Gene table creation completed!


In [14]:
multiModulon.add_eggnog_annotation("./Output_eggnog_mapper")


Adding eggNOG annotations from ./Output_eggnog_mapper

Processing Enterococcus_faecium...
  - Reading Enterococcus_faecium.emapper.annotations
  ✓ Added eggNOG annotations to 2438/2833 genes

Processing Enterococcus_faecalis...
  - Reading Enterococcus_faecalis.emapper.annotations
  ✓ Added eggNOG annotations to 2528/2680 genes

eggNOG annotation addition completed!


In [15]:
multiModulon['Enterococcus_faecium'].gene_table

Unnamed: 0_level_0,accession,start,end,strand,gene_name,old_locus_tag,gene_product,ncbi_protein,seed_ortholog,evalue,score,eggNOG_OGs,max_annot_lvl,COG_category,Description,Preferred_name,GOs,EC,KEGG_ko,KEGG_Pathway,KEGG_Module,KEGG_Reaction,KEGG_rclass,BRITE,KEGG_TC,CAZy,BiGG_Reaction,PFAMs
locus_tag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1
E6A31_00005,CP038996.1,1,1335,+,dnaA,,chromosomal replication initiator protein DnaA,QHU86234.1,1104325.M7W_223,9.360000e-317,863.0,"COG0593@1|root,COG0593@2|Bacteria,1TPV7@1239|F...",2|Bacteria,L,it binds specifically double-stranded DNA at a...,dnaA,"GO:0003674,GO:0003676,GO:0003677,GO:0003688,GO...",-,ko:K02313,"ko02020,ko04112,map02020,map04112",-,-,-,"ko00000,ko00001,ko03032,ko03036",-,-,-,"Bac_DnaA,Bac_DnaA_C,DnaA_N"
E6A31_00010,CP038996.1,1533,2663,+,dnaN,,DNA polymerase III subunit beta,QHU86235.1,1104325.M7W_224,4.890000e-262,719.0,"COG0592@1|root,COG0592@2|Bacteria,1TQ7J@1239|F...",2|Bacteria,L,Confers DNA tethering and processivity to DNA ...,dnaN,-,2.7.7.7,ko:K02338,"ko00230,ko00240,ko01100,ko03030,ko03430,ko0344...",M00260,"R00375,R00376,R00377,R00378",RC02795,"ko00000,ko00001,ko00002,ko01000,ko03032,ko03400",-,-,-,"DNA_pol3_beta,DNA_pol3_beta_2,DNA_pol3_beta_3"
E6A31_00015,CP038996.1,2892,3134,+,yaaA,,S4 domain-containing protein YaaA,QHU86236.1,565664.EFXG_00020,3.220000e-51,162.0,"COG2501@1|root,COG2501@2|Bacteria,1VEJ2@1239|F...",2|Bacteria,S,S4 domain,yaaA,"GO:0003674,GO:0003676,GO:0003723,GO:0005488,GO...",-,ko:K14761,-,-,-,-,"ko00000,ko03009",-,-,-,S4_2
E6A31_00020,CP038996.1,3121,4245,+,recF,,DNA replication/repair protein RecF,QHU86237.1,565664.EFXG_00021,4.880000e-261,716.0,"COG1195@1|root,COG1195@2|Bacteria,1TP9U@1239|F...",2|Bacteria,L,it is required for DNA replication and normal ...,recF,"GO:0000731,GO:0005575,GO:0005622,GO:0005623,GO...",-,ko:K03629,"ko03440,map03440",-,-,-,"ko00000,ko00001,ko03400",-,-,-,SMC_N
E6A31_00025,CP038996.1,4242,6188,+,gyrB,,DNA topoisomerase (ATP-hydrolyzing) subunit B,QHU86238.1,1104325.M7W_227,0.000000e+00,1273.0,"COG0187@1|root,COG0187@2|Bacteria,1TQ0R@1239|F...",2|Bacteria,L,A type II topoisomerase that negatively superc...,gyrB,"GO:0000166,GO:0000287,GO:0003674,GO:0003824,GO...",5.99.1.3,"ko:K02470,ko:K02622",-,-,-,-,"ko00000,ko01000,ko02048,ko03032,ko03036,ko03400",-,-,-,"DNA_gyraseB,DNA_gyraseB_C,HATPase_c,Toprim"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
E6A31_15120,CP038997.1,118661,119254,+,,,XRE family transcriptional regulator,QHU89020.1,1140002.I570_03545,1.200000e-127,364.0,"COG1476@1|root,COG1476@2|Bacteria,1V3Q1@1239|F...",2|Bacteria,K,Cro/C1-type HTH DNA-binding domain,-,-,-,-,-,-,-,-,-,-,-,-,HTH_3
E6A31_15125,CP038997.1,119363,119677,+,,,hypothetical protein,,,,,,,,,,,,,,,,,,,,,
E6A31_15130,CP038997.1,119831,120184,-,,,hypothetical protein,QHU89021.1,,,,,,,,,,,,,,,,,,,,
E6A31_15135,CP038997.1,120204,121259,-,,,ParM/StbA family protein,QHU89022.1,,,,,,,,,,,,,,,,,,,,


## Step 5: Generate Aligned Expression Matrices

Create expression matrices with consistent gene indexing across all strains for multi-view ICA.

In [16]:
# Generate aligned expression matrices
print("Generating aligned expression matrices...")
multiModulon.generate_X(output_gene_info_path)

# The output shows aligned X matrices and dimension recommendations

Generating aligned expression matrices...

Generated aligned X matrices:
Enterococcus_faecium: (3884, 138) (2792 non-zero gene groups)
Enterococcus_faecalis: (3884, 463) (2653 non-zero gene groups)
Maximum dimension recommendation: 135


## Step 6: Optimize Number of Core Components

Use Cohen's d effect size metric to automatically determine the optimal number of core components.

In [17]:
# Optimize number of core components
print("Optimizing number of core components...")
print("This will test different values of k and find the optimal number.")

optimal_num_core_components = multiModulon.optimize_number_of_core_components(
    metric='effect_size',          # Use Cohen's d effect size metric
    step=10,                       # Test k = 5, 10, 15, 20, ...
    save_path='./Output_Optimization_Figures', # Save plots to directory
    fig_size=(7, 5),               # Figure size
    num_runs_per_dimension=10,
    seed=100                        # Random seed for reproducibility
)

Optimizing number of core components...
This will test different values of k and find the optimal number.
Optimizing core components for 2 species/strains: ['Enterococcus_faecium', 'Enterococcus_faecalis']
Auto-determined max_k = 130 based on minimum samples (138)
Using GPU: NVIDIA GeForce RTX 3090


Run 1/1:   0%|          | 0/13 [00:00<?, ?it/s]


Optimal k = 120 (Number of robust components above threshold = 19.0)

Plot saved to: Output_Optimization_Figures/num_core_optimization.svg


## Step 7: Optimize Number of Unique Components

Determine the optimal number of unique (species-specific) components for each strain.

In [18]:
# Optimize unique components for each species
print("Optimizing unique components per species...")
print("This will test different numbers of unique components for each species.\n")

optimal_unique, optimal_total = multiModulon.optimize_number_of_unique_components(
    optimal_num_core_components=optimal_num_core_components,
    step=10,
    save_path='./Output_Optimization_Figures',
    fig_size=(7, 5),
    num_runs_per_dimension=10,
    seed=100                          # Random seed for reproducibility
)

Optimizing unique components per species...
This will test different numbers of unique components for each species.


Optimizing unique components with core k = 120

Optimizing unique components for Enterococcus_faecium
Testing a values: [120, 130]


Testing a values for Enterococcus_faecium:   0%|          | 0/2 [00:00<?, ?it/s]

Plot saved to: Output_Optimization_Figures/num_unique_Enterococcus_faecium_optimization.svg

Optimal a for Enterococcus_faecium: 130 (5 robust unique components)

Optimizing unique components for Enterococcus_faecalis
Testing a values: [120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290, 300, 310, 320, 330, 340, 350, 360, 370, 380, 390, 400, 410, 420, 430, 440, 450]


Testing a values for Enterococcus_faecalis:   0%|          | 0/34 [00:00<?, ?it/s]

Plot saved to: Output_Optimization_Figures/num_unique_Enterococcus_faecalis_optimization.svg

Optimal a for Enterococcus_faecalis: 370 (182 robust unique components)

Optimization Summary
Core components (c): 120
Enterococcus_faecium: a = 130 (unique components: 10)
Enterococcus_faecalis: a = 370 (unique components: 250)


## Step 8: Run Robust Multi-view ICA

Perform robust multi-view ICA with multiple runs and clustering to identify consistent components.

In [21]:
# Run robust multi-view ICA
print("Running robust multi-view ICA with clustering...")
print("This performs multiple ICA runs and clusters the results for robustness.\n")

M_matrices, A_matrices = multiModulon.run_robust_multiview_ica(
    a=optimal_total,                 # Dictionary of total components per species
    c=optimal_num_core_components,   # Number of core components
    num_runs=20,                     # Number of runs for robustness
    seed=100                         # Random seed for reproducibility
)

Running robust multi-view ICA with clustering...
This performs multiple ICA runs and clusters the results for robustness.


Running robust multi-view ICA with 20 runs
Species: ['Enterococcus_faecium', 'Enterococcus_faecalis']
Total components (a): {'Enterococcus_faecium': 130, 'Enterococcus_faecalis': 370}
Core components (c): 120

Collecting components from 20 runs...


ICA runs:   0%|          | 0/20 [00:00<?, ?it/s]


Clustering components...

Creating final M matrices with robust components...

Saving robust M matrices to species objects...
✓ Enterococcus_faecium: (3884, 25) (20 core, 5 unique components)
✓ Enterococcus_faecalis: (3884, 170) (20 core, 150 unique components)

Generating A matrices from robust M matrices...
✓ Generated A matrix for Enterococcus_faecium: (25, 138)
✓ Generated A matrix for Enterococcus_faecalis: (170, 463)

Robust multi-view ICA completed!
Total core clusters identified: 20
Enterococcus_faecium: 5 unique components
Enterococcus_faecalis: 150 unique components


## Step 9: Optimize thresholds to binarize the M matrices

use Otsu's method to calculates thresholds for each component in M matrices across all species

In [50]:
multiModulon.optimize_M_thresholds(method="Otsu's method", quantile_threshold=95)


Optimizing thresholds for Enterococcus_faecium...
  Gene mapping: 2833/2833 genes have expression data
✓ Optimized thresholds for 25 components
  Average genes per component: 14.7

Optimizing thresholds for Enterococcus_faecalis...
  Gene mapping: 2680/2680 genes have expression data
✓ Optimized thresholds for 170 components
  Average genes per component: 3.9

Threshold optimization completed!


## Step 10: Save the multiModulon object to json

save the multiModulon object to json in the given path and file name

In [168]:
multiModulon.save_to_json_multimodulon("./multiModulon_Enterococcus.json.gz")