# 1_Create_MultiModulon_object

This notebook demonstrates the first step for multi-species/strain/modality analysis using the MultiModulon package.

In [1]:
# Import required libraries
from multimodulon import MultiModulon
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)

## Step 1: Initialize MultiModulon

Load data from the Input_Data directory containing expression matrices, gene annotations, and sample metadata for all strains.

In [2]:
# Path to the Input_Data folder
input_data_path = './Input_Data'

# Initialize MultiModulon object
multiModulon = MultiModulon(input_data_path)


Initializing MultiModulon...

Loading from Input_Data:

Streptomyces_albidoflavus:
  - Number of genes: 6140
  - Number of samples: 234
  - Data validation: PASSED

Streptomyces_venezuelae:
  - Number of genes: 7377
  - Number of samples: 200
  - Data validation: PASSED

Successfully loaded 2 species/strains/modalities


## Step 2: Generate BBH Files

Generate Bidirectional Best Hits (BBH) files for ortholog detection between all strain pairs.

In [3]:
# Generate BBH files using multiple threads for faster computation
output_bbh_path = './Output_BBH'

multiModulon.generate_BBH(output_bbh_path, threads=4)

## Step 3: Align Genes Across Strains

Create a unified gene database by aligning genes across all strains using the BBH results.

In [4]:
# Align genes across all strains
output_gene_info_path = './Output_Gene_Info'

combined_gene_db = multiModulon.align_genes(
    input_bbh_dir=output_bbh_path,
    output_dir=output_gene_info_path
)

combined_gene_db.head()


Gene counts in combined_gene_db:
  ✓ Streptomyces_albidoflavus: 6140 genes (complete)
  ✓ Streptomyces_venezuelae: 7377 genes (complete)

✓ All species have complete gene sets in combined_gene_db!

Combined gene database shape: (9532, 2)
Number of gene groups: 9532


Unnamed: 0,Streptomyces_albidoflavus,Streptomyces_venezuelae,row_label
0,OG330_00005,,OG330_00005
1,OG330_00010,,OG330_00010
2,OG330_00015,,OG330_00015
3,OG330_00020,DEJ43_30400,OG330_00020
4,OG330_00025,,OG330_00025


## Step 4: Create Gene Tables

Parse GFF files to create gene annotation tables for each strain.

In [5]:
# Create gene tables from GFF files
print("Creating gene tables from GFF files...")
multiModulon.create_gene_table()

Creating gene tables from GFF files...

Creating gene tables for all species...

Processing Streptomyces_albidoflavus...
  ✓ Created gene table with 6140 genes

Processing Streptomyces_venezuelae...
  ✓ Created gene table with 7377 genes

Gene table creation completed!


In [6]:
multiModulon.add_eggnog_annotation("./Output_eggnog_mapper")


Adding eggNOG annotations from ./Output_eggnog_mapper

Processing Streptomyces_albidoflavus...
  - Reading Streptomyces_albidoflavus.emapper.annotations
  ✓ Added eggNOG annotations to 5794/6140 genes

Processing Streptomyces_venezuelae...
  - Reading Streptomyces_venezuelae.emapper.annotations
  ✓ Added eggNOG annotations to 6893/7377 genes

eggNOG annotation addition completed!


In [7]:
multiModulon['Streptomyces_albidoflavus'].gene_table

Unnamed: 0_level_0,accession,start,end,strand,gene_name,old_locus_tag,gene_product,ncbi_protein,seed_ortholog,evalue,score,eggNOG_OGs,max_annot_lvl,COG_category,Description,Preferred_name,GOs,EC,KEGG_ko,KEGG_Pathway,KEGG_Module,KEGG_Reaction,KEGG_rclass,BRITE,KEGG_TC,CAZy,BiGG_Reaction,PFAMs
locus_tag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1
OG330_00005,CP108647.1,664,1248,-,,,hypothetical protein,WSU13568.1,457425.XNR_5935,1.210000e-130,371.0,"2AKX1@1|root,31BQI@2|Bacteria,2GNMY@201174|Act...",2|Bacteria,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
OG330_00010,CP108647.1,1245,2162,-,,,dihydrolipoamide S-succinyltransferase,WSU13569.1,457425.XNR_5934,1.570000e-209,579.0,"COG2944@1|root,COG2944@2|Bacteria,2GM0F@201174...",2|Bacteria,K,sequence-specific DNA binding,-,-,-,-,-,-,-,-,-,-,-,-,"DnaB_C,HTH_3,HTH_31"
OG330_00015,CP108647.1,2774,3118,+,,,IS5 family transposase,WSU13570.1,1205910.B005_1687,1.200000e-171,481.0,"COG3293@1|root,COG3293@2|Bacteria,2IIAY@201174...",2|Bacteria,L,Transposase DDE domain,-,-,-,-,-,-,-,-,-,-,-,-,"DDE_Tnp_1,DDE_Tnp_1_2,DUF4096"
OG330_00020,CP108647.1,3651,4421,-,,,IS5/IS1182 family transposase,WSU13571.1,591167.Sfla_6473,5.550000e-169,473.0,"28PSB@1|root,2ZCDV@2|Bacteria,2GNDZ@201174|Act...",2|Bacteria,S,DDE superfamily endonuclease,-,-,-,-,-,-,-,-,-,-,-,-,"DDE_Tnp_1,DDE_Tnp_4,HTH_Tnp_4"
OG330_00025,CP108647.1,4639,5427,-,,,hypothetical protein,WSU13572.1,73044.JNXP01000011_gene5545,3.360000e-30,120.0,"2B7QF@1|root,320WD@2|Bacteria,2ISID@201174|Act...",2|Bacteria,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
OG330_31200,CP108648.1,75948,76877,+,,,hypothetical protein,WSU19571.1,644283.Micau_3171,2.600000e-08,62.8,"COG4974@1|root,COG4974@2|Bacteria,2ID5W@201174...",2|Bacteria,L,Belongs to the 'phage' integrase family,-,-,-,ko:K04763,-,-,-,-,"ko00000,ko03036",-,-,-,"Phage_int_SAM_1,Phage_integrase"
OG330_31205,CP108648.1,77051,78076,+,,,competence protein CoiA,WSU19572.1,1206737.BAGF01000148_gene5919,3.830000e-52,185.0,"2BM88@1|root,32FRY@2|Bacteria,2HJYN@201174|Act...",2|Bacteria,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
OG330_31210,CP108648.1,78153,79427,-,,,pentapeptide repeat-containing protein,WSU19573.1,455632.SGR_77t,2.910000e-133,400.0,"COG1340@1|root,COG1357@1|root,COG1340@2|Bacter...",2|Bacteria,S,Pentapeptide repeats (9 copies),potC,-,"2.1.1.172,2.1.1.80,3.1.1.61","ko:K00564,ko:K02026,ko:K02057,ko:K03201,ko:K10...","ko02010,ko02020,ko02030,ko03070,map02010,map02...","M00207,M00221,M00299,M00333,M00506",R07234,RC00003,"ko00000,ko00001,ko00002,ko01000,ko02000,ko0202...","1.A.1.1,1.A.1.13,1.A.1.17,1.A.1.24,1.A.1.25,1....",-,-,"Abhydrolase_8,Ion_trans_2,Pentapeptide_3"
OG330_31215,CP108648.1,79763,80095,-,,,hypothetical protein,WSU19574.1,1157635.KB892024_gene350,1.360000e-38,132.0,"2BNF9@1|root,32H33@2|Bacteria,2H54K@201174|Act...",2|Bacteria,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-


## Step 5: Generate Aligned Expression Matrices

Create expression matrices with consistent gene indexing across all strains for multi-view ICA.

In [8]:
# Generate aligned expression matrices
print("Generating aligned expression matrices...")
multiModulon.generate_X(output_gene_info_path)

# The output shows aligned X matrices and dimension recommendations

Generating aligned expression matrices...

Generated aligned X matrices:
Streptomyces_albidoflavus: (9532, 234) (5629 non-zero gene groups)
Streptomyces_venezuelae: (9532, 200) (7377 non-zero gene groups)
Maximum dimension recommendation: 200


## Step 6: Optimize Number of Core Components

Use Cohen's d effect size metric to automatically determine the optimal number of core components.

In [9]:
# Optimize number of core components
print("Optimizing number of core components...")
print("This will test different values of k and find the optimal number.")

optimal_num_core_components = multiModulon.optimize_number_of_core_components(
    step=10,                       # Test k = 5, 10, 15, 20, ...
    save_path='./Output_Optimization_Figures', # Save plots to directory
    fig_size=(7, 5),               # Figure size
    num_runs_per_dimension=5,
    seed=100                        # Random seed for reproducibility
)

Optimizing number of core components...
This will test different values of k and find the optimal number.
Optimizing core components for 2 species/strains: ['Streptomyces_albidoflavus', 'Streptomyces_venezuelae']
Auto-determined max_k = 190 based on minimum samples (200)
Using GPU: NVIDIA GeForce RTX 3090


Run 1/1:   0%|          | 0/19 [00:00<?, ?it/s]


Optimal k = 30 (robust components passing filter = 7.0)

Plot saved to: Output_Optimization_Figures/num_core_optimization.svg


## Step 7: Optimize Number of Unique Components

Determine the optimal number of unique (species-specific) components for each strain.

In [10]:
# Optimize unique components for each species
print("Optimizing unique components per species...")
print("This will test different numbers of unique components for each species.\n")

optimal_unique, optimal_total = multiModulon.optimize_number_of_unique_components(
    optimal_num_core_components=optimal_num_core_components,
    step=10,
    save_path='./Output_Optimization_Figures',
    fig_size=(7, 5),
    num_runs_per_dimension=5,
    seed=100                          # Random seed for reproducibility
)

Optimizing unique components per species...
This will test different numbers of unique components for each species.


Optimizing unique components with core k = 30

Optimizing unique components for Streptomyces_albidoflavus
Testing a values: [30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220]


Testing a values for Streptomyces_albidoflavus:   0%|          | 0/20 [00:00<?, ?it/s]

Plot saved to: Output_Optimization_Figures/num_unique_Streptomyces_albidoflavus_optimization.svg

Optimal a for Streptomyces_albidoflavus: 220 (194 robust unique components)

Optimizing unique components for Streptomyces_venezuelae
Testing a values: [30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190]


Testing a values for Streptomyces_venezuelae:   0%|          | 0/17 [00:00<?, ?it/s]

Plot saved to: Output_Optimization_Figures/num_unique_Streptomyces_venezuelae_optimization.svg

Optimal a for Streptomyces_venezuelae: 190 (167 robust unique components)

Optimization Summary
Core components (c): 30
Streptomyces_albidoflavus: a = 220 (unique components: 190)
Streptomyces_venezuelae: a = 190 (unique components: 160)


## Step 8: Run Robust Multi-view ICA

Perform robust multi-view ICA with multiple runs and clustering to identify consistent components.

In [11]:
# Run robust multi-view ICA
print("Running robust multi-view ICA with clustering...")
print("This performs multiple ICA runs and clusters the results for robustness.\n")

M_matrices, A_matrices = multiModulon.run_robust_multiview_ica(
    a=optimal_total,                 # Dictionary of total components per species
    c=optimal_num_core_components,   # Number of core components
    num_runs=10,                     # Number of runs for robustness
    seed=100                         # Random seed for reproducibility
)

Running robust multi-view ICA with clustering...
This performs multiple ICA runs and clusters the results for robustness.


Running robust multi-view ICA with 10 runs
Species: ['Streptomyces_albidoflavus', 'Streptomyces_venezuelae']
Total components (a): {'Streptomyces_albidoflavus': 220, 'Streptomyces_venezuelae': 190}
Core components (c): 30

Collecting components from 10 runs...


ICA runs:   0%|          | 0/10 [00:00<?, ?it/s]


Clustering components...

Creating final M matrices with robust components...

Saving robust M matrices to species objects...
✓ Streptomyces_albidoflavus: (9532, 82) (7 core, 75 unique components)
✓ Streptomyces_venezuelae: (9532, 76) (7 core, 69 unique components)

Generating A matrices from robust M matrices...
✓ Generated A matrix for Streptomyces_albidoflavus: (82, 234)
✓ Generated A matrix for Streptomyces_venezuelae: (76, 200)

Robust multi-view ICA completed!
Total core components retained: 7
Streptomyces_albidoflavus: 75 unique components
Streptomyces_venezuelae: 69 unique components


## Step 9: Optimize thresholds to binarize the M matrices

use Otsu's method to calculates thresholds for each component in M matrices across all species

In [12]:
multiModulon.optimize_M_thresholds(method="Otsu's method", quantile_threshold=95)


Optimizing thresholds for Streptomyces_albidoflavus...
  Gene mapping: 6140/6140 genes have expression data
✓ Optimized thresholds for 82 components
  Average genes per component: 21.2

Optimizing thresholds for Streptomyces_venezuelae...
  Gene mapping: 7377/7377 genes have expression data
✓ Optimized thresholds for 76 components
  Average genes per component: 35.7

Threshold optimization completed!


## Step 10: Save the multiModulon object to json

save the multiModulon object to json in the given path and file name

In [57]:
multiModulon.save_to_json_multimodulon("./multiModulon_Streptomyces.json.gz")