# MetDeeCINE Interactive Tutorial
This notebook provides an interactive execution of MetDeeCINE on *in vivo* dataset used in the original paper.

## Overview
MetDeeCINE is an explainable deep learning framework designed to learn the principles of metabolic control from multi-omics data. It integrates foundational stoichiometric knowledge into a graph neural network (GNN) to predict quantitative relationships between enzymes and metabolites.

The primary output of MetDeeCINE is the prediction of **Concentration Control Coefficients (CCCs)**, which quantify how a change in the activity of a single enzyme affects the concentration of each metabolite across the entire network.

### Expected Runtime
~3-5 minutes including installation

## Step 0: Clone repository and change directory


In [None]:
!git clone https://github.com/Takumi110/MetDeeCINE
%cd /content/MetDeeCINE/

!ls -la

## Step 1: Import Required Libraries and Initialize

In [None]:
!pip install optuna

In [None]:
# Import required libraries
import sys
import os
from datetime import datetime
import numpy as np
import pandas as pd
import torch
import copy
import math
import shutil

# Add src directory to path
path = "/content/MetDeeCINE/src"
sys.path.append(path)

# Import MetDeeCINE modules
from config import config
from preprocessing_new import Preprocessing
from mignn import MiGNN
from train import *
from create_dics import *
from metdeecine import *

fix_seeds()

## Step 2: Configuration Setup

Here we load the configuration settings from `config.py` and create a timestamped output directory for saving results.

The configuration includes:
- **Data settings**: Input data path (`exp_root`, `exp_name`) and experimental conditions (`exp_strains`)
- **Preprocessing settings**: How to generate fold changes and perform cross-validation
- **Model hyperparameters**: Learning rate, loss function, activation function, GNN-specific parameters
- **Training settings**: Number of epochs, batch size, early stopping patience

**Please review and modify these settings as needed.**

The default settings are optimized for the Uematsu_2022 dataset used in the original paper.

In [134]:
conf = config()
output_dir = "./results/"+f"{datetime.now():%Y%m%d_%H%M%S}/"
os.makedirs(output_dir, exist_ok=True)

## Step 3: Save Configuration for Reproducibility

We save a copy of the configuration file to the output directory. This ensures that we can reproduce the exact same results later by documenting the parameters used in this run.

In [135]:
config_source = "src/config.py"
config_destination = output_dir+"config.py"
shutil.copy2(config_source, config_destination)

'./results/20250908_184741/config.py'

## Step 4: Initialize MetDeeCINE

Here we create an instance of the MetDeeCINE class. This will:
- Load and validate the input data files (CSV files starting with 'tbl')
- Parse the stoichiometry file to understand the metabolic network structure
- Identify enzymes and metabolites from the data
- Set up the preprocessing parameters

The system will print the number of enzymes and metabolites detected from your input data.

In [136]:
metdeecine = MetDeeCINE(conf)

Number of enzymes: 15
Number of metabolites: 27


## Step 5: Data Preprocessing

This is a crucial step that prepares the data for machine learning. The `data_preprocessing()` method performs several operations:

1. **Fold Change Calculation**: Computes log fold changes between different experimental conditions
2. **Data Splitting**: Creates training and validation sets using the specified cross-validation method (default: leave-one-strain-out)
3. **Data Standardization**: Applies standardization to input features if specified
4. **Network Matrices**: Constructs enzyme-metabolite (EM) and metabolite-metabolite (MM) adjacency matrices from stoichiometry.txt
5. **DataLoader Creation**: Prepares PyTorch DataLoaders for efficient batch processing during training
6. **Inference Input**: Prepares input vectors for CCC inference (vectors with single enzyme perturbations)

The output includes all the data structures needed for training and inference.

In [137]:
train_loader_list, val_loader_list, train_loader_all, inference_input, model_settings = metdeecine.data_preprocessing()

  sample_concat_log=sample_concat.drop("index_col",axis=1).applymap(lambda x: np.log(x))
  sample_concat_log=sample_concat.drop("index_col",axis=1).applymap(lambda x: np.log(x))
  enz = enz.applymap(lambda x: np.log(x) )


## Step 6: Hyperparameter Tuning (Optional)

This step uses Optuna, a Bayesian optimization framework, to automatically find the best hyperparameters for your dataset. The optimization process:

1. **Search Space**: Explores different combinations of:
   - Loss function (MSE, L1)
   - Regularization type (L1, L2)
   - Activation function (tanh, relu, elu, swish, mish)
   - GNN layers and regularization weights

2. **Cross-Validation**: Each hyperparameter combination is evaluated using cross-validation on the training data

3. **Optimization Metric**: Maximizes Pearson correlation coefficient (PCC) between predicted and true metabolite changes

4. **Trial Storage**: Results are saved in SQLite database and CSV file for later analysis

**Note**: The number of trials is set to 2 in the default configuration for demonstration purposes. For real applications, consider increasing this to 50-100 trials.

You can optimize other hyperparameters as well, such as learning rate and batch size.

In [None]:
# Hyperparameter tuning using MetDeeCINE integrated method
best_params = metdeecine.hyperparameter_tuning(train_loader_list, val_loader_list, output_dir)
print("Best hyperparameters:", best_params)

## Step 7: Update Configuration with Optimized Parameters

**IMPORTANT**: If hyperparameter tuning was performed, we must update the configuration with the optimized parameters and reinitialize MetDeeCINE.
If no hyperparameter tuning was performed (best_params is empty), we proceed with the default parameters.

In [None]:
# IMPORTANT: Update config with best parameters and reinitialize MetDeeCINE
if best_params:
    # Update config with best parameters
    for param_name, param_value in best_params.items():
        setattr(conf, param_name, param_value)
    
    # Reinitialize MetDeeCINE and rerun preprocessing
    metdeecine = MetDeeCINE(conf)
    train_loader_list, val_loader_list, train_loader_all, inference_input, model_settings = metdeecine.data_preprocessing()
    print("MetDeeCINE reinitialized with optimized hyperparameters")
else:
    print("No hyperparameter tuning was performed, using default parameters")

## Step 8: Model Training

Now we train the MiGNN (Metabolism-informed Graph Neural Network) model. The training process involves:

1. **Cross-Validation Training**: The model is trained using the specified cross-validation method (leave-one-strain-out by default)

2. **For Each Fold**:
   - Initialize model parameters using Xavier initialization
   - Train the model using the AdamW optimizer
   - Apply early stopping based on validation loss
   - Calculate performance metrics (PCC, SCC, R²)

3. **Final Model Training**: After determining the optimal number of epochs from cross-validation, train a final model on the entire dataset

4. **Model Saving**: Save the trained model weights to a .pth file for later use

The training uses the loss function combining:
- Data fitting loss (L1 or MSE)
- Regularization terms that encourage the model to respect the metabolic network structure

In [None]:
metdeecine.fc_training(model_settings, train_loader_list, val_loader_list, train_loader_all, parameter_save_path=output_dir+'model_params.pth')

## Step 9: CCC Inference

Finally, we use the trained model to infer Concentration Control Coefficients (CCCs). This process:

1. **Load Model**: Loads the trained model weights (you can use either the newly trained model or the provided pre-trained weights)

2. **Generate Input Vectors**: Creates input vectors where only one enzyme at a time has a non-zero log fold change, while all others are set to zero

3. **Model Prediction**: Feeds these vectors through the trained model to predict the resulting metabolite fold changes

4. **CCC Calculation**: Converts the predictions into CCCs by dividing the predicted metabolite changes by the enzyme changes

5. **Output Generation**: Saves the CCC matrix as `meanCCC.csv`, where:
   - **Rows**: Enzymes (KEGG IDs)
   - **Columns**: Metabolites (KEGG IDs)
   - **Values**: CCC values indicating the quantitative effect of each enzyme on each metabolite

**Note**: We're using pre-trained weights here for demonstration, but you can replace the path with `output_dir+'model_params.pth'` to use your newly trained model.

In [81]:
metdeecine.ccc_inference(model_settings, inference_input, output_dir, parameter_load_path='./input/Uematsu_2022/pretrained_weight.pth')

## Results Interpretation

### Output Files
The results are saved in the timestamped directory and include:

- **`meanCCC.csv`**: The main output containing the CCC matrix
- **`model_params.pth`**: Trained model weights for future use
- **`config.py`**: Copy of configuration for reproducibility
- **`optuna_study.db` and `optuna_study.csv`**: Hyperparameter optimization results (if performed)

### Understanding CCC Values
Positive values indicate that enzyme increase leads to metabolite increase in the steady state and negative values indicate metabolite decrease, with larger absolute values indicating stronger regulatory effects