# Integrating Machine Learning with Metabolic Models for Precision Trauma Care: Personalized ENDOTYPE Stratification and Metabolic Target Identification

Authors:
- Igor Marin de Mas (Copenhagen University Hospital, Rigshospitalet)
- Lincoln Moura (Universidade Federal do Ceará)
- Fernando Luiz Marcelo Antunes (Universidade Federal do Ceará)
- Josep Maria Guerrero (Aalborg University)
- Pär Ingemar Johansson (Copenhagen University Hospital, Rigshospitalet)

# Introduction

This code is designed to preprocess and analyze datasets stored in MATLAB `.mat` files, with the goal of cleaning and transforming the data for further analysis. It performs several essential steps, including data validation, handling missing values, removing outliers, numerical stabilization, and dimensionality reduction using Principal Component Analysis (PCA). The preprocessed data is then saved in CSV format for easy access and further use.

## Workflow Overview
1. **Loading MATLAB files**: Extracting data and converting it into a tabular format (`DataFrame`) using Python libraries.
2. **Preprocessing**: Cleaning the dataset by removing rows with problematic values (e.g., zeros), stabilizing numerical values, handling outliers, and replacing missing data with column means.
3. **Dimensionality Reduction**: Utilizing PCA to reduce the number of features while retaining the majority of the dataset’s explainability (variance).
4. **Saving Results**: Storing the transformed datasets and PCA explainability metrics in CSV files for later use.

## Libraries Used
This code leverages powerful Python libraries, including:
- `pandas` for data manipulation and saving datasets in CSV format.
- `scipy` for loading MATLAB `.mat` files.
- `numpy` for numerical operations.
- `scikit-learn` for data scaling (`StandardScaler`) and dimensionality reduction (`PCA`).
- `os` and `glob` for navigating directories and handling files.

## Key Features
- **Batch Processing**: The code processes multiple directories and datasets simultaneously, making it efficient for large-scale data analysis.
- **Data Cleaning**: It ensures the dataset’s quality by addressing missing values, outliers, and numerical stability.
- **Dimensionality Reduction**: PCA is applied to simplify the dataset while retaining high explainability, making it suitable for machine learning models or statistical analysis.

This implementation is ideal for preparing raw data for further computational analysis or machine learning workflows. 


In [1]:
# Import necessary libraries
import pandas as pd
import scipy.io as sio
import numpy as np
import glob
import os
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Function to preprocess the data
def preprocess(data_frame):
    """
    Preprocess the input dataset by cleaning, handling outliers, ensuring numerical stability,
    and applying PCA for dimensionality reduction.

    Args:
        data_frame (pd.DataFrame): Input dataset to preprocess.

    Returns:
        tuple: Processed dataset (pd.DataFrame) and explainability score (float).
    """
    # Check if the input DataFrame is valid
    if data_frame.empty:
        raise ValueError("Input DataFrame is empty. Please provide a valid dataset.")

    print("Shape before preprocess: ", data_frame.shape)

    # Remove rows with specific indices where all values are zero
    # The indices representing rows with all zero values were carefully analyzed and hardcoded to ensure their removal across all patient files.
    removal_indices = [
        7, 8, 9, 10, 11, 12, 13, 14, 20, 39, 67, 91, 143, 145, 146, 177, 189, 205, 240, 241, 242, 
        251, 298, 302, 335, 373, 383, 399, 460, 466, 467, 468, 480, 493, 510, 511, 519, 520, 536, 
        537, 538, 539, 540, 541, 542, 639, 640, 641, 642, 643, 644, 645, 646, 647, 648, 649, 650, 
        651, 652, 653, 654, 655, 656, 657, 658, 659, 660, 661, 662, 663, 664, 666, 667, 668, 669, 
        670, 671, 672, 673, 676, 678, 679, 681, 682, 688, 690, 691, 692, 695, 697, 698, 699, 701, 
        702, 704, 706, 707, 708, 710, 711, 712, 713, 714, 715, 716, 717, 718, 719, 720, 721, 722, 
        723, 724, 725, 726, 727, 731, 732, 733, 734, 735, 736, 737, 738, 739, 740, 741, 742, 743, 
        744, 746, 747, 749, 750, 753, 755, 762, 763, 766, 768, 770, 772, 774, 775, 776, 777, 778, 
        779, 782, 783, 785, 790, 792, 793, 797, 798, 799, 800, 802, 804, 806, 807, 809, 810, 811, 
        814, 815, 816, 817, 818, 820, 823, 825, 826, 828, 829, 830, 831, 833, 836, 838, 840, 854, 
        868, 870, 871, 872, 873, 900, 903, 912, 1030, 1146, 1161, 1165, 1193, 1208, 1282, 1378, 
        1402, 1416, 1434, 1458, 1471, 1542, 1543, 1545, 1548, 1601, 1609, 1610, 1613, 1614, 1632, 
        1640, 1645, 1646, 1648, 1649, 1749, 1755, 1769, 1771, 1778, 1806, 1839, 1931, 2008, 2121, 
        2122, 2123, 2124, 2155, 2156, 2171, 2180, 2207, 2236, 2255, 2261, 2286, 2350, 2351, 2357, 
        2366, 2375, 2404, 2434, 2512, 2519, 2590, 2596, 2611, 2612, 2616, 2617, 2618, 2619, 2620, 
        2621, 2622, 2623, 2624, 2627, 2628, 2629, 2630, 2631, 2634, 2638, 2686, 2687, 2688, 2689, 
        2691, 2692, 2696, 2697, 2702, 2703, 2704, 2711, 2712, 2713, 2714, 2715, 2716, 2719, 2720, 
        2721, 2722, 2723, 2724, 2725, 2726, 2727, 2731, 2732, 2733, 2734, 2736, 2738, 2739, 2742, 
        2744, 2747, 2748, 2749, 2753, 2755, 2756, 2757, 2758, 2759, 2760, 2763, 2765, 2767, 2771, 
        2772, 2773, 2774, 2776, 2782, 2783, 2785, 2787, 2788, 2789, 2794, 2795, 2796, 2797, 2798, 
        2799, 2805, 2906, 2909, 2911, 2986, 2994, 2997, 2998, 2999, 3000, 3001, 3002, 3003, 3004, 
        3005
    ]

    # Check if rows with zeros match the specified indices
    zero_indices = (data_frame[(data_frame == 0).any(axis=1)].index).tolist()
    if zero_indices == removal_indices:
        print("Number of reactions exactly equal")
        data_frame = data_frame[(data_frame != 0).any(axis=1)]
    else:
        raise Exception("Number of reactions with zero mismatch ERROR!")

    # Ensure numerical stability by applying a threshold
    stabilize_values = lambda x: 0 if (x < 0.001 and x > -0.001) else x
    data_frame = data_frame.applymap(stabilize_values)  # Apply the stability function
    data_frame = data_frame.round(3)  # Round values to 3 decimal places

    # Handle outliers by replacing them with the mean
    first_quartile = data_frame.quantile(0.25)
    third_quartile = data_frame.quantile(0.75)
    interquartile_step = 1.5 * (third_quartile - first_quartile)
    data_frame = data_frame[
        (data_frame >= (first_quartile - interquartile_step)) &
        (data_frame <= (third_quartile + interquartile_step))
    ]
    data_frame = data_frame.fillna(data_frame.mean())  # Replace NaN values with column means

    # Apply PCA to reduce dimensionality while keeping explainability > 0.99
    scaler = StandardScaler()
    pca = PCA(n_components=600)
    transformed_data = scaler.fit_transform(data_frame)
    data_frame = pd.DataFrame(pca.fit_transform(transformed_data), index=data_frame.index)

    # Display PCA explainability and final shape of the dataset
    explained_variance = pca.explained_variance_ratio_.sum()
    print("Explainability: ", explained_variance)
    print("Shape after preprocess: ", data_frame.shape)

    return data_frame, explained_variance


# Function to load and convert MATLAB files to DataFrame
def load_mat_file(file_path):
    """
    Load a MATLAB .mat file and extract its contents into a DataFrame.

    Args:
        file_path (str): Path to the MATLAB file.

    Returns:
        pd.DataFrame: Extracted data as a DataFrame.
    """
    try:
        # Load MATLAB file into a Python dictionary
        matlab_data = sio.loadmat(file_path)
    except Exception as e:
        # Handle any errors during file loading
        print(f"[ERROR] Failed to load MATLAB file: {file_path}. Error: {e}")
        return pd.DataFrame()
    
    # Extract data from MATLAB file and convert it to DataFrame
    try:
        new_data_frame = pd.DataFrame(matlab_data['sampleMetaOutC'][0][0][-1])
    except KeyError as e:
        print(f"[ERROR] Key 'sampleMetaOutC' not found in the MATLAB file: {file_path}. Error: {e}")
        return pd.DataFrame()

    # Return the extracted DataFrame
    return new_data_frame

# Main loop for processing directories
def process_directories(directory_list):
    """
    Process multiple directories containing MATLAB files, preprocess the data, and save
    the results in CSV format.

    Args:
        directory_list (list): List of directory paths to process.

    Returns:
        None
    """
    explainability_data = pd.DataFrame()

    for directory in directory_list:
        # Extract patient identifier from the directory name
        patient_id = directory[-11:-1]
        data_frame = pd.DataFrame()

        # Skip processing if the directory is empty
        if not os.path.exists(directory):
            print(f"[WARNING] Directory does not exist: {directory}")
            continue
        if not os.listdir(directory):
            print(f"[INFO] No files found in the directory: {directory}")
            continue

        # Process MATLAB files in the directory
        for file_path in glob.glob(f"{directory}/*.mat"):
            print("[INFO] Processing MATLAB file: ", file_path)
            new_data_frame = load_mat_file(file_path)

            # Skip if the DataFrame is empty due to errors
            if new_data_frame.empty:
                print(f"[WARNING] Skipping data from file: {file_path}")
                continue

            # Append data to the main DataFrame
            data_frame = pd.concat([data_frame, new_data_frame], axis=1)

        # Preprocess the data
        print("[INFO] Preprocessing data.")
        try:
            processed_data, explained_variance = preprocess(data_frame)
        except ValueError as e:
            print(f"[ERROR] Preprocessing failed for directory {directory}: {e}")
            continue

        # Save the preprocessed data
        output_path = "/preprocess_PC/"
        processed_data.to_csv(f"{output_path}{patient_id}.csv")

        # Save explainability metrics
        explainability_data[patient_id] = [explained_variance]
        explainability_data.to_csv("/explainability.csv")

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.
