# Script Search flux species
Version GitHub 01

  
Previously:  
You simulated a community for groups of samples (sick vs healthy) in Micom.
You used compare_groups() to see differential metabolite production for sick vs healthy.
You did see a differential flux, but want to track it down to those taxa that produce these
differentially produced metabolites.

Prepare:
- Res[1] from Micoms grow() as CSV
- Gapseq/ModelSEED CSV file with metabolite IDs (column id) and corresponding metabolite names (column name)
- Condition CSV file with sample names (column sample) and conditions A or B (column condition)
- Make sure CSV files are loaded and stored with correct separator ";" or ","
- Later on, make sure conditionA and conditionB are dedicated to the correct conditions

Procedure:
- Extracts all metabolite names (cpd***) present in your community and stores it in a list
- Screen for each *exported* metabolite, saving all genera that produced it, including flux, abundance, flux\*abundance and sample ID in a unique CSV file
- Add the sample's condition in an additional column, based on your condition file
- Combine flux, abundances and flux\*abundances from each taxa and condition (Alistipes in healthy samples has a total sum of flux of 66). 
- Translate Gapseq/Modelseed metabolite IDs to metabolic names
- Merge condition A (like healthy) with condition B (like disease) for each metabolite, to contain both (mIDs_metabolite_fluxsummerge.csv)



**Output files:**
Aside of a lot of intermediate files, you will get 9 CSV files:
- All flux*abundances from all taxa, in condition A and in condition B with separte columns (project_genusfluxmeta_abundflux.csv)
- All flux*abundances from all taxa, with condition B subtracted from condition A (project_genusfluxmeta_abundflux_diff.csv)
- All flux*abundances from all taxa, with column headers simplified (project_genusfluxmeta_abundflux_diffreduced.csv)
- Same triplet for flux, and abundance.
  
**Value interpretation:**  
The difference of SICK (sick_cond) MINUS CONTROL (control_cond) will be calculated. (checkpoints for this later on)  
A +POSITIVE value in _diff.csv and _diffreduced.csv means higher flux/abundance/flux*abundance in DISEASE  
A -NEGATIVE value in _diff.csv and _diffreduced.csv means a reduced flux in DISEASE  
  
########################################################  
By Torben Kuehnast, torben.kuehnast@gmail.com, 2024

In [1]:
import os
import pandas as pd
import re
import csv
from bs4 import BeautifulSoup


In [None]:
### INSERT THE CORRECT FILE NAMES HERE ####
### INSERT THE CORRECT FILE NAMES HERE ####
### INSERT THE CORRECT FILE NAMES HERE ####
### INSERT THE CORRECT FILE NAMES HERE ####



# INSERT: Where files are created
working_dir = "/home/working_directory"

# INSERT: the folder where the result of build and grow are, the res[1] file!
# ... its basic path / working directory
source_path = "/home/micoms_res1_directory"

# Split res_file into the <project name part> and <_res1.csv> ... (needed to be able to apply proper naming of newly created files)
condition_folder_part = "projectnamepart_withoutending"

# ... Automatically, this should lead to the name of the RES file. It should precisely look like this.
target_res1 = f"{condition_folder_part}_res1.csv"

# INSERT: Choose a name for the automatically created file where only the list of metabolites found in res1 are listed:
only_metab_list = "all_metabolites_project.csv"


# INSERT: Path and filename of condition file comparing A to B with sample names and condition names?
#Looking like:
#sample,condition
#sample45,healthy
#sample46,healthy
#sample47,sick
condition_file = "/home/condition_directory/projectname_conditionfileAvsB.csv"


# INSERT: Define column headers for the new CSVs. Should be like this. 
headers = ['taxon', 'sample_id', 'tolerance', 'reaction', 'flux', 'abundance', 'metabolite', 'direction', 'origin']


# INSERT: File where ALL gapseq metabolites are listed with cpd ID and real metabolite name
filepath_allmetab = '/home/all_gapseq_metabolites_756_compgr.csv'


# INSERT: Final genus_flux_metabolite output file name, possibly stay like that.
genus_flux_metabolites_var = f"{condition_folder_part}_genusfluxmeta"


# INSERT: Make sure that all ";" and "," separation of your CSV files fit to the commands below

### insert END ###
### insert END ###
### insert END ###
### insert END ###
### insert END ###


# Part 1 - finding all metabolites involved 

This script is based on metabolic ID nomenclature from modelSEED  
Found in gapseq, ssniff diet file and Micom output.  
https://modelseed.org/biochem/compounds   
https://github.com/jotech/gapseq  
Script collects all cpd**** metabolite IDs that have been created throughout the simulation in  
one metabolite file, a simple list:  

    metabolite  
    cpd00001  
    cpd00002  
    cpd00009  
    cpd00013  
    ...  

INPUT: Products of Micom's grow() function, in res[1]!, stored as CSV  
  
OUTOUT: all_metabolites.csv  


In [None]:
# processing, Extracting the conditions from condition folder automatically.
df = pd.read_csv(condition_file, delimiter=';')
# processing: Extract the unique condition names
unique_conditions = df['condition'].unique()
print("unique_conditions:", unique_conditions)
# processing: Ensure there are only two unique conditions
if len(unique_conditions) != 2:
    raise ValueError("There are more than two unique condition names!")
# processing: Store the unique condition names in variables
conditionpartA, conditionpartB = unique_conditions
# processing: Create DataFrames based on the condition parts
condition_filter_A = df[df['condition'] == conditionpartA]
condition_filter_B = df[df['condition'] == conditionpartB]
# processing: Display the results
print("Condition Filter A:")
display(condition_filter_A)
print("Condition Filter B:")
display(condition_filter_B)


# processing
csv_path = os.path.join(working_dir, only_metab_list)
if not os.path.exists(csv_path):
    pd.DataFrame(columns=['metabolite']).to_csv(csv_path, sep=';', index=False)

#processing
condition_folder_core = condition_folder_part

#conditionABfolder
conditionAfolder = f"{condition_folder_part}_{conditionpartA}"
conditionBfolder = f"{condition_folder_part}_{conditionpartB}"

# The folder where condion A and B are placed
condition_folder_partA = conditionAfolder
condition_folder_partB = conditionBfolder

# The folder where condion A and B are placed RENAMED with mIDs
condition_folder_partAm = f"{conditionAfolder}_mIDs"
condition_folder_partBm = f"{conditionBfolder}_mIDs"
merged_conditions = f"{condition_folder_part}_merged"
    
# Read existing metabolites to avoid duplicates
metabolites_df = pd.read_csv(csv_path, sep=';')

# Set to store all unique metabolites
unique_metabolites = set(metabolites_df['metabolite'])

# Function to clean metabolite names
def clean_metabolite_name(name):
    return re.sub(r'(_e0|_m)$', '', name)

# Search and update CSV
for root, dirs, files in os.walk(source_path):
    if condition_folder_part in root:
        print("condition_folder_part found:", condition_folder_part)
        print("root is:", root)
        for file in files:
            if file == target_res1:
                print("---> found it!", file, " looking for ", target_res1)
                file_path = os.path.join(root, file)
                df = pd.read_csv(file_path, sep=';')
                metabolites_in_file = df['metabolite'].unique()
                for metabolite in metabolites_in_file:
                    clean_metabolite = clean_metabolite_name(metabolite)
                    if clean_metabolite not in unique_metabolites:
                        unique_metabolites.add(clean_metabolite)
                        print("Added the following metabolite: ", clean_metabolite)

# Convert set to DataFrame
new_metabolites_df = pd.DataFrame(list(unique_metabolites), columns=['metabolite'])

# Sort and save metabolite list
new_metabolites_df.sort_values(by='metabolite', inplace=True)
new_metabolites_df.to_csv(csv_path, sep=';', index=False)


# Save separated condition A and B
condApath = os.path.join(working_dir, f"{conditionpartA}.csv")
condition_filter_A.to_csv(condApath, sep=';', index=False)
condBpath = os.path.join(working_dir, f"{conditionpartB}.csv")
condition_filter_A.to_csv(condBpath, sep=';', index=False)

print("Finished.")

In [None]:
conditionpartA

In [None]:
conditionpartB

In [None]:
# IMPORTANT: AFTER running the condition A/B extraction, look what is A and B.
# Look at conditionpartA and its samples, and define in the following that it is either sick_cond or control_cond
# 

# This is important when later on substracting flux*abundance of methane's control (like value = 1) from methane in cachexia (like value = 3), 
# leading to (3-1=) +2 flux*abundance in cacheixa

# SICK condition:

#sick_cond = conditionpartA
sick_cond = conditionpartB



# HEALTHY condition:

control_cond = conditionpartA
#control_cond = conditionpartB


print("sick condition is:", sick_cond)
print("control coondition is:", control_cond)

# PART 2A - collecting fluxes to the respective metabolites

In [None]:
### For conditionA <-- (all healthy or disease samples)

# Define full path to the subfolder
full_subfolder_path = os.path.join(working_dir, conditionAfolder)

# Ensure the subfolder exists
if not os.path.exists(full_subfolder_path):
    os.makedirs(full_subfolder_path)

# Load the list of metabolites
metabolites_df = pd.read_csv(csv_path, sep=';')

# Iterate through each metabolite
for metabolite in metabolites_df['metabolite']:
    print("Processing:", metabolite)
    all_rows = []  # Initialize a list to collect rows
    
    for root, dirs, files in os.walk(source_path):
        if condition_folder_part in root:
            for file in files:
                if file == target_res1:
                    with open(os.path.join(root, file), mode='r', encoding='utf-8') as f:
                        reader = csv.DictReader(f, delimiter=';')
                        for row in reader:
                            # Check if the row's sample_id matches any sample in the condition_filter_A
                            if row['sample_id'] in condition_filter_A['sample'].values:
                                if metabolite in row['metabolite']:
                                    row['origin'] = os.path.basename(root)
                                    all_rows.append(row)  # Append the row to the list if it matches the filter
    
    # Create a DataFrame from the collected rows if there are any
    if all_rows:
        metabolite_df = pd.DataFrame(all_rows, columns=headers)
        # Save the DataFrame for the current metabolite to a CSV file
        metabolite_df.to_csv(os.path.join(full_subfolder_path, f'{metabolite}_{conditionpartA}.csv'), sep=';', index=False)
        print(f'{metabolite}_{conditionpartA}.csv created in {full_subfolder_path}')

        
print("Finished.")


# PART 2B - collecting fluxes to the respective metabolites

In [None]:
### For conditionB <-- (all disease or healthy samples)

# Define full path to the subfolder
full_subfolder_path = os.path.join(working_dir, conditionBfolder)

# Ensure the subfolder exists
if not os.path.exists(full_subfolder_path):
    os.makedirs(full_subfolder_path)

# Load the list of metabolites
metabolites_df = pd.read_csv(csv_path, sep=';')

# Iterate through each metabolite
for metabolite in metabolites_df['metabolite']:
    print("Processing:", metabolite)
    all_rows = []  # Initialize a list to collect rows
    
    for root, dirs, files in os.walk(source_path):
        if condition_folder_part in root:
            for file in files:
                if file == target_res1:
                    with open(os.path.join(root, file), mode='r', encoding='utf-8') as f:
                        reader = csv.DictReader(f, delimiter=';')
                        for row in reader:
                            # Check if the row's sample_id matches any sample in the condition_filter_B
                            if row['sample_id'] in condition_filter_B['sample'].values:
                                if metabolite in row['metabolite']:
                                    row['origin'] = os.path.basename(root)
                                    all_rows.append(row)  # Append the row to the list if it matches the filter
    
    # Create a DataFrame from the collected rows if there are any
    if all_rows:
        metabolite_df = pd.DataFrame(all_rows, columns=headers)
        # Save the DataFrame for the current metabolite to a CSV file
        metabolite_df.to_csv(os.path.join(full_subfolder_path, f'{metabolite}_{conditionpartB}.csv'), sep=';', index=False)
        print(f'{metabolite}_{conditionpartB}.csv created in {full_subfolder_path}')

        
print("Finished.")


#  Part 3 - polish flux file, add conditions to each row

In [None]:
# Remove medium rows
# Remove import rows
# Add condition based on a CSV file showing sample and conditions
# Calculate flux*abundance
# For conditionA, then conditionB



# Condition-Datei laden
condition_df = pd.read_csv(condition_file, sep=";")

# Arbeitsverzeichnis für die Screening-Dateien festlegen
screening_folderA = os.path.join(working_dir, condition_folder_partA)

# Alle CSV-Dateien im Screening-Ordner durchlaufen
for filename in os.listdir(screening_folderA):
    if filename.endswith(".csv"):
        file_path = os.path.join(screening_folderA, filename)
        
        # Screening-Datei laden
        df = pd.read_csv(file_path, sep=";")
        
        # Rows löschen, die "medium" in der "taxon"-Spalte haben
        df = df[df['taxon'] != 'medium']
        
        # Rows löschen, die "import" in der "direction"-Spalte haben
        df = df[df['direction'] != 'import']
        
        # Neue Spalte "condition" erstellen und Werte aus condition_file übernehmen
        df['condition'] = df['sample_id'].map(condition_df.set_index('sample')['condition'])
        
        # Neue Spalte "flux_abundance" erstellen und Werte berechnen
        df['flux_abundance'] = df['flux'] * df['abundance']
        
        # Datei speichern
        df.to_csv(file_path, sep=";", index=False)

print(f"Die Verarbeitung der Dateien von condition {conditionpartA} ist abgeschlossen.")







# Arbeitsverzeichnis für die Screening-Dateien festlegen
screening_folderB = os.path.join(working_dir, condition_folder_partB)

# Alle CSV-Dateien im Screening-Ordner durchlaufen
for filename in os.listdir(screening_folderB):
    if filename.endswith(".csv"):
        file_path = os.path.join(screening_folderB, filename)
        
        # Screening-Datei laden
        df = pd.read_csv(file_path, sep=";")
        
        # Rows löschen, die "medium" in der "taxon"-Spalte haben
        df = df[df['taxon'] != 'medium']
        
        # Rows löschen, die "import" in der "direction"-Spalte haben
        df = df[df['direction'] != 'import']
        
        # Neue Spalte "condition" erstellen und Werte aus condition_file übernehmen
        df['condition'] = df['sample_id'].map(condition_df.set_index('sample')['condition'])
        
        # Neue Spalte "flux_abundance" erstellen und Werte berechnen
        df['flux_abundance'] = df['flux'] * df['abundance']
        
        # Datei speichern
        df.to_csv(file_path, sep=";", index=False)

print(f"Die Verarbeitung der Dateien von condition {conditionpartB} ist abgeschlossen.")



print("Finished.")


# Part 4 - calculations

In [None]:
### Part 4 

# Add sums of flux, abundance and flux_abundance

# Calculate a factor to normalize uneven condition sample sizes (like 4 healthy and 5 sick)
# Multiply sums with factor, to get normalized values.

# Re-sort, to be better and more quickly added to statistics program


# Condition-Datei laden
condition_df = pd.read_csv(condition_file, sep=";")

# Zähle die Einträge in der Spalte "condition"
conditionpartA_count = condition_df['condition'].value_counts().get(conditionpartA, 0)
conditionpartB_count = condition_df['condition'].value_counts().get(conditionpartB, 0)

# Finde den niedrigeren Wert und berechne die Faktoren
condition_min_count = min(conditionpartA_count, conditionpartB_count)
condA_factor = condition_min_count / conditionpartA_count if conditionpartA_count else 0
condB_factor = condition_min_count / conditionpartB_count if conditionpartB_count else 0

def process_files(screening_folder, factor, condition_suffix):
    for filename in os.listdir(screening_folder):
        if filename.endswith(".csv"):
            file_path = os.path.join(screening_folder, filename)
            
            # Screening-Datei laden
            df = pd.read_csv(file_path, sep=";")
            
            # Gruppieren und zusammenfassen
            grouped = df.groupby(['taxon', 'reaction', 'metabolite', 'origin', 'condition', 'direction']).agg({
                'flux': 'sum',
                'abundance': 'sum',
                'flux_abundance': 'sum'
            }).reset_index()
            
            # Faktoren anwenden und umbenennen
            grouped[f'flux_sum_norm_{condition_suffix}'] = grouped['flux'] * factor
            grouped[f'abundance_sum_norm_{condition_suffix}'] = grouped['abundance'] * factor
            grouped[f'flux_abundance_sum_norm_{condition_suffix}'] = grouped['flux_abundance'] * factor
            
            # Umbenennen der ursprünglichen Summenspalten
            grouped.rename(columns={
                'flux': 'flux_sum',
                'abundance': 'abundance_sum',
                'flux_abundance': 'flux_abundance_sum'
            }, inplace=True)
            
            # Gesamtwerte hinzufügen
            total_flux_sum = grouped['flux_sum'].sum()
            total_abundance_sum = grouped['abundance_sum'].sum()
            total_flux_abundance_sum = grouped['flux_abundance_sum'].sum()
            
            total_flux_sum_norm = grouped[f'flux_sum_norm_{condition_suffix}'].sum()
            total_abundance_sum_norm = grouped[f'abundance_sum_norm_{condition_suffix}'].sum()
            total_flux_abundance_sum_norm = grouped[f'flux_abundance_sum_norm_{condition_suffix}'].sum()
            
            total_row = pd.DataFrame({
                'taxon': ['Total'],
                'flux_sum': [total_flux_sum],
                'abundance_sum': [total_abundance_sum],
                'flux_abundance_sum': [total_flux_abundance_sum],
                f'flux_sum_norm_{condition_suffix}': [total_flux_sum_norm],
                f'abundance_sum_norm_{condition_suffix}': [total_abundance_sum_norm],
                f'flux_abundance_sum_norm_{condition_suffix}': [total_flux_abundance_sum_norm],
                'reaction': [''],
                'metabolite': [''],
                'origin': [''],
                'condition': [''],
                'direction': ['']
            })
            
            result = pd.concat([grouped, total_row], ignore_index=True)
            
            # Spaltenreihenfolge anpassen
            result = result[['taxon', 
                             f'flux_sum_norm_{condition_suffix}', 
                             f'abundance_sum_norm_{condition_suffix}', 
                             f'flux_abundance_sum_norm_{condition_suffix}', 
                             'reaction', 
                             'metabolite', 
                             'origin', 
                             'condition', 
                             'direction', 
                             'flux_sum', 
                             'abundance_sum', 
                             'flux_abundance_sum']]
            
            # Erstellen der neuen CSV mit "_fluxsum.csv"
            new_filename = filename.replace(".csv", "_fluxsum.csv")
            new_file_path = os.path.join(screening_folder, new_filename)
            result.to_csv(new_file_path, sep=";", index=False)

# Dateien in beiden Ordnern verarbeiten
process_files(os.path.join(working_dir, condition_folder_partA), condA_factor, conditionpartA)
print(f"Die Verarbeitung der Dateien von condition {conditionpartA} ist abgeschlossen.")

process_files(os.path.join(working_dir, condition_folder_partB), condB_factor, conditionpartB)
print(f"Die Verarbeitung der Dateien von condition {conditionpartB} ist abgeschlossen.")
print("Finished.")


# Part 5A - replacing of IDs of "cpd..." to actual names.

In [None]:
# Condition A

# CHANGE THE IDS from cpd to REAL IDs. The change also modifies the filenames.

# INPUT: merged_folder/<metabolite>_bothdif.csv _finalsum.csv _mergedsum.csv

# OUTPUT: mIDs_merged/<REAL_metabolic_name>_*.csv

def replace_ids_in_filename(filename, replacement_dict):
    # For each metabolite ID in the dictionary, check if it's in the filename
    for id_metab, metabolite in replacement_dict.items():
        # If the ID is found in the filename, replace it with the metabolite name
        if id_metab in filename:
            filename = filename.replace(id_metab, metabolite)
    return filename


main_folder_path = os.path.join(working_dir, conditionAfolder)

target_folder = condition_folder_partAm  # Change this to your specific subfolder name

target_folder_path = os.path.join(working_dir, target_folder)

# Ensure target directory exists
if not os.path.exists(target_folder_path):
    os.makedirs(target_folder_path)
    print(f"Created directory {target_folder_path}")
    

# Read the mapping from 'id_metab' to 'metabolite' into a DataFrame
metab_df = pd.read_csv(filepath_allmetab, sep=";")
# Create a dictionary for replacements
replacement_dict = pd.Series(metab_df.name.values, index=metab_df.id).to_dict()

for dirpath, dirnames, filenames in os.walk(main_folder_path):
    for filename in filenames:
        #if "_bothdif.csv" in filename:  # Filter to consider only specific files
        if f"_{conditionpartA}.csv" in filename or f"_{conditionpartA}_fluxsum.csv" in filename:
            # Generate a new filename by replacing metabolite IDs with names
            new_filename = replace_ids_in_filename(filename, replacement_dict)
            
            # Construct the full filepath and the modified filepath
            filepath = os.path.join(dirpath, filename)
            mod_filepath = os.path.join(target_folder_path, 'mIDs_' + new_filename)

            if filename.endswith('.html'):
                with open(filepath, 'r') as f:
                    contents = f.read()

                soup = BeautifulSoup(contents, 'lxml')
                for item in metab_df.itertuples():
                    for tag in soup.find_all(text=re.compile(re.escape(item.id_metab))):
                        updated_string = tag.replace(item.id_metab, item.metabolite)
                        tag.replace_with(updated_string)

                # Save the modified HTML content to a new file
                with open(mod_filepath, 'w') as f:
                    f.write(str(soup))

            elif filename.endswith('.csv'):
                df = pd.read_csv(filepath)
                df.replace(replacement_dict, regex=True, inplace=True)
                # Save the modified DataFrame to a new CSV file
                df.to_csv(mod_filepath, index=False)

print("Processing complete.")
print("Finished.")


# Part 5B - replacing of IDs of "cpd..." to actual names.

In [None]:
# Condition B

# CHANGE THE IDS from cpd to REAL IDs. The change also modifies the filenames.

# INPUT: merged_folder/<metabolite>_bothdif.csv _finalsum.csv _mergedsum.csv

# OUTPUT: mIDs_merged/<REAL_metabolic_name>_*.csv

def replace_ids_in_filename(filename, replacement_dict):
    # For each metabolite ID in the dictionary, check if it's in the filename
    for id_metab, metabolite in replacement_dict.items():
        # If the ID is found in the filename, replace it with the metabolite name
        if id_metab in filename:
            filename = filename.replace(id_metab, metabolite)
    return filename


main_folder_path = os.path.join(working_dir, conditionBfolder)

target_folder = condition_folder_partBm  # Change this to your specific subfolder name

target_folder_path = os.path.join(working_dir, target_folder)

# Ensure target directory exists
if not os.path.exists(target_folder_path):
    os.makedirs(target_folder_path)
    print(f"Created directory {target_folder_path}")
    
# Read the mapping from 'id_metab' to 'metabolite' into a DataFrame
metab_df = pd.read_csv(filepath_allmetab, sep=";")
# Create a dictionary for replacements
replacement_dict = pd.Series(metab_df.name.values, index=metab_df.id).to_dict()

for dirpath, dirnames, filenames in os.walk(main_folder_path):
    for filename in filenames:
        #if "_bothdif.csv" in filename:  # Filter to consider only specific files
        if f"_{conditionpartB}.csv" in filename or f"_{conditionpartB}_fluxsum.csv" in filename:
            # Generate a new filename by replacing metabolite IDs with names
            new_filename = replace_ids_in_filename(filename, replacement_dict)
            
            # Construct the full filepath and the modified filepath
            filepath = os.path.join(dirpath, filename)
            mod_filepath = os.path.join(target_folder_path, 'mIDs_' + new_filename)

            if filename.endswith('.html'):
                with open(filepath, 'r') as f:
                    contents = f.read()

                soup = BeautifulSoup(contents, 'lxml')
                for item in metab_df.itertuples():
                    for tag in soup.find_all(text=re.compile(re.escape(item.id_metab))):
                        updated_string = tag.replace(item.id_metab, item.metabolite)
                        tag.replace_with(updated_string)

                # Save the modified HTML content to a new file
                with open(mod_filepath, 'w') as f:
                    f.write(str(soup))

            elif filename.endswith('.csv'):
                df = pd.read_csv(filepath)
                df.replace(replacement_dict, regex=True, inplace=True)
                # Save the modified DataFrame to a new CSV file
                df.to_csv(mod_filepath, index=False)

print("Processing complete.")
print("Finished.")


In [None]:
print(condition_folder_core)
print(conditionpartA)
print(conditionpartB)

# Part 6 - merging conditions for each metabolite

In [None]:
# COMBINE CHX and CONTROL metabolite files from the folders ending on _mIDs
# You will get a merged folder with files ending on _fluxsummerge.csv

# OUTPUT example: mIDs_Formate_MNXM39_fluxsummerge.csv
#taxon	flux_abundance_sum_norm_d13_CHX207	flux_abundance_sum_norm_d13_control	flux_sum_norm_d13_CHX207	flux_sum_norm_d13_control	abundance_sum_norm_d13_CHX207	abundance_sum_norm_d13_control
#Total	0.357715421	0.050616287	34.69583001	43.20324925	1.474192454	1.79060995
#Stenotrophomonas	0.008991778	0.002472753	32.15788288	8.871002326	0.000919222	0.000222997
#Parvibacter	0	0.00299169	0	2.852755745	0	0.000838961
#Oscillibacter	0	0.001131594	0	0.139606914	0	0.022365999
#...

# Be aware that the values were normalized to the comparison variant with the fewer samples in the last step!

# Erstellen des neuen Ordners für die zusammengeführten Dateien
os.makedirs(os.path.join(working_dir, merged_conditions), exist_ok=True)

def process_and_merge_files(partA_folder, partB_folder, output_folder, conditionA, conditionB):
    # Durchsuchen des Ordners für conditionpartA
    for filename in os.listdir(partA_folder):
        if filename.startswith("mIDs_") and filename.endswith(f"{conditionA}_fluxsum.csv"):
            # Metabolite ID extrahieren
            metabolite_id = filename[len("mIDs_"):-len(f"_{conditionA}_fluxsum.csv")]

            # Korrespondierende Datei in partB suchen
            corresponding_file_B = f"mIDs_{metabolite_id}_{conditionB}_fluxsum.csv"
            partB_filepath = os.path.join(partB_folder, corresponding_file_B)
            
            if os.path.exists(partB_filepath):
                # Dateien einlesen
                df_A = pd.read_csv(os.path.join(partA_folder, filename), sep=";")
                df_B = pd.read_csv(partB_filepath, sep=";")

                # DataFrames anpassen
                df_A = df_A[['taxon', f'flux_sum_norm_{conditionA}', f'abundance_sum_norm_{conditionA}', f'flux_abundance_sum_norm_{conditionA}']]
                df_B = df_B[['taxon', f'flux_sum_norm_{conditionB}', f'abundance_sum_norm_{conditionB}', f'flux_abundance_sum_norm_{conditionB}']]
                
                # Zusammenführen der DataFrames
                merged_df = pd.merge(df_A, df_B, on='taxon', how='outer').fillna(0)

                # Spaltenreihenfolge anpassen
                merged_df = merged_df[['taxon', 
                                       f'flux_abundance_sum_norm_{conditionA}', f'flux_abundance_sum_norm_{conditionB}',
                                       f'flux_sum_norm_{conditionA}', f'flux_sum_norm_{conditionB}',
                                       f'abundance_sum_norm_{conditionA}', f'abundance_sum_norm_{conditionB}']]
                
                # Sortieren nach taxon alphabetisch von unten nach oben
                merged_df = merged_df.sort_values(by='taxon', ascending=False).reset_index(drop=True)

                # Zeile mit taxon "Total" an die zweite Position verschieben
                total_row = merged_df[merged_df['taxon'] == 'Total']
                other_rows = merged_df[merged_df['taxon'] != 'Total']
                merged_df = pd.concat([total_row, other_rows]).reset_index(drop=True)

                # Speichern der zusammengeführten und sortierten Datei
                output_filename = f"mIDs_{metabolite_id}_fluxsummerge.csv"
                merged_df.to_csv(os.path.join(output_folder, output_filename), sep=";", index=False)

# Aufrufen der Funktion zum Verarbeiten und Zusammenführen der Dateien
process_and_merge_files(
    os.path.join(working_dir, condition_folder_partAm), 
    os.path.join(working_dir, condition_folder_partBm), 
    os.path.join(working_dir, merged_conditions), 
    conditionpartA, 
    conditionpartB
)

print("Die Verarbeitung, Zusammenführung und Sortierung der Dateien ist abgeschlossen.")
print("Finished.")


# Part 7 - SORTING the metabolites according to the TAXA and separately the flux, flux*abundance and abundance.

In [None]:
# Ordnerpfad erstellen
merged_folder_path = os.path.join(working_dir, merged_conditions)

# Dateien durchsuchen und Liste erstellen
file_list_csv = []
mIDs_list = []

for filename in os.listdir(merged_folder_path):
    if filename.endswith("_fluxsummerge.csv"):
        file_list_csv.append(filename)
        metabolite_id = filename[len("mIDs_"):-len("_fluxsummerge.csv")]
        mIDs_list.append(metabolite_id)

# Listen alphabetisch sortieren
file_list_csv.sort()
mIDs_list.sort()

# mIDs_list als DataFrame speichern
mIDs_df = pd.DataFrame(mIDs_list, columns=["metabolite"])
mIDs_df.to_csv(os.path.join(working_dir, "mIDs_list.csv"), index=False, sep=";")

# Taxon Liste erstellen
taxon_set = set()
for filename in file_list_csv:
    df = pd.read_csv(os.path.join(merged_folder_path, filename), sep=";")
    taxon_set.update(df["taxon"].unique())

# Taxon Liste sortieren und als DataFrame speichern
taxon_list = sorted(taxon_set)
taxon_df = pd.DataFrame(taxon_list, columns=["taxon"])
taxon_df.to_csv(os.path.join(working_dir, "taxon_list.csv"), index=False, sep=";")

# Leere DataFrames für die neuen CSVs erstellen
genus_flux_metabolites_flux_df = pd.DataFrame({"taxon": taxon_list})
genus_flux_metabolites_abundance_df = pd.DataFrame({"taxon": taxon_list})
genus_flux_metabolites_abundflux_df = pd.DataFrame({"taxon": taxon_list})

# Dateien durchlaufen und Daten in die neuen DataFrames übertragen
for filename in file_list_csv:
    metabolite_id = filename[len("mIDs_"):-len("_fluxsummerge.csv")]
    df = pd.read_csv(os.path.join(merged_folder_path, filename), sep=";")
    
    # Übertragen der flux_abundance Daten
    genus_flux_metabolites_abundflux_df = pd.merge(
        genus_flux_metabolites_abundflux_df, 
        df[["taxon", f"flux_abundance_sum_norm_{conditionpartB}", f"flux_abundance_sum_norm_{conditionpartA}"]].rename(
            columns={
                f"flux_abundance_sum_norm_{conditionpartB}": f"flux_abundance_sum_norm_{conditionpartB}_{metabolite_id}",
                f"flux_abundance_sum_norm_{conditionpartA}": f"flux_abundance_sum_norm_{conditionpartA}_{metabolite_id}"
            }
        ), 
        on="taxon", how="left"
    )

    # Übertragen der flux Daten
    genus_flux_metabolites_flux_df = pd.merge(
        genus_flux_metabolites_flux_df, 
        df[["taxon", f"flux_sum_norm_{conditionpartB}", f"flux_sum_norm_{conditionpartA}"]].rename(
            columns={
                f"flux_sum_norm_{conditionpartB}": f"flux_sum_norm_{conditionpartB}_{metabolite_id}",
                f"flux_sum_norm_{conditionpartA}": f"flux_sum_norm_{conditionpartA}_{metabolite_id}"
            }
        ), 
        on="taxon", how="left"
    )

    # Übertragen der abundance Daten
    genus_flux_metabolites_abundance_df = pd.merge(
        genus_flux_metabolites_abundance_df, 
        df[["taxon", f"abundance_sum_norm_{conditionpartB}", f"abundance_sum_norm_{conditionpartA}"]].rename(
            columns={
                f"abundance_sum_norm_{conditionpartB}": f"abundance_sum_norm_{conditionpartB}_{metabolite_id}",
                f"abundance_sum_norm_{conditionpartA}": f"abundance_sum_norm_{conditionpartA}_{metabolite_id}"
            }
        ), 
        on="taxon", how="left"
    )

# Sortieren der DataFrames nach taxon
genus_flux_metabolites_flux_df = genus_flux_metabolites_flux_df.sort_values(by="taxon")
genus_flux_metabolites_abundance_df = genus_flux_metabolites_abundance_df.sort_values(by="taxon")
genus_flux_metabolites_abundflux_df = genus_flux_metabolites_abundflux_df.sort_values(by="taxon")

# Speichern der DataFrames als CSVs
genus_flux_metabolites_flux_df.to_csv(os.path.join(working_dir, f"{genus_flux_metabolites_var}_flux.csv"), index=False, sep=";")
genus_flux_metabolites_abundance_df.to_csv(os.path.join(working_dir, f"{genus_flux_metabolites_var}_abundance.csv"), index=False, sep=";")
genus_flux_metabolites_abundflux_df.to_csv(os.path.join(working_dir, f"{genus_flux_metabolites_var}_abundflux.csv"), index=False, sep=";")
print("Finished. TK")


# Part 8 - re-checking conditions are correctly attributed

In [None]:
# Again, make sure that the sick and healthy conditions are in the correct order

# Remember, you determined sick condition to be A or B, and control to be the other one:

print("sick condition is:", sick_cond)
print("control coondition is:", control_cond)

#Check again that this is correct.

# The difference of SICK (sick_cond) MINUS CONTROL (control_cond) will be calculated.
# A +POSITIVE value in _diff.csv and _diffreduced.csv means higher flux in DISEASE
# A -NEGATIVE value in _diff.csv and _diffreduced.csv means a reduced flux in DISEASE

In [None]:
# Create a difference of cachexia MINUS control.
# new difference, if POSITIVE means elevated in disease
#                 if negative means reduced in disease

# Example 
# Input: E123_lateCHXvCTR_M754_MGdiet_v02_t10_genusfluxmeta_abundflux.csv
# taxon	flux_abundance_sum_norm_d12d13_chx_1_2_Propanediol_MNXM1118	flux_abundance_sum_norm_d12d13_ctrlmca_1_2_Propanediol_MNXM1118
#Acetatifactor	0.150308456	0.064411408
#Akkermansia	0.225251359	0.143899104

# Output ONE, contains more information, E123_lateCHXvCTR_M754_MGdiet_v02_t10_genusfluxmeta_abundflux_diff.csv
#taxon	flux_abundance_sum_norm_d12d13_chx_minus_d12d13_ctrlmca_1_2_Propanediol_MNXM1118
#Acetatifactor	0.085897048
#Akkermansia	0.081352255

# Output TWO, use directly for R and heatmap. E123_lateCHXvCTR_M754_MGdiet_v02_t10_genusfluxmeta_abundflux_diffreduced.csv
#metabolite	1_2_Propanediol
#Acetatifactor	0.085897048
#Akkermansia	0.081352255

# Part 9A - Creating flux\*abundance files

In [None]:

def process_csv(genus_flux_metabolites_var, sick_cond, control_cond, working_dir):
    # Load the CSV
    file_path = os.path.join(working_dir, f"{genus_flux_metabolites_var}_abundflux.csv")
    df = pd.read_csv(file_path, delimiter=';')
    
    # Handle empty fields by filling them with 0
    df.fillna(0, inplace=True)
    
    # Create a new DataFrame for the difference
    diff_df = df[['taxon']].copy()

    # Create a list to store the new columns for improved performance
    new_columns = {}

    # Process each pair of columns
    for col in df.columns[1:]:
        if sick_cond in col:
            try:
                if "_MNX" in col:
                    metabolite_name = col.split(f"{sick_cond}_")[1].rsplit("_MNX", 1)[0]
                    metabolite_id = col.split("_MNX")[1]
                    control_col = f"flux_abundance_sum_norm_{control_cond}_{metabolite_name}_MNX{metabolite_id}"
                else:
                    metabolite_name = col.split(f"{sick_cond}_")[1]
                    control_col = f"flux_abundance_sum_norm_{control_cond}_{metabolite_name}"
                
                if control_col in df.columns:
                    new_col_name = f"flux_abundance_sum_norm_{sick_cond}_minus_{control_cond}_{metabolite_name}"
                    if "_MNX" in col:
                        new_col_name += f"_MNX{metabolite_id}"
                    
                    new_columns[new_col_name] = df[col] - df[control_col]
                else:
                    print(f"Control column not found for: {col}")
            except IndexError:
                print(f"Error processing column: {col}")

    # Add new columns to the DataFrame at once
    for col_name, col_data in new_columns.items():
        diff_df[col_name] = col_data

    # Save the difference DataFrame to CSV
    diff_file_path = os.path.join(working_dir, f"{genus_flux_metabolites_var}_abundflux_diff.csv")
    diff_df.to_csv(diff_file_path, sep=';', index=False)

    # Create the reduced CSV with only metabolite names in the headers
    reduced_df = diff_df.copy()
    reduced_df.columns = ['metabolite' if col == 'taxon' else col.split(f"{sick_cond}_minus_{control_cond}_")[1].rsplit("_MNX", 1)[0] for col in reduced_df.columns]
    
    # Save the reduced DataFrame to CSV
    reduced_file_path = os.path.join(working_dir, f"{genus_flux_metabolites_var}_abundflux_diffreduced.csv")
    reduced_df.to_csv(reduced_file_path, sep=';', index=False)


process_csv(genus_flux_metabolites_var, sick_cond, control_cond, working_dir)

print("Finished. TK")


# Part 9B - Creating abundance files

In [None]:
import pandas as pd
import os

def process_csv2(genus_flux_metabolites_var, sick_cond, control_cond, working_dir):
    # Load the CSV
    file_path = os.path.join(working_dir, f"{genus_flux_metabolites_var}_abundance.csv")
    df = pd.read_csv(file_path, delimiter=';')
    
    # Handle empty fields by filling them with 0
    df.fillna(0, inplace=True)
    
    # Create a new DataFrame for the difference
    diff_df = df[['taxon']].copy()

    # Create a list to store the new columns for improved performance
    new_columns = {}

    # Process each pair of columns
    for col in df.columns[1:]:
        if sick_cond in col:
            try:
                if "_MNX" in col:
                    metabolite_name = col.split(f"{sick_cond}_")[1].rsplit("_MNX", 1)[0]
                    metabolite_id = col.split("_MNX")[1]
                    control_col = f"abundance_sum_norm_{control_cond}_{metabolite_name}_MNX{metabolite_id}"
                else:
                    metabolite_name = col.split(f"{sick_cond}_")[1]
                    control_col = f"abundance_sum_norm_{control_cond}_{metabolite_name}"
                
                if control_col in df.columns:
                    new_col_name = f"abundance_sum_norm_{sick_cond}_minus_{control_cond}_{metabolite_name}"
                    if "_MNX" in col:
                        new_col_name += f"_MNX{metabolite_id}"
                    
                    new_columns[new_col_name] = df[col] - df[control_col]
                else:
                    print(f"Control column not found for: {col}")
            except IndexError:
                print(f"Error processing column: {col}")

    # Add new columns to the DataFrame at once
    for col_name, col_data in new_columns.items():
        diff_df[col_name] = col_data

    # Save the difference DataFrame to CSV
    diff_file_path = os.path.join(working_dir, f"{genus_flux_metabolites_var}_abundance_diff.csv")
    diff_df.to_csv(diff_file_path, sep=';', index=False)

    # Create the reduced CSV with only metabolite names in the headers
    reduced_df = diff_df.copy()
    reduced_df.columns = ['metabolite' if col == 'taxon' else col.split(f"{sick_cond}_minus_{control_cond}_")[1].rsplit("_MNX", 1)[0] for col in reduced_df.columns]
    
    # Save the reduced DataFrame to CSV
    reduced_file_path = os.path.join(working_dir, f"{genus_flux_metabolites_var}_abundance_diffreduced.csv")
    reduced_df.to_csv(reduced_file_path, sep=';', index=False)


process_csv2(genus_flux_metabolites_var, sick_cond, control_cond, working_dir)


# Part 9C - Creating flux files

In [None]:
import pandas as pd
import os

def process_csv3(genus_flux_metabolites_var, sick_cond, control_cond, working_dir):
    # Load the CSV
    file_path = os.path.join(working_dir, f"{genus_flux_metabolites_var}_flux.csv")
    df = pd.read_csv(file_path, delimiter=';')
    
    # Handle empty fields by filling them with 0
    df.fillna(0, inplace=True)
    
    # Create a new DataFrame for the difference
    diff_df = df[['taxon']].copy()

    # Create a list to store the new columns for improved performance
    new_columns = {}

    # Process each pair of columns
    for col in df.columns[1:]:
        if sick_cond in col:
            try:
                if "_MNX" in col:
                    metabolite_name = col.split(f"{sick_cond}_")[1].rsplit("_MNX", 1)[0]
                    metabolite_id = col.split("_MNX")[1]
                    control_col = f"flux_sum_norm_{control_cond}_{metabolite_name}_MNX{metabolite_id}"
                else:
                    metabolite_name = col.split(f"{sick_cond}_")[1]
                    control_col = f"flux_sum_norm_{control_cond}_{metabolite_name}"
                
                if control_col in df.columns:
                    new_col_name = f"flux_sum_norm_{sick_cond}_minus_{control_cond}_{metabolite_name}"
                    if "_MNX" in col:
                        new_col_name += f"_MNX{metabolite_id}"
                    
                    new_columns[new_col_name] = df[col] - df[control_col]
                else:
                    print(f"Control column not found for: {col}")
            except IndexError:
                print(f"Error processing column: {col}")

    # Add new columns to the DataFrame at once
    for col_name, col_data in new_columns.items():
        diff_df[col_name] = col_data

    # Save the difference DataFrame to CSV
    diff_file_path = os.path.join(working_dir, f"{genus_flux_metabolites_var}_flux_diff.csv")
    diff_df.to_csv(diff_file_path, sep=';', index=False)

    # Create the reduced CSV with only metabolite names in the headers
    reduced_df = diff_df.copy()
    reduced_df.columns = ['metabolite' if col == 'taxon' else col.split(f"{sick_cond}_minus_{control_cond}_")[1].rsplit("_MNX", 1)[0] for col in reduced_df.columns]
    
    # Save the reduced DataFrame to CSV
    reduced_file_path = os.path.join(working_dir, f"{genus_flux_metabolites_var}_flux_diffreduced.csv")
    reduced_df.to_csv(reduced_file_path, sep=';', index=False)

# Example usage
#genus_flux_metabolites_var = 'example_genus_flux_metabolites_var'
#sick_cond = 'sick_cond'
#control_cond = 'control_cond'
#working_dir = 'path/to/working_dir'
process_csv3(genus_flux_metabolites_var, sick_cond, control_cond, working_dir)


# Script Search flux species

Finished!
