# Sequence Analysis and Functional Prediction Pipeline

## 1. Introduction
This notebook analyzes the functional and sequence relationships between newly identified bacteria and known corrosion-influencing microorganisms. The analysis builds upon previous findings where:
- Statistical significance was established between the selected bacteria and corrosion risk (Notebook 3)
- Literature validation confirmed corrosion influence for many bacteria (Notebook 4)
- Evolutionary relationships were mapped through phylogenetic analysis (Notebook 5)

The study focuses on bacteria from operational heating and cooling water systems, primarily in Germany. Using 16S rRNA data (bootstrap-validated from Notebook 5), this analysis employs PICRUSt2 to predict metabolic functions and compare functional profiles between different bacterial groups.

### Analysis Approaches
We implement two classification strategies:

1. Simple Classification:
   - Known corrosion-causing bacteria (usual_taxa)
   - Other bacteria (combining checked_taxa and core_taxa)

2. Detailed Classification:
   - Known corrosion-causing bacteria (usual_taxa)
   - Pure checked bacteria (exclusive to checked_taxa)
   - Pure core bacteria (exclusive to core_taxa)
   - Checked-core bacteria (overlap between checked and core taxa)

This detailed approach allows for more nuanced analysis of functional profiles and better understanding of potential corrosion mechanisms across different bacterial groups.

### Analysis Goals:
- Predict metabolic functions from 16S sequences
- Focus on corrosion-relevant pathways (sulfur/iron metabolism)
- Compare functional profiles between known corrosion-causing bacteria and newly identified candidates
- Validate whether statistical correlations reflect genuine metabolic capabilities associated with corrosion processes

### Directory Structure:
 Following is the structure of the notebook data named data_picrus  
data_tree  
 ├── sequences/  
 │   ├── known.fasta : sequences of known corrosion-causing bacteria  
 │   ├── candidate.fasta : sequences of potential new corrosion-causing bacteria  
 |   └── other files  
 data_picrus  
 └── picrust_results/  
      ├── known_bacteria/  
      |               ├── EC_predictions/       : enzyme predictions  
      |               ├── pathway_predictions/  : metabolic pathway abundance  
      |               ├── KO_predictions/       : KEGG ortholog predictions  
      |               └── other_picrust_files/  
      ├── candidate_bacteria/  
      |               ├── EC_predictions/       : enzyme predictions  
      |               ├── pathway_predictions/  : metabolic pathway abundance  
      |               ├── KO_predictions/       : KEGG ortholog predictions  
      |               └── other_picrust_files/  : final comparison summary 
      ├── core_bacteria/ 
      |               ├── EC_predictions/       : enzyme predictions  
      |               ├── pathway_predictions/  : metabolic pathway abundance  
      |               ├── KO_predictions/       : KEGG ortholog predictions  
      |               └── other_picrust_files/  
      │      
      └── functional_comparison.xlsx  

# 2. Loading and Preparing the Data

## 2.1 Imports, Directories, Loading and preparing the Abundance DataFrame
The abundance DataFrame (Integrated) was carefully prepared to meet PICRUSt2 input requirements, including proper taxonomic level organization and removal of unnamed or missing data. The sequence data is sourced directly from aligned_sequences_integrated.fasta, which contains the phylogenetically aligned sequences generated in notebook 5. This integration ensures consistency between abundance data and sequence information.

In [20]:
import pandas as pd
import numpy as np
from Bio import SeqIO, Entrez
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
import subprocess
import os
from pathlib import Path
import ast
from io import StringIO
import openpyxl

In [21]:
# Directory Structure Definitions
SIMPLE_BASE = {
    'known': 'simple_known_mic',
    'other': 'simple_candidate_mic'
}

DETAILED_BASE = {
    'known': 'detailed_known_mic',
    'pure_checked': 'detailed_pure_checked_mic',
    'pure_core': 'detailed_pure_core_mic',
    'checked_core': 'detailed_checked_core_mic'
}

SUBDIRS = [
    'EC_predictions',
    'pathway_predictions', 
    'KO_predictions',
    'other_picrust_files'
]

# Base Paths
base_dir = Path("/home/beatriz/MIC/2_Micro/data_picrust")
aligned_file = Path("/home/beatriz/MIC/2_Micro/data_tree/aligned_sequences_integrate.fasta")
abundance_excel = Path("/home/beatriz/MIC/2_Micro/data_Ref/merged_to_sequence.xlsx")
results_file = base_dir / "functional_comparison.xlsx"

In [22]:
# Read fasta file
#aligned_sequences = list(SeqIO.parse(aligned_file, "fasta"))

In [23]:
# Integrated taxa from origin genus as headers with levels 6 for the genera, 7 for the GID, muss be cleaned
Integrated_T = pd.read_excel(abundance_excel, sheet_name='core_check_usual_taxa', header=[0,1,2,3,4,5,6,7])
# Drop first row (index 0) and first column in one chain
Integrated_T = Integrated_T.drop(index=0).drop(Integrated_T.columns[0], axis=1)
# Remove 'Unnamed' level names
Integrated_T.columns = Integrated_T.columns.map(lambda x: tuple('' if 'Unnamed' in str(level) else level for level in x))
# If the dataframe has Nan in sites it will replace it with Source
Integrated_T['Sites'] = Integrated_T['Sites'].fillna('Source')
# Fill the other index with nothing
Integrated_T =  Integrated_T.fillna(' ')
Integrated_T= Integrated_T.set_index("Sites")
Integrated = Integrated_T.T
Integrated.shape
# sources are  array([' ', 'chk-core', 'chk', 'chk-core-us', 'chk-us', 'core-us', 'core', 'us'], dtype=object)

(85, 71)

## 2.2. Sequence Identity to Clean up the Sequences
checking if the imput file comply with the requirements for picrust2 to run. First we make sure no to take away the identify of what makes a bacteria.

In [24]:
def analyze_sequence_identity(input_fasta):
    """
    Analyze diagnostic regions and sequence identity
    """
    sequences = {}
    current_header = ""
    
    print("Reading sequences...")
    with open(input_fasta) as f:
        for line in f:
            line = line.strip()
            if line.startswith('>'):
                current_header = line
                sequences[current_header] = []
            elif line:
                sequences[current_header].append(line)
    
    # Join sequences
    for header in sequences:
        sequences[header] = ''.join(sequences[header])
    
    # Analyze variable regions
    seq_length = len(next(iter(sequences.values())))
    window_size = 50
    variability = []
    
    print("\nAnalyzing sequence variability...")
    for i in range(0, seq_length - window_size):
        window_sequences = [seq[i:i+window_size] for seq in sequences.values()]
        # Calculate variability excluding gaps
        bases_at_position = [set(seq[j] for seq in window_sequences if seq[j] != '-') 
                           for j in range(window_size)]
        variability.append(sum(len(bases) for bases in bases_at_position) / window_size)
    
    # Find highly variable regions (potential diagnostic regions)
    diagnostic_regions = []
    current_region = []
    
    for i, var in enumerate(variability):
        if var > 2.0:  # High variability threshold
            current_region.append(i)
        elif current_region:
            if len(current_region) >= 20:  # Minimum region size
                diagnostic_regions.append((min(current_region), max(current_region)))
            current_region = []
    
    print("\nDiagnostic regions found:")
    for start, end in diagnostic_regions:
        print(f"Region {start}-{end} (length: {end-start+1})")
        
    print("\nSequence identity summary:")
    print(f"Total sequences: {len(sequences)}")
    print(f"Diagnostic regions preserved: {len(diagnostic_regions)}")
    print(f"Average variability: {sum(variability)/len(variability):.2f}")

# Run the analysis
aligned_file = Path("/home/beatriz/MIC/2_Micro/data_tree/aligned_sequences_integrate.fasta")
analyze_sequence_identity(aligned_file)

Reading sequences...

Analyzing sequence variability...

Diagnostic regions found:
Region 2-149 (length: 148)
Region 249-572 (length: 324)
Region 622-776 (length: 155)
Region 806-884 (length: 79)
Region 934-1653 (length: 720)
Region 1662-1860 (length: 199)
Region 1866-2081 (length: 216)
Region 2128-2181 (length: 54)
Region 2211-2305 (length: 95)
Region 2344-2846 (length: 503)
Region 2878-3276 (length: 399)

Sequence identity summary:
Total sequences: 79
Diagnostic regions preserved: 11
Average variability: 2.93


Diagnostic Regions Found:

We found 11 distinct variable regions
The largest regions are:
|Region 934-1653 (720 bases)|Region 2344-2846 (503 bases)|Region 2878-3276 (399 bases)|Region 249-572 (324 bases)|
|-|-|-|-|

Sequence Coverage:
The regions span from position 2 to 3276, covering most of the original sequence length. The most variable regions are substantial in length (>100 bases) and we have good distribution of regions throughout the sequence.
Quality Indicators: Average variability of 2.93 indicates good sequence diversity. Having 11 diagnostic regions is excellent for species identification. The regions are well-distributed, not clustered
Based on this, the trimming strategy will focus on preserving these regions. Next function optimizes cleaning and trimmind around these diagnostic regions.

## 2.3. Optimising the Sequences by Trimming and Cleaning

The focus is to preserve the most informative diagnostic regions, maintain alignment within these regions. Care is taken on keeping the phylogenetic relationships intact so that the picrust analysis be of better quality, mantaining the biological significance.

In [25]:
def optimize_diagnostic_sequences(input_fasta, output_fasta):
    """
    Optimize sequences preserving key diagnostic regions
    """
    # Key diagnostic regions we want to preserve
    key_regions = [
        (249, 572),   # Large region 1
        (934, 1653),  # Largest region
        (2344, 2846)  # Large region 2
    ]
    
    sequences = {}
    current_header = ""
    
    print("Reading sequences...")
    with open(input_fasta) as f:
        for line in f:
            line = line.strip()
            if line.startswith('>'):
                current_header = line
                sequences[current_header] = []
            elif line:
                sequences[current_header].append(line)
    
    # Join sequences
    for header in sequences:
        sequences[header] = ''.join(sequences[header])
    
    # Find optimal boundaries that include key regions
    start_pos = min(region[0] for region in key_regions)
    end_pos = max(region[1] for region in key_regions)
    
    print(f"\nOptimized boundaries:")
    print(f"Start: {start_pos}")
    print(f"End: {end_pos}")
    
    # Write optimized sequences
    print("\nWriting optimized sequences...")
    with open(output_fasta, 'w') as out:
        for header, seq in sequences.items():
            trimmed_seq = seq[start_pos:end_pos]
            non_gaps = sum(1 for c in trimmed_seq if c != '-')
            content_ratio = non_gaps / len(trimmed_seq)
            
            out.write(f"{header}\n")
            for i in range(0, len(trimmed_seq), 60):
                out.write(trimmed_seq[i:i+60] + '\n')
            
            print(f"Sequence {header.split()[0]} content ratio: {content_ratio:.2%}")
    
    print(f"\nProcessing complete:")
    print(f"Original length: {len(next(iter(sequences.values())))}")
    print(f"Optimized length: {end_pos - start_pos}")
    print(f"Sequences processed: {len(sequences)}")

# Run the optimization
aligned_file = Path("/home/beatriz/MIC/2_Micro/data_tree/aligned_sequences_integrate.fasta")
output_file = aligned_file.parent / "diagnostic_optimized_sequences.fasta"
optimize_diagnostic_sequences(aligned_file, output_file)

Reading sequences...

Optimized boundaries:
Start: 249
End: 2846

Writing optimized sequences...
Sequence >Nitrospira content ratio: 45.86%
Sequence >Oerskovia content ratio: 41.86%
Sequence >Propionivibrio content ratio: 43.86%
Sequence >Cutibacterium content ratio: 41.86%
Sequence >Silanimonas content ratio: 39.08%
Sequence >Opitutus content ratio: 50.29%
Sequence >Corynebacterium content ratio: 16.67%
Sequence >Treponema content ratio: 24.53%
Sequence >Phreatobacter content ratio: 34.73%
Sequence >Propionibacterium content ratio: 33.38%
Sequence >Bradyrhizobium content ratio: 41.20%
Sequence >Aestuariimicrobium content ratio: 32.38%
Sequence >Azospira content ratio: 34.69%
Sequence >Mycoplana content ratio: 37.31%
Sequence >Hydrogenophaga content ratio: 53.60%
Sequence >Mycobacterium content ratio: 18.68%
Sequence >Tepidimonas content ratio: 36.35%
Sequence >Blastomonas content ratio: 59.38%
Sequence >Paracoccus content ratio: 19.29%
Sequence >Phenylobacterium content ratio: 52.87%


There are high quality (>50%): Hydrogenophaga (53.60%), Blastomonas (59.38%), Phenylobacterium (52.87%), Afipia (57.91%), Neisseria (55.99%), Desulfovibrio (60.95%), Acetobacterium (57.87%), Bulleidia (51.75%). The moderate quality (35-50%): About 35 sequences, including Nitrospira, Oerskovia, most Proteobacteria. Also we found low quality (<25%): About 20 sequences, including Corynebacterium (16.67%), Treponema (24.53%), Variovorax (16.90%), Desulfobulbus (16.71%).
Regarding sequence Length, the original sequences have 3471 bases and by optimising they are left about 2597 bases. That makes a 75% of the original length, and this regions are quality diagnostic regions. Base on this realities two approach will be taken, run picrust2 on high >50% quality qusequences and second compare result s with low quality sequences. However this approach will sacrify some of the bacteria that may have no quality sequences but are relevant for out study.  Therefore it is important to check the quality quality distribution within our groups. We make consider to use different quality threshold so that we can barging on the results. 
## 2.4. Analysing the Quality of Sequences by Group

In [26]:
def process_integrated_data(df):
    """
    Process the integrated DataFrame to create a new DataFrame with clear column names
    and preserve all values including source information.
    
    Parameters:
    df (pandas.DataFrame): Input DataFrame with MultiIndex index and site columns
    
    Returns:
    pandas.DataFrame: Processed DataFrame with clear structure
    """
    # Create a copy of the DataFrame to avoid modifying the original
    processed_df = df.copy()
    
    # Extract genera and GIDs from the index MultiIndex
    genera = df.index.get_level_values(6)[1:]  # Skip first row
    gids = pd.to_numeric(df.index.get_level_values(7)[1:], errors='coerce')
    
    # Create a new DataFrame with the extracted information
    result_df = pd.DataFrame({
        'Genus': genera,
        'GID': gids
    })
    
    # Add the site values from the original DataFrame
    for col in df.columns:
        result_df[col] = df.iloc[1:][col].values
    
    # Clean up the DataFrame
    result_df['GID'] = pd.to_numeric(result_df['GID'], errors='coerce')
    result_df = result_df.dropna(subset=['GID'])
    result_df['GID'] = result_df['GID'].astype(int)
    
    return result_df

def get_taxa_groups(df):
    """
    Separate the processed DataFrame into different taxa groups based on Source column
    
    Parameters:
    df (pandas.DataFrame): Processed DataFrame from process_integrated_data()
    
    Returns:
    dict: Dictionary containing DataFrames for different taxa groups
    """
    # Split the data into groups based on 'Source' column patterns
    
    # Known corrosion bacteria (any pattern with 'us')
    known_bacteria = df[df['Source'].str.contains('us', case=False, na=False)]
    
    # Pure checked bacteria (only 'chk' without 'core' or 'us')
    pure_checked = df[
        df['Source'].str.contains('chk', case=False, na=False) & 
        ~df['Source'].str.contains('core|us', case=False, na=False)
    ]
    
    # Pure core bacteria (only 'core' without 'chk' or 'us')
    pure_core = df[
        df['Source'].str.contains('core', case=False, na=False) & 
        ~df['Source'].str.contains('chk|us', case=False, na=False)
    ]
    
    # Checked-core bacteria (contains both 'core' and 'chk' but no 'us')
    checked_core = df[
        df['Source'].str.contains('chk.*core|core.*chk', case=False, na=False) & 
        ~df['Source'].str.contains('us', case=False, na=False)
    ]

    # Create groups dictionary
    taxa_groups = {
        'known_bacteria': known_bacteria,
        'pure_checked': pure_checked,
        'pure_core': pure_core,
        'checked_core': checked_core
    }
    
    # Print summary statistics
    print("\nDetailed Classification Results:")
    print(f"Known corrosion bacteria: {len(known_bacteria)}")
    print(f"Pure checked bacteria: {len(pure_checked)}")
    print(f"Pure core bacteria: {len(pure_core)}")
    print(f"Checked-core bacteria: {len(checked_core)}")
    
    # Verify total matches expected
    total_classified = len(known_bacteria) + len(pure_checked) + len(pure_core) + len(checked_core)
    print(f"\nTotal classified taxa: {total_classified}")
    print(f"Total in dataset: {len(df)}")
    
    return taxa_groups

# Usage example:
processed_df = process_integrated_data(Integrated)
# Get the groups
taxa_groups = get_taxa_groups(processed_df)

# Access individual groups - 
known_bacteria = taxa_groups['known_bacteria']    
pure_core = taxa_groups['pure_core']             
pure_checked = taxa_groups['pure_checked']        
checked_core = taxa_groups['checked_core']        


Detailed Classification Results:
Known corrosion bacteria: 17
Pure checked bacteria: 19
Pure core bacteria: 45
Checked-core bacteria: 3

Total classified taxa: 84
Total in dataset: 84


In [27]:
def integrate_abundance_and_sequences(sequence_fasta, abundance_df, output_file):
    """
    Integrate sequence quality metrics with abundance data
    
    Args:
        sequence_fasta: Path to FASTA file with sequences
        abundance_df: DataFrame with abundance data (Genus, GID, sites, Source)
        output_file: Path to save integrated analysis
    """
    # Read sequences and calculate quality metrics
    sequences = {}
    current_header = ""
    
    print("Reading sequences...")
    with open(sequence_fasta) as f:
        for line in f:
            line = line.strip()
            if line.startswith('>'):
                current_header = line.split()[0][1:]  # Get genus name without '>'
                sequences[current_header] = []
            elif line:
                sequences[current_header].append(line)
    
    # Calculate sequence metrics
    sequence_metrics = {}
    for genus, seq_lines in sequences.items():
        seq = ''.join(seq_lines)
        non_gaps = sum(1 for c in seq if c != '-')
        metrics = {
            'sequence_length': len(seq),
            'non_gap_bases': non_gaps,
            'quality_ratio': non_gaps / len(seq),
        }
        sequence_metrics[genus] = metrics
    
    # Create integrated DataFrame
    metrics_df = pd.DataFrame.from_dict(sequence_metrics, orient='index')
    
    # Merge with abundance data
    integrated_df = abundance_df.merge(
        metrics_df, 
        left_on='Genus', 
        right_index=True, 
        how='left'
    )
    
    # Group analysis by Source
    print("\nAnalyzing sequence quality by source group:")
    source_analysis = integrated_df.groupby('Source').agg({
        'quality_ratio': ['mean', 'min', 'max', 'count'],
        'Genus': 'count'
    }).round(3)
    
    print("\nSource group summary:")
    print(source_analysis)
    
    # Identify potential issues
    print("\nPotential quality issues by source:")
    for source in integrated_df['Source'].unique():
        group_df = integrated_df[integrated_df['Source'] == source]
        low_quality = group_df[group_df['quality_ratio'] < 0.25]
        if not low_quality.empty:
            print(f"\n{source}:")
            for _, row in low_quality.iterrows():
                print(f"  {row['Genus']}: {row['quality_ratio']:.2%}")
    
    # Save integrated analysis
    integrated_df.to_excel(output_file, index=False)
    print(f"\nIntegrated analysis saved to: {output_file}")
    
    return integrated_df

In [28]:
# Analyze each group separately
aligned_file = Path("/home/beatriz/MIC/2_Micro/data_tree/diagnostic_optimized_sequences.fasta")

# For known bacteria
known_output = aligned_file.parent / "known_bacteria_quality.xlsx"
known_quality = integrate_abundance_and_sequences(str(aligned_file), known_bacteria,  known_output)

# For pure core
core_output = aligned_file.parent / "pure_core_quality.xlsx"
core_quality = integrate_abundance_and_sequences(str(aligned_file),  pure_core, core_output)
    
# For pure checked
checked_output = aligned_file.parent / "pure_checked_quality.xlsx"
checked_quality = integrate_abundance_and_sequences(str(aligned_file), pure_checked, checked_output)

# For checked-core
checked_core_output = aligned_file.parent / "checked_core_quality.xlsx"
checked_core_quality = integrate_abundance_and_sequences(str(aligned_file), checked_core, checked_core_output)

Reading sequences...

Analyzing sequence quality by source group:

Source group summary:
            quality_ratio                     Genus
                     mean    min    max count count
Source                                             
chk-core-us         0.308  0.167  0.427     4     4
chk-us              0.422  0.422  0.422     1     1
core-us             0.427  0.196  0.610     8     8
us                  0.286  0.167  0.424     4     4

Potential quality issues by source:

chk-core-us:
  Clostridium: 22.91%
  Corynebacterium: 16.67%

core-us:
  Desulfotomaculum: 19.56%

us:
  Desulfobulbus: 16.71%
  Gallionella: 22.60%

Integrated analysis saved to: /home/beatriz/MIC/2_Micro/data_tree/known_bacteria_quality.xlsx
Reading sequences...

Analyzing sequence quality by source group:

Source group summary:
       quality_ratio                     Genus
                mean    min    max count count
Source                                        
core           0.356  0.169  0.594 

In [29]:
def verify_cleaned_sequences(fasta_file):
    """
    Verify the quality of cleaned sequences
    """
    sequences = {}
    current_header = ""
    
    print("Analyzing cleaned sequences...")
    with open(fasta_file) as f:
        for line in f:
            line = line.strip()
            if line.startswith('>'):
                current_header = line
                sequences[current_header] = []
            elif line:
                sequences[current_header].append(line)
    
    # Join sequences and analyze
    for header in sequences:
        sequences[header] = ''.join(sequences[header])
    
    # Calculate statistics
    lengths = []
    base_counts = []
    
    for header, seq in sequences.items():
        lengths.append(len(seq))
        base_counts.append(sum(1 for c in seq if c != '-'))
    
    print(f"\nSequence Statistics:")
    print(f"Total sequences: {len(sequences)}")
    print(f"Sequence length: {lengths[0]} (all sequences same length)")
    print(f"Average non-gap bases: {sum(base_counts)/len(base_counts):.1f}")
    print(f"Min non-gap bases: {min(base_counts)}")
    print(f"Max non-gap bases: {max(base_counts)}")

# Verify the cleaned sequences
output_file = aligned_file.parent / "picrust_ready_sequences.fasta"
verify_cleaned_sequences(output_file)

Analyzing cleaned sequences...

Sequence Statistics:
Total sequences: 79
Sequence length: 890 (all sequences same length)
Average non-gap bases: 314.0
Min non-gap bases: 96
Max non-gap bases: 588


## 2.3. Integrating data from Sequences and Abundances
Two distinct classification approaches are implemented to categorize bacteria. The simple approach (get_bacteria_sources_simple) divides bacteria into known corrosion-causers (usual_taxa) and candidates (all others). The detailed approach (get_bacteria_sources_detailed) provides finer categorization by separating bacteria into known corrosion-causers, pure checked taxa, pure core taxa, and those present in both checked and core datasets.

In [30]:
def get_bacteria_sources_simple(df):
    """
    Simple classification:
    1. Known (anything with 'us')
    2. All others (combined chk, core, chk-core)
    """
    # Get genera and gids from column levels 6 and 7
    genera = df.index.get_level_values(6)[1:]
    gids = df.index.get_level_values(7)[1:]
    # Look for Source in the data, not index
    sources = df['Source'] if 'Source' in df.columns else None
    
    known_bacteria = {}     # usual_taxa
    other_bacteria = {}     # everything else
    
    sources_found = set()
    source ={}
    patterns = ['us', 'core-us', 'chk-us', 'chk-core-us']
    
    for i, (genus, gid) in enumerate (zip(genera, gids)):
        if source is not None:  # Check if source exists for this genus
            source = str(sources.iloc[i]).strip().lower()
            sources_found.add(source)
            
            if source in patterns:
                known_bacteria[genus] = int(gid) if str(gid).isdigit() else gid
            else:
                other_bacteria[genus] = int(gid) if str(gid).isdigit() else gid
                    
    print("\nSimple Classification Results:")
    print(f"Known corrosion bacteria: {len(known_bacteria)}")
    print(f"Other bacteria: {len(other_bacteria)}")
    print("\nSources found:", sources_found)
    
    return {
        'known_bacteria': known_bacteria,
        'other_bacteria': other_bacteria
    }

def get_bacteria_sources_detailed(df):
    """
    Detailed classification with all possible combinations:
    1. Known (usual_taxa)
    2. Pure checked (only 'chk')
    3. Pure core (only 'core')
    4. Checked-core (overlap 'chk-core')
    """
    # Get genera and gids from column levels 6 and 7
    genera = df.index.get_level_values(6)[1:]
    gids = df.index.get_level_values(7)[1:]

    sources = df['Source'] if 'Source' in df.columns else None
    
    known_bacteria = {}      # usual_taxa
    pure_checked = {}        # only 'chk' checked_taxa
    pure_core = {}          # only 'core' core_taxa
    checked_core = {}       # 'chk-core' checked and core taxa
    source ={}
    sources_found = set()
    patterns = ['us', 'core-us', 'chk-us', 'chk-core-us']
    
    for i, (genus, gid) in enumerate (zip(genera, gids)):
        if source is not None:  # Check if source exists for this genus
            source = str(sources.iloc[i]).strip().lower()
            sources_found.add(source)
            
            if source in patterns:
                known_bacteria[genus] = int(gid) if str(gid).isdigit() else gid
                continue
                    
            # Then handle other combinations
            if source == 'chk':
                pure_checked[genus] = gid
            elif source == 'core':
                pure_core[genus] = gid
            elif 'chk-core' in source:
                checked_core[genus] = gid
    
    print("\nDetailed Classification Results:")
    print(f"Known corrosion bacteria: {len(known_bacteria)}")
    print(f"Pure checked bacteria: {len(pure_checked)}")
    print(f"Pure core bacteria: {len(pure_core)}")
    print(f"Checked-core bacteria: {len(checked_core)}")
    print("\nSources found:", sources_found)
    
    return {
        'known_bacteria': known_bacteria,
        'pure_checked': pure_checked,
        'pure_core': pure_core,
        'checked_core': checked_core
    }

In [31]:
Groups = get_bacteria_sources_detailed(Integrated)


Detailed Classification Results:
Known corrosion bacteria: 16
Pure checked bacteria: 19
Pure core bacteria: 45
Checked-core bacteria: 3

Sources found: {'', 'core', 'chk-core-us', 'chk-core', 'core-us', 'chk-us', 'us', 'chk'}


## 2.4. Preparing Data for PICRUSt2 Input
The prepare_picrust_data function processes the integrated data and handles data quality control. Known problematic genera (e.g., 'Clostridium_sensu_stricto_12', 'Oxalobacteraceae_unclassified') are flagged for exclusion to prevent analysis errors. The function also creates an organized directory structure as outlined in the introduction, with separate paths for different bacterial classifications (known_mic, candidate_mic, etc.) and their respective analysis outputs (EC_predictions, pathway_predictions, KO_predictions).

In [32]:
def prepare_picrust_data(df, aligned_file, function_type='simple'):
    """
    Prepare data for PICRUSt analysis with choice of  function_type method
    
    Args:
        df: Input DataFrame
        aligned_file: Path to aligned sequences
        function_type: 'simple' or 'detailed'
    """
    # Get bacteria source_groups based on chosen  function_type
    if  function_type == 'simple':
        source_groups = get_bacteria_sources_simple(df)
    else:
        source_groups= get_bacteria_sources_detailed(df)
    
    # Remove missing genera if needed
    missing_genera = {'Clostridium_sensu_stricto_12', 'Oxalobacteraceae_unclassified',
                   'Psb-m-3', 'Ruminiclostridium_1', 'Wchb1-05'}
    
    print("\nMissing genera that will be excluded:")
    for genus in missing_genera:
        print(f"- {genus}")
    
    # Create appropriate directory structure
    create_directory_structure(function_type)
    
    return source_groups

def create_directory_structure(function_type='simple'):
    """Create directory structure for PICRUSt analysis"""
    base_dir = Path("/home/beatriz/MIC/2_Micro/data_picrust")
    
    if function_type == 'simple':
        directories = SIMPLE_BASE
    else:
        directories = DETAILED_BASE
    
    # Create all required directories
    for dir_name in directories.values():
        for subdir in SUBDIRS:
            (base_dir / dir_name / subdir).mkdir(parents=True, exist_ok=True)

# 3. Analysis of Pathways
The analysis focuses on metabolic pathways known to be involved in microbially influenced corrosion, including sulfur metabolism, organic acid production, iron metabolism, and biofilm formation. These pathways were selected based on documented mechanisms of known corrosion-inducing bacteria. Separate pipeline runs for simple and detailed classifications ensure proper pathway analysis for each bacterial group.

In [33]:
def analyze_functional_profiles(picrust_output_dir, bacteria_list):
    """
    Analyze functional profiles with focus on corrosion-relevant pathways
    
    Parameters:
    picrust_output_dir: directory containing PICRUSt2 output
    bacteria_list: list of bacteria names to analyze
    """
    # Define corrosion-relevant pathways
    relevant_pathways = [
        'Sulfur metabolism',
        'Iron metabolism', 
        'Energy metabolism',
        'Biofilm formation',
        'Metal transport',
        'ochre formation',
        'iron oxide deposits',
        'iron precipitation',
        'rust formation',
        'organic acid production',
        'acetate production',
        'lactate metabolism',
        'formate production',
    ]
    
    try:
        # Read PICRUSt2 output
        pathway_file = os.path.join(picrust_output_dir, 'pathways_with_descriptions.tsv')
        pathways_df = pd.read_csv(pathway_file, sep='\t')
        
        # Filter for relevant pathways
        filtered_pathways = pathways_df[
            pathways_df['description'].str.contains('|'.join(relevant_pathways), 
                                                  case=False, 
                                                  na=False)
        ]
        
        # Calculate pathway abundances per bacteria
        pathway_abundances = filtered_pathways.groupby('description').sum()
        
        # Calculate pathway similarities between bacteria
        pathway_similarities = {}
        for bacteria in bacteria_list:
            if bacteria in pathways_df.columns:
                similarities = pathways_df[bacteria].corr(pathways_df[list(bacteria_list)])
                pathway_similarities[bacteria] = similarities
        
        # Predict functional potential
        functional_predictions = {}
        for pathway in relevant_pathways:
            pathway_presence = filtered_pathways[
                filtered_pathways['description'].str.contains(pathway, case=False)
            ]
            if not pathway_presence.empty:
                functional_predictions[pathway] = {
                    'presence': len(pathway_presence),
                    'mean_abundance': pathway_presence.mean().mean(),
                    'max_abundance': pathway_presence.max().max()
                }
        
        # Calculate correlation scores
        correlation_scores = {}
        for bacteria in bacteria_list:
            if bacteria in pathways_df.columns:
                correlations = pathways_df[bacteria].corr(
                    pathways_df[filtered_pathways.index]
                )
                correlation_scores[bacteria] = {
                    'mean_correlation': correlations.mean(),
                    'max_correlation': correlations.max(),
                    'key_pathways': correlations.nlargest(5).index.tolist()
                }
        
        comparison_results = {
            'pathway_similarities': pathway_similarities,
            'functional_predictions': functional_predictions,
            'correlation_scores': correlation_scores,
            'pathway_abundances': pathway_abundances.to_dict()
        }
        
        return filtered_pathways, comparison_results
        
    except Exception as e:
        print(f"Error in pathway analysis: {str(e)}")
        return None, None

# 4. PICRUSt Pipeline Definition
The pipeline processes the aligned sequence data from notebook 5, which has been integrated with abundance information. Using either simple or detailed classification approaches, it queries the PICRUSt database to predict potential metabolic pathways for each genus. This prediction is based on evolutionary relationships and known genomic capabilities of related organisms.

In [34]:
def run_picrust2_pipeline(fasta_file, output_dir):
    """
    Run PICRUSt2 pipeline on the input sequences
    
    Args:
        fasta_file: Path to aligned sequences fasta file
        output_dir: Directory for PICRUSt2 output
    """
    try:
        # Run main PICRUSt2 pipeline 
        cmd = [
            'picrust2_pipeline.py',
            '-s', fasta_file, # Input sequences
            '-i', fasta_file, # Use same file for -i since we have aligned seqs
            '-o', output_dir,
            '--processes', '1',
            '--verbose'
        ]
        subprocess.run(cmd, check=True)
        
        # Add pathway descriptions if the pathway file exists
        pathway_file = os.path.join(output_dir, 'pathways_out/path_abun_unstrat.tsv.gz')
        if os.path.exists(pathway_file):
            cmd_desc = [
                'add_descriptions.py',
                '-i', pathway_file,
                '-m', 'PATHWAY',
                '-o', os.path.join(output_dir, 'pathways_with_descriptions.tsv')
            ]
            subprocess.run(cmd_desc, check=True)
            
        print(f"PICRUSt2 pipeline completed successfully for {output_dir}")
        return True
        
    except subprocess.CalledProcessError as e:
        print(f"Error running PICRUSt2: {e}")
        return False

# 5. Functional Analysis
The analysis workflow begins by categorizing bacteria into source groups using the classification functions. These categorized data are then processed through the PICRUSt pipeline to predict metabolic capabilities. The functional analysis examines pathway presence, abundance, and correlations between different bacterial groups to identify potential corrosion-related metabolic patterns.

In [35]:
def run_functional_analysis(df, aligned_file, analysis_type='simple'):
    """
    Run complete functional analysis pipeline for either simple or detailed classification
    
    Parameters:
    df: Input DataFrame
    aligned_file: Path to aligned sequences file
    analysis_type: 'simple' or 'detailed'
    """
    try:
        print(f"\n{'='*50}")
        print(f"Starting {analysis_type} classification analysis")
        print(f"{'='*50}")
        
        # Prepare data and get source groups
        print("\nStep 1: Preparing data...")

        source_groups = prepare_picrust_data(df, aligned_file, function_type=analysis_type)
    
        if not source_groups:
            raise ValueError("Failed to prepare data: No source groups returned")
        
        # Base directory for PICRUSt output
        base_dir = Path("/home/beatriz/MIC/2_Micro/data_picrust")

        results = {}
        
        if analysis_type == 'simple':
            # Run analysis for simple classification
            # Known bacteria
            known_output_dir = base_dir /SIMPLE_BASE['known']
            success_known = run_picrust2_pipeline(aligned_file, str(known_output_dir))
            if success_known:
                results_known = analyze_functional_profiles(str(known_output_dir), 
                                                        source_groups['known_bacteria'].keys())
            
            # Other bacteria
            other_output_dir = base_dir / SIMPLE_BASE['other']
            success_other = run_picrust2_pipeline(aligned_file, str(other_output_dir))
            if success_other:
                results_other = analyze_functional_profiles(str(other_output_dir), 
                                                        source_groups['other_bacteria'].keys())
                
        else:
            # Run analysis for detailed classification
            for group, dir_name in DETAILED_BASE.items():

                # Known bacteria
                known_output_dir = base_dir / DETAILED_BASE['known']
                success_known = run_picrust2_pipeline(aligned_file, str(known_output_dir))
                if success_known:
                    results_known = analyze_functional_profiles(str(known_output_dir), 
                                                            source_groups['known_bacteria'].keys())
            
                # Pure checked bacteria
                checked_output_dir = base_dir /  DETAILED_BASE['pure_checked']
                success_checked = run_picrust2_pipeline(aligned_file, str(checked_output_dir))
                if success_checked:
                    results_checked = analyze_functional_profiles(str(checked_output_dir), 
                                                            source_groups['pure_checked'].keys())
                
                # Pure core bacteria
                core_output_dir = base_dir /DETAILED_BASE['pure_core']
                success_core = run_picrust2_pipeline(aligned_file, str(core_output_dir))
                if success_core:
                    results_core = analyze_functional_profiles(str(core_output_dir), 
                                                            source_groups['pure_core'].keys())
                
                # Checked-core bacteria
                checked_core_output_dir = base_dir /DETAILED_BASE['checked_core']
                success_checked_core = run_picrust2_pipeline(aligned_file, str(checked_core_output_dir))
                if success_checked_core:
                    results_checked_core = analyze_functional_profiles(str(checked_core_output_dir), 
                                                                    source_groups['checked_core'].keys())
    except subprocess.CalledProcessError as e:
        print(f"Error running PICRUSt2: {e}")

        return "Analysis completed successfully"


In [43]:
# Read fasta file
aligned_sequences = list(SeqIO.parse("/home/beatriz/MIC/2_Micro/data_tree/diagnostic_optimized_sequences.fasta", "fasta"))

diagnostic_optimized_sequences.fasta, picrust_ready_sequences.fasta

In [44]:
# Run the analysis for both types
# Simple source classification
simple_results = run_functional_analysis(Integrated, aligned_sequences, analysis_type='simple')

# Detailed source classification
detailed_results = run_functional_analysis(Integrated, aligned_sequences, analysis_type='detailed')


Starting simple classification analysis

Step 1: Preparing data...

Simple Classification Results:
Known corrosion bacteria: 16
Other bacteria: 68

Sources found: {'', 'core', 'chk-core-us', 'chk-core', 'core-us', 'chk-us', 'us', 'chk'}

Missing genera that will be excluded:
- Psb-m-3
- Wchb1-05
- Ruminiclostridium_1
- Oxalobacteraceae_unclassified
- Clostridium_sensu_stricto_12


TypeError: expected str, bytes or os.PathLike object, not list

# 6. Findings and Discusion