# Sequence Analysis and Functional Prediction Pipeline

This notebook analyzes the functional and sequence relationships between newly identified bacteria and known corrosion-influencing microorganisms. The analysis builds upon previous findings where:
Statistical significance was established between the selected bacteria and corrosion risk (Notebook 3).Most of the bacteria have also been previously reported as influencing corrosion as seeing in the literature search notebook 4. Also, the evolutionary relationhship of the candidates to be assigned as MIC has been mapped through phylogenetic analyis on notebok 5.   
The study focuses on bacteria from operational heating and cooling water systems, primarily in Germany. Using 16S rRNA data (bootstrap-validated from Notebook 5), this analysis employs PICRUSt2 to:
 Predict metabolic functions from 16S sequences. Focuses on pathways relevant to corrosion such as sulfur and iron metabolism. Ultimately it compares functional profiles between the known corrosion-causing bacteria on the selected list (validated through literature) and the newly identified candidates showing statistical correlation with corrosion. This functional comparison aims to validate whether statistical correlations reflect genuine metabolic capabilities associated with corrosion processes.

In [1]:
import pandas as pd
import numpy as np
from Bio import SeqIO, Entrez
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
import subprocess
import os
from pathlib import Path

Following is the structure of the notebook data named data_picrus  
data_tree  
 ├── sequences/  
 │   ├── known.fasta : sequences of known corrosion-causing bacteria  
 │   ├── candidate.fasta : sequences of potential new corrosion-causing bacteria  
 |   └── other files  
 data_picrus  
 └── picrust_results/  
      ├── known_bacteria/  
      |               ├── EC_predictions/       : enzyme predictions  
      |               ├── pathway_predictions/  : metabolic pathway abundance  
      |               ├── KO_predictions/       : KEGG ortholog predictions  
      |               └── other_picrust_files/  
      ├── candidate_bacteria/  
      |               ├── EC_predictions/       : enzyme predictions  
      |               ├── pathway_predictions/  : metabolic pathway abundance  
      |               ├── KO_predictions/       : KEGG ortholog predictions  
      |               └── other_picrust_files/  : final comparison summary  
      │      
      └── functional_comparison.xlsx  

In [2]:
# For VSCode
base_dir = Path("/home/beatriz/MIC/2_Micro/data_picrus")
known_dir = base_dir / "known_bacteria"
candidate_dir = base_dir / "candidate_bacteria"
results_file = base_dir / "functional_comparison.xlsx" 

In [3]:
# Read aligned sequences
aligned_file = Path("/home/beatriz/MIC/2_Micro/data_tree/aligned_sequences.fasta")

# Define known corrosive bacteria
known_bacteria = ['Aquamicrobium',' Azospira', 'Brachybacterium', 'Brevibacterium', 'Cellulosimicrobium', 'Clavibacter',
                   'Clostridium', 'Cohnella', 'Corynebacterium', 'Enterococcus', 'Halomonas', 'Legionella', 'Methyloversatilis',
                     'Mycobacterium', 'Neisseria', 'Novosphingobium', 'Opitutus', 'Paracoccus', 'Prevotella','Psb-m-3', 'Pseudarthrobacter',
                        'Pseudoalteromonas', 'Roseateles', 'Streptococcus', 'Thiobacillus']

# Split sequences
known_seqs = []
candidate_seqs = []

for record in SeqIO.parse(aligned_file, "fasta"):
    if record.id in known_bacteria:
        known_seqs.append(record)
    else:
        candidate_seqs.append(record)

# Save split files
SeqIO.write(known_seqs, "data_picrus/known.fasta", "fasta")
SeqIO.write(candidate_seqs, "data_picrus/candidate.fasta", "fasta")

6

In [4]:
def prepare_sequences_for_picrust(sequences, output_dir):
    """
    Prepare sequences for PICRUSt2 analysis
    
    Parameters:
    sequences: list of SeqRecord objects or path to FASTA file
    output_dir: directory to save prepared files
    """
    # Create output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)
    
    # Save sequences in FASTA format if they're not already in a file
    if isinstance(sequences, list):
        output_fasta = os.path.join(output_dir, 'sequences.fasta')
        SeqIO.write(sequences, output_fasta, 'fasta')
    else:
        output_fasta = sequences
    
    return output_fasta

def run_picrust2_pipeline(fasta_file, output_dir):
    """
    Run PICRUSt2 analysis pipeline
    
    Parameters:
    fasta_file: path to input FASTA file
    output_dir: directory for PICRUSt2 output
    """
    try:
        # Run PICRUSt2 pipeline
        cmd = [
            'picrust2_pipeline.py',
            '-s', fasta_file,
            '-o', output_dir,
            '--processes', '1',  # Adjust based on available CPU
            '--verbose'
        ]
        subprocess.run(cmd, check=True)
        
        # Add pathway descriptions
        pathway_file = os.path.join(output_dir, 'pathways_out/path_abun_unstrat.tsv.gz')
        if os.path.exists(pathway_file):
            cmd_desc = [
                'add_descriptions.py',
                '-i', pathway_file,
                '-m', 'PATHWAY',
                '-o', os.path.join(output_dir, 'pathways_with_descriptions.tsv')
            ]
            subprocess.run(cmd_desc, check=True)
            
        return True
    except subprocess.CalledProcessError as e:
        print(f"Error running PICRUSt2: {e}")
        return False

def analyze_functional_profiles(picrust_output_dir, known_corrosive_bacteria):
    """
    Analyze functional profiles to compare with known corrosive bacteria
    
    Parameters:
    picrust_output_dir: directory containing PICRUSt2 output
    known_corrosive_bacteria: list of known corrosive bacteria names
    """
    # Read PICRUSt2 output
    pathway_file = os.path.join(picrust_output_dir, 'pathways_with_descriptions.tsv')
    pathways_df = pd.read_csv(pathway_file, sep='\t')
    
    # Focus on relevant pathways
    relevant_pathways = [
        'Sulfur metabolism',
        'Iron metabolism',
        'Energy metabolism',
        'Biofilm formation',
        'Metal transport'
    ]
    
    # Filter and analyze pathways
    filtered_pathways = pathways_df[pathways_df['description'].str.contains('|'.join(relevant_pathways), 
                                                                                case=False, na=False)]
    # Compare profiles between known and candidate bacteria
    comparison_results = {
        'pathway_similarities': {},
        'functional_predictions': {},
        'correlation_scores': {}
    }
    
    return filtered_pathways, comparison_results

def main_analysis_pipeline(input_sequences, output_dir, known_corrosive_bacteria):
    """
    Main pipeline for functional analysis
    """
    # Prepare sequences
    fasta_file = prepare_sequences_for_picrust(input_sequences, output_dir)
    
    # Run PICRUSt2
    success = run_picrust2_pipeline(fasta_file, output_dir)
    if not success:
        return None
    
    # Analyze results
    pathways, results = analyze_functional_profiles(output_dir, known_corrosive_bacteria)
    
    # Save results
    timestamp = pd.Timestamp.now().strftime('%Y%m%d_%H%M')
    results_file = os.path.join(output_dir, f'functional_analysis_{timestamp}.xlsx')
    
    with pd.ExcelWriter(results_file) as writer:
        pathways.to_excel(writer, sheet_name='Pathway_Analysis', index=False)
        pd.DataFrame(results['pathway_similarities']).to_excel(writer, sheet_name='Similarities')
        pd.DataFrame(results['functional_predictions']).to_excel(writer, sheet_name='Predictions')
    
    return results_file

3. Calling the Function

In [5]:
input_seqs = "/home/beatriz/MIC/2_Micro/data_tree/aligned_sequences.fasta"
output_directory = "/home/beatriz/MIC/2_Micro/data_picrus"

results = main_analysis_pipeline(input_seqs, output_directory, known_bacteria)

usage: picrust2_pipeline.py [-h] -s PATH -i PATH -o PATH [-p PROCESSES]
                            [-t epa-ng|sepp] [-r PATH] [--in_traits IN_TRAITS]
                            [--custom_trait_tables PATH]
                            [--marker_gene_table PATH] [--pathway_map MAP]
                            [--reaction_func MAP] [--no_pathways]
                            [--regroup_map ID_MAP] [--no_regroup]
                            [--stratified] [--max_nsti FLOAT]
                            [--min_reads INT] [--min_samples INT]
                            [-m {mp,emp_prob,pic,scp,subtree_average}]
                            [-e EDGE_EXPONENT] [--min_align MIN_ALIGN]
                            [--skip_nsti] [--skip_minpath] [--no_gap_fill]
                            [--coverage] [--per_sequence_contrib]
                            [--wide_table] [--skip_norm]
                            [--remove_intermediate] [--verbose] [-v]
picrust2_pipeline.py: error: the following argum

Error running PICRUSt2: Command '['picrust2_pipeline.py', '-s', '/home/beatriz/MIC/2_Micro/data_tree/aligned_sequences.fasta', '-o', '/home/beatriz/MIC/2_Micro/data_picrus', '--processes', '1', '--verbose']' returned non-zero exit status 2.
