# Sequence Analysis and Functional Prediction Pipeline
This notebook is enterily run in colab

## 1. Introduction
This notebook analyzes the functional and sequence relationships between newly identified bacteria and known corrosion-influencing microorganisms. The analysis builds upon previous findings where:
- Statistical significance was established between the selected bacteria and corrosion risk (Notebook 3)
- Literature validation confirmed corrosion influence for many bacteria (Notebook 4)
- Evolutionary relationships were mapped through phylogenetic analysis (Notebook 5)

The study focuses on bacteria from operational heating and cooling water systems, primarily in Germany. Using 16S rRNA data (bootstrap-validated from Notebook 5), this analysis employs PICRUSt2 to predict metabolic functions and compare functional profiles between different bacterial groups.

### Analysis Approaches
We implement two classification strategies:

1. Simple Classification:
   - Known corrosion-causing bacteria (usual_taxa)
   - Other bacteria (combining checked_taxa and core_taxa)

2. Detailed Classification:
   - Known corrosion-causing bacteria (usual_taxa)
   - Pure checked bacteria (exclusive to checked_taxa)
   - Pure core bacteria (exclusive to core_taxa)
   - Checked-core bacteria (overlap between checked and core taxa)

This detailed approach allows for more nuanced analysis of functional profiles and better understanding of potential corrosion mechanisms across different bacterial groups.

### Analysis Goals:
- Predict metabolic functions from 16S sequences
- Focus on corrosion-relevant pathways (sulfur/iron metabolism)
- Compare functional profiles between known corrosion-causing bacteria and newly identified candidates
- Validate whether statistical correlations reflect genuine metabolic capabilities associated with corrosion processes

### Directory Structure:
 Following is the structure of the notebook data named data_picrus  
data_tree  
 ├── sequences/  
 │   ├── known.fasta : sequences of known corrosion-causing bacteria  
 │   ├── candidate.fasta : sequences of potential new corrosion-causing bacteria  
 |   └── other files  
 data_picrus  
 └── picrust_results/  
      ├── known_bacteria/  
      |               ├── EC_predictions/       : enzyme predictions  
      |               ├── pathway_predictions/  : metabolic pathway abundance  
      |               ├── KO_predictions/       : KEGG ortholog predictions  
      |               └── other_picrust_files/  
      ├── candidate_bacteria/  
      |               ├── EC_predictions/       : enzyme predictions  
      |               ├── pathway_predictions/  : metabolic pathway abundance  
      |               ├── KO_predictions/       : KEGG ortholog predictions  
      |               └── other_picrust_files/  : final comparison summary
      ├── core_bacteria/
      |               ├── EC_predictions/       : enzyme predictions  
      |               ├── pathway_predictions/  : metabolic pathway abundance  
      |               ├── KO_predictions/       : KEGG ortholog predictions  
      |               └── other_picrust_files/  
      │      
      └── functional_comparison.xlsx  
Picrust2 works using its reference database that was installed with the package /home/beatriz/miniconda3/envs/picrust2/lib/python3.9/site-packages/picrust2/default_files/prokaryotic/pro_ref

# 2. Loading and Preparing the Data

## 2.1 Imports, Directories, Loading and preparing the Abundance DataFrame
The abundance DataFrame (Integrated) was carefully prepared to meet PICRUSt2 input requirements, including proper taxonomic level organization and removal of unnamed or missing data. The sequence data is sourced directly from aligned_sequences_integrated.fasta, which contains the phylogenetically aligned sequences generated in notebook 5. This integration ensures consistency between abundance data and sequence information.

Importing QIIME AND PICRUST IN COLAB

In [1]:
# Install miniconda and initialize
!wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
!bash Miniconda3-latest-Linux-x86_64.sh -b -f -p /usr/local/miniconda3
!conda config --add channels defaults
!conda config --add channels bioconda
!conda config --add channels conda-forge

--2025-01-20 16:23:21--  https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
Resolving repo.anaconda.com (repo.anaconda.com)... 104.16.191.158, 104.16.32.241, 2606:4700::6810:20f1, ...
Connecting to repo.anaconda.com (repo.anaconda.com)|104.16.191.158|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 147784736 (141M) [application/octet-stream]
Saving to: ‘Miniconda3-latest-Linux-x86_64.sh’


2025-01-20 16:23:22 (105 MB/s) - ‘Miniconda3-latest-Linux-x86_64.sh’ saved [147784736/147784736]

PREFIX=/usr/local/miniconda3
Unpacking payload ...

Installing base environment...

Preparing transaction: ...working... done
Executing transaction: ...working... done
installation finished.
    You currently have a PYTHONPATH environment variable set. This may cause
    unexpected behavior when running the Python interpreter in Miniconda3.
    For best results, please verify that your PYTHONPATH only points to
    directories of packages that are compatible wi

In [None]:
import sys
sys.path.append('/usr/local/miniconda3/lib/python3.7/site-packages/')

# Create environment with QIIME2-2020.8 (stable version known to work with PICRUSt2)
!conda create -n qiime2-2020.8 python=3.7 -y
!conda activate qiime2-2020.8

# Install QIIME2
!wget https://data.qiime2.org/distro/core/qiime2-2020.8-py36-linux-conda.yml
!conda env update -n qiime2-2020.8 --file qiime2-2020.8-py36-linux-conda.yml

# Install PICRUSt2 and its dependencies
!conda install -c bioconda -c conda-forge picrust2=2.4.1 -y

In [4]:


# Verify installations
!conda list | grep qiime2
!conda list | grep picrust2

# Function to check if the installations were successful
def check_installations():
    try:
        import qiime2
        print("QIIME2 installation successful")
        print(f"QIIME2 version: {qiime2.__version__}")
    except ImportError:
        print("QIIME2 installation failed")

    try:
        !which picrust2_pipeline.py
        print("PICRUSt2 installation successful")
    except:
        print("PICRUSt2 installation failed")

check_installations()


✨🍰✨ Everything looks OK!
Channels:
 - conda-forge
 - bioconda
Platform: linux-64
Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - done
Solving environment: | failed

SpecsConfigurationConflictError: Requested specs conflict with configured specs.
  requested specs: 
    - python=3.6
  pinned specs: 
    - cuda-version=12
    - python=3.11
    - python_abi=3.11[build=*cp311*]
Use 'conda config --show-sources' to look for 'pinned_specs' and 'track_features'
configuration parameters.  Pinned specs may also be defined in the file
/usr/local/conda-meta/pinned.


Channels:
 - bioconda
 - conda-forge
Platform: linux-64
Collecting package metadata (repodata.json): - \ | / - \ | / - \ | /

ModuleNotFoundError: No module named 'qiime2'

In [9]:

# Example usage with test data
# Download test data
!wget http://kronos.pharmacology.dal.ca/public_files/picrust/picrust2_tutorial_files/mammal_biom.qza
!wget http://kronos.pharmacology.dal.ca/public_files/picrust/picrust2_tutorial_files/mammal_seqs.qza

# Run PICRUSt2 through QIIME2
!qiime picrust2 full-pipeline \
    --i-table mammal_biom.qza \
    --i-seq mammal_seqs.qza \
    --output-dir q2-picrust2_output \
    --p-placement-tool sepp \
    --p-threads 1 \
    --p-hsp-method pic \
    --p-max-nsti 2 \
    --verbose

# Function to analyze the output
def check_output():
    import os
    output_files = os.listdir('q2-picrust2_output')
    print("Generated output files:")
    for file in output_files:
        print(f"- {file}")

check_output()

"""
Instructions for using this notebook:

1. Create a new Colab notebook
2. Copy this entire code into the notebook
3. Run the cells in order
4. The installation may take 5-10 minutes
5. After installation, you can use QIIME2 and PICRUSt2 commands

Common troubleshooting:
- If you get memory errors, try restarting the runtime
- Make sure to run cells in order
- Check that all installations completed successfully
- If you get path errors, make sure conda environment is activated

To use your own data:
1. Upload your feature table (.qza format)
2. Upload your sequence file (.qza format)
3. Modify the PICRUSt2 command with your file names
"""

/bin/bash: line 1: qiime: command not found
/bin/bash: line 1: qiime: command not found


In [3]:
from google.colab import drive
drive.mount('/content/drive')

#change the path
os.chdir('/content/drive/My Drive/MIC/picrust')

Mounted at /content/drive


NameError: name 'os' is not defined

In [11]:
# Verify QIIME2 installation
import qiime2
print(qiime2.__version__)

ModuleNotFoundError: No module named 'qiime2'

In [10]:
# Standard library imports
import condacolab
import os
import ast
from io import StringIO
from pathlib import Path
from Bio import SeqIO, Phylo

# Data processing imports
import pandas as pd
import numpy as np
import openpyxl
import matplotlib.pyplot as plt

# BIOM handling
from biom import Table
from biom.util import biom_open

# Add QIIME2 specific imports
import qiime2
from qiime2.plugins import feature_table

ModuleNotFoundError: No module named 'Bio'

In [None]:
# Directory Structure Definitions
SIMPLE_BASE = {
    'known': 'simple_known_mic',
    'other': 'simple_candidate_mic'
}

DETAILED_BASE = {
    'known': 'detailed_known_mic',
    'pure_checked': 'detailed_pure_checked_mic',
    'pure_core': 'detailed_pure_core_mic',
    'checked_core': 'detailed_checked_core_mic'
}

SUBDIRS = [
    'EC_predictions',
    'pathway_predictions',
    'KO_predictions',
    'other_picrust_files'
]

# Base Paths
base_dir = Path("/home/beatriz/MIC/2_Micro/data_picrust")
# Create output directory if it doesn't exist
base_dir.mkdir(parents=True, exist_ok=True)
aligned_file = Path("/home/beatriz/MIC/2_Micro/data_qiime/qiime_aligned_sequences.fasta/aligned-dna-sequences.fasta")
abundance_excel = Path("/home/beatriz/MIC/2_Micro/data_Ref/merged_to_sequence.xlsx")
results_file = base_dir / "functional_comparison.xlsx"

In [None]:
# Read fasta file
aligned_sequences = list(SeqIO.parse(aligned_file, "fasta"))

Main dataframe come from the merged dataframe of name 'core_check_usual_taxa' coming from the directory /home/beatriz/MIC/2_Micro/data_Ref/merged_to_sequence.xlsx, 'sheet_name='core_check_usual_taxa',  it was cleaned, then it was groupby on dataframes that reflected the source where came from, the known bacteria were groupby from the sources: 'chk-core-us', 'chk-us', 'core-us', 'us'. The core group pure_core come from core_taxa, column core. The checked group pure_checked come from checked_taxa column chck. The group check_core was made for the combination of core_taxa and checked_genera column chck_core. The final proccesed dataframe is called Integrated and is clened up from the taxonomical levels, the Source and Category columns are keep appart. Then Integrate df has only the identifiers GIDs as index, the sites as headers and the values on floats corresponding to the abundance.

In [None]:
# Integrated taxa from origin genus as headers with levels 6 for the genera, 7 for the GID, muss be cleaned
Integrated_T = pd.read_excel(abundance_excel, sheet_name='core_check_usual_taxa', header=[0,1,2,3,4,5,6,7])
# Drop first row (index 0) and first column in one chain
Integrated_T = Integrated_T.drop(index=0).drop(Integrated_T.columns[0], axis=1)
# Remove 'Unnamed' level names
Integrated_T.columns = Integrated_T.columns.map(lambda x: tuple('' if 'Unnamed' in str(level) else level for level in x))
# If the dataframe has Nan in sites it will replace it with Source
Integrated_T['Sites'] = Integrated_T['Sites'].fillna('Source')
# Fill the other index with nothing
Integrated_T =  Integrated_T.fillna(' ')
Integrated_T= Integrated_T.set_index("Sites")
pre_Integrated = Integrated_T.T
# sources are  array([' ', 'chk-core', 'chk', 'chk-core-us', 'chk-us', 'core-us', 'core', 'us'], dtype=object)

In [None]:
def process_integrated_data(df):
    """
    Process the integrated DataFrame to create a new DataFrame with clear column names
    and preserve all values including source information.

    Parameters:
    df (pandas.DataFrame): Input DataFrame with MultiIndex index and site columns

    Returns:
    pandas.DataFrame: Processed DataFrame with clear structure
    """

    # Extract genera and GIDs from the index MultiIndex
    genera = df.index.get_level_values(6)[1:]  # Skip first row
    gids = pd.to_numeric(df.index.get_level_values(7)[1:], errors='coerce')

    # Create a new DataFrame with the extracted information
    result_df = pd.DataFrame({
        'Genus': genera,
        'GID': gids
    })

    # Add the site values from the original DataFrame
    for col in df.columns:
        result_df[col] = df.iloc[1:][col].values

    # Clean up the DataFrame
    result_df['GID'] = pd.to_numeric(result_df['GID'], errors='coerce')
    result_df = result_df.dropna(subset=['GID'])
    result_df['GID'] = result_df['GID'].astype(int)

    return result_df

def get_taxa_groups(df):
    """
    Separate the processed DataFrame into different taxa groups based on Source column

    Parameters:
    df (pandas.DataFrame): Processed DataFrame from process_integrated_data()

    Returns:
    dict: Dictionary containing DataFrames for different taxa groups
    """
    # Split the data into groups based on 'Source' column patterns

    # Known corrosion bacteria (any pattern with 'us')
    known_bacteria = df[df['Source'].str.contains('us', case=False, na=False)]

    # Pure checked bacteria (only 'chk' without 'core' or 'us')
    pure_checked = df[
        df['Source'].str.contains('chk', case=False, na=False) &
        ~df['Source'].str.contains('core|us', case=False, na=False)
    ]

    # Pure core bacteria (only 'core' without 'chk' or 'us')
    pure_core = df[
        df['Source'].str.contains('core', case=False, na=False) &
        ~df['Source'].str.contains('chk|us', case=False, na=False)
    ]

    # Checked-core bacteria (contains both 'core' and 'chk' but no 'us')
    checked_core = df[
        df['Source'].str.contains('chk.*core|core.*chk', case=False, na=False) &
        ~df['Source'].str.contains('us', case=False, na=False)
    ]

    # Create groups dictionary
    taxa_groups = {
        'known_bacteria': known_bacteria,
        'pure_checked': pure_checked,
        'pure_core': pure_core,
        'checked_core': checked_core
    }

    # Print summary statistics
    print("\nDetailed Classification Results:")
    print(f"Known corrosion bacteria: {len(known_bacteria)}")
    print(f"Pure checked bacteria: {len(pure_checked)}")
    print(f"Pure core bacteria: {len(pure_core)}")
    print(f"Checked-core bacteria: {len(checked_core)}")

    # Verify total matches expected
    total_classified = len(known_bacteria) + len(pure_checked) + len(pure_core) + len(checked_core)
    print(f"\nTotal classified taxa: {total_classified}")
    print(f"Total in dataset: {len(df)}")

    return taxa_groups

# Usage example:
Integrated = process_integrated_data(pre_Integrated)

# Get the groups
taxa_groups = get_taxa_groups(Integrated)

# Access individual groups -
known_bacteria = taxa_groups['known_bacteria']
pure_core = taxa_groups['pure_core']
pure_checked = taxa_groups['pure_checked']
checked_core = taxa_groups['checked_core']

Some bacterial genera were excluded from the analysis due to unavailable reference sequences, primarily affecting rare species. The following genera were removed: Clostridium_sensu_stricto_12, Oxalobacteraceae_unclassified, Psb-m-3, Ruminiclostridium_1, and Wchb1-05. As demonstrated in Section 2.3, the statistical analysis of the BIOM-formatted data confirmed that the removal of these genera did not significantly impact the overall results of this study.

In [None]:
# List of genera to remove
genera_to_remove = {'Clostridium_sensu_stricto_12', 'Oxalobacteraceae_unclassified',
                   'Psb-m-3', 'Ruminiclostridium_1', 'Wchb1-05'}

# Filter out the rows where Genus column matches any of the genera in the list
Integrated= Integrated [~Integrated ['Genus'].isin(genera_to_remove)]

optional

In [None]:
# Ensure the genera_to_remove set is correctly defined
genera_to_remove = {'Clostridium_sensu_stricto_12', 'Oxalobacteraceae_unclassified',
                    'Psb-m-3', 'Ruminiclostridium_1', 'Wchb1-05'}

# Convert genera_to_remove to a set of strings
genera_to_remove = set(str(genus) for genus in genera_to_remove)

# Now try the filtering again
Integrated = Integrated[~Integrated['Genus'].isin(genera_to_remove)]

# Check if any rows were removed
print(f"Rows in dataframe: {len(Integrated)}")

# Check if any of the genera to remove are still present
remaining_genera = set(Integrated['Genus']) & genera_to_remove
if remaining_genera:
    print(f"These genera are still present: {remaining_genera}")
else:
    print("All specified genera have been removed successfully.")

In [None]:
# droping source and genus and putting GID as index
pre_biom= Integrated.drop(columns=["Source", "GID"])
pre_biom= pre_biom.set_index("Genus")

In [None]:
pre_biom.shape

Having the cleaned structure for Biom transformation, follows the formatting
## 2.2. Formatting Integrated df to Biom table to QIIME artifact
It creates a table with GID/OTUS as index, Sites as headers, abundance values and saves it as abundance.biom ultimately transforming it to QIIME format.

In [None]:
# Create BIOM table
biom_table = Table(
    data= pre_biom.values,
    observation_ids=pre_biom.index.astype(str),  # GID strings
    sample_ids=pre_biom.columns.astype(str) ,  # Sites as sample IDs
)

# Write to file
output_biom = "/home/beatriz/MIC/2_Micro/data_picrust/abundance.biom"
with biom_open(output_biom, 'w') as f:
    biom_table.to_hdf5(f, "Abundance data in BIOM format")

# Verify BIOM file
print(f"BIOM file created: {output_biom}")
print(f"Number of observations: {biom_table.shape[0]}")
print(f"Number of samples: {biom_table.shape[1]}")

# Convert BIOM to QIIME2 artifact
table_artifact = qiime2.Artifact.import_data(
    'FeatureTable[Frequency]',
    output_biom
)
# Verify QIIME2 artifact
print("\nQIIME2 Artifact Info:")
print(f"Type: {table_artifact.type}")
print(f"UUID: {table_artifact.uuid}")

Looking at the table how is formed

In [None]:
# Load and check the BIOM file
from biom import load_table
biom_table = load_table("/home/beatriz/MIC/2_Micro/data_picrust/abundance.biom")
print(biom_table)

In [None]:
!biom summarize-table -i /home/beatriz/MIC/2_Micro/data_picrust/abundance.biom

Sumarising the counts of the samples (Sites) and the observations (genera) in the previous cell. This represent statistics, include values like min and max mean and median counts per sample. The raw data provided by the studied as mencioned everywhere else represents relative abundances. The majority of your samples (~98%) are normalized so that their total abundance sums to 99-100%, as expected for datasets providing relative abundances. 70 samples are to 100-99% abundance relative percentage. 10 of them are less than 99%. Two of them are 89 and 87% this diferences could be due to normalisation artifacts, rounding or truncation. Also if the technicians filtered out rare or low-abundance taxa to clean the dataset, those exclusions may account for totals less than 100%. Samples with higher proportions of these filtered taxa might show a bigger drop. This is for the raw percentages, Now the biom statistics reflex other view of the data, the following statistics were done for the whole 84 features/observations/genera:

Num samples: 70
Num observations: 84
Total count: 5630
Table density (fraction of non-zero values): 0.406

Counts/sample summary:
 Min: 18.180
 Max: 99.058
 Median: 84.819
 Mean: 80.439
 Std. dev.: 16.000
 Sample Metadata Categories: None provided
 Observation Metadata Categories: None provided

Counts/sample detail:
site_69: 18.180
site_67: 21.790
site_70: 27.060
site_13: 54.982
site_26: 58.300
site_21: 58.973
site_5: 60.650

The Statistics seen in this biom table could be read as low intensity (0.406) and indicates that more than half the taxa have zero counts for most samples, consistent with a dataset dominated by a few taxa. Counts/sample summary is calculated by relative abundances and site_69 shows very low count, that maybe explained by an uneven distribution of taxa (highly skewed abundances, few dominant taxa or/and technical issues during sample preparation or sequencing). Other possible explanations for the low density of the samples 69,67,70 could be that they are the very sites with missing taxa and it is noticed during the evaluation of the sequences. However close inspection of the sites: site_40 has 77% of sequences been removed by concept of removing Clostridium_sensu_stricto_12, because the sequence was no get from the NCBI nor elsewhere and however this site shows a count ratio of 91.21 %. Same site when removed these missing genera from the data, shows a very low relative abundance which is expected since 73% was removed by no sequenciating the Clostridium sensu stricto 12. Site_31 has a percentage of 55,35 of Oxalobacteraceae_unclassified which has been also removed. Sites 12,38 and 65, has been remove between 8-14% by concept of removing Psb-m-3 bacteria.  Sites 20 and 41 has been removed between 11-18% sequences when removing Ruminiclostridium_1 bacteria. Site 12 has removed Wchb1-0 bacteria which accounted for 20% of the abundance of the site. However the fact that this removals are not being reflected on the statistical summary is a good signal that those genera were no relevant for the community as they do not belong to any of the here studied groups of genera core_taxa, checked_genera or usual_taxa. Instead the percentages are reflecting that on site_69, there is few of our selected genera and hence the representation is very low. In conclusion sites site_69, site_67 and site_70 have different community compositions than the others, with fewer of your target bacteria present, which is no a surprise since those sites come from UK sites.

# Removing the genera and replacing the accension numbers for PICRUST2 Database

In [None]:
# Input file from previous QIIME2 alignment
input_file = Path("data_qiime/qiime_aligned_sequences.fasta/aligned-dna-sequences.fasta")

# Intermediate file with cleaned headers
clean_headers_file = Path('data_qiime/clean_headers.fasta')

# Create output directory for masked alignment
masked_output_dir = Path("data_qiime/masked_sequences")
masked_output_dir.mkdir(parents=True, exist_ok=True)

# Clean the headers
cleaned_records = []
for record in SeqIO.parse(input_file, "fasta"):
    accession = record.description.split('Accession:')[1].strip()
    new_record = SeqRecord(
        seq=record.seq,
        id=accession,
        description=""
    )
    cleaned_records.append(new_record)

# Write cleaned sequences
SeqIO.write(cleaned_records, clean_headers_file, "fasta")

# Import cleaned sequences into QIIME2
aligned_artifact = qiime2.Artifact.import_data(
    'FeatureData[AlignedSequence]',
    str(clean_headers_file)
)

# Apply masking
masked_alignment = alignment.methods.mask(
    alignment=aligned_artifact,
    max_gap_frequency=0.5,
    min_conservation=0.4
)

# Export masked alignment to directory
masked_alignment.masked_alignment.export_data(str(masked_output_dir))

# The resulting file will be in a new directory with QIIME2's default name
print(f"Pipeline steps:")
print(f"1. Input aligned sequences: {input_file}")
print(f"2. Cleaned headers file: {clean_headers_file}")
print(f"3. Masked alignment output: {masked_output_dir}/aligned-dna-sequences.fasta")

## 2.4. Optimising the Sequences by Trimming and Cleaning

The focus is to preserve the most informative diagnostic regions, maintain alignment within these regions. Care is taken on keeping the phylogenetic relationships intact so that the picrust analysis be of better quality, mantaining the biological significance.

In [None]:
def optimize_diagnostic_sequences(input_fasta, output_fasta):
    """
    Optimize sequences preserving key diagnostic regions
    """
    # Key diagnostic regions we want to preserve
    key_regions = [
        (249, 572),   # Large region 1
        (934, 1653),  # Largest region
        (2344, 2846)  # Large region 2
    ]

    sequences = {}
    current_header = ""

    print("Reading sequences...")
    with open(input_fasta) as f:
        for line in f:
            line = line.strip()
            if line.startswith('>'):
                current_header = line
                sequences[current_header] = []
            elif line:
                sequences[current_header].append(line)

    # Join sequences
    for header in sequences:
        sequences[header] = ''.join(sequences[header])

    # Find optimal boundaries that include key regions
    start_pos = min(region[0] for region in key_regions)
    end_pos = max(region[1] for region in key_regions)

    print(f"\nOptimized boundaries:")
    print(f"Start: {start_pos}")
    print(f"End: {end_pos}")

    # Write optimized sequences
    print("\nWriting optimized sequences...")
    with open(output_fasta, 'w') as out:
        for header, seq in sequences.items():
            trimmed_seq = seq[start_pos:end_pos]
            non_gaps = sum(1 for c in trimmed_seq if c != '-')
            content_ratio = non_gaps / len(trimmed_seq)

            out.write(f"{header}\n")
            for i in range(0, len(trimmed_seq), 60):
                out.write(trimmed_seq[i:i+60] + '\n')

            print(f"Sequence {header.split()[0]} content ratio: {content_ratio:.2%}")

    print(f"\nProcessing complete:")
    print(f"Original length: {len(next(iter(sequences.values())))}")
    print(f"Optimized length: {end_pos - start_pos}")
    print(f"Sequences processed: {len(sequences)}")

# Run the optimization
aligned_file = Path("/home/beatriz/MIC/2_Micro/data_tree/aligned_sequences_integrate.fasta")
output_file = aligned_file.parent / "diagnostic_optimized_sequences.fasta"
optimize_diagnostic_sequences(aligned_file, output_file)

There are high quality (>50%): Hydrogenophaga (53.60%), Blastomonas (59.38%), Phenylobacterium (52.87%), Afipia (57.91%), Neisseria (55.99%), Desulfovibrio (60.95%), Acetobacterium (57.87%), Bulleidia (51.75%). The moderate quality (35-50%): About 35 sequences, including Nitrospira, Oerskovia, most Proteobacteria. Also we found low quality (<25%): About 20 sequences, including Corynebacterium (16.67%), Treponema (24.53%), Variovorax (16.90%), Desulfobulbus (16.71%).
Regarding sequence Length, the original sequences have 3471 bases and by optimising they are left about 2597 bases. That makes a 75% of the original length, and this regions are quality diagnostic regions. Base on this realities two approach will be taken, run picrust2 on high >50% quality qusequences and second compare result s with low quality sequences. However this approach will sacrify some of the bacteria that may have no quality sequences but are relevant for out study.  Therefore it is important to check the quality quality distribution within our groups. We make consider to use different quality threshold so that we can barging on the results.


In [None]:
def verify_cleaned_sequences(fasta_file):
    """
    Verify the quality of cleaned sequences
    """
    sequences = {}
    current_header = ""

    print("Analyzing cleaned sequences...")
    with open(fasta_file) as f:
        for line in f:
            line = line.strip()
            if line.startswith('>'):
                current_header = line
                sequences[current_header] = []
            elif line:
                sequences[current_header].append(line)

    # Join sequences and analyze
    for header in sequences:
        sequences[header] = ''.join(sequences[header])

    # Calculate statistics
    lengths = []
    base_counts = []

    for header, seq in sequences.items():
        lengths.append(len(seq))
        base_counts.append(sum(1 for c in seq if c != '-'))

    print(f"\nSequence Statistics:")
    print(f"Total sequences: {len(sequences)}")
    print(f"Sequence length: {lengths[0]} (all sequences same length)")
    print(f"Average non-gap bases: {sum(base_counts)/len(base_counts):.1f}")
    print(f"Min non-gap bases: {min(base_counts)}")
    print(f"Max non-gap bases: {max(base_counts)}")

# Verify the cleaned sequences
output_file = aligned_file.parent / "picrust_ready_sequences.fasta" #"diagnostic_optimized_sequences.fasta" # "picrust_ready_sequences.fasta"
verify_cleaned_sequences(output_file)

Using verify_cleaned_sequences run over **"diagnostic_optimises_sequences.fasta"** which has 2597 bp length (better content), was found that: Total sequences: 79 Sequence length: 2597 (all sequences same length) Average non-gap bases: 937.5 Min non-gap bases: 433 Max non-gap bases: 1583. Using the **"picrust_ready_sequences.fasta"** which has 890 bp length (more aggressive trimming)was found: Total sequences: 79 Sequence length: 890 (all sequences same length) Average non-gap bases: 314.0, Min non-gap bases: 96 Max non-gap bases: 588. In average the first trimming diadnostic optimised version has better content with  937.5 average non-gap bases, in contrast to  314.0 non-gap bases, which appears to be too aggressive. The first cleaning-triming preserves more sequence content, removes unnecesary gaps, yet mantainig the important diagnostic regions. On the other hand the groupby analysis show similar quality patterns for all.

## Biom Data Replacing Genera with Accession Numbers

In [None]:
# First recreate the mapping to make sure we have it
fasta_mapping = {}
with open(input_fasta) as f:
    for record in SeqIO.parse(f, "fasta"):
        genus = record.description.split()[0]
        accession = record.description.split('Accession:')[1].strip()
        fasta_mapping[genus] = accession

# Load current BIOM table
biom_table = load_table("/home/beatriz/MIC/2_Micro/data_picrust/abundance.biom")

# Get the observation IDs (currently genera)
obs_ids = biom_table.ids(axis='observation')

# Create new IDs using the mapping
new_ids = [fasta_mapping[obs_id] for obs_id in obs_ids]

# Create new BIOM table with accession numbers
acce_biom = Table(
    data=biom_table.matrix_data,
    observation_ids=new_ids,
    sample_ids=biom_table.ids()
)

# Save new BIOM file
with biom_open('/home/beatriz/MIC/2_Micro/data_picrust/abundance_accession.biom', 'w') as f:
    acce_biom.to_hdf5(f, "Abundance data with accession numbers")

print("New BIOM file created with accession numbers as IDs")

# Fasta Mapping and Accession as ID
It appears that picrust doesnt take genus nor gid numbers but accession numbers, so in order to be able to compare those, it is necesary to map the accession numers to the gids to the genera and let the fasta data just with the identifiers accession which is the ones that picrust2 database uses.

In [None]:
# First create the mapping
fasta_mapping = {}
with open(aligned_file) as f:
    for line in f:
        if line.startswith('>'):
            # Parse header like "Nitrospira Accession:1197011011"
            genus = line.split()[0][1:]  # Remove '>' and get genus
            accession = line.split('Accession:')[1].strip()
            fasta_mapping[genus] = accession

# Now we can use this mapping to update both files
print("Sample of genus to accession mapping:")
for genus, accession in list(fasta_mapping.items())[:5]:
    print(f"{genus}: {accession}")

In [None]:
# Input and output paths
input_fasta = Path("/home/beatriz/MIC/2_Micro/data_tree/diagnostic_optimized_sequences.fasta")
output_fasta = Path("/home/beatriz/MIC/2_Micro/data_tree/accession_sequences.fasta")

clean_fasta_with_accessions(input_fasta, output_fasta)

def clean_fasta_with_accessions(input_fasta, output_fasta):
    """
    Clean FASTA headers to use accession numbers as IDs
    """
    cleaned_records = []
    with open(input_fasta) as f:
        for record in SeqIO.parse(f, "fasta"):
            # Get accession from description
            accession = record.description.split('Accession:')[1].strip()
            # Create new record with accession as ID
            new_record = SeqRecord(
                seq=record.seq,
                id=accession,
                name=accession,
                description=""
            )
            cleaned_records.append(new_record)

    # Write cleaned sequences
    SeqIO.write(cleaned_records, output_fasta, "fasta")
    print(f"Created clean FASTA file with {len(cleaned_records)} sequences")

    # Show first few headers to verify
    print("\nFirst few headers in cleaned file:")
    for record in cleaned_records[:3]:
        print(f">{record.id}")


Reversing the sequence

In [None]:
input_fasta = Path("/home/beatriz/MIC/2_Micro/data_tree/accession_sequences.fasta")
output_fasta = Path("/home/beatriz/MIC/2_Micro/data_tree/accession_revers_seq.fasta")

# Use an f-string to format the command with the correct file paths
command = f"seqtk seq -r {input_fasta} > {output_fasta}"

# Use subprocess to run the command
import subprocess
subprocess.run(command, shell=True, check=True)


Comparing our data with the database data

In [None]:
# our data
# Replace 'input_sequences.fasta' with your actual input file name
for record in SeqIO.parse("/home/beatriz/MIC/2_Micro/data_tree/accession_sequences.fasta", "fasta"):
    print(f">{record.id}")
    print(record.seq[:150])  # Print first 50 bases of each sequence

# Database Data from picrust

In [None]:
# Check first few entries of PICRUSt2's reference database
with open('/home/beatriz/miniconda3/envs/picrust2/lib/python3.9/site-packages/picrust2/default_files/prokaryotic/pro_ref/pro_ref.fna') as f:
    print("First few lines of PICRUSt2 reference database:")
    for i, line in enumerate(f):
        print(line.strip())
        if i > 10:  # Print first few lines only
            break

## 2.6. Classifying Bacteria by their Source DataFrame
Two distinct classification approaches are implemented to categorize bacteria. The simple approach (get_bacteria_sources_simple) divides bacteria into known corrosion-causers (usual_taxa) and candidates (all others). The detailed approach (get_bacteria_sources_detailed) provides finer categorization by separating bacteria into known corrosion-causers, pure checked taxa, pure core taxa, and those present in both checked and core datasets. Please notice that this function uses df Integrated for source clasification and no abundance.biom which will be used for the picrust2 pipeline.

In [None]:
def get_bacteria_sources_simple(Integrated_df):
    """
    Simple classification:
    1. Known (anything with 'us')
    2. All others (combined chk, core, chk-core)
    """
    # Get genera and gids from column levels 6 and 7
    genera = Integrated_df["Genus"]
    gids = Integrated_df["GID"]
    # Look for Source in the data, not index
    sources = Integrated_df['Source'] if 'Source' in Integrated_df.columns else None

    known_bacteria = {}     # usual_taxa
    other_bacteria = {}     # everything else

    sources_found = set()
    source ={}
    patterns = ['us', 'core-us', 'chk-us', 'chk-core-us']

    for i, (genus, gid) in enumerate (zip(genera, gids)):
        if source is not None:  # Check if source exists for this genus
            source = str(sources.iloc[i]).strip().lower()
            sources_found.add(source)

            if source in patterns:
                known_bacteria[genus] = int(gid) if str(gid).isdigit() else gid
            else:
                other_bacteria[genus] = int(gid) if str(gid).isdigit() else gid

    print("\nSimple Classification Results:")
    print(f"Known corrosion bacteria: {len(known_bacteria)}")
    print(f"Other bacteria: {len(other_bacteria)}")
    print("\nSources found:", sources_found)

    return {
        'known_bacteria': known_bacteria,
        'other_bacteria': other_bacteria
    }

def get_bacteria_sources_detailed(Integrated_df):
    """
    Detailed classification with all possible combinations:
    1. Known (usual_taxa)
    2. Pure checked (only 'chk')
    3. Pure core (only 'core')
    4. Checked-core (overlap 'chk-core')
    """
    # Get genera and gids from column levels 6 and 7
    genera = Integrated_df.index.get_level_values(6)[1:]
    gids = Integrated_df.index.get_level_values(7)[1:]

    sources = Integrated_df['Source'] if 'Source' in Integrated_df.columns else None

    known_bacteria = {}      # usual_taxa
    pure_checked = {}        # only 'chk' checked_taxa
    pure_core = {}          # only 'core' core_taxa
    checked_core = {}       # 'chk-core' checked and core taxa
    source ={}
    sources_found = set()
    patterns = ['us', 'core-us', 'chk-us', 'chk-core-us']

    for i, (genus, gid) in enumerate (zip(genera, gids)):
        if source is not None:  # Check if source exists for this genus
            source = str(sources.iloc[i]).strip().lower()
            sources_found.add(source)

            if source in patterns:
                known_bacteria[genus] = int(gid) if str(gid).isdigit() else gid
                continue

            # Then handle other combinations
            if source == 'chk':
                pure_checked[genus] = gid
            elif source == 'core':
                pure_core[genus] = gid
            elif 'chk-core' in source:
                checked_core[genus] = gid

    print("\nDetailed Classification Results:")
    print(f"Known corrosion bacteria: {len(known_bacteria)}")
    print(f"Pure checked bacteria: {len(pure_checked)}")
    print(f"Pure core bacteria: {len(pure_core)}")
    print(f"Checked-core bacteria: {len(checked_core)}")
    print("\nSources found:", sources_found)

    return {
        'known_bacteria': known_bacteria,
        'pure_checked': pure_checked,
        'pure_core': pure_core,
        'checked_core': checked_core
    }

## 2.7. Prepare picrust data and Creating Directories for PICRUSt2 Input
The check_missing_genera function processes the integrated data and handles data quality control. Known problematic genera (e.g., 'Clostridium_sensu_stricto_12', 'Oxalobacteraceae_unclassified') are flagged for exclusion to prevent analysis errors. The function also creates an organized directory structure as outlined in the introduction, with separate paths for different bacterial classifications (known_mic, candidate_mic, etc.) and their respective analysis outputs (EC_predictions, pathway_predictions, KO_predictions). Following function prepares the data for picrust analysis but both dataframes the abundance.biom and Integrated have some bacteria that were no sequenciated mostly cause are no known specimens. So it is necesary to do same procedure to both dfs.

In [None]:
def prepare_picrust_data(Integrated_df, aligned_file, function_type='simple'):
    """
    Prepare data for PICRUSt analysis with choice of  function_type method

    Args:
        Integrated_df: Input DataFrame
        aligned_file: Path to aligned sequences
        function_type: 'simple' or 'detailed'
    """
    # Get bacteria source_groups based on chosen  function_type
    if  function_type == 'simple':
        source_groups = get_bacteria_sources_simple(Integrated_df)
    else:
        source_groups= get_bacteria_sources_detailed(Integrated_df)

    # Create appropriate directory structure
    create_directory_structure(function_type)

    return source_groups

def create_directory_structure(function_type='simple'):
    """Create directory structure for PICRUSt analysis"""
    base_dir = Path("/home/beatriz/MIC/2_Micro/data_picrust")

    if function_type == 'simple':
        directories = SIMPLE_BASE
    else:
        directories = DETAILED_BASE

    # Create all required directories
    for dir_name in directories.values():
        for subdir in SUBDIRS:
            (base_dir / dir_name / subdir).mkdir(parents=True, exist_ok=True)

# 3. PICRUSt Pipeline Definition
The pipeline processes the aligned sequence data from notebook 5 that has or not undergo cleaning of the sequences as previously done on section 2. Also processes the biom_table in order to account on this anylsis on abundance. It queries the PICRUSt database to predict potential metabolic pathways for each genus. This prediction is based on evolutionary relationships and known genomic capabilities of related organisms.

In [None]:
def run_picrust2_pipeline(fasta_file, biom_file, output_dir):
    """
    Run the main PICRUSt2 pipeline on input sequences and BIOM table.

    Args:
        fasta_file: Path to the aligned sequences FASTA file.
        biom_file: Path to the BIOM table (without extra columns).
        output_dir: Directory for PICRUSt2 output.
    """
    try:
        # Run main PICRUSt2 pipeline
        cmd = [
            'picrust2_pipeline.py',
            '-s', fasta_file,        # Input FASTA file with aligned sequences
            '-i', biom_file,         # BIOM table with abundance data
            '-o', output_dir,        # Output directory
            '--processes', '4',      # Parallel processes
            '--verbose',
            '--min_align', '0.25'    # Note the split here
        ]
        subprocess.run(cmd, check=True)

        # Add pathway descriptions if the pathway file exists
        pathway_file = os.path.join(output_dir, 'pathways_out/path_abun_unstrat.tsv.gz')
        if os.path.exists(pathway_file):
            cmd_desc = [
                'add_descriptions.py',
                '-i', pathway_file,
                '-m', 'PATHWAY',
                '-o', os.path.join(output_dir, 'pathways_with_descriptions.tsv')
            ]
            subprocess.run(cmd_desc, check=True)

        print(f"PICRUSt2 pipeline completed successfully for {output_dir}")
        return True

    except subprocess.CalledProcessError as e:
        print(f"Error running PICRUSt2: {e}")
        return False

# 4. Analysis of Pathways
The analysis focuses on metabolic pathways known to be involved in microbially influenced corrosion, including sulfur metabolism, organic acid production, iron metabolism, and biofilm formation. These pathways were selected based on documented mechanisms of known corrosion-inducing bacteria. Separate pipeline runs for simple and detailed classifications ensure proper pathway analysis for each bacterial group.

In [None]:
def analyze_functional_profiles(picrust_output_dir, bacteria_list):
    """
    Analyze functional profiles with focus on corrosion-relevant pathways

    Parameters:
    picrust_output_dir: directory containing PICRUSt2 output
    bacteria_list: list of bacteria names to analyze
    """
    # Define corrosion-relevant pathways
    relevant_pathways = [
        'Sulfur metabolism',
        'Iron metabolism',
        'Energy metabolism',
        'Biofilm formation',
        'Metal transport',
        'ochre formation',
        'iron oxide deposits',
        'iron precipitation',
        'rust formation',
        'organic acid production',
        'acetate production',
        'lactate metabolism',
        'formate production',
    ]

    try:
        # Read PICRUSt2 output
        pathway_file = os.path.join(picrust_output_dir, 'pathways_with_descriptions.tsv')
        pathways_df = pd.read_csv(pathway_file, sep='\t')

        # Filter for relevant pathways
        filtered_pathways = pathways_df[
            pathways_df['description'].str.contains('|'.join(relevant_pathways),
                                                  case=False,
                                                  na=False)]

        # Calculate pathway abundances per bacteria
        pathway_abundances = filtered_pathways.groupby('description').sum()

        # Calculate pathway similarities between bacteria
        pathway_similarities = {}
        for bacteria in bacteria_list:
            if bacteria in pathways_df.columns:
                similarities = pathways_df[bacteria].corr(pathways_df[list(bacteria_list)])
                pathway_similarities[bacteria] = similarities

        # Predict functional potential
        functional_predictions = {}
        for pathway in relevant_pathways:
            pathway_presence = filtered_pathways[
                filtered_pathways['description'].str.contains(pathway, case=False)
            ]
            if not pathway_presence.empty:
                functional_predictions[pathway] = {
                    'presence': len(pathway_presence),
                    'mean_abundance': pathway_presence.mean().mean(),
                    'max_abundance': pathway_presence.max().max()
                }

        # Calculate correlation scores
        correlation_scores = {}
        for bacteria in bacteria_list:
            if bacteria in pathways_df.columns:
                correlations = pathways_df[bacteria].corr(
                    pathways_df[filtered_pathways.index]
                )
                correlation_scores[bacteria] = {
                    'mean_correlation': correlations.mean(),
                    'max_correlation': correlations.max(),
                    'key_pathways': correlations.nlargest(5).index.tolist()
                }

        comparison_results = {
            'pathway_similarities': pathway_similarities,
            'functional_predictions': functional_predictions,
            'correlation_scores': correlation_scores,
            'pathway_abundances': pathway_abundances.to_dict()
        }

        return filtered_pathways, comparison_results

    except Exception as e:
        print(f"Error in pathway analysis: {str(e)}")
        return None, None

# Testing the pipeline

In [None]:
# ---- RUNNING THE PIPELINE ----

# Set paths
aligned_fasta_file = Path('/home/beatriz/MIC/2_Micro/data_tree/accession_sequences.fasta') #'data_tree/aligned_sequences_integrate.fasta')
abundance_biom_file =  Path('/home/beatriz/MIC/2_Micro/data_picrust/abundance_accession.biom')
output_dir = 'picrust9_output'

# List of bacteria to analyze
bacteria_of_interest = ['Azospira', 'Brachybacterium', 'Bulleidia']

# Run PICRUSt2
if run_picrust2_pipeline(aligned_fasta_file,
                         abundance_biom_file,
                         output_dir
                        ):
    # Analyze functional profiles if the pipeline completes successfully
    filtered_pathways, abundances = analyze_functional_profiles(output_dir, bacteria_of_interest)

# 5. Functional Analysis
The analysis workflow begins by categorizing bacteria into source groups using the classification functions. These categorized data are then processed through the PICRUSt pipeline to predict metabolic capabilities. The functional analysis examines pathway presence, abundance, and correlations between different bacterial groups to identify potential corrosion-related metabolic patterns.

In [None]:
def run_functional_analysis(df, Integrated_df, aligned_file, analysis_type='simple'):
    """
    Run complete functional analysis pipeline for either simple or detailed classification

    Parameters:
    df: Input DataFrame
    aligned_file: Path to aligned sequences file
    analysis_type: 'simple' or 'detailed'
    """
    try:
        print(f"\n{'='*50}")
        print(f"Starting {analysis_type} classification analysis")
        print(f"{'='*50}")

        # Prepare data and get source groups
        print("\nStep 1: Preparing data...")

        source_groups = prepare_picrust_data(Integrated_df, aligned_file, function_type=analysis_type)

        if not source_groups:
            raise ValueError("Failed to prepare data: No source groups returned")

        # Base directory for PICRUSt output
        base_dir = Path("/home/beatriz/MIC/2_Micro/data_picrust")

        results = {}

        if analysis_type == 'simple':
            # Run analysis for simple classification
            # Known bacteria
            known_output_dir = base_dir /SIMPLE_BASE['known']
            success_known = run_picrust2_pipeline(aligned_file, df, str(known_output_dir))
            if success_known:
                results_known = analyze_functional_profiles(str(known_output_dir),
                                                        source_groups['known_bacteria'].keys())

            # Other bacteria
            other_output_dir = base_dir / SIMPLE_BASE['other']
            success_other = run_picrust2_pipeline(aligned_file, str(other_output_dir))
            if success_other:
                results_other = analyze_functional_profiles(str(other_output_dir),
                                                        source_groups['other_bacteria'].keys())

        else:
            # Run analysis for detailed classification
            for group, dir_name in DETAILED_BASE.items():

                # Known bacteria
                known_output_dir = base_dir / DETAILED_BASE['known']
                success_known = run_picrust2_pipeline(aligned_file, str(known_output_dir))
                if success_known:
                    results_known = analyze_functional_profiles(str(known_output_dir),
                                                            source_groups['known_bacteria'].keys())

                # Pure checked bacteria
                checked_output_dir = base_dir /  DETAILED_BASE['pure_checked']
                success_checked = run_picrust2_pipeline(aligned_file, str(checked_output_dir))
                if success_checked:
                    results_checked = analyze_functional_profiles(str(checked_output_dir),
                                                            source_groups['pure_checked'].keys())

                # Pure core bacteria
                core_output_dir = base_dir /DETAILED_BASE['pure_core']
                success_core = run_picrust2_pipeline(aligned_file, str(core_output_dir))
                if success_core:
                    results_core = analyze_functional_profiles(str(core_output_dir),
                                                            source_groups['pure_core'].keys())

                # Checked-core bacteria
                checked_core_output_dir = base_dir /DETAILED_BASE['checked_core']
                success_checked_core = run_picrust2_pipeline(aligned_file, str(checked_core_output_dir))
                if success_checked_core:
                    results_checked_core = analyze_functional_profiles(str(checked_core_output_dir),
                                                                    source_groups['checked_core'].keys())
    except subprocess.CalledProcessError as e:
        print(f"Error running PICRUSt2: {e}")

        return "Analysis completed successfully"


diagnostic_optimized_sequences.fasta, picrust_ready_sequences.fasta

In [None]:
# Run the analysis for both types
# Simple source classification
simple_results = run_functional_analysis(biom_table, aligned_file, analysis_type='simple') # output_biom

# Detailed source classification
detailed_results = run_functional_analysis(biom_table, aligned_file, analysis_type='detailed')

# 6. Findings and Discusion

In [None]:
def run_picrust2_pipeline(fasta_file, output_dir, min_align =0.5):
    """
    Run PICRUSt2 pipeline with improved error handling and path management

    Args:
        fasta_file: Path to aligned sequences fasta file (str or Path)
        output_dir: Directory for PICRUSt2 output (str or Path)
    """
    import subprocess
    import os
    from pathlib import Path

    # Convert paths to strings
    fasta_file = str(fasta_file)
    output_dir = str(output_dir)

    try:
        # Verify picrust2 is available
        picrust_check = subprocess.run(['which', 'picrust2_pipeline.py'],
                                     capture_output=True,
                                     text=True)
        if picrust_check.returncode != 0:
            raise RuntimeError("picrust2_pipeline.py not found. Please ensure PICRUSt2 is properly installed.")

        # Create output directory
        os.makedirs(output_dir, exist_ok=True)

        # Construct command as a single string
        cmd = f"picrust2_pipeline.py -s {fasta_file} -i {fasta_file} -o {output_dir} --processes 1 --verbose"

        # Run pipeline
        print(f"Running command: {cmd}")
        process = subprocess.run(cmd,
                               shell=True,  # Use shell to handle command string
                               check=True,
                               capture_output=True,
                               text=True)

        print("PICRUSt2 Output:")
        print(process.stdout)

        if process.stderr:
            print("Warnings/Errors:")
            print(process.stderr)

        # Add descriptions if pathway file exists
        pathway_file = os.path.join(output_dir, 'pathways_out/path_abun_unstrat.tsv.gz')
        if os.path.exists(pathway_file):
            desc_cmd = f"add_descriptions.py -i {pathway_file} -m PATHWAY -o {os.path.join(output_dir, 'pathways_with_descriptions.tsv')}"
            subprocess.run(desc_cmd, shell=True, check=True)

        print(f"PICRUSt2 pipeline completed successfully for {output_dir}")
        return True

    except subprocess.CalledProcessError as e:
        print(f"Error running PICRUSt2 command: {e}")
        print(f"Command output: {e.output}")
        return False
    except Exception as e:
        print(f"Error in pipeline: {str(e)}")
        return False

In [None]:
# For original sequences
aligned_file = Path("/home/beatriz/MIC/2_Micro/data_tree/aligned_sequences_integrate.fasta")
output_dir = Path("/home/beatriz/MIC/2_Micro/data_picrust/original_results")
success = run_picrust2_pipeline(aligned_file, output_dir)

# For improved sequences
optimized_file = Path("/home/beatriz/MIC/2_Micro/data_tree/picrust_optimized_sequences.fasta")
optimized_output = Path("/home/beatriz/MIC/2_Micro/data_picrust/optimized_results")
success_opt = run_picrust2_pipeline(optimized_file, optimized_output)