# 1. Bacteria Influencing Corrosion
This notebook has as objective to dilucidate which of the microorganisms in the data have recognised influence in corrosion damage. With this pool of bacteria we compare the nobel bacteria here found with the gene sequences of the known corrosion-related genes belonging to MIC or the metabolic pathways are related or can be related to corrosion. Then using the list here found as "anchors" to find associated bacteria
Looking for similar metabolic patterns in other species no yet related to MIC.
## Aims
 __ Compare specific functional genes known to be involved in corrosion processes, particularly        focusing on sulfate reduction pathways (dsrAB, aprAB genes),Metal reduction genes, Cytochrome c3 complexes
__Perform targeted comparative genomic analysis between known corrosion-causing bacteria
  newly identified bacterial specimens from this research

The databases uses on this notebook are:

Bacmet: 'https://bacmet.biomedicine.gu.se/download.html',
KEGG : 'https://www.genome.jp/kegg/pathway.html', which is the Kyoto Encyclopedia of Genes and Genomes. With this is possible to find metabolic pathways, identify functional gene annotations
IMG/M: 'https://img.jgi.doe.gov/',- For detailed metabolic pathways
BRENDA: 'https://www.brenda-enzymes.org/

1. Initial Computational Screening  --> 1a. search_all_databases -->1b. analyze_metabolic_pathways 
   ↓                                                                 ↓
2. Literature Validation                           <--- 1c. Literature Analysis
   ↓  
3. Metabolic Pathway Analysis and mapping- PICRUSt2 - Can predict metabolic functions from 16S data  
   ↓
4. Find functional similarities between known and candidate bacteria, compare taxonomic groups with similar functional profiles
   ↓
5. Sequence analysis for: Sulfate reduction genes, Iron metabolism genes, Biofilm    formation genes
   ↓
6. Identify gene clusters associated with iron metabolism

Structure of the notebook: /home/beatriz/MIC/2_Micro/data_MIC/
├── bacteria_corrosion_summary.xlsx  # Main results file with multiple sheets
├── Original_data/                   # For future bacteria lists
└── References/                      # For literature results

In [2]:
'''import os
from google.colab import drive  #silence for vscode
drive.mount('/content/drive')
#change the path
os.chdir('/content/drive/My Drive/MIC')
# For colab
!pip install pandas numpy biopython
!pip install requests beautifulsoup4
!pip install Bio'''

"import os\nfrom google.colab import drive  #silence for vscode\ndrive.mount('/content/drive')\n#change the path\nos.chdir('/content/drive/My Drive/MIC')\n# For colab\n!pip install pandas numpy biopython\n!pip install requests beautifulsoup4\n!pip install Bio"

# 2. Preparing data

In [3]:
import os
from pathlib import Path
from Bio import Entrez
import pandas as pd
from functools import partial
import requests
from bs4 import BeautifulSoup
import time
import urllib3
from datetime import datetime
import logging
import numpy as np
import matplotlib.pyplot as plt
import openpyxl
from openpyxl.styles import Alignment

In [4]:
# For VSCode
base_dir = Path("/home/beatriz/MIC/2_Micro/data_MIC")
original_dir = base_dir / "Original_data"
Literatur_dir = base_dir / "References"
results_file = base_dir / "bacteria_corrosion_summary.xlsx" 

# For Colab
'''
from google.colab import drive
drive.mount('/content/drive')
base_dir = Path('/content/drive/My Drive/MIC/data')
original_dir = base_dir / "original"
original_dir.mkdir(exist_ok=True)
'''

'\nfrom google.colab import drive\ndrive.mount(\'/content/drive\')\nbase_dir = Path(\'/content/drive/My Drive/MIC/data\')\noriginal_dir = base_dir / "original"\noriginal_dir.mkdir(exist_ok=True)\n'

In [5]:
# Read the Excel file for the whole data
Jointax = pd.read_excel("data/Jointax.xlsx", sheet_name='Biotot_jointax', header=[0,1,2,3,4,5,6,7])
# Drop 2 first columns
Jointax = Jointax.drop(Jointax.columns[0:2], axis=1)

In [6]:
# Read the Excel file for the checked genera
selected = pd.read_excel("/home/beatriz/MIC/2_Micro/data/finalist_dfs.xlsx", sheet_name='checked_genera', header=[0,1,2,3,4,5,6,7])
# Drop first row specifically (index 0 which contains NaNs)
selected = selected.drop(index=0)
# Drop first column (the index column with Level1, Level2, etc)
selected = selected.drop(selected.columns[0:3], axis=1)

In [7]:
selected_list = selected.columns.get_level_values(6)

In [8]:
# Extract Genera and ID from the multi-index, For selected genera
selected_GID = dict(zip(selected.columns.get_level_values(6), selected.columns.get_level_values(7)))
# For all genera 
all_GID = dict(zip(Jointax.columns.get_level_values(6), Jointax.columns.get_level_values(7)))

# 3. Reference Formating Function
Following function is to take the references given in the search and present them on APA style list

In [9]:
def format_apa_reference(article):
    """Format article data into APA style reference"""
    try:
        # Get authors
        if 'AuthorList' in article:
            authors = article['AuthorList']
            if len(authors) > 6:
                author_text = f"{authors[0]['LastName']}, {authors[0].get('ForeName', '')[0]}., et al."
            else:
                author_list = []
                for author in authors:
                    if 'ForeName' in author:
                        author_list.append(f"{author['LastName']}, {author['ForeName'][0]}.")
                    else:
                        author_list.append(f"{author['LastName']}")
                author_text = ", ".join(author_list[:-1]) + " & " + author_list[-1] if len(author_list) > 1 else author_list[0]
        else:
            author_text = "No author"

        # Get year
        pub_date = article['Journal']['JournalIssue']['PubDate']
        year = pub_date.get('Year', 'n.d.')

        # Get title
        title = article.get('ArticleTitle', 'No title')
        
        # Get journal info
        journal = article['Journal']
        journal_title = journal.get('Title', journal.get('ISOAbbreviation', 'No journal'))
        
        # Get volume, issue, pages
        volume = journal['JournalIssue'].get('Volume', '')
        issue = journal['JournalIssue'].get('Issue', '')
        pagination = article.get('Pagination', {}).get('MedlinePgn', '')

        # Format the reference
        reference = f"{author_text} ({year}). {title}. {journal_title}"
        if volume:
            reference += f", {volume}"
        if issue:
            reference += f"({issue})"
        if pagination:
            reference += f", {pagination}"
        reference += "."

        return reference
    except Exception as e:
        return f"Error formatting reference: {str(e)}"

# 4. Query DB: Search Multiple Databases
we connect and search multiple databases for MIC-related terms. Phase 1 (search_mic_databases): The code establishes initial database connections using provided email. Searches BacMet, KEGG, IMG/M, and BRENDA for MIC-related keywords and uses defined metabolic pathways to categorize results. At the end it returns DataFrame with [Bacteria, Database, Evidence, Pathway] columns. The bacteria undergoes search in batches,through through all functions in sequence:search_all_databases,analyze_metabolic_pathways, literature_analysis. Complete batch is saved and move to the next batch

In [10]:
def search_corrosion_genes(bacteria_name, base_dir, Literatur_dir, gid_dict):
    """
    Search for specific corrosion-related genes and pathways for a given bacteria
    
    Parameters:
    bacteria_name: str - name of the bacteria to search
    base_dir: Path - base directory for saving results
    Literatur_dir: Path - directory for literature results
    gid_dict: dict - mapping of bacteria names to their GIDs
    """
    # Get GID for this bacteria
    bacteria_gid = gid_dict.get(bacteria_name, f"NEW_{bacteria_name}")  # Use NEW_ prefix for new bacteria
    
    # Defining the results file within base_dir
    results_file = base_dir / "bacteria_corrosion_summary.xlsx"

    # Add timing for individual bacteria
    bacteria_start_time = time.time()
    print(f"Starting search for {bacteria_name} at: {datetime.now().strftime('%H:%M:%S')}")
    
    results = {
        'bacteria': bacteria_name,
        'sulfate_reduction': False,
        'metal_reduction': False,
        'corrosion_associated': False,
        'cytochrome_c3': False,
        'acid_production': False,
        'biofilm_formation': False,
        'h2s_production': False,
        'literature_count': 0,
        'evidence': [],
        'processing_time': 0
    }
    
    # Creating a structured record for each bacteria
    bacteria_record = {
        'Name': bacteria_name,
        'Metabolism': [],
        'Terms': [],
        'Hits': 0,
        'Last_Reference': '',
        'Abstract': ''
    }
    
    try:
        # 1. Check KEGG for pathways and genes
        base_url = "http://rest.kegg.jp/"
        
        # Look for pathway modules
        pathway_response = requests.get(f"{base_url}find/module/{bacteria_name}")
        pathway_text = pathway_response.text.lower()
        
        # Define search terms
        sulfate_terms = [
        'sulfate', 'sulphate', 
        'dsrab', 'dsra', 'dsrb',  # Breaking down dsrAB into individual components
        'aprab', 'apra', 'aprb',  # Breaking down aprAB into individual components
        'sulfite', 'sulphite',
        'sat',  # Sulfate adenylyltransferase
        'sox',  # Sulfur oxidation
        'sir',  # Sulfite reductase
        'aps'   # Adenosine phosphosulfate
        ]           
               
        metal_terms = [
                    'metal', 'iron', 'fe(iii)', 'metal deterioration', 'MIC',
                    'cytochrome', 'corrosion', 'biocorrosion',
                    'methane corrosion', 'methanogenesis corrosion',
                    'bacteria corrosion', 'anaerobic corrosion',
                    'biofilm corrosion', 'manganese corrosion',
                    'denitrification corrosion',
                    'mtr',  # Metal reduction
                    'omc',  # Outer membrane cytochromes
                    'pil',  # Pili genes involved in metal reduction
                    'cymA',  # Cytoplasmic membrane protein
                    'hydA',  # Hydrogenase
                    'feo',  # Ferrous iron transport
                    'nrf',   # Nitrite reduction
                    'organic acid AND corrosion',
                    'acid metabolite AND metal deterioration',
                    'fermentation AND corrosion',
                    'biofilm AND (corrosion OR MIC)',
                    'hydrogen sulfide AND corrosion',
                    'thiosulfate AND corrosion'
                ]

        # Check pathway text
        if any(term in pathway_text for term in sulfate_terms):
            results['sulfate_reduction'] = True
            results['evidence'].append(f"Found sulfate pathway: {[term for term in sulfate_terms if term in pathway_text]}")
        
        if any(term in pathway_text for term in metal_terms):
            results['metal_reduction'] = True
            results['evidence'].append(f"Found metal pathway: {[term for term in metal_terms if term in pathway_text]}")
        
        # Look for genes
        genes_response = requests.get(f"{base_url}find/genes/{bacteria_name}")
        genes_text = genes_response.text.lower()
        
        if "cytochrome c3" in genes_text:
            results['cytochrome_c3'] = True
            results['evidence'].append("Found cytochrome c3 gene")
        
        if any(gene in genes_text for gene in ['dsr', 'apr', 'sat']):
            results['sulfate_reduction'] = True
            results['evidence'].append(f"Found sulfate genes: {[gene for gene in ['dsr', 'apr', 'sat'] if gene in genes_text]}")
            
    except Exception as e:
        print(f"KEGG API error for {bacteria_name}: {str(e)}")
    
    # 2. Check literature
    try:
        Entrez.email = "beatrizamandawatts@gmail.com"
        papers_details = []
        
        search_terms = [
            f"{bacteria_name}[Organism] AND (sulfate reduction OR dsrAB OR aprAB)",
            f"{bacteria_name}[Organism] AND (metal reduction OR iron reduction)",
            f"{bacteria_name}[Organism] AND cytochrome c3",
            f"{bacteria_name}[Organism] AND corrosion",
            f"{bacteria_name}[Organism] AND biocorrosion",
            f"{bacteria_name}[Organism] AND (MIC OR 'microbiologically influenced corrosion')",
            f"{bacteria_name}[Organism] AND 'material deterioration'",
            f"{bacteria_name}[Organism] AND ('metal deterioration' OR 'metallic corrosion')",
            f"{bacteria_name}[Organism] AND (acid production) AND (corrosion OR 'metal deterioration' OR MIC)",
            f"{bacteria_name}[Organism] AND biofilm AND (corrosion OR MIC)",
            f"{bacteria_name}[Organism] AND (hydrogen sulfide OR H2S) AND (corrosion OR 'metal deterioration')"
        ]
        
        for term in search_terms:
            handle = Entrez.esearch(db="pubmed", term=term)
            record = Entrez.read(handle)
            count = int(record["Count"])
            results['literature_count'] += count
            
            if count > 0:
                results['evidence'].append(f"Found {count} papers for: {term}")
                paper_ids = record["IdList"]
                
                try:
                    papers_handle = Entrez.efetch(db="pubmed", id=paper_ids, rettype="medline", retmode="xml")
                    papers = Entrez.read(papers_handle)
                    
                    # Update metabolism flags
                    if "sulfate" in term.lower():
                        results['sulfate_reduction'] = True
                        if 'Sulfate Reduction' not in bacteria_record['Metabolism']:
                            bacteria_record['Metabolism'].append('Sulfate Reduction')
                    
                    if "metal" in term.lower():
                        results['metal_reduction'] = True
                        if 'Metal Reduction' not in bacteria_record['Metabolism']:
                            bacteria_record['Metabolism'].append('Metal Reduction')
                    
                    if "cytochrome" in term.lower():
                        results['cytochrome_c3'] = True
                        if 'Cytochrome c3' not in bacteria_record['Metabolism']:
                            bacteria_record['Metabolism'].append('Cytochrome c3')
                    
                    bacteria_record['Hits'] += count
                    bacteria_record['Terms'].append(f"{term}: {count} hits")
                    
                    # Process paper details
                    if papers.get('PubmedArticle'):
                        latest_paper = papers['PubmedArticle'][0]
                        article = latest_paper['MedlineCitation']['Article']

                        # Format reference in APA style
                        bacteria_record['Last_Reference'] = format_apa_reference(article)
                        
                        # Get full abstract
                        if 'Abstract' in article:
                            abstract_text = article['Abstract']['AbstractText'][0]
                            bacteria_record['Abstract'] = abstract_text  # Store full abstract

                    time.sleep(1)  # Being nice to the APIs
                    
                except Exception as e:
                    print(f"Error processing PubMed data for {bacteria_name}: {e}")       
        # Save to Excel
        try:
            if results_file.exists():
                df = pd.read_excel(results_file, index_col= 0)
            else:
                df = pd.DataFrame(columns=['Name', 'Metabolism', 'Terms', 'Hits', 'Last_Reference', 'Abstract'],
                                                      index =pd.Index([], name ='GID'))
            
            # Convert lists to strings and ensure all fields exist
            new_row = pd.DataFrame({
                'Name': [bacteria_name],
                'Metabolism': ['; '.join(bacteria_record['Metabolism']) if bacteria_record['Metabolism'] else ''],
                'Terms': ['; '.join(bacteria_record['Terms']) if bacteria_record['Terms'] else ''],
                'Hits': [bacteria_record['Hits']],
                'Last_Reference': [bacteria_record.get('Last_Reference', '')],
                'Abstract': [bacteria_record.get('Abstract', '')]
            }, index=[bacteria_gid])
            # Update or append
            if bacteria_name in df['Name'].values:
                df.loc[df['Name'] == bacteria_name] = new_row.iloc[0]
            else:
                df = pd.concat([df, new_row])
                
            # Generate timestamp for sheet name
            sheet_name = f"Analysis_{datetime.now().strftime('%Y%m%d_%H%M')}"

            # Adding the reference from the first function and the Abstract
            with pd.ExcelWriter(results_file, engine='openpyxl') as writer:
                # Write the DataFrame
                df.to_excel(writer, sheet_name=sheet_name, index=False)

                # Get the worksheet
                worksheet = writer.sheets[sheet_name]
                
                # Format the Abstract column for wrapping
                for idx, col in enumerate(df.columns):
                    if col == 'Abstract':
                        # Make column wider and enable text wrapping
                        worksheet.column_dimensions[openpyxl.utils.get_column_letter(idx + 1)].width = 50
                        for cell in worksheet[openpyxl.utils.get_column_letter(idx + 1)]:
                            cell.alignment = openpyxl.styles.Alignment(wrap_text=True)

        except Exception as e:
            print(f"Error saving to Excel for {bacteria_name}: {e}")

    except Exception as e:
        print(f"Error in literature processing for {bacteria_name}: {e}")

    finally:
        results['processing_time'] = time.time() - bacteria_start_time
        print(f"Finished {bacteria_name} in {results['processing_time']:.2f} seconds")

        return results

# 5. Searching Corrosion Genes
This function search on PubMed database the bacteria in the list for different criteria related to corrosion, in order to found which of the bacteria has been previouly identified as causing damage by corrosion.

In [11]:
def search_corrosion_genes(bacteria_name, base_dir, Literatur_dir, gid_dict):
    """
    Search for specific corrosion-related genes and pathways for a given bacteria
    
    Parameters:
    bacteria_name: str - name of the bacteria to search
    base_dir: Path - base directory for saving results
    Literatur_dir: Path - directory for literature results
    gid_dict: dict - mapping of bacteria names to their GIDs
    """
    # Get GID for this bacteria
    bacteria_gid = gid_dict.get(bacteria_name, f"NEW_{bacteria_name}")  # Use NEW_ prefix for new bacteria
    
    # Defining the results file within base_dir
    results_file = base_dir / "bacteria_corrosion_summary.xlsx"

    # Add timing for individual bacteria
    bacteria_start_time = time.time()
    print(f"Starting search for {bacteria_name} at: {datetime.now().strftime('%H:%M:%S')}")
    
    results = {
        'bacteria': bacteria_name,
        'sulfate_reduction': False,
        'metal_reduction': False,
        'corrosion_associated': False,
        'cytochrome_c3': False,
        'acid_production': False,
        'biofilm_formation': False,
        'h2s_production': False,
        'literature_count': 0,
        'evidence': [],
        'processing_time': 0
    }
    
    # Creating a structured record for each bacteria
    bacteria_record = {
        'Name': bacteria_name,
        'Metabolism': [],
        'Terms': [],
        'Hits': 0,
        'Last_Reference': '',
        'Abstract': ''
    }
    
    try:
        # 1. Check KEGG for pathways and genes
        base_url = "http://rest.kegg.jp/"
        
        # Look for pathway modules
        pathway_response = requests.get(f"{base_url}find/module/{bacteria_name}")
        pathway_text = pathway_response.text.lower()
        
        # Define search terms
        sulfate_terms = [
        'sulfate', 'sulphate', 
        'dsrab', 'dsra', 'dsrb',  # Breaking down dsrAB into individual components
        'aprab', 'apra', 'aprb',  # Breaking down aprAB into individual components
        'sulfite', 'sulphite',
        'sat',  # Sulfate adenylyltransferase
        'sox',  # Sulfur oxidation
        'sir',  # Sulfite reductase
        'aps'   # Adenosine phosphosulfate
        ]           
               
        metal_terms = [
                    'metal', 'iron', 'fe(iii)', 'metal deterioration', 'MIC',
                    'cytochrome', 'corrosion', 'biocorrosion',
                    'methane corrosion', 'methanogenesis corrosion',
                    'bacteria corrosion', 'anaerobic corrosion',
                    'biofilm corrosion', 'manganese corrosion',
                    'denitrification corrosion',
                    'mtr',  # Metal reduction
                    'omc',  # Outer membrane cytochromes
                    'pil',  # Pili genes involved in metal reduction
                    'cymA',  # Cytoplasmic membrane protein
                    'hydA',  # Hydrogenase
                    'feo',  # Ferrous iron transport
                    'nrf',   # Nitrite reduction
                    'organic acid AND corrosion',
                    'acid metabolite AND metal deterioration',
                    'fermentation AND corrosion',
                    'biofilm AND (corrosion OR MIC)',
                    'hydrogen sulfide AND corrosion',
                    'thiosulfate AND corrosion'
                ]

        # Check pathway text
        if any(term in pathway_text for term in sulfate_terms):
            results['sulfate_reduction'] = True
            results['evidence'].append(f"Found sulfate pathway: {[term for term in sulfate_terms if term in pathway_text]}")
        
        if any(term in pathway_text for term in metal_terms):
            results['metal_reduction'] = True
            results['evidence'].append(f"Found metal pathway: {[term for term in metal_terms if term in pathway_text]}")
        
        # Look for genes
        genes_response = requests.get(f"{base_url}find/genes/{bacteria_name}")
        genes_text = genes_response.text.lower()
        
        if "cytochrome c3" in genes_text:
            results['cytochrome_c3'] = True
            results['evidence'].append("Found cytochrome c3 gene")
        
        if any(gene in genes_text for gene in ['dsr', 'apr', 'sat']):
            results['sulfate_reduction'] = True
            results['evidence'].append(f"Found sulfate genes: {[gene for gene in ['dsr', 'apr', 'sat'] if gene in genes_text]}")
            
    except Exception as e:
        print(f"KEGG API error for {bacteria_name}: {str(e)}")
    
    # 2. Check literature
    try:
        Entrez.email = "beatrizamandawatts@gmail.com"
        papers_details = []
        
        search_terms = [
            f"{bacteria_name}[Organism] AND (sulfate reduction OR dsrAB OR aprAB)",
            f"{bacteria_name}[Organism] AND (metal reduction OR iron reduction)",
            f"{bacteria_name}[Organism] AND cytochrome c3",
            f"{bacteria_name}[Organism] AND corrosion",
            f"{bacteria_name}[Organism] AND biocorrosion",
            f"{bacteria_name}[Organism] AND (MIC OR 'microbiologically influenced corrosion')",
            f"{bacteria_name}[Organism] AND 'material deterioration'",
            f"{bacteria_name}[Organism] AND ('metal deterioration' OR 'metallic corrosion')",
            f"{bacteria_name}[Organism] AND (acid production) AND (corrosion OR 'metal deterioration' OR MIC)",
            f"{bacteria_name}[Organism] AND biofilm AND (corrosion OR MIC)",
            f"{bacteria_name}[Organism] AND (hydrogen sulfide OR H2S) AND (corrosion OR 'metal deterioration')"
        ]
        
        for term in search_terms:
            handle = Entrez.esearch(db="pubmed", term=term)
            record = Entrez.read(handle)
            count = int(record["Count"])
            results['literature_count'] += count
            
            if count > 0:
                results['evidence'].append(f"Found {count} papers for: {term}")
                paper_ids = record["IdList"]
                
                try:
                    papers_handle = Entrez.efetch(db="pubmed", id=paper_ids, rettype="medline", retmode="xml")
                    papers = Entrez.read(papers_handle)
                    
                    # Update metabolism flags
                    if "sulfate" in term.lower():
                        results['sulfate_reduction'] = True
                        if 'Sulfate Reduction' not in bacteria_record['Metabolism']:
                            bacteria_record['Metabolism'].append('Sulfate Reduction')
                    
                    if "metal" in term.lower():
                        results['metal_reduction'] = True
                        if 'Metal Reduction' not in bacteria_record['Metabolism']:
                            bacteria_record['Metabolism'].append('Metal Reduction')
                    
                    if "cytochrome" in term.lower():
                        results['cytochrome_c3'] = True
                        if 'Cytochrome c3' not in bacteria_record['Metabolism']:
                            bacteria_record['Metabolism'].append('Cytochrome c3')
                    
                    bacteria_record['Hits'] += count
                    bacteria_record['Terms'].append(f"{term}: {count} hits")
                    
                    # Process paper details
                    if papers.get('PubmedArticle'):
                        latest_paper = papers['PubmedArticle'][0]
                        article = latest_paper['MedlineCitation']['Article']

                        # Format reference in APA style
                        bacteria_record['Last_Reference'] = format_apa_reference(article)
                        
                        # Get full abstract
                        if 'Abstract' in article:
                            abstract_text = article['Abstract']['AbstractText'][0]
                            bacteria_record['Abstract'] = abstract_text  # Store full abstract

                    time.sleep(1)  # Being nice to the APIs
                    
                except Exception as e:
                    print(f"Error processing PubMed data for {bacteria_name}: {e}")       
        # Save to Excel
        try:
            if results_file.exists():
                df = pd.read_excel(results_file, index_col= 0)
            else:
                df = pd.DataFrame(columns=['Name', 'Metabolism', 'Terms', 'Hits', 'Last_Reference', 'Abstract'],
                                                      index =pd.Index([], name ='GID'))
            
            # Convert lists to strings and ensure all fields exist
            new_row = pd.DataFrame({
                'Name': [bacteria_name],
                'Metabolism': ['; '.join(bacteria_record['Metabolism']) if bacteria_record['Metabolism'] else ''],
                'Terms': ['; '.join(bacteria_record['Terms']) if bacteria_record['Terms'] else ''],
                'Hits': [bacteria_record['Hits']],
                'Last_Reference': [bacteria_record.get('Last_Reference', '')],
                'Abstract': [bacteria_record.get('Abstract', '')]
            }, index=[bacteria_gid])
            # Update or append
            if bacteria_name in df['Name'].values:
                df.loc[df['Name'] == bacteria_name] = new_row.iloc[0]
            else:
                df = pd.concat([df, new_row])
                
            # Generate timestamp for sheet name
            sheet_name = f"Analysis_{datetime.now().strftime('%Y%m%d_%H%M')}"

            # Adding the reference from the first function and the Abstract
            with pd.ExcelWriter(results_file, engine='openpyxl') as writer:
                # Write the DataFrame
                df.to_excel(writer, sheet_name=sheet_name, index=False)

                # Get the worksheet
                worksheet = writer.sheets[sheet_name]
                
                # Format the Abstract column for wrapping
                for idx, col in enumerate(df.columns):
                    if col == 'Abstract':
                        # Make column wider and enable text wrapping
                        worksheet.column_dimensions[openpyxl.utils.get_column_letter(idx + 1)].width = 50
                        for cell in worksheet[openpyxl.utils.get_column_letter(idx + 1)]:
                            cell.alignment = openpyxl.styles.Alignment(wrap_text=True)

        except Exception as e:
            print(f"Error saving to Excel for {bacteria_name}: {e}")

    except Exception as e:
        print(f"Error in literature processing for {bacteria_name}: {e}")

    finally:
        results['processing_time'] = time.time() - bacteria_start_time
        print(f"Finished {bacteria_name} in {results['processing_time']:.2f} seconds")

        return results

In [15]:
def search_corrosion_genes(bacteria_name, base_dir, Literatur_dir, gid_dict):
    """
    Search for specific corrosion-related genes and pathways for a given bacteria
    
    Parameters:
    bacteria_name: str - name of the bacteria to search
    base_dir: Path - base directory for saving results
    Literatur_dir: Path - directory for literature results
    gid_dict: dict - mapping of bacteria names to their GIDs
    """
    # Get GID for this bacteria
    bacteria_gid = gid_dict.get(bacteria_name, f"NEW_{bacteria_name}")  # Use NEW_ prefix for new bacteria
    
    # Defining the results file within base_dir
    results_file = base_dir / "bacteria_corrosion_summary.xlsx"

    # Add timing for individual bacteria
    bacteria_start_time = time.time()
    print(f"Starting search for {bacteria_name} at: {datetime.now().strftime('%H:%M:%S')}")
    
    results = {
        'bacteria': bacteria_name,
        'sulfate_reduction': False,
        'metal_reduction': False,
        'corrosion_associated': False,
        'cytochrome_c3': False,
        'acid_production': False,
        'biofilm_formation': False,
        'h2s_production': False,
        'literature_count': 0,
        'evidence': [],
        'processing_time': 0
    }
    
    # Creating a structured record for each bacteria
    bacteria_record = {
        'Name': bacteria_name,
        'Metabolism': [],
        'Terms': [],
        'Hits': 0,
        'Last_Reference': '',
        'Abstract': ''
    }
    
    try:
        # 1. Check KEGG for pathways and genes
        base_url = "http://rest.kegg.jp/"
        
        # Look for pathway modules
        pathway_response = requests.get(f"{base_url}find/module/{bacteria_name}")
        pathway_text = pathway_response.text.lower()
        
        # Define search terms
        sulfate_terms = [
        'sulfate', 'sulphate', 
        'dsrab', 'dsra', 'dsrb',  # Breaking down dsrAB into individual components
        'aprab', 'apra', 'aprb',  # Breaking down aprAB into individual components
        'sulfite', 'sulphite',
        'sat',  # Sulfate adenylyltransferase
        'sox',  # Sulfur oxidation
        'sir',  # Sulfite reductase
        'aps'   # Adenosine phosphosulfate
        ]           
               
        metal_terms = [
                    'metal', 'iron', 'fe(iii)', 'metal deterioration', 'MIC',
                    'cytochrome', 'corrosion', 'biocorrosion',
                    'methane corrosion', 'methanogenesis corrosion',
                    'bacteria corrosion', 'anaerobic corrosion',
                    'biofilm corrosion', 'manganese corrosion',
                    'denitrification corrosion',
                    'mtr',  # Metal reduction
                    'omc',  # Outer membrane cytochromes
                    'pil',  # Pili genes involved in metal reduction
                    'cymA',  # Cytoplasmic membrane protein
                    'hydA',  # Hydrogenase
                    'feo',  # Ferrous iron transport
                    'nrf',   # Nitrite reduction
                    'organic acid AND corrosion',
                    'acid metabolite AND metal deterioration',
                    'fermentation AND corrosion',
                    'biofilm AND (corrosion OR MIC)',
                    'hydrogen sulfide AND corrosion',
                    'thiosulfate AND corrosion'
                ]

        # Check pathway text
        if any(term in pathway_text for term in sulfate_terms):
            results['sulfate_reduction'] = True
            results['evidence'].append(f"Found sulfate pathway: {[term for term in sulfate_terms if term in pathway_text]}")
        
        if any(term in pathway_text for term in metal_terms):
            results['metal_reduction'] = True
            results['evidence'].append(f"Found metal pathway: {[term for term in metal_terms if term in pathway_text]}")
        
        # Look for genes
        genes_response = requests.get(f"{base_url}find/genes/{bacteria_name}")
        genes_text = genes_response.text.lower()
        
        if "cytochrome c3" in genes_text:
            results['cytochrome_c3'] = True
            results['evidence'].append("Found cytochrome c3 gene")
        
        if any(gene in genes_text for gene in ['dsr', 'apr', 'sat']):
            results['sulfate_reduction'] = True
            results['evidence'].append(f"Found sulfate genes: {[gene for gene in ['dsr', 'apr', 'sat'] if gene in genes_text]}")
            
    except Exception as e:
        print(f"KEGG API error for {bacteria_name}: {str(e)}")
    
    # 2. Check literature
    try:
        Entrez.email = "beatrizamandawatts@gmail.com"
        papers_details = []
        
        search_terms = [
            f"{bacteria_name}[Organism] AND (sulfate reduction OR dsrAB OR aprAB)",
            f"{bacteria_name}[Organism] AND (metal reduction OR iron reduction)",
            f"{bacteria_name}[Organism] AND cytochrome c3",
            f"{bacteria_name}[Organism] AND corrosion",
            f"{bacteria_name}[Organism] AND biocorrosion",
            f"{bacteria_name}[Organism] AND (MIC OR 'microbiologically influenced corrosion')",
            f"{bacteria_name}[Organism] AND 'material deterioration'",
            f"{bacteria_name}[Organism] AND ('metal deterioration' OR 'metallic corrosion')",
            f"{bacteria_name}[Organism] AND (acid production) AND (corrosion OR 'metal deterioration' OR MIC)",
            f"{bacteria_name}[Organism] AND biofilm AND (corrosion OR MIC)",
            f"{bacteria_name}[Organism] AND (hydrogen sulfide OR H2S) AND (corrosion OR 'metal deterioration')"
        ]
        
        for term in search_terms:
            handle = Entrez.esearch(db="pubmed", term=term)
            record = Entrez.read(handle)
            count = int(record["Count"])
            results['literature_count'] += count
            
            if count > 0:
                results['evidence'].append(f"Found {count} papers for: {term}")
                paper_ids = record["IdList"]
                
                try:
                    papers_handle = Entrez.efetch(db="pubmed", id=paper_ids, rettype="medline", retmode="xml")
                    papers = Entrez.read(papers_handle)
                    
                    # Update metabolism flags
                    if "sulfate" in term.lower():
                        results['sulfate_reduction'] = True
                        if 'Sulfate Reduction' not in bacteria_record['Metabolism']:
                            bacteria_record['Metabolism'].append('Sulfate Reduction')
                    
                    if "metal" in term.lower():
                        results['metal_reduction'] = True
                        if 'Metal Reduction' not in bacteria_record['Metabolism']:
                            bacteria_record['Metabolism'].append('Metal Reduction')
                    
                    if "cytochrome" in term.lower():
                        results['cytochrome_c3'] = True
                        if 'Cytochrome c3' not in bacteria_record['Metabolism']:
                            bacteria_record['Metabolism'].append('Cytochrome c3')
                    
                    bacteria_record['Hits'] += count
                    bacteria_record['Terms'].append(f"{term}: {count} hits")
                    
                    # Process paper details
                    if papers.get('PubmedArticle'):
                        latest_paper = papers['PubmedArticle'][0]
                        article = latest_paper['MedlineCitation']['Article']

                        # Format reference in APA style
                        bacteria_record['Last_Reference'] = format_apa_reference(article)
                        
                        # Get full abstract
                        if 'Abstract' in article:
                            abstract_text = article['Abstract']['AbstractText'][0]
                            bacteria_record['Abstract'] = abstract_text  # Store full abstract

                    time.sleep(1)  # Being nice to the APIs
                    
                except Exception as e:
                    print(f"Error processing PubMed data for {bacteria_name}: {e}")       
        # Save to Excel
        try:
            if results_file.exists():
                df = pd.read_excel(results_file, index_col= 0)
            else:
                df = pd.DataFrame(columns=['Name', 'Metabolism', 'Terms', 'Hits', 'Last_Reference', 'Abstract'],
                                                      index =pd.Index([], name ='GID'))
            
            # Convert lists to strings and ensure all fields exist
            new_row = pd.DataFrame({
                'Name': [bacteria_name],
                'Metabolism': ['; '.join(bacteria_record['Metabolism']) if bacteria_record['Metabolism'] else ''],
                'Terms': ['; '.join(bacteria_record['Terms']) if bacteria_record['Terms'] else ''],
                'Hits': [bacteria_record['Hits']],
                'Last_Reference': [bacteria_record.get('Last_Reference', '')],
                'Abstract': [bacteria_record.get('Abstract', '')]
            }, index=[bacteria_gid])
            # Update or append
            if bacteria_name in df['Name'].values:
                df.loc[df['Name'] == bacteria_name] = new_row.iloc[0]
            else:
                df = pd.concat([df, new_row])
                
            # Generate timestamp for sheet name
            sheet_name = f"Analysis_{datetime.now().strftime('%Y%m%d_%H%M')}"

            # Adding the reference from the first function and the Abstract
            with pd.ExcelWriter(results_file, engine='openpyxl', mode='w') as writer:
                # Write the DataFrame to the main analysis sheet
                df.to_excel(writer, sheet_name=f"Analysis_{datetime.now().strftime('%Y%m%d_%H%M')}", index=True)
    
                # Get the worksheet for formatting
                worksheet = writer.sheets[sheet_name]
                
                # Format the Abstract column for wrapping
                for idx, col in enumerate(df.columns):
                    if col == 'Abstract':
                        # Make column wider and enable text wrapping
                        worksheet.column_dimensions[openpyxl.utils.get_column_letter(idx +2)].width = 50
                        for cell in worksheet[openpyxl.utils.get_column_letter(idx + 2)]:
                            cell.alignment = openpyxl.styles.Alignment(wrap_text=True)
                # References sheet
                if 'Last_Reference' in df.columns:
                    references_df = df[['Name', 'Last_Reference']].copy()
                    references_df.to_excel(writer, sheet_name='References', index=True)
                
                # Abstracts sheet
                if 'Abstract' in df.columns:
                    abstracts_df = df[['Name', 'Abstract']].copy()
                    abstracts_df.to_excel(writer, sheet_name='Abstracts', index=True)

                    # Format after writing all sheets
                    for sheet in writer.sheets.values():
                        # Format the Abstract column if it exists in this sheet
                        for idx, col in enumerate(sheet.iter_cols(1, sheet.max_column)):
                            header = col[0].value
                            if header == 'Abstract':
                                sheet.column_dimensions[openpyxl.utils.get_column_letter(idx + 1)].width = 50
                                for cell in col[1:]:  # Skip header
                                    cell.alignment = openpyxl.styles.Alignment(wrap_text=True)
        except Exception as e:
            print(f"Error saving to Excel for {bacteria_name}: {e}")

    except Exception as e:
        print(f"Error in literature processing for {bacteria_name}: {e}")

    finally:
        results['processing_time'] = time.time() - bacteria_start_time
        print(f"Finished {bacteria_name} in {results['processing_time']:.2f} seconds")

        return results

In [16]:
# For your selected bacteria
for bacteria in selected_list:
    result = search_corrosion_genes(bacteria, base_dir, Literatur_dir, selected_GID)
    print(f"\nResults for {bacteria}:")
    print(f"Sulfate reduction: {result['sulfate_reduction']}")
    print(f"Metal reduction: {result['metal_reduction']}")
    print(f"Cytochrome c3: {result['cytochrome_c3']}")
    print(f"Literature count: {result['literature_count']}")
    print("Evidence:", "\n- ".join([''] + result['evidence']))

Starting search for Anaerococcus at: 21:32:53
Error saving to Excel for Anaerococcus: 'Name'
Finished Anaerococcus in 21.81 seconds

Results for Anaerococcus:
Sulfate reduction: False
Metal reduction: False
Cytochrome c3: False
Literature count: 7
Evidence: 
- Found 7 papers for: Anaerococcus[Organism] AND (MIC OR 'microbiologically influenced corrosion')
Starting search for Aquamicrobium at: 21:33:15
Error saving to Excel for Aquamicrobium: 'Name'
Finished Aquamicrobium in 16.66 seconds

Results for Aquamicrobium:
Sulfate reduction: True
Metal reduction: False
Cytochrome c3: False
Literature count: 1
Evidence: 
- Found 1 papers for: Aquamicrobium[Organism] AND (sulfate reduction OR dsrAB OR aprAB)
Starting search for Azospira at: 21:33:32
Error saving to Excel for Azospira: 'Name'
Finished Azospira in 27.30 seconds

Results for Azospira:
Sulfate reduction: True
Metal reduction: True
Cytochrome c3: False
Literature count: 34
Evidence: 
- Found 11 papers for: Azospira[Organism] AND (sul

# 6. Analysing the Search Results

In summary the Bacteria found to be influencing the label corrosion, most of them have been already identified with corrosion on a way or another.The Results table give us a visual of the bacteria name, the mecanism for which is known to be influencing corrosion, the numbers of hits on the literature with such claims. The reference of the final article and the abstract corresponding could be found on the following sheets.The structure is like this: 
Main file: bacteria_corrosion_summary.xlsx with sheets:

Sheet 1: Analysis results (metabolism, hits, etc.) --> from search_corrosion_genes function
Sheet 2: References in APA format--> from format_apa_reference and search_corrosion_genes functions
Sheet 3: Abstracts--> from format_apa_reference and search_corrosion_genes functions

Especialised db
MicrobeDB
GOLD (Genomes Online Database)
PATRIC Bacterial Bioinformatics Resource

QIIME2 (Microbiome analysis)
MetaPhlAn (Metagenomic profiling)
MG-RAST (Metagenome analysis)
Prokka (Genome annotation)



# Biomarkers Refinement
Prioritize bacteria with known corrosion-related activities
Consider biofilm formation capabilities
Look for known metal-oxidizing/reducing bacteria
Factor in pH tolerance and oxygen requirements


functional annotation analysis

3. Metabolic Pathway Analysis and mapping-
PICRUSt2 - Can predict metabolic functions from 16S data

In [14]:
bashCopy# Install PICRUSt2 (if not already installed)
conda create -n picrust2 -c bioconda -c conda-forge picrust2

# Activate the environment
conda activate picrust2

# Run full pipeline
picrust2_pipeline.py -s your_sequences.fasta -i your_abundance.biom -o picrust2_output_folder

# For more specific pathway analysis
add_descriptions.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz -m EC \
                   -o EC_metagenome_out/pred_metagenome_unstrat_described.tsv.gz

Requirements:


Your sequences should be properly quality filtered
Sequences should be aligned and trimmed to the same length
ASVs/OTUs should be properly clustered



SyntaxError: invalid syntax (1207116454.py, line 2)

# Network analysis

Ecological Networks:


Bacteria that appear "neutral" alone might be critical support species
They could be enabling or moderating the effects of the corrosion-significant species
In microbial communities, some species act as "keystone" species not through abundance but through their metabolic interactions


Stability Indicators:


Species present across all conditions might be:

Buffer species that maintain community stability
Indicators of baseline environmental conditions
Part of the core microbiome that enables other species to thrive

Think of it like a metal alloy - some elements might not directly affect corrosion resistance, but their presence maintains the overall structure that makes the protective elements effective.
However, if data size/processing is a significant concern, you could:

Keep full bacterial data initially
Run your analysis
Check if removing the "uniform" species significantly changes your results
Document which removals affect the model and which don't
_________________________
This is to understand genus interactions
Group bacteria by their typical ecological roles (e.g., primary degraders, secondary degraders)
Add known syntrophic relationships between genera
Map carbon/nitrogen cycling capabilities
Identify potential metabolic handoffs between community members
__

Map each genus to known electron acceptor preferences (Fe, Mn, S, etc.)
Create functional groups based on these metabolic capabilities
Compare distribution of these functional groups across your categories
Look for enrichment patterns of specific metabolic types


# QIIME2 (Microbiome analysis)

In [None]:
# Import FASTA into QIIME 2
qiime tools import \
  --input-path your_sequences.fasta \
  --output-path sequences.qza \
  --type 'FeatureData[Sequence]'

# Run DADA2 or Deblur for ASV generation
qiime dada2 denoise-single \
  --i-demultiplexed-seqs sequences.qza \
  --p-trim-left 0 \
  --p-trunc-len 250 \
  --o-representative-sequences rep-seqs.qza \
  --o-table table.qza

# Export to BIOM format
qiime tools export \
  --input-path table.qza \
  --output-path exported-table

# Convert to TSV if needed
biom convert \
  -i exported-table/feature-table.biom \
  -o feature-table.tsv \
  --to-tsv


# Dereplicate sequences
vsearch --derep_fulllength your_sequences.fasta \
        --output unique_sequences.fasta \
        --sizeout

# Cluster at 97% similarity (for OTUs)
vsearch --cluster_size unique_sequences.fasta \
        --id 0.97 \
        --centroids clustered_sequences.fasta

# Create OTU table
vsearch --usearch_global your_sequences.fasta \
        --db clustered_sequences.fasta \
        --id 0.97 \
        --otutabout otu_table.txt

