# 1. Bacteria Influencing Corrosion
This notebook aims to identify microorganisms that have a recognized influence on corrosion damage. The analysis involves comparing bacteria against known corrosion-related gene sequences and metabolic pathways associated with Microbiologically Influenced Corrosion (MIC).
__Aims__
Search literature on the study selected genera that have been reported as causing corrosion damage, using different terms. Make a comprehensive tabel of the results. Comprehensive search on specific functional genes involved in corrosion processes, focusing on: Sulfate reduction pathways (dsrAB, aprAB genes),metal reduction genes and cytochrome c3 complexes. 
Perform targeted analysis between known corrosion-causing bacteria and newly identified bacterial specimens

__Databases Used__:
    * KEGG (Kyoto Encyclopedia of Genes and Genomes): https://www.genome.jp/kegg/pathway.html.Used for metabolic pathway identification and functional gene annotations  
    * PubMed: Used for literature analysis and validation. 
__Analysis Workflow__    
1. Initial Computational Screening → Search KEGG database for pathways and genes →Literature validation through PubMed 
2. Results Analysis and Documentation → Compilation of findings in Excel sheets → Documentation of references and abstracts
Notebook files
Copy/home/beatriz/MIC/2_Micro/data_Ref/
├── bacteria_corrosion_summary_{timestamp}.xlsx    # Results file for each run
│   ├── Analysis_{timestamp}    # Main results sheet
│   ├── References             # APA formatted references
│   └── Abstracts             # Related paper abstracts
└── Original_data/            # Raw data storage

In [1]:
'''import os
from google.colab import drive  #silence for vscode
drive.mount('/content/drive')
#change the path
os.chdir('/content/drive/My Drive/MIC')
# For colab
!pip install pandas numpy biopython
!pip install requests beautifulsoup4
!pip install Bio'''

"import os\nfrom google.colab import drive  #silence for vscode\ndrive.mount('/content/drive')\n#change the path\nos.chdir('/content/drive/My Drive/MIC')\n# For colab\n!pip install pandas numpy biopython\n!pip install requests beautifulsoup4\n!pip install Bio"

# 2. Preparing data

In [2]:
import os
from pathlib import Path
from Bio import Entrez
import pandas as pd
from functools import partial
import requests
from bs4 import BeautifulSoup
import time
import urllib3
from datetime import datetime
import logging
import numpy as np
import matplotlib.pyplot as plt
import openpyxl
from openpyxl.styles import Alignment
import gc #clutter
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from scholarly import scholarly  # For Google Scholar
from crossref.restful import Works  # For CrossRef

In [3]:
# For VSCode
base_dir = Path("/home/beatriz/MIC/2_Micro/data_Ref")
original_dir = base_dir / "Original_data"
results_file = base_dir / "bacteria_corrosion_summary.xlsx" 

# For Colab
'''
from google.colab import drive
drive.mount('/content/drive')
base_dir = Path('/content/drive/My Drive/MIC/data')
original_dir = base_dir / "original"
original_dir.mkdir(exist_ok=True)
'''

'\nfrom google.colab import drive\ndrive.mount(\'/content/drive\')\nbase_dir = Path(\'/content/drive/My Drive/MIC/data\')\noriginal_dir = base_dir / "original"\noriginal_dir.mkdir(exist_ok=True)\n'

In [14]:
# Read the Excel file for the whole data
Jointax = pd.read_excel("data/Jointax.xlsx", sheet_name='Biotot_jointax', header=[0,1,2,3,4,5,6,7])
# Drop 2 first columns
Jointax = Jointax.drop(Jointax.columns[0:2], axis=1)

In [15]:
# Read the Excel file for the checked genera
selected = pd.read_excel("/home/beatriz/MIC/2_Micro/data/finalist_dfs.xlsx", sheet_name='selected', header=[0,1,2,3,4,5,6,7])
# Drop first row specifically (index 0 which contains NaNs)
selected = selected.drop(index=0)
# Drop first column (the index column with Level1, Level2, etc)
selected = selected.drop(selected.columns[0:3], axis=1)

In [16]:
selected.head()

Unnamed: 0_level_0,Rhodocyclales_Rhodocyclaceae_Azospira,Actinomycetales_Dermabacteraceae_Brachybacterium,Erysipelotrichales_Erysipelotrichaceae_Bulleidia,Actinomycetales_Promicromonosporaceae_Cellulosimicrobium,Clostridiales_Clostridiaceae_Clostridium,Actinomycetales_Corynebacteriaceae_Corynebacterium,Oceanospirillales_Halomonadaceae_Halomonas,Legionellales_Legionellaceae_Legionella,Caulobacterales_Caulobacteraceae_Mycoplana,Actinomycetales_Cellulomonadaceae_Oerskovia,Clostridiales_Clostridiaceae_Oxobacter,Rhodobacterales_Rhodobacteraceae_Paracoccus,Erysipelotrichales_Erysipelotrichaceae_Psb-m-3,Vibrionales_Pseudoalteromonadaceae_Pseudoalteromonas
Unnamed: 0_level_1,Bacteria,Bacteria,Bacteria,Bacteria,Bacteria,Bacteria,Bacteria,Bacteria,Bacteria,Bacteria,Bacteria,Bacteria,Bacteria,Bacteria
Unnamed: 0_level_2,Proteobacteria,Actinobacteria,Firmicutes,Actinobacteria,Firmicutes,Actinobacteria,Proteobacteria,Proteobacteria,Proteobacteria,Actinobacteria,Firmicutes,Proteobacteria,Firmicutes,Proteobacteria
Unnamed: 0_level_3,Betaproteobacteria,Actinobacteria,Erysipelotrichi,Actinobacteria,Clostridia,Actinobacteria,Gammaproteobacteria,Gammaproteobacteria,Alphaproteobacteria,Actinobacteria,Clostridia,Alphaproteobacteria,Erysipelotrichi,Gammaproteobacteria
Unnamed: 0_level_4,Rhodocyclales,Actinomycetales,Erysipelotrichales,Actinomycetales,Clostridiales,Actinomycetales,Oceanospirillales,Legionellales,Caulobacterales,Actinomycetales,Clostridiales,Rhodobacterales,Erysipelotrichales,Vibrionales
Unnamed: 0_level_5,Rhodocyclaceae,Dermabacteraceae,Erysipelotrichaceae,Promicromonosporaceae,Clostridiaceae,Corynebacteriaceae,Halomonadaceae,Legionellaceae,Caulobacteraceae,Cellulomonadaceae,Clostridiaceae,Rhodobacteraceae,Erysipelotrichaceae,Pseudoalteromonadaceae
Unnamed: 0_level_6,Azospira,Brachybacterium,Bulleidia,Cellulosimicrobium,Clostridium,Corynebacterium,Halomonas,Legionella,Mycoplana,Oerskovia,Oxobacter,Paracoccus,Psb-m-3,Pseudoalteromonas
Unnamed: 0_level_7,110,140,154,201,214,229,354,408,471,497,512,526,581,584
1,26.928048,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.150797,0.0,0.0
2,1.85923,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.701954,0.0,0.0
3,3.093543,0.0,0.0,0.0,0.0,0.0,0.024552,0.0,0.0,0.0,0.0,0.220967,0.0,0.0
4,2.573991,0.0,0.0,0.0,0.0,0.0,0.0,0.004408,0.0,0.0,0.0,0.709611,0.0,0.0
5,2.709369,0.0,0.0,0.0,0.0,0.001841,0.0,0.007362,0.0,0.0,0.0,0.18222,0.0,0.0


In [17]:
selected_list = selected.columns.get_level_values(6)

In [18]:
# Extract Genera and ID from the multi-index, For selected genera
selected_GID = dict(zip(selected.columns.get_level_values(6), selected.columns.get_level_values(7)))
# For all genera 
all_GID = dict(zip(Jointax.columns.get_level_values(6), Jointax.columns.get_level_values(7)))

# 3. Reference Formating Function
Following function is to take the references given in the search and present them on APA style list

In [19]:
def format_apa_reference(article):
    """Format article data into APA style reference"""
    try:
        # Get authors
        if 'AuthorList' in article:
            authors = article['AuthorList']
            if len(authors) > 6:
                author_text = f"{authors[0]['LastName']}, {authors[0].get('ForeName', '')[0]}., et al."
            else:
                author_list = []
                for author in authors:
                    if 'ForeName' in author:
                        author_list.append(f"{author['LastName']}, {author['ForeName'][0]}.")
                    else:
                        author_list.append(f"{author['LastName']}")
                author_text = ", ".join(author_list[:-1]) + " & " + author_list[-1] if len(author_list) > 1 else author_list[0]
        else:
            author_text = "No author"

        # Get year
        pub_date = article['Journal']['JournalIssue']['PubDate']
        year = pub_date.get('Year', 'n.d.')

        # Get title
        title = article.get('ArticleTitle', 'No title')
        
        # Get journal info
        journal = article['Journal']
        journal_title = journal.get('Title', journal.get('ISOAbbreviation', 'No journal'))
        
        # Get volume, issue, pages
        volume = journal['JournalIssue'].get('Volume', '')
        issue = journal['JournalIssue'].get('Issue', '')
        pagination = article.get('Pagination', {}).get('MedlinePgn', '')

        # Format the reference
        reference = f"{author_text} ({year}). {title}. {journal_title}"
        if volume:
            reference += f", {volume}"
        if issue:
            reference += f"({issue})"
        if pagination:
            reference += f", {pagination}"
        reference += "."

        return reference
    except Exception as e:
        return f"Error formatting reference: {str(e)}"

# 4. Query DB: Searching Corrosion Genes
This function search on PubMed database the bacteria in the list for different criteria related to corrosion, in order to found which of the bacteria has been previouly identified as causing damage by corrosion. The funciton search various terms used in corrosion and metabolic pathways, then the literature_analysis is done.

In [20]:
def search_corrosion_genes(bacteria_name, base_dir, gid_dict):
    """
This function searches for bacteria's involvement in corrosion processes through:
Literature search using specific corrosion-related terms
Analysis of metabolic pathways and genes related to corrosion

    Parameters:
    bacteria_name: str - name of the bacteria to search
    base_dir: Path - directory where the Excel file will be stored containing:
        - Main Analysis sheet: Complete table with columns Name, Metabolism, Hits etc
        - References sheet: Citations in APA format
        - Abstracts sheet: Full paper abstracts
    gid_dict: dict - mapping of bacteria names to their GIDs
    """
    # Get GID for this bacteria, so that we can identify with name and ID
    bacteria_gid = gid_dict.get(bacteria_name, f"NEW_{bacteria_name}")  # Use NEW_ prefix for new bacteria
    # Create a timestamped filename for this run
    # timestamp = datetime.now().strftime('%Y%m%d_%H%M')   
    # Defining the results file within base_dir
    results_file = base_dir / f"bacteria_corrosion_summary.xlsx"
   
    # Add timing for individual bacteria
    bacteria_start_time = time.time()
    print(f"Starting search for {bacteria_name} at: {datetime.now().strftime('%H:%M:%S')}")
    
    results = {
        'bacteria': bacteria_name,
        'sulfate_reduction': False,
        'metal_reduction': False,
        'corrosion_associated': False,
        'cytochrome_c3': False,
        'acid_production': False,
        'biofilm_formation': False,
        'h2s_production': False,
        'literature_count': 0,
        'evidence': [],
        'processing_time': 0,
    }
    
    # Creating a structured record for each bacteria
    bacteria_record = {
        'Name': bacteria_name,      # Bacteria species/strain name
        'Metabolism': [],          # List of identified metabolic pathways
        'Terms': [],              # Search terms that yielded results
        'Hits': 0,               # Total number of relevant papers found
        'Best_Reference': '',    # Most relevant paper in APA format
        'Abstract': ''          # Abstract from key paper
    }
    
    try:
        # 1. Check KEGG for pathways and genes
        base_url = "http://rest.kegg.jp/" # Check for metabolic pathways and gene presence. Functional anotation.

    except Exception as e:
        print(f"KEGG API error for {bacteria_name}: {str(e)}")
        
        # retry strategy in case there is connectivity issues
        retry_strategy = Retry(
            total=3,
            backoff_factor=1,
            status_forcelist=[500, 502, 503, 504]
        )
        adapter = HTTPAdapter(max_retries=retry_strategy)
        session = requests.Session()
        session.mount("http://", adapter)
        session.mount("https://", adapter)
        # Using sessions no request
        pathway_response = session.get(f"{base_url}find/module/{bacteria_name}")
        pathway_text = pathway_response.text.lower()
        
        # Define search terms key metabolic processes related to corrosion
        sulfate_terms = [
                'sulfate', 'sulphate',    # Terms related to sulfate reduction pathway
                'dsrab', 'dsra', 'dsrb',  # Key genes in dissimilatory sulfate reduction
                'aprab', 'apra', 'aprb',  # Adenosine-5'-phosphosulfate reductase genes
                'sat',  # Sulfate adenylyltransferase
                'sox',  # Sulfur oxidation
                'sir',  # Sulfite reductase
                'aps'   # Adenosine phosphosulfate
        ]                     
        metal_terms = [
                'metal', 'iron', 'fe(iii)', 'metal deterioration', 'MIC',
                'cytochrome', 'corrosion', 'biocorrosion',
                'methane corrosion', 'methanogenesis corrosion',
                'bacteria corrosion', 'anaerobic corrosion',
                'biofilm corrosion', 'manganese corrosion',
                'denitrification corrosion',
                'mtr',  # Metal reduction
                'omc',  # Outer membrane cytochromes
                'pil',  # Pili genes involved in metal reduction
                'cymA',  # Cytoplasmic membrane protein
                'hydA',  # Hydrogenase
                'feo',  # Ferrous iron transport
                'nrf',   # Nitrite reduction
                'organic acid AND corrosion',
                'acid metabolite AND metal deterioration',
                'fermentation AND corrosion',
                'biofilm AND (corrosion OR MIC)',
                'hydrogen sulfide AND corrosion',
                'thiosulfate AND corrosion'
                ]

        # Check pathway text
        if any(term in pathway_text for term in sulfate_terms):
            # Verify pathway presence
            pathway_evidence = [term for term in sulfate_terms if term in pathway_text]
            if pathway_evidence:
                results['sulfate_reduction'] = True
                results['evidence'].append(f"Found sulfate pathway evidence: {pathway_evidence}")
            else:
                # Flag potential inconsistency
                print(f"Warning: {bacteria_name} shows sulfate reduction hits but no clear pathway evidence")
        
        if any(term in pathway_text for term in metal_terms):
            results['metal_reduction'] = True
            results['evidence'].append(f"Found metal pathway: {[term for term in metal_terms if term in pathway_text]}")
        
        # Look for genes
        genes_response = session.get(f"{base_url}find/genes/{bacteria_name}")
        genes_text = genes_response.text.lower()
        
        if "cytochrome c3" in genes_text:
            results['cytochrome_c3'] = True
            results['evidence'].append("Found cytochrome c3 gene")
        
        if any(gene in genes_text for gene in ['dsr', 'apr', 'sat']):
            results['sulfate_reduction'] = True
            results['evidence'].append(f"Found sulfate genes: {[gene for gene in ['dsr', 'apr', 'sat'] if gene in genes_text]}")
            
    except Exception as e:
        print(f"KEGG API error for {bacteria_name}: {str(e)}")
    
    # 3. Check literature
    try:
        Entrez.email = "beatrizamandawatts@gmail.com"
             
        search_terms = [
            f"{bacteria_name}[Organism] AND corrosion[Title]",
            f"{bacteria_name}[Organism] AND biocorrosion[Title]",
            f"{bacteria_name}[Organism] AND 'microbiologically influenced corrosion'[Title]",
            f"{bacteria_name}[Organism] AND (dsrAB OR aprAB) AND corrosion", # sulphate metabolism
            f"{bacteria_name}[Organism] AND (metal reduction OR iron reduction)",   # metal interaction
            f"{bacteria_name}[Organism] AND (cytochrome c3) AND corrosion",
            f"{bacteria_name}[Organism] AND corrosion",
            f"{bacteria_name}[Organism] AND biocorrosion",
            f"{bacteria_name}[Organism] AND (MIC OR 'microbiologically influenced corrosion')",
            f"{bacteria_name}[Organism] AND 'material deterioration'",
            f"{bacteria_name}[Organism] AND ('metal deterioration' OR 'metallic corrosion')",
            f"{bacteria_name}[Organism] AND (acid production) AND (corrosion OR 'metal deterioration' OR MIC)",
            f"{bacteria_name}[Organism] AND AND biofilm AND (corrosion OR MIC)",
            f"{bacteria_name}[Organism] AND (ochre formation OR iron oxide deposits OR rust formation)",
            f"{bacteria_name}[Organism] AND (hydrogen sulfide OR H2S) AND (corrosion OR 'metal deterioration')"
            f"{bacteria_name}[Organism] AND ('sulfate reducing bacteria'[Title/Abstract] AND corrosion)",
            f"{bacteria_name}[Organism] AND ('metal reducing bacteria'[Title/Abstract] AND corrosion)",
            ]
        
        for term in search_terms:
            handle = Entrez.esearch(db="pubmed", term=term)
            record = Entrez.read(handle)
            count = int(record["Count"])
            results['literature_count'] += count
            
            if count > 0:
                results['evidence'].append(f"Found {count} papers for: {term}")
                paper_ids = record["IdList"]
                
                try:
                    papers_handle = Entrez.efetch(db="pubmed", id=paper_ids, rettype="medline", retmode="xml")
                    papers = Entrez.read(papers_handle)
                    print(f"Found {len(papers.get('PubmedArticle', []))} papers for {bacteria_name}")  # Debug line
                    
                    # Update metabolism flags
                    if "sulfate" in term.lower():
                        results['sulfate_reduction'] = True
                        if 'Sulfate Reduction' not in bacteria_record['Metabolism']:
                            bacteria_record['Metabolism'].append('Sulfate Reduction')
                    
                    if "metal" in term.lower():
                        results['metal_reduction'] = True
                        if 'Metal Reduction' not in bacteria_record['Metabolism']:
                            bacteria_record['Metabolism'].append('Metal Reduction')
                    
                    if "cytochrome" in term.lower():
                        results['cytochrome_c3'] = True
                        if 'Cytochrome c3' not in bacteria_record['Metabolism']:
                            bacteria_record['Metabolism'].append('Cytochrome c3')
                    
                    bacteria_record['Hits'] += count
                    bacteria_record['Terms'].append(f"{term}: {count} hits")

                    if papers.get('PubmedArticle'):
                        latest_paper = papers['PubmedArticle'][0]
                        article = latest_paper['MedlineCitation']['Article']
                    
                    # Store reference if it's corrosion-related (broadened criteria)
                    if ('corrosion' in article['ArticleTitle'].lower() or 
                        'mic' in article['ArticleTitle'].lower() or
                        'metal' in article['ArticleTitle'].lower()):
                        bacteria_record['Best_Reference'] = format_apa_reference(article)
                        if 'Abstract' in article:
                            bacteria_record['Abstract'] = article['Abstract']['AbstractText'][0]

                    time.sleep(1)  # Being nice to the APIs
                    
                except Exception as e:
                    print(f"Error processing PubMed data for {bacteria_name}: {e}")       
                    # Add debug print before saving
                    print(f"Saving reference for {bacteria_name}: {bacteria_record.get('Best_Reference', 'No reference')}")
        # Save to Excel
        try:           
            # Load existing data for this run or create new DataFrame
            if results_file.exists():
                main_df = pd.read_excel(results_file, sheet_name='Analysis', index_col=0)
                refs_df= pd.read_excel(results_file, sheet_name ='References_Abstracts', index_col=0)
            else:
                # First bacteria in this tun creates new df
                main_df = pd.DataFrame(columns=['Name', 'Metabolism', 'Terms', 'Hits'],
                                            index=pd.Index([], name='GID'))
                refs_df=pd.DataFrame(columns=['Name', 'Reference', 'Abstract'],    
                                            index=pd.Index([], name='GID'))                      
            # Prepare new row for main sheet
            new_row = pd.DataFrame({
                'Name': [bacteria_name],
                'Metabolism': ['; '.join(bacteria_record['Metabolism']) if bacteria_record['Metabolism'] else ''],
                'Terms': ['; '.join(bacteria_record['Terms']) if bacteria_record['Terms'] else ''],
                'Hits': [bacteria_record['Hits']]
            }, index=[bacteria_gid])

            # New row for ref sheet
            new_refs_row = pd.DataFrame({
                'Name': [bacteria_name],
                'Reference': [bacteria_record.get('Best_Reference', '')],
                'Abstract': [bacteria_record.get('Abstract', '')]
             }, index=[bacteria_gid])

            # Update or append to DataFrames
            if bacteria_gid in main_df.index:
                main_df.loc[bacteria_gid] = new_row.iloc[0]
            else:
                main_df = pd.concat([main_df, new_row])
            # Update or append to DataFrames
            if bacteria_gid in refs_df.index:
                refs_df.loc[bacteria_gid] = new_refs_row.iloc[0]
            else:
                refs_df = pd.concat([refs_df, new_refs_row])    
                                    
            # Save both DataFrames to the same file
            with pd.ExcelWriter(results_file, engine='openpyxl', mode='w') as writer:
                main_df.to_excel(writer, sheet_name='Analysis')
                refs_df.to_excel(writer, sheet_name='References_Abstracts')
                
                # Format columns
                for sheet in writer.sheets.values():
                    sheet.column_dimensions['B'].width = 30
                    if sheet.title == 'References_Abstracts':
                        sheet.column_dimensions['C'].width = 50
                        sheet.column_dimensions['D'].width = 50
                        for row in sheet.iter_rows(min_row=2, min_col=3, max_col=4):
                            for cell in row:
                                cell.alignment = openpyxl.styles.Alignment(wrap_text=True)
        except Exception as e:
            print(f"Error saving to Excel for {bacteria_name}: {str(e)}")

    except Exception as e:
        print(f"Error in literature processing for {bacteria_name}: {e}")

    finally:
        results['processing_time'] = time.time() - bacteria_start_time
        print(f"Finished {bacteria_name} in {results['processing_time']:.2f} seconds")

        return results

In [21]:
# For your selected bacteria
for bacteria in selected_list:
    result = search_corrosion_genes(bacteria, base_dir, selected_GID)
    print(f"\nResults for {bacteria}:")
    print(f"Sulfate reduction: {result['sulfate_reduction']}")
    print(f"Metal reduction: {result['metal_reduction']}")
    print(f"Cytochrome c3: {result['cytochrome_c3']}")
    print(f"Literature count: {result['literature_count']}")
    print("Evidence:", "\n- ".join([''] + result['evidence']))
gc.collect()

Starting search for Azospira at: 18:21:05
Found 20 papers for Azospira
Found 1 papers for Azospira
Finished Azospira in 15.62 seconds

Results for Azospira:
Sulfate reduction: False
Metal reduction: True
Cytochrome c3: False
Literature count: 23
Evidence: 
- Found 22 papers for: Azospira[Organism] AND (metal reduction OR iron reduction)
- Found 1 papers for: Azospira[Organism] AND biocorrosion
Starting search for Brachybacterium at: 18:21:21
Found 2 papers for Brachybacterium
Found 1 papers for Brachybacterium
Found 2 papers for Brachybacterium
Found 1 papers for Brachybacterium
Finished Brachybacterium in 21.04 seconds

Results for Brachybacterium:
Sulfate reduction: False
Metal reduction: True
Cytochrome c3: False
Literature count: 6
Evidence: 
- Found 2 papers for: Brachybacterium[Organism] AND (metal reduction OR iron reduction)
- Found 1 papers for: Brachybacterium[Organism] AND corrosion
- Found 2 papers for: Brachybacterium[Organism] AND (MIC OR 'microbiologically influenced cor

6036

# 6.1.  Analysis of Search Results for checked DataFrame
The literature search validates our statistical selection of significant bacteria. Most of these 30 genera, chosen from 882 bacteria and archaea based on statistical significance, show evidence of corrosion-related activity in existing literature.
The results demonstrate varying levels of prior documentation:

Well-documented corrosion-causers (e.g., Thiobacillus, Streptococcus): These serve as positive controls, confirming our statistical approach
Moderately documented genera: Support our findings while suggesting areas for further research
Novel candidates with minimal documentation (e.g., Bulleidia, Mycoplana, Oxobacter): These represent potentially new corrosion-associated bacteria identified through our statistical analysis

The presence of well-known corrosion-causing bacteria in our statistically significant set validates our analytical approach. This gives more weight to our novel findings regarding the less-studied bacteria in our selection.
Next steps with PICRUSt functional analysis will help understand the metabolic capabilities of our newly identified bacteria, using the well-documented corrosion-causing bacteria as reference points for comparison.

Some improvement was done in the following function. The next function more tailored to corrosion influencing bacteria, which title has any word on corrosion exactly.

Prioritizing papers with "corrosion" in the title. Focusing on heating/cooling system contexts
Weighting hits based on relevance to industrial systems.

# 6.1.  Analysis of Search Results for high_loadings DataFrame

# 7. Improved function series

In [22]:
# Helper Functions
def format_apa_reference(article):
    """Format article data into APA style reference"""
    try:
        # Get authors
        if 'AuthorList' in article:
            authors = article['AuthorList']
            if len(authors) > 6:
                author_text = f"{authors[0]['LastName']}, {authors[0].get('ForeName', '')[0]}., et al."
            else:
                author_list = []
                for author in authors:
                    if 'ForeName' in author:
                        author_list.append(f"{author['LastName']}, {author['ForeName'][0]}.")
                    else:
                        author_list.append(f"{author['LastName']}")
                author_text = ", ".join(author_list[:-1]) + " & " + author_list[-1] if len(author_list) > 1 else author_list[0]
        else:
            author_text = "No author"

        # Get year
        pub_date = article['Journal']['JournalIssue']['PubDate']
        year = pub_date.get('Year', 'n.d.')

        # Get title
        title = article.get('ArticleTitle', 'No title')
        
        # Get journal info
        journal = article['Journal']
        journal_title = journal.get('Title', journal.get('ISOAbbreviation', 'No journal'))
        
        # Get volume, issue, pages
        volume = journal['JournalIssue'].get('Volume', '')
        issue = journal['JournalIssue'].get('Issue', '')
        pagination = article.get('Pagination', {}).get('MedlinePgn', '')

        # Format the reference
        reference = f"{author_text} ({year}). {title}. {journal_title}"
        if volume:
            reference += f", {volume}"
        if issue:
            reference += f"({issue})"
        if pagination:
            reference += f", {pagination}"
        reference += "."

        return reference
    except Exception as e:
        return f"Error formatting reference: {str(e)}"

def analyze_kegg_pathways(bacteria_name, session):
    """
    Analyzes KEGG pathways for corrosion-relevant processes with comprehensive term matching
    """
    base_url = "http://rest.kegg.jp/"
    pathway_data = {
        'sulfate_reduction': {
            'found': False,
            'terms': [
                'sulfate', 'sulphate',    # Terms related to sulfate reduction pathway
                'dsrab', 'dsra', 'dsrb',  # Key genes in dissimilatory sulfate reduction
                'aprab', 'apra', 'aprb',  # Adenosine-5'-phosphosulfate reductase genes
                'sat',  # Sulfate adenylyltransferase
                'sox',  # Sulfur oxidation
                'sir',  # Sulfite reductase
                'aps'   # Adenosine phosphosulfate
            ]
        },
        'metal_reduction': {
            'found': False,
            'terms': [
                'metal', 'iron', 'fe(iii)', 'metal deterioration',
                'cytochrome', 'corrosion', 'biocorrosion',
                'mtr',  # Metal reduction
                'omc',  # Outer membrane cytochromes
                'pil',  # Pili genes involved in metal reduction
                'cymA',  # Cytoplasmic membrane protein
                'hydA',  # Hydrogenase
                'feo',  # Ferrous iron transport
                'nrf'   # Nitrite reduction
            ]
        },
        'biofilm_formation': {
            'found': False,
            'terms': [
                'biofilm', 'eps', 'exopolysaccharide',
                'adhesin', 'fimbriae', 'pili'
            ]
        },
        'acid_production': {
            'found': False,
            'terms': [
                'organic acid', 'fermentation',
                'acid metabolite', 'acidogenic'
            ]
        }
    }

    try:
        # Look for pathway modules
        pathway_response = session.get(f"{base_url}find/module/{bacteria_name}")
        pathway_text = pathway_response.text.lower()
        
        # Check each pathway category
        for category, data in pathway_data.items():
            if any(term in pathway_text for term in data['terms']):
                pathway_data[category]['found'] = True
                
        # Check genes specifically
        genes_response = session.get(f"{base_url}find/genes/{bacteria_name}")
        genes_text = genes_response.text.lower()
        
        # Additional gene-specific checks
        if "cytochrome c3" in genes_text:
            pathway_data['metal_reduction']['found'] = True
        
        if any(gene in genes_text for gene in ['dsr', 'apr', 'sat']):
            pathway_data['sulfate_reduction']['found'] = True
            
    except Exception as e:
        print(f"Error in KEGG pathway analysis for {bacteria_name}: {str(e)}")
        
    return pathway_data

def search_corrosion_genes_improved(bacteria_name, base_dir, gid_dict):
    """
    Enhanced version of search_corrosion_genes with improved tracking of results and comprehensive search terms
    
    Parameters:
    -----------
    bacteria_name : str
        Name of the bacteria to search
    base_dir : Path
        Directory where the Excel file will be stored containing:
        - Main Analysis sheet: Complete table with columns Name, Mechanisms, Evidence_Quality, etc.
        - References_Abstracts sheet: Citations in APA format and their abstracts
    gid_dict : dict
        Mapping of bacteria names to their GIDs
    
    Returns:
    --------
    dict
        Results dictionary containing mechanisms found, evidence quality score, and hit counts
    """
    results = {
        'bacteria': bacteria_name,
        'Mechanisms': [],
        'Evidence_Quality': 0,
        'Total_Hits': 0,
        'Corrosion_Specific_Hits': 0
    }
    results_file = base_dir / "bacteria_corrosion_summary_improved.xlsx"
    bacteria_gid = gid_dict.get(bacteria_name, f"NEW_{bacteria_name}")

    # Initialize counters and collectors
    hit_counter = {
        'sulfate_reduction': 0,
        'metal_reduction': 0,
        'biofilm_formation': 0,
        'acid_production': 0,
        'h2s_production': 0,
        'total': 0
    }

    # Comprehensive search terms from original implementation
    search_terms = {
        'primary': [
            f'"{bacteria_name}"[Organism] AND "microbiologically influenced corrosion"[Title/Abstract]',
            f'"{bacteria_name}"[Organism] AND "biocorrosion"[Title/Abstract]',
            f'"{bacteria_name}"[Organism] AND "metal corrosion"[Title/Abstract]',
            f'"{bacteria_name}"[Organism] AND corrosion[Title]',
            f'"{bacteria_name}"[Organism] AND (dsrAB OR aprAB) AND corrosion',
            f'"{bacteria_name}"[Organism] AND (metal reduction OR iron reduction)',
            f'"{bacteria_name}"[Organism] AND (cytochrome c3) AND corrosion'
        ],
        'secondary': [
            f'"{bacteria_name}"[Organism] AND (MIC OR "microbiologically influenced corrosion")',
            f'"{bacteria_name}"[Organism] AND "material deterioration"',
            f'"{bacteria_name}"[Organism] AND ("metal deterioration" OR "metallic corrosion")',
            f'"{bacteria_name}"[Organism] AND (acid production) AND (corrosion OR "metal deterioration" OR MIC)',
            f'"{bacteria_name}"[Organism] AND biofilm AND (corrosion OR MIC)',
            f'"{bacteria_name}"[Organism] AND (ochre formation OR iron oxide deposits OR rust formation)',
            f'"{bacteria_name}"[Organism] AND (hydrogen sulfide OR H2S) AND (corrosion OR "metal deterioration")',
            f'"{bacteria_name}"[Organism] AND "sulfate reducing bacteria"[Title/Abstract] AND corrosion',
            f'"{bacteria_name}"[Organism] AND "metal reducing bacteria"[Title/Abstract] AND corrosion',
            f'"{bacteria_name}"[Organism] AND (dsrAB[Title/Abstract] OR aprAB[Title/Abstract])',
            f'"{bacteria_name}"[Organism] AND "biofilm formation"[Title/Abstract]',
            f'"{bacteria_name}"[Organism] AND "acid production"[Title/Abstract]',
            f'"{bacteria_name}"[Organism] AND "hydrogen sulfide"[Title/Abstract]'
        ],
        'context': [
            f'"{bacteria_name}"[Organism] AND "heating system"[Title/Abstract]',
            f'"{bacteria_name}"[Organism] AND "cooling system"[Title/Abstract]',
            f'"{bacteria_name}"[Organism] AND "industrial water"[Title/Abstract]',
            f'"{bacteria_name}"[Organism] AND "pipeline corrosion"[Title/Abstract]',
            f'"{bacteria_name}"[Organism] AND "methane corrosion"[Title/Abstract]',
            f'"{bacteria_name}"[Organism] AND "anaerobic corrosion"[Title/Abstract]',
            f'"{bacteria_name}"[Organism] AND "manganese corrosion"[Title/Abstract]',
            f'"{bacteria_name}"[Organism] AND "denitrification corrosion"[Title/Abstract]'
        ]
    }

    try:
        # Set up retry strategy for API calls
        retry_strategy = Retry(
            total=3,
            backoff_factor=1,
            status_forcelist=[500, 502, 503, 504]
        )
        adapter = HTTPAdapter(max_retries=retry_strategy)
        session = requests.Session()
        session.mount("http://", adapter)
        session.mount("https://", adapter)

        # KEGG pathway analysis using helper function
        pathway_data = analyze_kegg_pathways(bacteria_name, session)
        
        # Update mechanisms based on pathway data
        for Mechanism, data in pathway_data.items():
            if data['found'] and Mechanism not in results['Mechanisms']:
                results['Mechanisms'].append(Mechanism)
                hit_counter[Mechanism] += 1

        # PubMed literature search
        Entrez.email = "beatrizamandawatts@gmail.com"
        
        # Search each category
        all_papers = []
        for category, terms in search_terms.items():
            for term in terms:
                try:
                    handle = Entrez.esearch(db="pubmed", term=term)
                    record = Entrez.read(handle)
                    count = int(record["Count"])
                    
                    if count > 0:
                        hit_counter['total'] += count
                        paper_ids = record["IdList"]
                        
                        papers_handle = Entrez.efetch(db="pubmed", id=paper_ids, 
                                                    rettype="medline", retmode="xml")
                        papers = Entrez.read(papers_handle)
                        
                        if papers.get('PubmedArticle'):
                            all_papers.extend(papers['PubmedArticle'])
                            
                            # Update mechanism counters based on paper content
                            for paper in papers['PubmedArticle']:
                                article = paper['MedlineCitation']['Article']
                                title_abstract = (article['ArticleTitle'] + 
                                                article.get('Abstract', {}).get('AbstractText', [''])[0]).lower()
                                
                                # Check for mechanisms in paper content
                                for Mechanism, data in pathway_data.items():
                                    if any(term in title_abstract for term in data['terms']):
                                        if Mechanism not in results['Mechanisms']:
                                            results['Mechanisms'].append(Mechanism)
                                        hit_counter[Mechanism] += 1
                    
                    time.sleep(1)  # Being nice to the APIs
                    
                except Exception as e:
                    print(f"Error in PubMed search for term '{term}': {str(e)}")

        # Calculate evidence quality score with weighted components
        results['Evidence_Quality'] = (
            hit_counter['total'] * 0.3 +  # Base hits
            len(results['Mechanisms']) * 0.4 +  # Diversity of mechanisms
            sum(hit_counter[m] for m in pathway_data.keys()) * 0.3  # Mechanism-specific hits
        )
        
        results['Total_Hits'] = hit_counter['total']
        
        # Save results to Excel with proper formatting
        try:
            # Load existing data or create new DataFrames
            if results_file.exists():
                main_df = pd.read_excel(results_file, sheet_name='Analysis', index_col=0)
                refs_df = pd.read_excel(results_file, sheet_name='References_Abstracts', index_col=0)
            else:
                main_df = pd.DataFrame(columns=['Name', 'Mechanisms', 'Evidence_Quality', 'Total_Hits'],
                                     index=pd.Index([], name='GID'))
                refs_df = pd.DataFrame(columns=['Name', 'Reference', 'Abstract'],
                                     index=pd.Index([], name='GID'))

            # Prepare new row for main sheet
            new_row = pd.DataFrame({
                'Name': [bacteria_name],
                'Mechanisms': ['; '.join(results['Mechanisms']) if results['Mechanisms'] else ''],
                'Evidence_Quality': [results['Evidence_Quality']],
                'Total_Hits': [results['Total_Hits']]
            }, index=[bacteria_gid])

            # Get the best corrosion-related paper
            best_paper = None
            if all_papers:
                # Sort papers by relevance (presence of corrosion terms in title)
                sorted_papers = sorted(
                    all_papers,
                    key=lambda x: 'corrosion' in x['MedlineCitation']['Article']['ArticleTitle'].lower(),
                    reverse=True
                )
                best_paper = sorted_papers[0]

            if best_paper:
                article = best_paper['MedlineCitation']['Article']
                new_refs_row = pd.DataFrame({
                    'Name': [bacteria_name],
                    'Reference': [format_apa_reference(article)],
                    'Abstract': [article.get('Abstract', {}).get('AbstractText', [''])[0]]
                }, index=[bacteria_gid])
            else:
                new_refs_row = pd.DataFrame({
                    'Name': [bacteria_name],
                    'Reference': [''],
                    'Abstract': ['']
                }, index=[bacteria_gid])

            # Update or append to DataFrames
            if bacteria_gid in main_df.index:
                main_df.loc[bacteria_gid] = new_row.iloc[0]
            else:
                main_df = pd.concat([main_df, new_row])

            if bacteria_gid in refs_df.index:
                refs_df.loc[bacteria_gid] = new_refs_row.iloc[0]
            else:
                refs_df = pd.concat([refs_df, new_refs_row])
           
            # Save both DataFrames to the same file
            with pd.ExcelWriter(results_file, engine='openpyxl', mode='w') as writer:
                main_df.to_excel(writer, sheet_name='Analysis')
                refs_df.to_excel(writer, sheet_name='References_Abstracts')
                
                # Format columns
                for sheet in writer.sheets.values():
                    sheet.column_dimensions['B'].width = 30
                    if sheet.title == 'References_Abstracts':
                        sheet.column_dimensions['C'].width = 50
                        sheet.column_dimensions['D'].width = 50
                        for row in sheet.iter_rows(min_row=2, min_col=3, max_col=4):
                            for cell in row:
                                cell.alignment = Alignment(wrap_text=True)
        except Exception as e:
            print(f"Error saving to Excel for {bacteria_name}: {e}")
            
    except Exception as e:
        logging.error(f"Error processing {bacteria_name}: {str(e)}")
        return None
        
    return results

In [23]:
for bacteria in selected_list:
    result = search_corrosion_genes_improved(bacteria, base_dir, selected_GID) 
    if result:
        print(f"\nResults for {bacteria}:")
        print(f"Evidence quality score: {result['Evidence_Quality']}")
        print(f"Total hits: {result['Total_Hits']}")
        print(f"Identified mechanisms: {result['Mechanisms']}")


Results for Azospira:
Evidence quality score: 20.5
Total hits: 25
Identified mechanisms: ['metal_reduction', 'sulfate_reduction', 'biofilm_formation', 'acid_production']

Results for Brachybacterium:
Evidence quality score: 5.1
Total hits: 6
Identified mechanisms: ['sulfate_reduction', 'metal_reduction', 'biofilm_formation']

Results for Bulleidia:
Evidence quality score: 1.0
Total hits: 1
Identified mechanisms: ['metal_reduction']

Results for Cellulosimicrobium:
Evidence quality score: 5.3
Total hits: 9
Identified mechanisms: ['metal_reduction', 'sulfate_reduction']

Results for Clostridium:
Evidence quality score: 636.1
Total hits: 1905
Identified mechanisms: ['sulfate_reduction', 'metal_reduction', 'biofilm_formation', 'acid_production']

Results for Corynebacterium:
Evidence quality score: 215.49999999999997
Total hits: 609
Identified mechanisms: ['metal_reduction', 'sulfate_reduction', 'biofilm_formation', 'acid_production']
Error in PubMed search for term '"Halomonas"[Organism]

Bacteria known to be corrosive and is found on the present systems are as follows

271 Desulfovibrio (Sulfate-reducing bacteria, well-documented MIC agent) 
727 Thiobacillus (Sulfur-oxidizing bacteria)  
332 Gallionella (Iron-oxidizing bacteria)   
587 Pseudomonas (Known for biofilm formation and acid production)  
656 Shewanella (Metal-reducing bacteria)  
214 Clostridium (Anaerobic, acid-producing bacteria)  
264 Desulfobacterium (Sulfate-reducing bacteria)  
265 Desulfobulbus (Sulfate-reducing bacteria)  
270 Desulfotomaculum (Thermophilic sulfate-reducing bacteria)  
264 Desulfobacterium (Sulfate-reducing bacteria)

The usual suspects of corrosion influencing bateria are no been taking on importance over the statistically significant bacteria as anchors, even if they're less well-known, for several reasons:

Statistical significance in compromised systems provides real-world evidence of correlation with corrosion events
This approach may reveal new mechanisms of microbially influenced corrosion (MIC)
The usual suspects might be present but not active in these specific environments
The statistical approach removes bias towards well-known organisms and allows discovery of new players
However moving forward to the Picrust analysis we taking an hybrid approach in order to use them as comparation features.
The statistically significant bacteria would be used as primary indicators since they show actual correlation with system compromise. But well-known corrosion-causing bacteria, would be taken as reference points to understand potential mechanisms.
Why certain "usual suspects" aren't showing statistical significance?:
They might be present but not active
They might be outcompeted in these specific environments
Their corrosion mechanisms might be less relevant in these systems
They might be acting as supporting organisms rather than primary corrosion agents

# 7.1. Analysis for checked DataFrame
Notable Traditional Corrosion-Causing Bacteria:

Thiobacillus (Evidence Quality: 180.1, Total Hits: 430)
Clostridium (Evidence Quality: 636.1, Total Hits: 1905)

Statistically Significant Bacteria with High Evidence Quality (in descending order):

Streptococcus (3014.8, 9850 hits)
Enterococcus (1534, 4974 hits)
Mycobacterium (1411.6, 4577 hits)
Neisseria (435.1, 1359 hits)
Corynebacterium (215.5, 609 hits)
Prevotella (188.5, 530 hits)
Paracoccus (121.6, 298 hits)
Pseudoalteromonas (69.4, 129 hits)
Halomonas (64.6, 91 hits)

This data presents a compelling case for focusing on statistically significant bacteria rather than just the traditional suspects. This is because most of these bacteria show multiple corrosion mechanisms, such as metal reduction, sulfate reduction, biofilm formation and acid production.
Additionally some non traditional corrosion bacteria (like Streptococcus and Enterococcus) show higher evidence quality scores than traditional corrosion-causing bacteria. The multiple mechanisms suggest these aren't false positives but rather previously underappreciated contributors to corrosion.
Candidate inducing corrosion such as Bulleidia, Mycoplana, and Oxobacter might represent new corrosion mechanisms. The presence of both anaerobic (Anaerococcus) and aerobic bacteria suggests complex corrosion environments.

# 7.2. Analysis for high_loadings DataFrame