This notebook has as objective to dilucidate which of the microorganisms in the data have recognised influence in corrosion damage. With this pool of bacteria we compare the nobel bacteria here found with the gene sequences of the known corrosion-related genes belonging to MIC or the metabolic pathways are related or can be related to corrosion. Then using the list here found as "anchors" to find associated bacteria
Looking for similar metabolic patterns in other species no yet related to MIC.
## Aims
 __ Compare specific functional genes known to be involved in corrosion processes, particularly        focusing on sulfate reduction pathways (dsrAB, aprAB genes),Metal reduction genes, Cytochrome c3 complexes
__Perform targeted comparative genomic analysis between known corrosion-causing bacteria
  newly identified bacterial specimens from this research

The databases uses on this notebook are:

Bacmet: 'https://bacmet.biomedicine.gu.se/download.html',
KEGG : 'https://www.genome.jp/kegg/pathway.html', which is the Kyoto Encyclopedia of Genes and Genomes. With this is possible to find metabolic pathways, identify functional gene annotations
IMG/M: 'https://img.jgi.doe.gov/',- For detailed metabolic pathways
BRENDA: 'https://www.brenda-enzymes.org/

1. Initial Computational Screening  --> 1a. search_all_databases -->1b. analyze_metabolic_pathways 
   ↓                                                                 ↓
2. Literature Validation                           <--- 1c. Literature Analysis
   ↓  
3. Metabolic Pathway Analysis and mapping- PICRUSt2 - Can predict metabolic functions from 16S data  
   ↓
4. Find functional similarities between known and candidate bacteria, compare taxonomic groups with similar functional profiles
   ↓
5. Sequence analysis for: Sulfate reduction genes, Iron metabolism genes, Biofilm    formation genes
   ↓
6. Identify gene clusters associated with iron metabolism


In [109]:
'''import os
from google.colab import drive  #silence for vscode
drive.mount('/content/drive')
#change the path
os.chdir('/content/drive/My Drive/MIC')
# For colab
!pip install pandas numpy biopython
!pip install requests beautifulsoup4
!pip install Bio'''

"import os\nfrom google.colab import drive  #silence for vscode\ndrive.mount('/content/drive')\n#change the path\nos.chdir('/content/drive/My Drive/MIC')\n# For colab\n!pip install pandas numpy biopython\n!pip install requests beautifulsoup4\n!pip install Bio"

# Preparing data

In [118]:
from pathlib import Path
from Bio import Entrez
import pandas as pd
from functools import partial
import requests
from bs4 import BeautifulSoup
import time
import urllib3
from datetime import datetime
import logging
import numpy as np
import matplotlib.pyplot as plt
import os

In [119]:
# For VSCode
base_dir = Path("/home/beatriz/MIC/2_Micro/data_MIC")
original_dir = base_dir / "Original_data"
Literatur_dir = base_dir / "References"
results_file = base_dir / "bacteria_corrosion_summary.xlsx" 

# For Colab
'''
from google.colab import drive
drive.mount('/content/drive')
base_dir = Path('/content/drive/My Drive/MIC/data')
original_dir = base_dir / "original"
original_dir.mkdir(exist_ok=True)
'''

'\nfrom google.colab import drive\ndrive.mount(\'/content/drive\')\nbase_dir = Path(\'/content/drive/My Drive/MIC/data\')\noriginal_dir = base_dir / "original"\noriginal_dir.mkdir(exist_ok=True)\n'

In [120]:
# Read the Excel file for the whole data
Jointax = pd.read_excel("data/Jointax.xlsx", sheet_name='Biotot_jointax', header=[0,1,2,3,4,5,6,7])
# Drop 2 first columns
Jointax = Jointax.drop(Jointax.columns[0:2], axis=1)

In [None]:
# Read the Excel file for the checked genera
selected = pd.read_excel("/home/beatriz/MIC/2_Micro/data/finalist_dfs.xlsx", sheet_name='checked_genera', header=[0,1,2,3,4,5,6,7])
# Drop first row specifically (index 0 which contains NaNs)
selected = selected.drop(index=0)
# Drop first column (the index column with Level1, Level2, etc)
selected = selected.drop(selected.columns[0], axis=1)

In [None]:
selected.head()

In [7]:
# Extract Genera and ID from the multi-index
bacteria_list = Jointax.columns.get_level_values(6).tolist()[1:]
# Extract Genera from the multi-index
bacteria_GID = list(zip(Jointax.columns.get_level_values(6), Jointax.columns.get_level_values(7)))

# 2. Query DB: Search Multiple Databases
we connect and search multiple databases for MIC-related terms. Phase 1 (search_mic_databases): The code establishes initial database connections using provided email. Searches BacMet, KEGG, IMG/M, and BRENDA for MIC-related keywords and uses defined metabolic pathways to categorize results. At the end it returns DataFrame with [Bacteria, Database, Evidence, Pathway] columns. The bacteria undergoes search in batches,through through all functions in sequence:search_all_databases,analyze_metabolic_pathways, literature_analysis. Complete batch is saved and move to the next batch

In [184]:
def search_corrosion_genes(bacteria_name, base_dir, Literatur_dir):
    """
    Search for specific corrosion-related genes and pathways for a given bacteria
    
    Parameters:
    bacteria_name: str - name of the bacteria to search
    base_dir: Path - base directory for saving results
    original_dir: Path - directory containing original data
    """
    # Defining the results file within base_dir
    results_file = base_dir / "bacteria_corrosion_summary.xlsx"

    # Add timing for individual bacteria
    bacteria_start_time = time.time()
    print(f"Starting search for {bacteria_name} at: {datetime.now().strftime('%H:%M:%S')}")
    
    results = {
        'bacteria': bacteria_name,
        'sulfate_reduction': False,
        'metal_reduction': False,
        'corrosion_associated': False,
        'cytochrome_c3': False,
        'acid_production': False,  # New
        'biofilm_formation': False,  # New
        'h2s_production': False,  # New
        'literature_count': 0,
        'evidence': []  #  tracking why marked something positive
    }
    # Creating a structured record for each bacteria
    bacteria_record = {
        'Name': bacteria_name,
        'Metabolism': [],  #  store sulfate_reduction, metal_reduction, etc.
        'Terms': [],      #  store which search terms got hits
        'Hits': 0,        # Total number of hits
        'Last_Reference': '',
        'Abstract': '',
        'Observations': []
    }
    
    # 1. Check KEGG for pathways and genes
    base_url = "http://rest.kegg.jp/"
    try:
        # Look for pathway modules
        pathway_response = requests.get(f"{base_url}find/module/{bacteria_name}")
        pathway_text = pathway_response.text.lower()
        
        # More specific search terms
        sulfate_terms = [
            'sulfate', 'sulphate', 
            'dsrab', 'dsra', 'dsrb',  # Breaking down dsrAB into individual components
            'aprab', 'apra', 'aprb',  # Breaking down aprAB into individual components
            'sulfite', 'sulphite',
            'sat',  # Sulfate adenylyltransferase
            'sox',  # Sulfur oxidation
            'sir',  # Sulfite reductase
            'aps'   # Adenosine phosphosulfate
        ]           
        metal_terms = [
            'metal', 'iron', 'fe(iii)', 'metal deterioration', 'MIC',
            'cytochrome', 'corrosion', 'biocorrosion',
            'methane corrosion', 'methanogenesis corrosion',
            'bacteria corrosion', 'anaerobic corrosion',
            'biofilm corrosion', 'manganese corrosion',
            'denitrification corrosion',
            'mtr',  # Metal reduction
            'omc',  # Outer membrane cytochromes
            'pil',  # Pili genes involved in metal reduction
            'cymA',  # Cytoplasmic membrane protein
            'hydA',  # Hydrogenase
            'feo',  # Ferrous iron transport
            'nrf',   # Nitrite reduction
            'organic acid AND corrosion',
            'acid metabolite AND metal deterioration',
            'fermentation AND corrosion',
            'biofilm AND (corrosion OR MIC)',
            'hydrogen sulfide AND corrosion',
            'thiosulfate AND corrosion'
        ]
        # Check pathway text
        if any(term in pathway_text for term in sulfate_terms):
            results['sulfate_reduction'] = True
            results['evidence'].append(f"Found sulfate pathway: {[term for term in sulfate_terms if term in pathway_text]}")
        
        if any(term in pathway_text for term in metal_terms):
            results['metal_reduction'] = True
            results['evidence'].append(f"Found metal pathway: {[term for term in metal_terms if term in pathway_text]}")
        
        # Look for genes specifically
        genes_response = requests.get(f"{base_url}find/genes/{bacteria_name}")
        genes_text = genes_response.text.lower()
        
        if "cytochrome c3" in genes_text:
            results['cytochrome_c3'] = True
            results['evidence'].append("Found cytochrome c3 gene")
            
        # Additional check for specific genes
        if any(gene in genes_text for gene in ['dsr', 'apr', 'sat']):
            results['sulfate_reduction'] = True
            results['evidence'].append(f"Found sulfate genes: {[gene for gene in ['dsr', 'apr', 'sat'] if gene in genes_text]}")
            
    except Exception as e:
        print(f"KEGG API error for {bacteria_name}: {str(e)}")
    
    # 2. Check literature with more specific terms
    try:
        Entrez.email = "beatrizamandawatts@gmail.com"  
        papers_details = []
        search_terms = [
            f"{bacteria_name}[Organism] AND (sulfate reduction OR dsrAB OR aprAB)",
            f"{bacteria_name}[Organism] AND (metal reduction OR iron reduction)",
            f"{bacteria_name}[Organism] AND cytochrome c3",
            f"{bacteria_name}[Organism] AND corrosion", 
            f"{bacteria_name}[Organism] AND biocorrosion",
            f"{bacteria_name}[Organism] AND (MIC OR 'microbiologically influenced corrosion')",
            f"{bacteria_name}[Organism] AND 'material deterioration'",
            f"{bacteria_name}[Organism] AND ('metal deterioration' OR 'metallic corrosion')",
            f"{bacteria_name}[Organism] AND (acid production) AND (corrosion OR 'metal deterioration' OR MIC)",
            f"{bacteria_name}[Organism] AND biofilm AND (corrosion OR MIC)",
            f"{bacteria_name}[Organism] AND (hydrogen sulfide OR H2S) AND (corrosion OR 'metal deterioration')"
        ]
               
        for term in search_terms:
            handle = Entrez.esearch(db="pubmed", term=term)
            record = Entrez.read(handle)
            count = int(record["Count"])
            results['literature_count'] += count
            if count > 0:
                results['evidence'].append(f"Found {count} papers for: {term}")
                
                # Update boolean flags based on literature evidence
                if "sulfate" in term.lower() and count > 0:
                    results['sulfate_reduction'] = True
                if "metal" in term.lower() and count > 0:
                    results['metal_reduction'] = True
                if "cytochrome" in term.lower() and count > 0:
                    results['cytochrome_c3'] = True
                if "evidence" in term.lower() and count > 0:
                    results['papers_details'] = papers_details

                # Update metabolism information
                if results['sulfate_reduction']:
                    bacteria_record['Metabolism'].append('Sulfate Reduction')
                if results['metal_reduction']:
                    bacteria_record['Metabolism'].append('Metal Reduction')
                if results['cytochrome_c3']:
                    bacteria_record['Metabolism'].append('Cytochrome c3')

            if count > 0:
                paper_ids = record["IdList"]
                try:
                    papers_handle = Entrez.efetch(db="pubmed", id=paper_ids, rettype="medline", retmode="xml")
                    papers = Entrez.read(papers_handle)

                    if papers.get('PubmedArticle'):  
                        latest_paper = papers['PubmedArticle'][0]
                        article = latest_paper['MedlineCitation']['Article']

                except Exception as e:
                    print(f"Error processing PubMed data for {bacteria_name}: {e}")
                    papers = {'PubmedArticle': []}  # Empty default if read fails   
                        
                bacteria_record['Hits'] += count
                bacteria_record['Terms'].append(f"{term}: {count} hits")  
                # Get latest paper's details
                if papers.get('PubmedArticle'):  # Use .get() to safely check
                    latest_paper = papers['PubmedArticle'][0]
                    try:
                        article = latest_paper['MedlineCitation']['Article']
                        authors = article.get('AuthorList', [{'LastName': 'et al.'}])[0].get('LastName', 'et al.')
                        year = article.get('Journal', {}).get('JournalIssue', {}).get('PubDate', {}).get('Year', 'N/A')
                        bacteria_record['Last_Reference'] = f"{authors} {year}"
                        
                        # Store reference
                        authors = article['AuthorList'][0]['LastName'] if 'AuthorList' in article else 'et al.'
                        year = article['Journal']['JournalIssue']['PubDate'].get('Year', 'N/A')
                        bacteria_record['Last_Reference'] = f"{authors} {year}"
                    
                        # Store abstract snippet (first 50 words)
                        if 'Abstract' in article:
                            abstract_text = article['Abstract']['AbstractText'][0]
                            bacteria_record['Abstract'] = ' '.join(abstract_text.split()[:50]) + '...'        
                            time.sleep(1)  # Being nice to the APIs      
                        if count > 0:
                            # Save paper details to a file, individual papers
                            output_file = Literatur_dir/  f"{bacteria_name}_papers.txt"
                            with open(output_file, 'a') as f:
                                f.write(f"\nSearch term: {term}\n")
                                f.write(f"Number of papers: {count}\n")
                                if papers_details:
                                    f.write(f"Paper details:\n{papers_details}\n")
                                f.write("-" * 50 + "\n")

                        if results_file.exists():
                            df = pd.read_csv(results_file)
                        else:
                            df = pd.DataFrame(columns=['Name', 'Metabolism', 'Observations', 'Terms', 'Hits', 'Last_Reference', 'Abstract'])
                    
                        # Update or append the record
                        bacteria_record['Metabolism'] = '; '.join(bacteria_record['Metabolism'])
                        bacteria_record['Terms'] = '; '.join(bacteria_record['Terms'])
                        bacteria_record['Observations'] = '; '.join(bacteria_record['Observations'])
                        
                        # Update if exists, append if new
                        if bacteria_name in df['Name'].values:
                            df.loc[df['Name'] == bacteria_name] = bacteria_record
                        else:
                            df = pd.concat([df, pd.DataFrame([bacteria_record])], ignore_index=True)
                        
                        # Save the updated DataFrame
                        df.to_excel(results_file, index=False)
                    except Exception as e:
                        print(f"Error processing article details: {e}") 
    except Exception as e:
            print(f"Error in paper processing: {e}")
            results['processing_time'] = time.time() - bacteria_start_time
            print(f"Finished {bacteria_name} in {results['processing_time']:.2f} seconds")

    except Exception as e:
        print(f"Major error processing {bacteria_name}: {e}")
        return {  # Return a default result structure instead of None
            'bacteria': bacteria_name,
            'sulfate_reduction': False,
            'metal_reduction': False,
            'cytochrome_c3': False,
            'literature_count': 0,
            'evidence': [f"Error during processing: {str(e)}"]
        }

    return results

# New code improved

In [186]:
def search_corrosion_genes(bacteria_name, base_dir, Literatur_dir):
    """
    Search for specific corrosion-related genes and pathways for a given bacteria
    
    Parameters:
    bacteria_name: str - name of the bacteria to search
    base_dir: Path - base directory for saving results
    Literatur_dir: Path - directory for literature results
    """
    # Defining the results file within base_dir
    results_file = base_dir / "bacteria_corrosion_summary.xlsx"

    # Add timing for individual bacteria
    bacteria_start_time = time.time()
    print(f"Starting search for {bacteria_name} at: {datetime.now().strftime('%H:%M:%S')}")
    
    results = {
        'bacteria': bacteria_name,
        'sulfate_reduction': False,
        'metal_reduction': False,
        'corrosion_associated': False,
        'cytochrome_c3': False,
        'acid_production': False,
        'biofilm_formation': False,
        'h2s_production': False,
        'literature_count': 0,
        'evidence': [],
        'processing_time': 0
    }
    
    # Creating a structured record for each bacteria
    bacteria_record = {
        'Name': bacteria_name,
        'Metabolism': [],
        'Terms': [],
        'Hits': 0,
        'Last_Reference': '',
        'Abstract': '',
        'Observations': []
    }
    
    try:
        # 1. Check KEGG for pathways and genes
        base_url = "http://rest.kegg.jp/"
        
        # Look for pathway modules
        pathway_response = requests.get(f"{base_url}find/module/{bacteria_name}")
        pathway_text = pathway_response.text.lower()
        
        # Define search terms
        sulfate_terms = [
            'sulfate', 'sulphate', 
            'dsrab', 'dsra', 'dsrb',
            'aprab', 'apra', 'aprb',
            'sulfite', 'sulphite',
            'sat', 'sox', 'sir', 'aps'
        ]
        
        metal_terms = [
            'metal', 'iron', 'fe(iii)', 'metal deterioration', 'MIC',
            'cytochrome', 'corrosion', 'biocorrosion',
            'methane corrosion', 'methanogenesis corrosion',
            'bacteria corrosion', 'anaerobic corrosion',
            'biofilm corrosion', 'manganese corrosion',
            'denitrification corrosion',
            'mtr', 'omc', 'pil', 'cymA', 'hydA', 'feo', 'nrf'
        ]
        
        # Check pathway text
        if any(term in pathway_text for term in sulfate_terms):
            results['sulfate_reduction'] = True
            results['evidence'].append(f"Found sulfate pathway: {[term for term in sulfate_terms if term in pathway_text]}")
        
        if any(term in pathway_text for term in metal_terms):
            results['metal_reduction'] = True
            results['evidence'].append(f"Found metal pathway: {[term for term in metal_terms if term in pathway_text]}")
        
        # Look for genes
        genes_response = requests.get(f"{base_url}find/genes/{bacteria_name}")
        genes_text = genes_response.text.lower()
        
        if "cytochrome c3" in genes_text:
            results['cytochrome_c3'] = True
            results['evidence'].append("Found cytochrome c3 gene")
        
        if any(gene in genes_text for gene in ['dsr', 'apr', 'sat']):
            results['sulfate_reduction'] = True
            results['evidence'].append(f"Found sulfate genes: {[gene for gene in ['dsr', 'apr', 'sat'] if gene in genes_text]}")
            
    except Exception as e:
        print(f"KEGG API error for {bacteria_name}: {str(e)}")
    
    # 2. Check literature
    try:
        Entrez.email = "beatrizamandawatts@gmail.com"
        papers_details = []
        
        search_terms = [
            f"{bacteria_name}[Organism] AND (sulfate reduction OR dsrAB OR aprAB)",
            f"{bacteria_name}[Organism] AND (metal reduction OR iron reduction)",
            f"{bacteria_name}[Organism] AND cytochrome c3",
            f"{bacteria_name}[Organism] AND corrosion",
            f"{bacteria_name}[Organism] AND biocorrosion",
            f"{bacteria_name}[Organism] AND (MIC OR 'microbiologically influenced corrosion')",
            f"{bacteria_name}[Organism] AND 'material deterioration'",
            f"{bacteria_name}[Organism] AND ('metal deterioration' OR 'metallic corrosion')",
            f"{bacteria_name}[Organism] AND (acid production) AND (corrosion OR 'metal deterioration' OR MIC)",
            f"{bacteria_name}[Organism] AND biofilm AND (corrosion OR MIC)",
            f"{bacteria_name}[Organism] AND (hydrogen sulfide OR H2S) AND (corrosion OR 'metal deterioration')"
        ]
        
        for term in search_terms:
            handle = Entrez.esearch(db="pubmed", term=term)
            record = Entrez.read(handle)
            count = int(record["Count"])
            results['literature_count'] += count
            
            if count > 0:
                results['evidence'].append(f"Found {count} papers for: {term}")
                paper_ids = record["IdList"]
                
                try:
                    papers_handle = Entrez.efetch(db="pubmed", id=paper_ids, rettype="medline", retmode="xml")
                    papers = Entrez.read(papers_handle)
                    
                    # Update metabolism flags
                    if "sulfate" in term.lower():
                        results['sulfate_reduction'] = True
                        if 'Sulfate Reduction' not in bacteria_record['Metabolism']:
                            bacteria_record['Metabolism'].append('Sulfate Reduction')
                    
                    if "metal" in term.lower():
                        results['metal_reduction'] = True
                        if 'Metal Reduction' not in bacteria_record['Metabolism']:
                            bacteria_record['Metabolism'].append('Metal Reduction')
                    
                    if "cytochrome" in term.lower():
                        results['cytochrome_c3'] = True
                        if 'Cytochrome c3' not in bacteria_record['Metabolism']:
                            bacteria_record['Metabolism'].append('Cytochrome c3')
                    
                    bacteria_record['Hits'] += count
                    bacteria_record['Terms'].append(f"{term}: {count} hits")
                    
                    # Process paper details
                    if papers.get('PubmedArticle'):
                        latest_paper = papers['PubmedArticle'][0]
                        article = latest_paper['MedlineCitation']['Article']
                        
                        # Get author and year
                        if 'AuthorList' in article:
                            authors = article['AuthorList'][0]['LastName']
                        else:
                            authors = 'et al.'
                            
                        year = article['Journal']['JournalIssue']['PubDate'].get('Year', 'N/A')
                        bacteria_record['Last_Reference'] = f"{authors} {year}"
                        
                        # Get abstract
                        if 'Abstract' in article:
                            abstract_text = article['Abstract']['AbstractText'][0]
                            bacteria_record['Abstract'] = ' '.join(abstract_text.split()[:50]) + '...'
                        
                        # Save paper details to file
                        output_file = Literatur_dir / f"{bacteria_name}_papers.txt"
                        with open(output_file, 'a') as f:
                            f.write(f"\nSearch term: {term}\n")
                            f.write(f"Number of papers: {count}\n")
                            if papers_details:
                                f.write(f"Paper details:\n{papers_details}\n")
                            f.write("-" * 50 + "\n")
                    
                    time.sleep(1)  # Being nice to the APIs
                    
                except Exception as e:
                    print(f"Error processing PubMed data for {bacteria_name}: {e}")
        
        # Save to Excel
        try:
            if results_file.exists():
                df = pd.read_excel(results_file)
            else:
                df = pd.DataFrame(columns=['Name', 'Metabolism', 'Observations', 'Terms', 'Hits', 'Last_Reference', 'Abstract'])
            
            # Convert lists to strings and ensure all fields exist
            record_dict = {
                'Name': bacteria_name,
                'Metabolism': '; '.join(bacteria_record['Metabolism']) if bacteria_record['Metabolism'] else '',
                'Observations': '; '.join(bacteria_record['Observations']) if bacteria_record['Observations'] else '',
                'Terms': '; '.join(bacteria_record['Terms']) if bacteria_record['Terms'] else '',
                'Hits': bacteria_record['Hits'],
                'Last_Reference': bacteria_record.get('Last_Reference', ''),
                'Abstract': bacteria_record.get('Abstract', '')
            }
            # Prepare record for saving
            bacteria_record['Metabolism'] = '; '.join(bacteria_record['Metabolism'])
            bacteria_record['Terms'] = '; '.join(bacteria_record['Terms'])
            bacteria_record['Observations'] = '; '.join(bacteria_record['Observations'])
            
            # Update or append
            if bacteria_name in df['Name'].values:
                df.loc[df['Name'] == bacteria_name] = bacteria_record
            else:
                df = pd.concat([df, pd.DataFrame([bacteria_record])], ignore_index=True)
            
            df.to_excel(results_file, index=False)
            
        except Exception as e:
            print(f"Error saving to Excel for {bacteria_name}: {e}")
    
    except Exception as e:
        print(f"Error in literature processing for {bacteria_name}: {e}")
    
    finally:
        results['processing_time'] = time.time() - bacteria_start_time
        print(f"Finished {bacteria_name} in {results['processing_time']:.2f} seconds")
        return results

In [187]:
# Let's test with known bacteria
checked_list = ['Anaerococcus', 'Aquamicrobium', 'Azospira', 'Brachybacterium', 'Brevibacterium', 'Bulleidia', 'Cellulosimicrobium', 'Clavibacter', 'Clostridium', 'Cohnella', 'Corynebacterium', 'Enterococcus', 'Halomonas', 'Legionella', 'Methyloversatilis', 'Mycobacterium', 'Mycoplana', 'Neisseria', 'Novosphingobium', 'Oerskovia', 'Opitutus', 'Oxobacter', 'Paracoccus', 'Prevotella', 'Psb-m-3', 'Pseudarthrobacter', 'Pseudoalteromonas', 'Roseateles', 'Streptococcus', 'Thiobacillus']
for bacteria in checked_list:
    result = search_corrosion_genes(bacteria, base_dir, Literatur_dir)
    print(f"\nResults for {bacteria}:")
    print(f"Sulfate reduction: {result['sulfate_reduction']}")
    print(f"Metal reduction: {result['metal_reduction']}")
    print(f"Cytochrome c3: {result['cytochrome_c3']}")
    print(f"Literature count: {result['literature_count']}")
    print("Evidence:", "\n- ".join([''] + result['evidence']))

Starting search for Anaerococcus at: 01:39:46
Finished Anaerococcus in 16.04 seconds

Results for Anaerococcus:
Sulfate reduction: False
Metal reduction: False
Cytochrome c3: False
Literature count: 7
Evidence: 
- Found 7 papers for: Anaerococcus[Organism] AND (MIC OR 'microbiologically influenced corrosion')
Starting search for Aquamicrobium at: 01:40:02
Error saving to Excel for Aquamicrobium: Must have equal len keys and value when setting with an iterable
Finished Aquamicrobium in 13.75 seconds

Results for Aquamicrobium:
Sulfate reduction: True
Metal reduction: False
Cytochrome c3: False
Literature count: 1
Evidence: 
- Found 1 papers for: Aquamicrobium[Organism] AND (sulfate reduction OR dsrAB OR aprAB)
Starting search for Azospira at: 01:40:15
Finished Azospira in 22.98 seconds

Results for Azospira:
Sulfate reduction: True
Metal reduction: True
Cytochrome c3: False
Literature count: 34
Evidence: 
- Found 11 papers for: Azospira[Organism] AND (sulfate reduction OR dsrAB OR aprAB

"import os\nfrom google.colab import drive  #silence for vscode\ndrive.mount('/content/drive')\n\n#change the path\nos.chdir('/content/drive/My Drive/MIC')"
'# For colab\n!pip install pandas numpy biopython\n!pip install requests beautifulsoup4\n!pip install Bio'
Starting search for Anaerococcus at: 17:23:29
Finished Anaerococcus in 13.58 seconds

Results for Anaerococcus:
Sulfate reduction: False
Metal reduction: False
Cytochrome c3: False
Literature count: 0
Evidence: 
Starting search for Aquamicrobium at: 17:23:43
Finished Aquamicrobium in 10.41 seconds

Results for Aquamicrobium:
Sulfate reduction: True
Metal reduction: False
Cytochrome c3: False
Literature count: 1
Evidence: 
- Found 1 papers for: Aquamicrobium[Organism] AND (sulfate reduction OR dsrAB OR aprAB)
Starting search for Azospira at: 17:23:53
Finished Azospira in 11.53 seconds

Results for Azospira:
Sulfate reduction: True
Metal reduction: True
Cytochrome c3: False
Literature count: 33
Evidence: 
- Found 11 papers for: Azospira[Organism] AND (sulfate reduction OR dsrAB OR aprAB)
- Found 22 papers for: Azospira[Organism] AND (metal reduction OR iron reduction)
Starting search for Brachybacterium at: 17:24:05
Finished Brachybacterium in 11.41 seconds

Results for Brachybacterium:
Sulfate reduction: True
Metal reduction: True
Cytochrome c3: False
Literature count: 4
Evidence: 
- Found 1 papers for: Brachybacterium[Organism] AND (sulfate reduction OR dsrAB OR aprAB)
- Found 2 papers for: Brachybacterium[Organism] AND (metal reduction OR iron reduction)
- Found 1 papers for: Brachybacterium[Organism] AND corrosion
Starting search for Brevibacterium at: 17:24:16
Finished Brevibacterium in 14.08 seconds

Results for Brevibacterium:
Sulfate reduction: True
Metal reduction: True
Cytochrome c3: False
Literature count: 31
Evidence: 
- Found 3 papers for: Brevibacterium[Organism] AND (sulfate reduction OR dsrAB OR aprAB)
- Found 25 papers for: Brevibacterium[Organism] AND (metal reduction OR iron reduction)
- Found 3 papers for: Brevibacterium[Organism] AND corrosion
Starting search for Bulleidia at: 17:24:30
Finished Bulleidia in 12.61 seconds

Results for Bulleidia:
Sulfate reduction: False
Metal reduction: False
Cytochrome c3: False
Literature count: 0
Evidence: 
Starting search for Cellulosimicrobium at: 17:24:43
Finished Cellulosimicrobium in 11.41 seconds

Results for Cellulosimicrobium:
Sulfate reduction: True
Metal reduction: True
Cytochrome c3: False
Literature count: 5
Evidence: 
- Found 1 papers for: Cellulosimicrobium[Organism] AND (sulfate reduction OR dsrAB OR aprAB)
- Found 4 papers for: Cellulosimicrobium[Organism] AND (metal reduction OR iron reduction)
Starting search for Clavibacter at: 17:24:54
Finished Clavibacter in 10.41 seconds

Results for Clavibacter:
Sulfate reduction: True
Metal reduction: True
Cytochrome c3: False
Literature count: 10
Evidence: 
- Found 1 papers for: Clavibacter[Organism] AND (sulfate reduction OR dsrAB OR aprAB)
- Found 8 papers for: Clavibacter[Organism] AND (metal reduction OR iron reduction)
- Found 1 papers for: Clavibacter[Organism] AND corrosion
Starting search for Clostridium at: 17:25:04
Finished Clostridium in 22.53 seconds

Results for Clostridium:
Sulfate reduction: True
Metal reduction: True
Cytochrome c3: True
Literature count: 877
Evidence: 
- Found sulfate genes: ['sat']
- Found 214 papers for: Clostridium[Organism] AND (sulfate reduction OR dsrAB OR aprAB)
- Found 589 papers for: Clostridium[Organism] AND (metal reduction OR iron reduction)
- Found 2 papers for: Clostridium[Organism] AND cytochrome c3
- Found 72 papers for: Clostridium[Organism] AND corrosion
Starting search for Cohnella at: 17:25:27
Finished Cohnella in 10.90 seconds

Results for Cohnella:
Sulfate reduction: True
Metal reduction: True
Cytochrome c3: False
Literature count: 2
Evidence: 
- Found 1 papers for: Cohnella[Organism] AND (sulfate reduction OR dsrAB OR aprAB)
- Found 1 papers for: Cohnella[Organism] AND (metal reduction OR iron reduction)
Starting search for Corynebacterium at: 17:25:38
Finished Corynebacterium in 12.84 seconds

Results for Corynebacterium:
Sulfate reduction: True
Metal reduction: True
Cytochrome c3: False
Literature count: 107
Evidence: 
- Found 24 papers for: Corynebacterium[Organism] AND (sulfate reduction OR dsrAB OR aprAB)
- Found 66 papers for: Corynebacterium[Organism] AND (metal reduction OR iron reduction)
- Found 17 papers for: Corynebacterium[Organism] AND corrosion
Starting search for Enterococcus at: 17:25:51
Finished Enterococcus in 33.12 seconds

Results for Enterococcus:
Sulfate reduction: True
Metal reduction: True
Cytochrome c3: False
Literature count: 338
Evidence: 
- Found 59 papers for: Enterococcus[Organism] AND (sulfate reduction OR dsrAB OR aprAB)
- Found 239 papers for: Enterococcus[Organism] AND (metal reduction OR iron reduction)
- Found 40 papers for: Enterococcus[Organism] AND corrosion
Starting search for Halomonas at: 17:26:24
Finished Halomonas in 9.94 seconds

Results for Halomonas:
Sulfate reduction: True
Metal reduction: True
Cytochrome c3: False
Literature count: 60
Evidence: 
- Found 23 papers for: Halomonas[Organism] AND (sulfate reduction OR dsrAB OR aprAB)
- Found 28 papers for: Halomonas[Organism] AND (metal reduction OR iron reduction)
- Found 9 papers for: Halomonas[Organism] AND corrosion
Starting search for Legionella at: 17:26:34
Finished Legionella in 11.81 seconds

Results for Legionella:
Sulfate reduction: True
Metal reduction: True
Cytochrome c3: False
Literature count: 104
Evidence: 
- Found 6 papers for: Legionella[Organism] AND (sulfate reduction OR dsrAB OR aprAB)
- Found 51 papers for: Legionella[Organism] AND (metal reduction OR iron reduction)
- Found 47 papers for: Legionella[Organism] AND corrosion
Starting search for Methyloversatilis at: 17:26:46
Finished Methyloversatilis in 10.47 seconds

Results for Methyloversatilis:
Sulfate reduction: True
Metal reduction: True
Cytochrome c3: False
Literature count: 6
Evidence: 
- Found sulfate genes: ['sat']
- Found 2 papers for: Methyloversatilis[Organism] AND (sulfate reduction OR dsrAB OR aprAB)
- Found 3 papers for: Methyloversatilis[Organism] AND (metal reduction OR iron reduction)
- Found 1 papers for: Methyloversatilis[Organism] AND corrosion
Starting search for Mycobacterium at: 17:26:56
Finished Mycobacterium in 31.46 seconds

Results for Mycobacterium:
Sulfate reduction: True
Metal reduction: True
Cytochrome c3: False
Literature count: 561
Evidence: 
- Found sulfate genes: ['sat']
- Found 67 papers for: Mycobacterium[Organism] AND (sulfate reduction OR dsrAB OR aprAB)
- Found 318 papers for: Mycobacterium[Organism] AND (metal reduction OR iron reduction)
- Found 176 papers for: Mycobacterium[Organism] AND corrosion
Starting search for Mycoplana at: 17:27:28
Finished Mycoplana in 8.98 seconds

Results for Mycoplana:
Sulfate reduction: False
Metal reduction: False
Cytochrome c3: False
Literature count: 0
Evidence: 
Starting search for Neisseria at: 17:27:37
Finished Neisseria in 9.14 seconds

Results for Neisseria:
Sulfate reduction: True
Metal reduction: True
Cytochrome c3: False
Literature count: 81
Evidence: 
- Found 17 papers for: Neisseria[Organism] AND (sulfate reduction OR dsrAB OR aprAB)
- Found 58 papers for: Neisseria[Organism] AND (metal reduction OR iron reduction)
- Found 6 papers for: Neisseria[Organism] AND corrosion
Starting search for Novosphingobium at: 17:27:46
Finished Novosphingobium in 10.17 seconds

Results for Novosphingobium:
Sulfate reduction: True
Metal reduction: True
Cytochrome c3: False
Literature count: 19
Evidence: 
- Found 4 papers for: Novosphingobium[Organism] AND (sulfate reduction OR dsrAB OR aprAB)
- Found 9 papers for: Novosphingobium[Organism] AND (metal reduction OR iron reduction)
- Found 6 papers for: Novosphingobium[Organism] AND corrosion
Starting search for Oerskovia at: 17:27:56
Finished Oerskovia in 11.17 seconds

Results for Oerskovia:
Sulfate reduction: False
Metal reduction: False
Cytochrome c3: False
Literature count: 0
Evidence: 
Starting search for Opitutus at: 17:28:07
Finished Opitutus in 10.22 seconds

Results for Opitutus:
Sulfate reduction: False
Metal reduction: True
Cytochrome c3: False
Literature count: 4
Evidence: 
- Found 4 papers for: Opitutus[Organism] AND (metal reduction OR iron reduction)
Starting search for Oxobacter at: 17:28:18
Finished Oxobacter in 9.70 seconds

Results for Oxobacter:
Sulfate reduction: False
Metal reduction: False
Cytochrome c3: False
Literature count: 0
Evidence: 
Starting search for Paracoccus at: 17:28:27
Finished Paracoccus in 11.00 seconds

Results for Paracoccus:
Sulfate reduction: True
Metal reduction: True
Cytochrome c3: True
Literature count: 305
Evidence: 
- Found 55 papers for: Paracoccus[Organism] AND (sulfate reduction OR dsrAB OR aprAB)
- Found 237 papers for: Paracoccus[Organism] AND (metal reduction OR iron reduction)
- Found 4 papers for: Paracoccus[Organism] AND cytochrome c3
- Found 9 papers for: Paracoccus[Organism] AND corrosion
Starting search for Prevotella at: 17:28:38
Finished Prevotella in 11.59 seconds

Results for Prevotella:
Sulfate reduction: True
Metal reduction: True
Cytochrome c3: False
Literature count: 42
Evidence: 
- Found 12 papers for: Prevotella[Organism] AND (sulfate reduction OR dsrAB OR aprAB)
- Found 27 papers for: Prevotella[Organism] AND (metal reduction OR iron reduction)
- Found 3 papers for: Prevotella[Organism] AND corrosion
Starting search for Psb-m-3 at: 17:28:50
Finished Psb-m-3 in 11.45 seconds

Results for Psb-m-3:
Sulfate reduction: False
Metal reduction: True
Cytochrome c3: False
Literature count: 1
Evidence: 
- Found 1 papers for: Psb-m-3[Organism] AND (metal reduction OR iron reduction)
Starting search for Pseudarthrobacter at: 17:29:01
Finished Pseudarthrobacter in 10.09 seconds

Results for Pseudarthrobacter:
Sulfate reduction: False
Metal reduction: True
Cytochrome c3: False
Literature count: 1
Evidence: 
- Found 1 papers for: Pseudarthrobacter[Organism] AND (metal reduction OR iron reduction)
Starting search for Pseudoalteromonas at: 17:29:11
Finished Pseudoalteromonas in 13.43 seconds

Results for Pseudoalteromonas:
Sulfate reduction: True
Metal reduction: True
Cytochrome c3: False
Literature count: 27
Evidence: 
- Found 3 papers for: Pseudoalteromonas[Organism] AND (sulfate reduction OR dsrAB OR aprAB)
- Found 11 papers for: Pseudoalteromonas[Organism] AND (metal reduction OR iron reduction)
- Found 13 papers for: Pseudoalteromonas[Organism] AND corrosion
Starting search for Roseateles at: 17:29:25
Finished Roseateles in 10.46 seconds

Results for Roseateles:
Sulfate reduction: False
Metal reduction: True
Cytochrome c3: False
Literature count: 2
Evidence: 
- Found 2 papers for: Roseateles[Organism] AND (metal reduction OR iron reduction)
Starting search for Streptococcus at: 17:29:35
Finished Streptococcus in 47.26 seconds

Results for Streptococcus:
Sulfate reduction: True
Metal reduction: True
Cytochrome c3: False
Literature count: 646
Evidence: 
- Found 81 papers for: Streptococcus[Organism] AND (sulfate reduction OR dsrAB OR aprAB)
- Found 420 papers for: Streptococcus[Organism] AND (metal reduction OR iron reduction)
- Found 145 papers for: Streptococcus[Organism] AND corrosion
Starting search for Thiobacillus at: 17:30:23
Finished Thiobacillus in 8.55 seconds

Results for Thiobacillus:
Sulfate reduction: True
Metal reduction: True
Cytochrome c3: False
Literature count: 586
Evidence: 
- Found 249 papers for: Thiobacillus[Organism] AND (sulfate reduction OR dsrAB OR aprAB)
- Found 289 papers for: Thiobacillus[Organism] AND (metal reduction OR iron reduction)
- Found 48 papers for: Thiobacillus[Organism] AND corrosion
Running locally, skipping Colab mount
Starting analysis at: 17:50:30
'\nfrom google.colab import drive\ndrive.mount(\'/content/drive\')\nbase_dir = Path(\'/content/drive/My Drive/MIC/data\')\noriginal_dir = base_dir / "original"\noriginal_dir.mkdir(exist_ok=True)\n'

In [81]:
checked_list = ['Anaerococcus', 'Aquamicrobium', 'Azospira', 'Brachybacterium', 'Brevibacterium', 'Bulleidia', 'Cellulosimicrobium', 'Clavibacter', 'Clostridium', 'Cohnella', 'Corynebacterium', 'Enterococcus', 'Halomonas', 'Legionella', 'Methyloversatilis', 'Mycobacterium', 'Mycoplana', 'Neisseria', 'Novosphingobium', 'Oerskovia', 'Opitutus', 'Oxobacter', 'Paracoccus', 'Prevotella', 'Psb-m-3', 'Pseudarthrobacter', 'Pseudoalteromonas', 'Roseateles', 'Streptococcus', 'Thiobacillus']

## 3.6. Calculate MIC potential

In [82]:
def analyze_mic_potential(results_df):
    """Analyze bacteria for MIC potential based on database hits"""
    # Add score columns
    results_df['Score'] = results_df['Total_Hits'].apply(lambda x: min(x / 10, 1))

    # Classify bacteria based on evidence
    def classify_potential(row):
        if row['Score'] >= 0.8:
            return 'High'
        elif row['Score'] >= 0.5:
            return 'Medium'
        elif row['Score'] > 0:
            return 'Low'
        return 'Unknown'

    results_df['MIC_Potential'] = results_df.apply(classify_potential, axis=1)

    return results_df

In [83]:
# Analyze results
analyzed_results = analyze_mic_potential(MIC_df)

NameError: name 'MIC_df' is not defined

In [None]:
# Display summary
print("\nSummary of MIC Potential:")
print(analyzed_results['MIC_Potential'].value_counts())

In [None]:
Pseudo-code:
pythonCopydef comprehensive_corrosion_screening(genera_list):
    corrosion_database = {}

    for genus in genera_list:
        # Multiple validation steps
        computational_score = compute_corrosion_potential(genus)
        literature_score = mine_literature(genus)
        metabolic_score = analyze_metabolic_pathways(genus)

        total_score = (computational_score +
                       literature_score +
                       metabolic_score) / 3

        if total_score > threshold:
            corrosion_database[genus] = {
                'potential': total_score,
                'details': generate_detailed_report(genus)
            }

    return corrosion_database



Especialised db
MicrobeDB
GOLD (Genomes Online Database)
PATRIC Bacterial Bioinformatics Resource

QIIME2 (Microbiome analysis)
MetaPhlAn (Metagenomic profiling)
MG-RAST (Metagenome analysis)
Prokka (Genome annotation)



# Biomarkers Refinement
Prioritize bacteria with known corrosion-related activities
Consider biofilm formation capabilities
Look for known metal-oxidizing/reducing bacteria
Factor in pH tolerance and oxygen requirements


functional annotation analysis

3. Metabolic Pathway Analysis and mapping-
PICRUSt2 - Can predict metabolic functions from 16S data

In [None]:
bashCopy# Install PICRUSt2 (if not already installed)
conda create -n picrust2 -c bioconda -c conda-forge picrust2

# Activate the environment
conda activate picrust2

# Run full pipeline
picrust2_pipeline.py -s your_sequences.fasta -i your_abundance.biom -o picrust2_output_folder

# For more specific pathway analysis
add_descriptions.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz -m EC \
                   -o EC_metagenome_out/pred_metagenome_unstrat_described.tsv.gz

Requirements:


Your sequences should be properly quality filtered
Sequences should be aligned and trimmed to the same length
ASVs/OTUs should be properly clustered



# Network analysis

Ecological Networks:


Bacteria that appear "neutral" alone might be critical support species
They could be enabling or moderating the effects of the corrosion-significant species
In microbial communities, some species act as "keystone" species not through abundance but through their metabolic interactions


Stability Indicators:


Species present across all conditions might be:

Buffer species that maintain community stability
Indicators of baseline environmental conditions
Part of the core microbiome that enables other species to thrive

Think of it like a metal alloy - some elements might not directly affect corrosion resistance, but their presence maintains the overall structure that makes the protective elements effective.
However, if data size/processing is a significant concern, you could:

Keep full bacterial data initially
Run your analysis
Check if removing the "uniform" species significantly changes your results
Document which removals affect the model and which don't
_________________________
This is to understand genus interactions
Group bacteria by their typical ecological roles (e.g., primary degraders, secondary degraders)
Add known syntrophic relationships between genera
Map carbon/nitrogen cycling capabilities
Identify potential metabolic handoffs between community members
__

Map each genus to known electron acceptor preferences (Fe, Mn, S, etc.)
Create functional groups based on these metabolic capabilities
Compare distribution of these functional groups across your categories
Look for enrichment patterns of specific metabolic types


# QIIME2 (Microbiome analysis)

In [None]:
# Import FASTA into QIIME 2
qiime tools import \
  --input-path your_sequences.fasta \
  --output-path sequences.qza \
  --type 'FeatureData[Sequence]'

# Run DADA2 or Deblur for ASV generation
qiime dada2 denoise-single \
  --i-demultiplexed-seqs sequences.qza \
  --p-trim-left 0 \
  --p-trunc-len 250 \
  --o-representative-sequences rep-seqs.qza \
  --o-table table.qza

# Export to BIOM format
qiime tools export \
  --input-path table.qza \
  --output-path exported-table

# Convert to TSV if needed
biom convert \
  -i exported-table/feature-table.biom \
  -o feature-table.tsv \
  --to-tsv


# Dereplicate sequences
vsearch --derep_fulllength your_sequences.fasta \
        --output unique_sequences.fasta \
        --sizeout

# Cluster at 97% similarity (for OTUs)
vsearch --cluster_size unique_sequences.fasta \
        --id 0.97 \
        --centroids clustered_sequences.fasta

# Create OTU table
vsearch --usearch_global your_sequences.fasta \
        --db clustered_sequences.fasta \
        --id 0.97 \
        --otutabout otu_table.txt

