This notebook has as objective to dilucidate which of the microorganisms in the data have recognised influence in corrosion damage. With this pool of bacteria we compare the nobel bacteria here found with the gene sequences of the known corrosion-related genes belonging to MIC or the metabolic pathways are related or can be related to corrosion. Then using the list here found as "anchors" to find associated bacteria
Looking for similar metabolic patterns in other species no yet related to MIC.

The databases uses on this notebook are:

Bacmet: 'https://bacmet.biomedicine.gu.se/download.html',
KEGG : 'https://www.genome.jp/kegg/pathway.html', which is the Kyoto Encyclopedia of Genes and Genomes. With this is possible to find metabolic pathways, identify functional gene annotations
IMG/M: 'https://img.jgi.doe.gov/',- For detailed metabolic pathways
BRENDA: 'https://www.brenda-enzymes.org/

1. Initial Computational Screening
   ↓
2. Literature Validation
   ↓
3. Metabolic Pathway Analysis and mapping- PICRUSt2 - Can predict metabolic functions from 16S data
   ↓
4. Find functional similarities between known and candidate bacteria, compare taxonomic groups with similar functional profiles
   ↓
5. Sequence analysis for: Sulfate reduction genes, Iron metabolism genes, Biofilm    formation genes
   ↓
6. Identify gene clusters associated with iron metabolism


# Preparing data

In [1]:
from Bio import Entrez
import pandas as pd
from functools import partial
import requests
from bs4 import BeautifulSoup
import time

In [12]:
# Read the Excel file
Jointax = pd.read_excel("data/Jointax.xlsx", sheet_name='Biotot_jointax', header=[0,1,2,3,4,5,6,7])

In [15]:
# Drop 2 first columns
Jointax = Jointax.drop(Jointax.columns[0:2], axis=1)

In [17]:
# Extract Genera and ID from the multi-index
genera_info = list(zip(Jointax.columns.get_level_values(6), Jointax.columns.get_level_values(7)))

In [22]:
bacteria_list= Jointax.columns.get_level_values(6).tolist()

# 2. Search Multiple Databases 
we connect and search multiple databases for MIC-related terms. Phase 1 (search_mic_databases): The code establishes initial database connections using provided email. Searches BacMet, KEGG, IMG/M, and BRENDA for MIC-related keywords and uses defined metabolic pathways to categorize results. At the end it returns DataFrame with [Bacteria, Database, Evidence, Pathway] columns.

In [27]:
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

def search_all_databases(bacteria_list):
    Entrez.email = "beatrizamandawatts@gmail.com"
    
    databases = {
        'BacMet': 'https://bacmet.biomedicine.gu.se/api/search/',
        'KEGG': 'https://rest.kegg.jp/find/',
        'IMG/M': 'https://img.jgi.doe.gov/cgi-bin/m/main.cgi',
        'BRENDA': 'https://www.brenda-enzymes.org/rest/'
    }
    
    mic_keywords = [
        "sulfate reduction", "metal reduction", "iron oxidation",
        "corrosion", "MIC", "biofilm", "EPS production",
        "microbially influenced corrosion", "sulfate reducing bacteria corrosion",
        "metal reducing bacteria", "biofilm corrosion",
        "Metal reduction", "Acid production", "MIC heating and cooling"
    ]
    
    metabolic_pathways = {
        'sulfate_reduction': ['sat', 'aprAB', 'dsrAB'],
        'iron_reduction': ['cymA', 'mtrCAB', 'omcA'],
        'metal_oxidation': ['cyc2', 'rusticyanin', 'cox genes']
    }
   
    results = pd.DataFrame(columns=['Bacteria', 'Database', 'Keyword', 'Hits'])
    
    for bacteria in bacteria_list:
        for db_name, db_url in databases.items():
            for keyword in mic_keywords:
                try:
                    query = f"{bacteria} {keyword}"
                    hits = query_database(db_name, db_url, query)
                    
                    results = results.append({
                        'Bacteria': bacteria,
                        'Database': db_name,
                        'Keyword': keyword,
                        'Hits': hits
                    }, ignore_index=True)
                    
                except Exception as e:
                    print(f"Error querying {db_name} for {bacteria}: {str(e)}")
                
                time.sleep(2)  # Rate limiting
    
    return results

def query_database(db_name, url, query):
    if db_name == 'BacMet':
        response = requests.get(
            url, 
            params={'query': query, 'type': 'experimentally_verified'},
            verify=False  # Bypass SSL verification
        )
        return len(response.json().get('hits', [])) if response.ok else 0

In [28]:
Bacteria_MIC = search_all_databases(bacteria_list)

Error querying BacMet for Aerococcus: HTTPSConnectionPool(host='bacmet.biomedicine.gu.se', port=443): Max retries exceeded with url: /api/search/?query=Aerococcus+sulfate+reduction&type=experimentally_verified (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fc8bb109c60>: Failed to resolve 'bacmet.biomedicine.gu.se' ([Errno -3] Temporary failure in name resolution)"))
Error querying BacMet for Aerococcus: HTTPSConnectionPool(host='bacmet.biomedicine.gu.se', port=443): Max retries exceeded with url: /api/search/?query=Aerococcus+metal+reduction&type=experimentally_verified (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fc8bb109930>: Failed to resolve 'bacmet.biomedicine.gu.se' ([Errno -3] Temporary failure in name resolution)"))
Error querying BacMet for Aerococcus: HTTPSConnectionPool(host='bacmet.biomedicine.gu.se', port=443): Max retries exceeded with url: /api/search/?query=Aerococcus+iron+oxidation&type=experimentall

KeyboardInterrupt: 

# Searching IMG/M and Brenda DataBases
Here it is performed detailed API queries using database URLs. It is retrived specific gene, metal, and function information. Here documented but runn in Colab. It returns detailed DataFrame with experimental evidence

In [None]:
def query_database(db_name, url, query):
    """Query different microbial/corrosion databases"""
    if db_name == 'BacMet':
        response = requests.get(url, params={'query': query, 'type': 'experimentally_verified'})
        return len(response.json().get('hits', []))
    
    elif db_name == 'KEGG':
        response = requests.get(f"{url}genes/{query}")
        return len(response.text.split('\n')) - 1
        
    elif db_name == 'IMG/M':
        # IMG/M requires portal authentication
        portal_url = f"{url}/portal/ext-api/search/genome"
        params = {
            'term': query,
            'filters': ['metadata_type:MICROBIAL']
        }
        # Add portal auth token to header
        headers = {'Authorization': 'Bearer YOUR_IMG_TOKEN'} 
        response = requests.get(portal_url, params=params, headers=headers)
        return len(response.json().get('hits', []))
        
    elif db_name == 'BRENDA':
        # BRENDA SOAP API endpoint
        soap_url = f"{url}/soap"
        headers = {'content-type': 'text/xml'}
        # Construct SOAP query
        soap_request = f"""
        <soapenv:Envelope>
            <soapenv:Body>
                <getEcNumber xmlns="http://www.brenda.org/">
                    <ecNumber>{query}</ecNumber>
                </getEcNumber>
            </soapenv:Body>
        </soapenv:Envelope>
        """
        response = requests.post(soap_url, data=soap_request, headers=headers)
        # Parse XML response
        return len(response.text.split('<entry>')) - 1
        
    return 0

In [None]:
Pseudo-code:
pythonCopydef comprehensive_corrosion_screening(genera_list):
    corrosion_database = {}
    
    for genus in genera_list:
        # Multiple validation steps
        computational_score = compute_corrosion_potential(genus)
        literature_score = mine_literature(genus)
        metabolic_score = analyze_metabolic_pathways(genus)
        
        total_score = (computational_score + 
                       literature_score + 
                       metabolic_score) / 3
        
        if total_score > threshold:
            corrosion_database[genus] = {
                'potential': total_score,
                'details': generate_detailed_report(genus)
            }
    
    return corrosion_database



Especialised db
MicrobeDB
GOLD (Genomes Online Database)
PATRIC Bacterial Bioinformatics Resource

QIIME2 (Microbiome analysis)
MetaPhlAn (Metagenomic profiling)
MG-RAST (Metagenome analysis)
Prokka (Genome annotation)



# Biomarkers Refinement
Prioritize bacteria with known corrosion-related activities
Consider biofilm formation capabilities
Look for known metal-oxidizing/reducing bacteria
Factor in pH tolerance and oxygen requirements


functional annotation analysis

3. Metabolic Pathway Analysis and mapping- 
PICRUSt2 - Can predict metabolic functions from 16S data

In [None]:
bashCopy# Install PICRUSt2 (if not already installed)
conda create -n picrust2 -c bioconda -c conda-forge picrust2

# Activate the environment
conda activate picrust2

# Run full pipeline
picrust2_pipeline.py -s your_sequences.fasta -i your_abundance.biom -o picrust2_output_folder

# For more specific pathway analysis
add_descriptions.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz -m EC \
                   -o EC_metagenome_out/pred_metagenome_unstrat_described.tsv.gz

Requirements:


Your sequences should be properly quality filtered
Sequences should be aligned and trimmed to the same length
ASVs/OTUs should be properly clustered



# Network analysis

Ecological Networks:


Bacteria that appear "neutral" alone might be critical support species
They could be enabling or moderating the effects of the corrosion-significant species
In microbial communities, some species act as "keystone" species not through abundance but through their metabolic interactions


Stability Indicators:


Species present across all conditions might be:

Buffer species that maintain community stability
Indicators of baseline environmental conditions
Part of the core microbiome that enables other species to thrive

Think of it like a metal alloy - some elements might not directly affect corrosion resistance, but their presence maintains the overall structure that makes the protective elements effective.
However, if data size/processing is a significant concern, you could:

Keep full bacterial data initially
Run your analysis
Check if removing the "uniform" species significantly changes your results
Document which removals affect the model and which don't
_________________________
This is to understand genus interactions
Group bacteria by their typical ecological roles (e.g., primary degraders, secondary degraders)
Add known syntrophic relationships between genera
Map carbon/nitrogen cycling capabilities
Identify potential metabolic handoffs between community members
__

Map each genus to known electron acceptor preferences (Fe, Mn, S, etc.)
Create functional groups based on these metabolic capabilities
Compare distribution of these functional groups across your categories
Look for enrichment patterns of specific metabolic types


# QIIME2 (Microbiome analysis)

In [None]:
# Import FASTA into QIIME 2
qiime tools import \
  --input-path your_sequences.fasta \
  --output-path sequences.qza \
  --type 'FeatureData[Sequence]'

# Run DADA2 or Deblur for ASV generation
qiime dada2 denoise-single \
  --i-demultiplexed-seqs sequences.qza \
  --p-trim-left 0 \
  --p-trunc-len 250 \
  --o-representative-sequences rep-seqs.qza \
  --o-table table.qza

# Export to BIOM format
qiime tools export \
  --input-path table.qza \
  --output-path exported-table

# Convert to TSV if needed
biom convert \
  -i exported-table/feature-table.biom \
  -o feature-table.tsv \
  --to-tsv


# Dereplicate sequences
vsearch --derep_fulllength your_sequences.fasta \
        --output unique_sequences.fasta \
        --sizeout

# Cluster at 97% similarity (for OTUs)
vsearch --cluster_size unique_sequences.fasta \
        --id 0.97 \
        --centroids clustered_sequences.fasta

# Create OTU table
vsearch --usearch_global your_sequences.fasta \
        --db clustered_sequences.fasta \
        --id 0.97 \
        --otutabout otu_table.txt

