This notebook has as objective to dilucidate which of the microorganisms in the data have recognised influence in corrosion damage. With this pool of bacteria we compare the nobel bacteria here found with the gene sequences of the known corrosion-related genes belonging to MIC or the metabolic pathways are related or can be related to corrosion. Then using the list here found as "anchors" to find associated bacteria
Looking for similar metabolic patterns in other species no yet related to MIC.

The databases uses on this notebook are:

Bacmet: 'https://bacmet.biomedicine.gu.se/download.html',
KEGG : 'https://www.genome.jp/kegg/pathway.html', which is the Kyoto Encyclopedia of Genes and Genomes. With this is possible to find metabolic pathways, identify functional gene annotations
IMG/M: 'https://img.jgi.doe.gov/',- For detailed metabolic pathways
BRENDA: 'https://www.brenda-enzymes.org/

1. Initial Computational Screening
   ↓
2. Literature Validation
   ↓
3. Metabolic Pathway Analysis and mapping- PICRUSt2 - Can predict metabolic functions from 16S data
   ↓
4. Find functional similarities between known and candidate bacteria, compare taxonomic groups with similar functional profiles
   ↓
5. Sequence analysis for: Sulfate reduction genes, Iron metabolism genes, Biofilm    formation genes
   ↓
6. Identify gene clusters associated with iron metabolism


In [1]:
'''import os
from google.colab import drive  #silence for vscode
drive.mount('/content/drive')

#change the path
os.chdir('/content/drive/My Drive/MIC')'''

"import os\nfrom google.colab import drive  #silence for vscode\ndrive.mount('/content/drive')\n\n#change the path\nos.chdir('/content/drive/My Drive/MIC')"

In [2]:
'''# For colab
!pip install pandas numpy biopython
!pip install requests beautifulsoup4
!pip install Bio'''

'# For colab\n!pip install pandas numpy biopython\n!pip install requests beautifulsoup4\n!pip install Bio'

# Preparing data

In [3]:
from Bio import Entrez
import pandas as pd
from functools import partial
import requests
from bs4 import BeautifulSoup
import time
import urllib3

In [4]:
# Read the Excel file
Jointax = pd.read_excel("data/Jointax.xlsx", sheet_name='Biotot_jointax', header=[0,1,2,3,4,5,6,7])

In [5]:
# Drop 2 first columns
Jointax = Jointax.drop(Jointax.columns[0:2], axis=1)

In [6]:
# Extract Genera and ID from the multi-index
bacteria_list = [genus for genus in Jointax.columns.get_level_values(6).tolist()[1:] if genus.strip()]

In [7]:
bacteria_list = Jointax.columns.get_level_values(6).tolist()[1:]

# 2. Query DB: Search Multiple Databases
we connect and search multiple databases for MIC-related terms. Phase 1 (search_mic_databases): The code establishes initial database connections using provided email. Searches BacMet, KEGG, IMG/M, and BRENDA for MIC-related keywords and uses defined metabolic pathways to categorize results. At the end it returns DataFrame with [Bacteria, Database, Evidence, Pathway] columns.

In [8]:
def query_database(db_name, url, query):
    """Query different databases with improved error handling and rate limiting"""
    try:
        session = requests.Session()
        # Add reasonable timeout and headers
        session.timeout = 30
        session.headers.update({
            'User-Agent': 'Research-Bot/1.0',
            'Accept': 'application/json, text/plain, */*'
        })

        if db_name == 'BacMet':
            response = session.get(
                url,
                params={'query': query, 'type': 'experimentally_verified'},
                verify=False
            )
            return response.json().get('hits', []) if response.ok else []

        elif db_name == 'KEGG':
            response = session.get(f"{url}find/genes/{query}")
            return response.text.split('\n') if response.ok else []

        elif db_name == 'IMG/M':
            # Simplified IMG/M query - consider adding proper authentication
            response = session.get(url, params={'keyword': query})
            return response.text.split('\n') if response.ok else []

        elif db_name == 'BRENDA':
            response = session.get(f"{url}?query={query}")
            return response.text.split('\n') if response.ok else []

    except requests.exceptions.RequestException as e:
        print(f"Error querying {db_name}: {str(e)}")
        return []

# 3. Search all Databases:
processes all bacteria through all databases
Here it is performed detailed API queries using database URLs. It is retrived specific gene, metal, and function information. Here documented but run in Colab in batches. It returns detailed DataFrame with experimental evidence
## 3.1. Chunk
Take first 50 bacteria
Search only most important databases (KEGG and BacMet)
Use fewer keywords
Save these initial results

In [9]:
def search_all_databases(bacteria_list, batch_size=50):
    """Search databases in smaller batches with progress tracking"""
    Entrez.email = "beatrizamandawatts@gmail.com"

    databases = {
        'BacMet': 'https://bacmet.biomedicine.gu.se/api/search/',
        'KEGG': 'https://rest.kegg.jp/',
        'IMG/M': 'https://img.jgi.doe.gov/cgi-bin/m/main.cgi',
        'BRENDA': 'https://www.brenda-enzymes.org/rest/'
    }

    mic_keywords = [
        "sulfate reduction", "metal reduction", "iron oxidation",
        "corrosion", "MIC", "biofilm", "EPS production"
    ]

    batch_bacteria = bacteria_list[:batch_size]
    results_list = []

    for bacteria in batch_bacteria:
        if not bacteria or pd.isna(bacteria):
            continue
        print(f"Processing: {bacteria}")

        for db_name, db_url in databases.items():
            for keyword in mic_keywords:
                try:
                    query = f"{bacteria} {keyword}"
                    hits = query_database(db_name, db_url, query)
                    results_list.append({
                        'Bacteria': bacteria,
                        'Database': db_name,
                        'Keyword': keyword,
                        'Hits': hits
                    })
                except Exception as e:
                    print(f"Error querying {db_name} for {bacteria}: {str(e)}")
                time.sleep(2)

    batch_results = pd.DataFrame(results_list)
    batch_results.to_csv(f'batch_results_{len(batch_bacteria)}.csv', index=False)
    return batch_results

In [10]:
# Run first batch of 50
first_batch = search_all_databases(bacteria_list, batch_size=10)
# Save full list for future sessions
pd.Series(bacteria_list).to_csv('full_bacteria_list.csv', index=False)

Processing: 02d06




Processing: A17




Processing: Abiotrophia




Processing: Acetanaerobacterium




KeyboardInterrupt: 

## 3.2. Analyze_metabolic_pathways function
 (second chunk)

In [None]:
def analyze_metabolic_pathways(high_priority_bacteria):
    """Analyze metabolic pathways for bacteria with high initial scores"""
    metabolic_results = []

    key_pathways = {
        'sulfate_reduction': ['sat', 'aprAB', 'dsrAB'],
        'iron_reduction': ['cymA', 'mtrCAB'],
        'biofilm_formation': ['eps', 'pgaC']
    }

    for bacteria in high_priority_bacteria:
        pathway_hits = {
            'Bacteria': bacteria,
            'Pathways_Found': 0
        }

        response = query_database('KEGG',
                                'https://rest.kegg.jp/',
                                f"{bacteria} pathway")

        for pathway, genes in key_pathways.items():
            hits = sum(1 for gene in genes if gene in str(response))
            pathway_hits[f"{pathway}_score"] = hits
            pathway_hits['Pathways_Found'] += hits > 0

        metabolic_results.append(pathway_hits)
        time.sleep(1)

    return pd.DataFrame(metabolic_results)

In [None]:
# Load first batch results and select high-scoring bacteria
first_results = pd.read_csv('batch_results_10.csv')
high_priority = first_results[first_results['Hits'] > 0]['Bacteria'].unique().tolist()

# Run metabolic analysis
metabolic_results = analyze_metabolic_pathways(high_priority)
metabolic_results.to_csv('metabolic_results.csv', index=False)

## 3.3. literature_analysis function (third chunk):

In [None]:
def literature_analysis(bacteria_list):
    """Analyze literature references for specific bacteria"""
    Entrez.email = "beatrizamandawatts@gmail.com"

    literature_results = []
    for bacteria in bacteria_list:
        citations = {
            'Bacteria': bacteria,
            'MIC_Citations': 0,
            'Keywords_Found': []
        }

        search_term = f"{bacteria} AND (corrosion OR MIC OR 'sulfate reducing')"
        try:
            handle = Entrez.esearch(db="pubmed", term=search_term)
            record = Entrez.read(handle)
            citations['MIC_Citations'] = int(record["Count"])

            if citations['MIC_Citations'] > 0:
                id_list = record["IdList"][:3]
                for pmid in id_list:
                    summary = Entrez.esearch(db="pubmed", id=pmid)
                    summary_record = Entrez.read(summary)
                    citations['Keywords_Found'].extend(summary_record[0].get('Keywords', []))
        except Exception as e:
            print(f"Error processing {bacteria}: {str(e)}")

        literature_results.append(citations)
        time.sleep(1)

    return pd.DataFrame(literature_results)

In [None]:
# Load metabolic results and select bacteria for literature search
metabolic_df = pd.read_csv('metabolic_results.csv')
literature_candidates = metabolic_df[metabolic_df['Pathways_Found'] > 1]['Bacteria'].tolist()

# Run literature analysis
lit_results = literature_analysis(literature_candidates)
lit_results.to_csv('literature_results.csv', index=False)

## 3.4. combine_all_results function (final chunk):

In [None]:
def combine_all_results():
    """Combine and score all previous results"""
    initial = pd.read_csv('batch_results_50.csv')
    metabolic = pd.read_csv('metabolic_results.csv')
    literature = pd.read_csv('literature_results.csv')

    combined = initial.merge(metabolic, on='Bacteria', how='outer')\
                     .merge(literature, on='Bacteria', how='outer')

    combined['Final_Score'] = (
        combined['Hits'].fillna(0) * 0.3 +
        combined['Pathways_Found'].fillna(0) * 0.4 +
        (combined['MIC_Citations'] > 0).astype(int) * 0.3
    )

    return combined

In [None]:
# Load full list and get next batch
all_bacteria = pd.read_csv('full_bacteria_list.csv')['0'].tolist()
next_batch_start = 10
next_batch = search_all_databases(all_bacteria[next_batch_start:], batch_size=50)

In [None]:
# Combine everything and get final scores
final_results = combine_all_results()
final_results.to_csv('final_mic_analysis.csv', index=False)

# View top candidates
print("\nTop corrosion-influencing bacteria:")
print(final_results.nlargest(10, 'Final_Score')[['Bacteria', 'Final_Score']])

In [None]:
def analyze_mic_potential(results_df):
    """Analyze bacteria for MIC potential based on database hits"""
    # Add score columns
    results_df['Score'] = results_df['Total_Hits'].apply(lambda x: min(x / 10, 1))

    # Classify bacteria based on evidence
    def classify_potential(row):
        if row['Score'] >= 0.8:
            return 'High'
        elif row['Score'] >= 0.5:
            return 'Medium'
        elif row['Score'] > 0:
            return 'Low'
        return 'Unknown'

    results_df['MIC_Potential'] = results_df.apply(classify_potential, axis=1)

    return results_df

In [None]:
# Analyze results
analyzed_results = analyze_mic_potential(MIC_df)


In [None]:
# Display summary
print("\nSummary of MIC Potential:")
print(analyzed_results['MIC_Potential'].value_counts())

In [None]:
Pseudo-code:
pythonCopydef comprehensive_corrosion_screening(genera_list):
    corrosion_database = {}

    for genus in genera_list:
        # Multiple validation steps
        computational_score = compute_corrosion_potential(genus)
        literature_score = mine_literature(genus)
        metabolic_score = analyze_metabolic_pathways(genus)

        total_score = (computational_score +
                       literature_score +
                       metabolic_score) / 3

        if total_score > threshold:
            corrosion_database[genus] = {
                'potential': total_score,
                'details': generate_detailed_report(genus)
            }

    return corrosion_database



Especialised db
MicrobeDB
GOLD (Genomes Online Database)
PATRIC Bacterial Bioinformatics Resource

QIIME2 (Microbiome analysis)
MetaPhlAn (Metagenomic profiling)
MG-RAST (Metagenome analysis)
Prokka (Genome annotation)



# Biomarkers Refinement
Prioritize bacteria with known corrosion-related activities
Consider biofilm formation capabilities
Look for known metal-oxidizing/reducing bacteria
Factor in pH tolerance and oxygen requirements


functional annotation analysis

3. Metabolic Pathway Analysis and mapping-
PICRUSt2 - Can predict metabolic functions from 16S data

In [None]:
bashCopy# Install PICRUSt2 (if not already installed)
conda create -n picrust2 -c bioconda -c conda-forge picrust2

# Activate the environment
conda activate picrust2

# Run full pipeline
picrust2_pipeline.py -s your_sequences.fasta -i your_abundance.biom -o picrust2_output_folder

# For more specific pathway analysis
add_descriptions.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz -m EC \
                   -o EC_metagenome_out/pred_metagenome_unstrat_described.tsv.gz

Requirements:


Your sequences should be properly quality filtered
Sequences should be aligned and trimmed to the same length
ASVs/OTUs should be properly clustered



# Network analysis

Ecological Networks:


Bacteria that appear "neutral" alone might be critical support species
They could be enabling or moderating the effects of the corrosion-significant species
In microbial communities, some species act as "keystone" species not through abundance but through their metabolic interactions


Stability Indicators:


Species present across all conditions might be:

Buffer species that maintain community stability
Indicators of baseline environmental conditions
Part of the core microbiome that enables other species to thrive

Think of it like a metal alloy - some elements might not directly affect corrosion resistance, but their presence maintains the overall structure that makes the protective elements effective.
However, if data size/processing is a significant concern, you could:

Keep full bacterial data initially
Run your analysis
Check if removing the "uniform" species significantly changes your results
Document which removals affect the model and which don't
_________________________
This is to understand genus interactions
Group bacteria by their typical ecological roles (e.g., primary degraders, secondary degraders)
Add known syntrophic relationships between genera
Map carbon/nitrogen cycling capabilities
Identify potential metabolic handoffs between community members
__

Map each genus to known electron acceptor preferences (Fe, Mn, S, etc.)
Create functional groups based on these metabolic capabilities
Compare distribution of these functional groups across your categories
Look for enrichment patterns of specific metabolic types


# QIIME2 (Microbiome analysis)

In [None]:
# Import FASTA into QIIME 2
qiime tools import \
  --input-path your_sequences.fasta \
  --output-path sequences.qza \
  --type 'FeatureData[Sequence]'

# Run DADA2 or Deblur for ASV generation
qiime dada2 denoise-single \
  --i-demultiplexed-seqs sequences.qza \
  --p-trim-left 0 \
  --p-trunc-len 250 \
  --o-representative-sequences rep-seqs.qza \
  --o-table table.qza

# Export to BIOM format
qiime tools export \
  --input-path table.qza \
  --output-path exported-table

# Convert to TSV if needed
biom convert \
  -i exported-table/feature-table.biom \
  -o feature-table.tsv \
  --to-tsv


# Dereplicate sequences
vsearch --derep_fulllength your_sequences.fasta \
        --output unique_sequences.fasta \
        --sizeout

# Cluster at 97% similarity (for OTUs)
vsearch --cluster_size unique_sequences.fasta \
        --id 0.97 \
        --centroids clustered_sequences.fasta

# Create OTU table
vsearch --usearch_global your_sequences.fasta \
        --db clustered_sequences.fasta \
        --id 0.97 \
        --otutabout otu_table.txt

