# Antibiotic Resistance Gene Origin Predictor

A machine-learning based bioinformatics tool for predicting the origin of antibiotic resistance genes using codon usage bias analysis.

## Project Overview

**Problem:** Antibiotic resistance genes spread rapidly through _horizontal gene transfer (HGT)_, making it crucial to track their origins and transmission routes.

**Solution:** Uses machine-learning to analyze codon usage patterns and predict whether resistance genes are native to an organism or were horizontally transferred.

**Impact:** Helps understand resistance gene transmission, track outbreak sources, and inform antibiotic stewardship strategies.

---

## Scientific Background

### Codon Usage Bias

Different organisms preferentially use certain synonymous codons over others. These preferences are shaped by:

- tRNA abundance
- Mutation biases
- Selection for translation efficiency
- GC content constraints

### Horizontal Gene Transfer Detection

When genes are transferred between organisms, they often retain the codon usage signature of their source organism, making them detectable through statistical analysis


## Create Sample Data

In [4]:
import os
import numpy as np

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO

In [5]:
def create_demo_data():
    """Create demo dataset with synthetic genes"""

    # Create data directory
    os.makedirs('data', exist_ok=True)

    # Known codon usage biases
    # E. coli prefers: CTG (Leu), GCG (Ala), GCG (Arg) etc
    # Transferred genes often have different biases
    native_codons = ['CTG', 'GCG', 'CGC', 'GAA', 'GTG', 'ACC', 'AAC', 'GAC']
    transferred_codons = ['CTA', 'GCA', 'CGA', 'GAG', 'GTA', 'ACA', 'AAT', 'GAT']

    def generate_gene(codon_preference, length=600):
        """Generate synthetic gene with codon bias"""
        gene = 'ATG'    # Start Codon
        while len(gene) < length - 3:
            codon = np.random.choice(codon_preference)
            gene += codon
        gene += 'TAA'   # Stop Codon
        return gene

    # Generate native genes (100 samples)
    native_records = []
    for i in range(100):
        seq = generate_gene(native_codons)
        record = SeqRecord(
            Seq(seq),
            id=f"native_{i+1}",
            description=f"Native housekeeping gene {i+1} [E. coli]"
        )
        native_records.append(record)

    # Generate transferred genes (100 samples)
    transferred_records = []
    for i in range(100):
        seq = generate_gene(transferred_codons)
        record = SeqRecord(
            Seq(seq),
            id=f"resistance_{i+1}",
            description=f"Transferred resistance gene {i+1}",
        )
        transferred_records.append(record)

    # Save FASTA files
    SeqIO.write(native_records, 'data/native_genes.fasta', 'fasta')
    SeqIO.write(transferred_records, 'data/transferred_genes.fasta', 'fasta')

    return 100, 100

In [6]:
print("\n[1/5] Creating demo dataset...")
create_demo_data()


[1/5] Creating demo dataset...


(100, 100)