<a href="https://colab.research.google.com/github/SenseiBassa/Bioinformatics-Projects-HackBio-/blob/main/Detect_Antimicrobial_Resistance_Genes_in_a_Bacterial_Genome_Project_6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

%%bash
#--------------------
# Amr_detection.sh
# Task: Task: Detect Antimicrobial Resistance Genes in a Bacterial Genome Reads
# By: Bassa Joshua Samuel
# Date: 10/09/2025
# In brief: The task was to build a pipeline that detects antimicrobial resistance genes in bacterial genomes by performing quality control, trimming, genome assembly, and resistance gene screening.
# amr_detection.sh
# Pipeline: raw paired reads -> FastQC -> fastp trimming -> SPAdes assembly -> Abricate AMR detection
# Outputs:
#  - qc_raw/        : FastQC reports for raw reads
#  - qc_trim/       : fastp reports for trimmed reads
#  - spades_output/ : SPAdes assembly (contigs.fasta)
#  - abricate_out.tab: raw abricate hits
#  - resistance_report.txt : final, human-readable list of detected AMR genes
# Tools needed: fastqc, fastp, spades.py, abricate, samtools (not required but recommended), wget
#----------------------------------


Introduction

Remember, the process of genome sequencing simply entails chopping down the over 3 billion nucleotide bases down into chunks of 30-100 bases. So, the data we actually get is fragments. Now, we want to reassemble the genome from these chunks.
If there’s a high-quality reference genome for the species (which we will do later on), you can map these reads to it and be done. But often, there is no reference — either because the species has never been sequenced, or because you expect major differences (new genes, structural rearrangements) that mapping might miss.
De novo assembly solves this problem by building the genome from scratch. It’s the genetic equivalent of solving a jigsaw puzzle without knowing the final picture. This is critical for:

Novel species discovery — sequencing non-model organisms.

Pathogen tracking — rapidly assembling new viral or bacterial genomes during outbreaks.

Evolutionary studies — comparing fully assembled genomes across species.

Structural variant detection — finding large insertions, deletions, and rearrangements missed by mapping approaches.

In this module, we will be covering specifics applications in organism detection and pathogen tracking.

Data and Tools

To complete this tutorial, install the following:
SPAdes
Quast
FastQC
Bandage (not to be installed on the terminal). See installation instruction here https://rrwick.github.io/Bandage/

Dataset:

Use wget to retrieve
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR135/059/SRR13554759/SRR13554759_1.fastq.gz

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR135/059/SRR13554759/SRR13554759_2.fastq.gz

⇒ We believe you know how to install them by now!

QC and Preprocessing

As usual, we have to perform QC and trim our dataset.

First, QC
mkdir -p qc
fastqc SRR13554759_1.fastq.gz -O qc/
fastqc SRR13554759_2.fastq.gz -O qc/

Then trim with fastp

fastp -i SRR13554759_1.fastq.gz -I SRR13554759_2.fastq.gz -o trim/SRR13554759_1.trim.fastq.gz -O trim/SRR13554759_2.trim.fastq.gz
fastp -i SRR13554759_1.fastq.gz -I SRR13554759_2.fastq.gz -o trim/SRR13554759_1.trim.fastq.gz -O trim/SRR13554759_2.trim.fastq.gz

Good Practice:

Remember to check the data post trimming.

De-novo Assembly with Spades.py

This is a straightforward step.
To do this,
mkdir assembly
spades.py -1 trim/SRR13554759_1.trim.fastq.gz -2 trim/SRR13554759_2.trim.fastq.gz -o assembly

Here’s what’s happening in plain English:

1. spades.py

This is the SPAdes genome assembler.
Think of it as a puzzle solver for DNA: it takes lots of short DNA reads from your sequencing machine and figures out how they fit together to recreate the original genome.

2. -1 trim/SRR13554759_1.trim.fastq.gz

This is your first set of DNA reads from paired-end sequencing.
In paired-end sequencing, the machine reads from both ends of a DNA fragment, so you get two files: one for read 1, one for read 2.
Here, SRR13554759_1.trim.fastq.gz is read 1, stored in a compressed .gz file.

3. -2 trim/SRR13554759_2.trim.fastq.gz

This is your second set of reads — the other “end” of each DNA fragment.
SPAdes will use the information from both ends to assemble the genome more accurately.

4. -o assembly

This says: “Put all the results into a folder called assembly.”
After SPAdes finishes, you’ll open that folder to find your assembled genome files (and some extra logs and reports).

In short:

This command is telling SPAdes:
“Here are my two paired-end read files (read 1 and read 2). Please figure out the genome they came from and save the results in a folder called assembly.”

⇒ Task, Add an option -k and specify a non zero positive integer to set the size of your kmers. Try 40 and 85.

Visualization with Bandage

Download software here: https://rrwick.github.io/Bandage/ ⇒ Please watch in full screen mode. Bandage does not support zoom and the text are tiny. Use case: Detecting presence of plasmids.

Assessing your Assembly with Quast.Py

QC with Quast.py
quast.py assembly/contigs.fasta -o quast_report

You’ll get an HTML report inside quast_report/. Open it in your browser. The main metrics you should care about:

Assembly Length

Should be close to your organism’s known genome size.
Too short → Missing regions.
Too long → Contamination or unresolved repeats.
Number of Contigs

For bacterial genomes: ideally under ~100 (often <10 for high-quality).

For eukaryotic genomes: depends on genome size and complexity, but lower is better.
Largest Contig
A single huge contig often means SPAdes resolved a large chunk well.

For small genomes, having the largest contig >50% of genome size is excellent.
N50

The length at which 50% of the assembly is contained in contigs that long or longer.

Higher = more continuous assembly.

Bacteria: >50 kb is decent, >200 kb is strong.

Eukaryotes: >1 Mb is often considered good (varies by species).
GC Content
Compare to expected GC% for your organism.
Big deviations suggest contamination.
Misassemblies
Count of suspicious joins in the assembly compared to a reference (if provided).
A low number is good; a high number means assembly errors.
Genome Fraction (% coverage)

If you used a reference genome in QUAST, this shows what % of it your assembly covers.
95% is usually strong.

In Summary

Start
 ↓
Do you have a very high number of contigs (>500 for bacteria, >10,000 for eukaryotes)?
  ├── Yes → Assembly likely fragmented → Check coverage & contamination.
  └── No  → Good start.
 ↓
Is the N50 large (for bacteria >50 kb, for eukaryotes usually >1 Mb)?
  ├── No  → May be fragmented → Try different k-mers or higher coverage.
  └── Yes → Great continuity.
 ↓
Is total assembly length close to expected genome size?
  ├── No  → Too small → missing sequences; Too big → contamination/repeats.
  └── Yes → Length looks fine.
 ↓
Do most contigs have even coverage (within 20% of average)?
  ├── No  → Uneven coverage suggests contamination or misassembly.
  └── Yes → Consistent coverage = good.
 ↓
Does QUAST show high genome completeness (>95%) and low misassemblies?
  ├── No  → Investigate misassemblies or missing regions.
  └── Yes → Assembly is solid.
 ↓
Final Verdict:
  - Passed all → High-confidence assembly.
  - Failed some → Adjust pipeline (coverage, k-mers, error correction).

Antimicrobial Resistance Gene Detection

Antimicrobial Resistance Gene Detection
You can also use assembled genomes (contigs.fasta) for detecting the presence of antimicrobial resistance.

Note: Most of the ARG you will detect here are acquired genetically.
To do this, install abricate
conda install bioconda::abricate

Then,
mkdir AMR
abricate assembly/contigs.fasta > AMR/amr_tab.tab

In the current data, the organism bears: LINCOSAMIDE;MACROLIDE;STREPTOGRAMIN Resistance genes.

CODE TASK

Task: Detect Antimicrobial Resistance Genes in a Bacterial Genome

Objective:

Using the dataset we have being working, complete the analysis and generate a report containing the script and final resistance gene in this data

Data Provided:

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR135/059/SRR13554759/SRR13554759_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR135/059/SRR13554759/SRR13554759_2.fastq.gz

Steps to Complete:

Perform quality control on raw reads using FastQC.
Trim reads using fastp to remove adapters and low-quality bases.
Assemble the trimmed reads into contigs using SPAdes (spades.py).

Detect antimicrobial resistance genes using Abricate.

Generate a simple report (resistance_report.txt) that clearly lists detected AMR genes.


Submission:

A single Bash script (amr_detection.sh) that executes the full pipeline from raw reads to AMR report.
Include comments explaining each step.


In [None]:
%%bash
#!/usr/bin/env bash
set -euo pipefail
# ------------------ Install Tools ------------------
# Update package list and install necessary tools
apt-get update -y
apt-get install -y fastqc fastp spades abricate

# ------------------ Config ------------------
THREADS=4
RAW1="SRR13554759_1.fastq.gz"
RAW2="SRR13554759_2.fastq.gz"

# ------------------ Download ----------------
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR135/059/SRR13554759/${RAW1}
wget -nc ftp://ftp.sra.ebi.uk/vol1/fastq/SRR135/059/SRR13554759/${RAW2}

# ------------------ QC ----------------------
mkdir -p qc_raw qc_trim
fastqc -o qc_raw -t $THREADS $RAW1 $RAW2

# ------------------ Trimming ----------------
fastp -i $RAW1 -I $RAW2 \
      -o qc_trim/trimmed_R1.fastq.gz \
      -O qc_trim/trimmed_R2.fastq.gz \
      --html qc_trim/fastp_report.html \
      --thread $THREADS

# ------------------ Assembly ----------------
spades.py -1 qc_trim/trimmed_R1.fastq.gz \
          -2 qc_trim/trimmed_R2.fastq.gz \
          -o spades_output \
          -t $THREADS --only-assembler

# ------------------ AMR Detection -----------
abricate spades_output/contigs.fasta > abricate_out.tab

# ------------------ Report ------------------
awk -F'\t' 'NR==1{print "CONTIG\tGENE\t%IDENTITY\t%COVERAGE"} NR>1{print $1,$4,$7,$6}' OFS='\t' abricate_out.tab \
  > resistance_report.txt

echo "Pipeline complete. See resistance_report.txt"