#Genome Mining and Pathogen Identification

This notebook guides students through the process of retrieving metagenomic sequencing reads from the NCBI Sequence Read Archive (SRA), processing those reads, and identifying plant pathogens, specifically begomoviruses, using tools like Kraken2 and Pavian.

#Install dependencies and tools

**Install miniconda**

In [None]:
# @title
!wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
!chmod +x Miniconda3-latest-Linux-x86_64.sh
!bash ./Miniconda3-latest-Linux-x86_64.sh -b -f -p /usr/local
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')
!conda config --add channels defaults
!conda config --add channels bioconda
!conda config --add channels conda-forge

**Install Kraken2 and KrakenTools**

In [None]:
# @title
!conda install bioconda::kraken2 -y
!conda install bioconda::krakentools -y

#Plant Pathogen Detection in SRA-Submitted Metagenomic Sample Data
We will use Illumina sequencing data from a published study (SRP430024) to demonstrate how to classify reads and identify potential plant pathogens in metagenomic data. The following code will fetch sequencing reads from a metagenomic study that investigates soil samples from both greenhouse (SRR24008622) and field (SRR24008623) environments. This will serve as a practical example of how to analyze metagenomic data for pathogen detection. Fetch the raw data

In [None]:
# @title
##greenhouse sample
!wget http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR240/022/SRR24008622/SRR24008622_1.fastq.gz	http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR240/022/SRR24008622/SRR24008622_2.fastq.gz
##field sample
!wget http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR240/023/SRR24008623/SRR24008623_1.fastq.gz	http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR240/023/SRR24008623/SRR24008623_2.fastq.gz

Now, let’s focus on detecting Begomoviruses. In this specific case, we will use Kraken2 to classify the reads against a custom-built database of Begomoviruses. We need a Kraken2 database that contains all reported Begomoviruses from the NCBI database. The code below will help you fetch the required database for this classification.

In [None]:
!wget https://zenodo.org/record/13966270/files/begomavirus_kraken2_102124.tar.gz
# @title
!tar -xvf begomavirus_kraken2_102124.tar.gz
!rm begomavirus_kraken2_102124.tar.gz

**Run Kraken2** to classify the reads for each sample/experiment. The commands below will search for hits in the Begomovirus database for both greenhouse and field samples. These commands will generate reports of the taxonomic classification for each sample, stored in report_x:

In [None]:
# @title
!kraken2 --db ./bego --use-names --report report_greenhouse.txt --output greenhouse SRR24008622_1.fastq.gz SRR24008622_2.fastq.gz
!kraken2 --db ./bego --use-names --report report_field.txt --output field SRR24008623_1.fastq.gz  SRR24008623_2.fastq.gz

**Run Pavian**

Download the Kraken reports to your computer and use Pavian to visualize the results. We will review the classification of the reads. Analyze the results using Sankey plots for a comprehensive overview, and compare the samples by the percentage of classified reads, taxonomic IDs (taxids), and Z-scores.

**Z-Scores**
In metagenomics, Z-scores are a useful statistical tool for comparing the abundance of specific microbes or viruses across different environments or conditions.

A Z-score tells us how far a particular value is from the average, measured in standard deviations. It helps you understand if a data point (like the abundance of a virus) is higher or lower than the average, and by how much.

**The formula for a Z-score is:**

𝑍
=
(
𝑋
−
𝜇
)
𝜎
Z=
σ
(X−μ)
​

**Where:**

X is the value you are measuring (e.g., abundance of a virus in a sample).
μ is the mean (average) value of that measurement across different samples (e.g., average abundance of the virus across greenhouse and field samples).
σ is the standard deviation, which measures how spread out the values are from the mean.

**Interpretation:**
Positive Z-score: Indicates that the virus is more abundant than the average.
Negative Z-score: Indicates that the virus is less abundant than the average.
Z = 0: The abundance is exactly at the average.

Download your reports

In [None]:
# @title
from google.colab import files
files.download('report_field.txt')
files.download('report_greenhouse.txt')

Run Pavian

In [None]:
# @title
#Visualizing Sample using Pavian
from IPython.display import IFrame
IFrame('https://fbreitwieser.shinyapps.io/pavian/', width='100%', height=600)

**Extract reads that are clasify to begomovirus for post assembly**

After identifying the Begomoviruses, the next step is to extract the relevant reads from the original metagenomic data. These reads will be used to assemble the viral genomes. Download the FASTQ files for use in further analysis. Example: Genome Assembly



In [None]:
# @title
!extract_kraken_reads.py -t 10814 -k greenhouse --include-children -s SRR24008622_1.fastq.gz -s2 SRR24008622_2.fastq.gz -t 10814 -r report_greenhouse.txt --fastq-output -o greenhouse_1.fastq -o2 greenhouse_2.fastq
!extract_kraken_reads.py -t 10814 -k field --include-children -s SRR24008623_1.fastq.gz -s2 SRR24008623_2.fastq.gz -t 10814 -r report_field.txt --fastq-output -o field_1.fastq -o2 field_2.fastq