# Metagenome analysis with QIIME 2

QIIME2 Tutorial
https://cap-lab.bio/q2-books/01-sra-data-access.html

SRA Data
https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJEB52339

## Setup
https://docs.qiime2.org/2024.2/install/native/#qiime-2-metagenome-distribution

You can install the latest distribution of the QIIME 2 metagenome distribution by following the instructions here. Once installed, you can activate the environment by running the following command
```
wget https://data.qiime2.org/distro/shotgun/qiime2-shotgun-2024.2-py38-linux-conda.yml
conda env create -n qiime2-shotgun-2024.2 --file qiime2-shotgun-2024.2-py38-linux-conda.yml
conda activate qiime2-shotgun-2024.2
```

## Directory structure
Below you can see the directory structure that we will use throughout this tutorial:
```
<your working directory>
├── moshpit_tutorial
│   ├── cache
│   ├── results
```
Once you decided on the location of your working directory, let’s create the results subdirectory by running the following command:

In [None]:
!mkdir -p moshpit_tutorial/results
#Next, we create the cache subdirectory (this is where majority of the data will be written to by QIIME 2)
#We will be saving all the artifacts into that QIIME cache and all the final visualizations and tables into the results directory. If you want to read more about the QIIME cache, you can do so here.
!qiime tools cache-create --cache ./moshpit_tutorial/cache

## Required databases
In order to perform the taxonomic and functional annotation, we will need a couple of different reference databases. Below you will find instructions on how to download these databases using respective QIIME 2 actions.

### Kraken 2/Bracken database

In [None]:
!qiime moshpit build-kraken-db \
    --p-collection standard \
    --o-kraken2-database ./moshpit_tutorial/cache:kraken_standard \
    --o-bracken-database ./moshpit_tutorial/cache:bracken_standard \
    --verbose

### EggNOG databases

In [None]:
!qiime moshpit fetch-diamond-db \
    --o-diamond-db ./moshpit_tutorial/cache:eggnog_diamond_full \
    --verbose

!qiime moshpit fetch-eggnog-db \
    --o-eggnog-db ./moshpit_tutorial/cache:eggnog_annot_full \
    --verbose

## Data retrieval from SRA
Find the bioproject with below link. Enter the BioProject ID in search box
https://www.ncbi.nlm.nih.gov/Traces/study/

(e.g. https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJEB52339&o=acc_s%3Aa)

On total row, click on download metadata and accession list. SraRunTable.txt and SRR_Acc_List.txt will be downloaded.
### Generate ids.tsv file
ids.tsv file only have "id" column include all accession ids. 

In [None]:
# Open the original file to read its contents
with open('SRR_Acc_List.txt', 'r') as file:
    # Read all lines from the file
    lines = file.readlines()

# Open the new file in write mode and add "id" at the first line
with open('ids.tsv', 'w') as file:
    # Write "id" followed by a newline character at the beginning
    file.write("id\n")
    
    # Write the original content after "id"
    file.writelines(lines)

In [None]:
# Import the accession IDs into a QIIME 2 artifact
!qiime tools import \
  --type NCBIAccessionIDs \
  --input-path ids.tsv \ 
  --output-path ./moshpit_tutorial/ids.qza

Finally, we can use the get-all action to download the data.

More detail https://github.com/bokulich-lab/q2-fondue/blob/main/tutorial/tutorial.md

In [None]:
!qiime fondue get-all \ 
      --i-accession-ids ./moshpit_tutorial/ids.qza \
      --p-email your.email@custom.com \
      --p-retries 3 \
      --verbose \
      --use-cache ./moshpit_tutorial/cache \
      --output-dir fondue-output

This will download all the sequences into the QIIME 2 cache. It is a lot of data, so keep in mind that depending on your network speed, this might take a while. Once the data is downloaded, you can proceed to one (or more) of the following steps:
 - Annotation of reads
 - Generation and annotation of contigs

## Annotation of reads

### Read Taxonomic Annotation
With metagenomic data, our first step of our analysis is to run Kraken.

This will give us taxonomic annotations for our reads and from this there, we can create our feature table that we will use for the rest of the analysis

In this command, we have loaded all of our inputs into cache, this saves time unzipping, reading, and writing them into memory. We are also writing our outputs directly to Artifact Cache, this similarly saves time for writing the files out and zipping them into .qza

We have found that its most effective to keep your artifacts in cache until after you have a feature table due to the size of this data.

In [None]:
!qiime moshpit classify-kraken2 \
	--i-seqs ./fondue-output/paired_reads.qza \
	--i-kraken2-db ./moshpit_tutorial/cache:kraken_standard \
	--p-threads 40 \
	--p-confidence 0.6 \
	--p-minimum-base-quality 20 \
	--o-hits ./moshpit_tutorial/cache:workshop_kraken_db_hits \
	--o-reports ./moshpit_tutorial/cache:workshop_kraken_db_reports \
	--p-report-minimizer-data \
	--use-cache ./moshpit_tutorial/cache \
    	--verbose

At this point we have kraken reports and hits.

Reports are per sample tab seperated files that contain read information per line. Hits are per sample tab seperated files that contain taxon information per line

Hits contain read information on each line: U/C based on if the read was classified or not, the read id as seen in the fastq header,Taxonomic ID(or 0 if unclassified), The length of the sequences, amd list of LCA mappings of each k-mer (which indicates what k-mers mapped to which taxonomic annotations).

Reports contain taxon information on each line: Percentage of fragments covered by the clade root, number of fragments covered by clade root, Number of fragments assigned directly to this taxon, a rank code: indicating (U)nclassified, (R)oot, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies, NCBI taxonomic ID number, and taxonomic annotation. Change the level base on the data using --p-level parameter (https://forum.qiime2.org/t/q2-shotgun-bracken-error/29171)

For more information on Kraken outputs, visit the Kraken Manual!

Bracken uses a Bracken database, the length of your reads and the kraken reports to give you a feature-table[Frequency]

In [None]:
!qiime moshpit estimate-bracken \
    --i-bracken-db ./moshpit_tutorial/cache:bracken_standard \
    --p-read-len 100 \
    --i-kraken-reports ./moshpit_tutorial/cache:workshop_kraken_db_reports \
    --o-reports ./moshpit_tutorial/kraken-outputs/bracken-reports.qza \
    --o-taxonomy ./moshpit_tutorial/kraken-outputs/taxonomy-bracken.qza \
    --p-level P \ 
    --o-table ./moshpit_tutorial/kraken-outputs/table-bracken.qza

### Filtering Feature Table and Normalization
Once we have feature table, this is becomes alot more similar to the amplicon workflow of QIIME 2.

In this tutorial, we’re going to work specifically with samples that were included in the autoFMT randomized trial. Many of these subjects dropped out before randomization (placing the subject into FMT group or Control group) and therefore do not have a value in the autoFmtGroup.

We need to filter our feature table to contain samples that were in the autoFMT study by filtering out any samples that are null in the metadata column autoFmtGroup.

In [None]:
!qiime feature-table filter-samples \
  --i-table ./moshpit_tutorial/kraken-outputs/table-bracken.qza \
  --m-metadata-file ./new-sample-metadata.tsv \
  --o-filtered-table autofmt-table.qza

For this tutorial, to normalization our data we will generate a relative-frequency table.

In [None]:
!qiime feature-table relative-frequency \
    --i-table autofmt-table.qza \
	--o-relative-frequency-table autofmt-table-rf.qz

### Alpha diversity
First we’ll look for general patterns, by comparing different categorical groupings of samples to see if there is some relationship to richness.

To start with, we’ll gernate an ‘observed features’ vector from our relative frequency table:

In [None]:
!qiime diversity alpha \
    --i-table autofmt-table-rf.qz.qza \
    --p-metric "observed_features" \
    --o-alpha-diversity obs-autofmt-bracken-rf

The first thing to notice is the high variability in each individual’s richness (PatientID). The centers and spreads of the individual distributions are likely to obscure other effects, so we will want to keep this in mind. Additionally, we have repeated measures of each individual, so we are violating independence assumptions when looking at other categories. (Kruskal-Wallis is a non-parameteric test, but like most tests, still requires samples to be independent.)

Keeping in mind that other categories are probably inconclusive, we notice that there are (amusingly, and somewhat reassuringly) differences in stool consistency (solid vs non-solid).

Because these data were derived from a study in which participants recieved auto-fecal microbiota transplant, we may also be interested in whether there was a difference in richness between the control group and the auto-FMT goup.

Looking at autoFmtGroup we see that there is no apparent difference, but we also know that we are violating independence with our repeated measures, and all patients recieved a bone-marrow transplant which may be a stronger effect. (The goal of the auto-FMT was to mitigate the impact of the marrow transplant.)

We will use a more advanced statistical model to explore this question.

In [None]:
!qiime diversity alpha-group-significance \
    --i-alpha-diversity obs-autofmt-bracken-rf.qza \
    --m-metadata-file ./new-sample-metadata.tsv \
    --o-visualization obs-table-bracken-rf-group-sig.qzv

### Beta Diversity
Now that we better understand community richness trends, lets look at differences in microbial composition.

Let investigate this by looking at Bray Curtis:

In [None]:
!qiime diversity beta \
  --i-table autofmt-table-rf.qz.qza \
  --p-metric braycurtis \
  --o-distance-matrix braycurtis-autofmt

### Emperor Plot Creation
Now that we have our Bray Curtis distance matrix, lets visualize this using a PCOA plot.

In [None]:
!qiime diversity pcoa \
  --i-distance-matrix braycurtis-autofmt.qza \
  --o-pcoa pcoa-braycurtis-auto-fmt.qza \
  --verbose

!qiime emperor plot \
  --i-pcoa pcoa-braycurtis-auto-fmt.qza \
  --m-metadata-file ./new-sample-metadata.tsv \
  --o-visualization braycurtis-auto-fmt-emperor.qzv

### Taxa-bar Creation
Another way we can look at microbial composition is to investigate the taxa barplot. One thing to Note, these tend to be even more chaotic then the Amplicon data.

In [None]:
!qiime taxa barplot \
  --i-table ./moshpit_tutorial/kraken-outputs/table-bracken.qza \
  --i-taxonomy ./moshpit_tutorial/kraken-outputs/taxonomy-bracken.qza \
  --m-metadata-file ./new-sample-metadata.tsv \
  --o-visualization taxa-bar-plot.qzv

### Differential Abundance Analysis
ANCOM-BC does not allow for repeated measures, so we need to filter down to a time point that will give us one sample per subject. 

We will attempt to do that by filtering down to the “peri” timepoint. This will allow us to look at the timepoint directly following FMT.

In [None]:
!qiime feature-table filter-samples \
  --i-table autofmt-table.qza \
  --m-metadata-file ./new-sample-metadata.tsv \
  --p-where "[disease]='atopic eczema'" \
  --o-filtered-table peri-fmt-table.qza

!qiime feature-table summarize \
  --i-table peri-fmt-table.qza \
  --m-sample-metadata-file ./new-sample-metadata.tsv \
  --o-visualization peri-fmt-table.qzv

# Contig Analysis
## Assemble Reads into Contigs with MEGAHIT
The first step in recovering metagenome-assembled genomes (MAGs) is genome assembly itself. There are many genome assemblers available, two of which you can use through our QIIME 2 plugin - here, we will use MEGAHIT. MEGAHIT takes short DNA sequencing reads, constructs a simplified De Bruijn graph, and generates longer contiguous sequences called contigs, providing valuable genetic information for the next steps of our analysis.

The --p-num-partition specifies the number of partitions to split the dataset into for parallel processing during assembly.

The --p-presets specifies the preset mode for MEGAHIT. In this case, it’s set to “meta-sensitive” for metagenomic data.

The --p-cpu-threads specifies the number of CPU threads to use during assembly.

In [None]:
!qiime assembly assemble-megahit \
    --i-seqs "./fondue-output/paired_reads.qza" \
    --p-presets "meta-sensitive" \
    --p-num-cpu-threads 64 \
    --p-num-partitions 4 \
    --o-contigs "./moshpit_tutorial/cache:contigs" \
    --verbose

## EggNOG search using diamond aligner
Searches for homologous sequences in the EggNOG database using the Diamond aligner for faster processing.

The --p-db-in-memoryloads the database into memory for faster processing.

In [None]:
!qiime moshpit eggnog-diamond-search \
  --i-sequences "./moshpit_tutorial/cache:contigs" \
  --i-diamond-db "./moshpit_tutorial/cache:eggnog_diamond_full"\
  --p-num-cpus 14 \
  --p-db-in-memory \
  --o-eggnog-hits "./moshpit_tutorial/cache:diamond_hits_contigs" \
  --o-table "./moshpit_tutorial/cache:diamond_feature_table_contigs" \
  --verbose