<a href="https://colab.research.google.com/github/Gibbons-Lab/isb_course_2020/blob/master/16S.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🦠 Amplicon Sequencing Data Analysis with Qiime 2

This notebook will accompany the session of the ISB Microbiome course 2020. The presentation slides can be [found here](https://gibbons-lab.github.io/isb_course_2020/16S). 

You can save a local copy of this notebook by using `File > Save a copy in Drive`. You may be promted to cetify the notebook is safe. We'll promise that it is 🤞

**Disclaimer:**

The Google colab notebook environment will interpret any command as Python code by default. If we want to run bash commands we will have to prefix them by `!`. So any command you see with a leading `!` is a bash command and if you wanted to run it in your terminal you would omit the leading `!`. So if the notebook run `!wget` you would just run `wget` in your terminal. 

## Setup

Qiime 2 can usually installed by following the [official installation instructions](https://docs.qiime2.org/2020.6/install/). Since we are using Google Colab and there are some caveats using conda here, we will have hack around those a little bit. But no worries, we will use a setup script which does all the work for us 😌 So let's start by getting a local copy of the project repository.

In [None]:
!git clone https://github.com/gibbons-lab/isb_course_2020 materials

Cloning into 'materials'...
remote: Enumerating objects: 21, done.[K
remote: Counting objects: 100% (21/21), done.[K
remote: Compressing objects: 100% (17/17), done.[K
remote: Total 682 (delta 8), reused 16 (delta 4), pack-reused 661[K
Receiving objects: 100% (682/682), 68.77 MiB | 33.41 MiB/s, done.
Resolving deltas: 100% (178/178), done.


Now we are ready to set up our environment. This will take about 10-15 minutes. 

**Note**: This setup is only relevant for Google Colaboratory and will not work on your local machine. Please follow the [official installation instructions](https://docs.qiime2.org/2020.6/install/) for that.

In [None]:
%run materials/setup_qiime2.py
!qiime dev refresh-cache

[17:38:33] 🐍 Downloading miniconda...                        setup_qiime2.py:39
[17:38:34] 🐍 Done.                                           setup_qiime2.py:45
           🐍 Installing miniconda...                         setup_qiime2.py:39
[17:39:01] 🐍 Installed miniconda to `/usr/local` 🐍          setup_qiime2.py:45
           🔍 Downloading Qiime 2 package list...             setup_qiime2.py:39
[17:39:02] 🔍 Done.                                           setup_qiime2.py:45
           🔍 Installing Qiime 2. This may take a little bit. setup_qiime2.py:39
            🕐                                                                  
[17:45:50] 🔍 Done.                                           setup_qiime2.py:45
           🔍 Fixed import paths to include Qiime 2.          setup_qiime2.py:93
           📊 Checking that Qiime 2 command line works...     setup_qiime2.py:39
[17:45:54] 📊 Qiime 2 command line looks good 🎉              setup_qiime2.py:45
           📊 Checking if Qiime 2 import wo

We will switch to the `materials` directory for the rest of the notebook.

In [None]:
%cd materials

/content/materials


## Our first Qiime 2 command

Let's remember our workflow for today.

![our workflow](https://github.com/Gibbons-Lab/isb_course_2020/raw/master/docs/16S/assets/steps.png)

The first thing we have to do is to get the data into an artifact.
We can import the data with the `import` action from the tools. For that we have to give
Qiime 2 a *manifest* (list of raw files) and tell it what *type of data* we
are importing and what *type of artifact* we want. 

**QoL Tip:** Qiime 2 commands can get very long. To split them up over several lines we can use `\` which means "continue on the next line".

In [None]:
!qiime tools import \
  --type 'SampleData[SequencesWithQuality]' \
  --input-path manifest.tsv \
  --output-path cdiff.qza \
  --input-format SingleEndFastqManifestPhred33V2

[32mImported manifest.tsv as SingleEndFastqManifestPhred33V2 to cdiff.qza[0m


Since we have quality information for the sequencing reads, let's also generate
our first visualization by inspecting those. 

---

Qiime 2 commands can become pretty long. Here some pointers to remember the
structure of a command:

```
qiime plugin action --i-argument1 ... --o-argument2 ...
```

Argument types usually begin with a letter denoting their meaning:

- `--i-...` = input files
- `--o-...` = output files
- `--p-...` = parameters
- `--m-...` = metadata

---

In this case we will use the `summarize` action from the `demux` plugin with the previously generated artifact as input and output the resulting visualization to the `qualities.qzv` file.

In [None]:
!qiime demux summarize --i-data cdiff.qza --o-visualization qualities.qzv

[32mSaved Visualization to: qualities.qzv[0m


You can open the visualization by downloading the visaulization and using http://view.qiime2.org. To downlaod click on the folder symbol to the left and choose download from the dot menu next to the file. Alternatively you can also have a look directly [here](https://gibbons-lab.github.io/isb_course_2020/16S/qualities).

🤔 What do you observe across the read? Where would you truncate the reads?

## Analyzing sequence variants with DADA2

We will now run the DADA2 plugin which will do 3 things:

1. filter and trim the reads
2. find the most likely original sequences in the sample (ASVs)
3. remove chimeras
4. count the abundances


Since it takes a bit let's start the process and use the time to
understand what is happening:

In [None]:
!qiime dada2 denoise-single \
    --i-demultiplexed-seqs cdiff.qza \
    --p-trunc-len 150 \
    --output-dir dada2 --verbose

Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada_single.R /tmp/qiime2-archive-6rndforz/64af4989-9c6a-4f4b-ac4c-305fbea32574/data /tmp/tmp905lib_0/output.tsv.biom /tmp/tmp905lib_0/track.tsv /tmp/tmp905lib_0 150 0 2.0 2 Inf independent consensus 1.0 1 1000000 NULL 16

R version 3.5.1 (2018-07-02) 
Loading required package: Rcpp
DADA2: 1.10.0 / Rcpp: 1.0.4.6 / RcppParallel: 5.0.0 
1) Filtering ........
2) Learning Error Rates
79415850 total bases in 529439 reads from 8 samples will be used for learning the error rates.
3) Denoise samples ........
4) Remove chimeras (method = consensus)
5) Report read numbers through the pipeline
6) Write output
[32mSaved FeatureTable[Frequency] to: dada2/table.qza[0m
[32mSaved FeatureData[Sequence] to: dada2/representative_sequences.qza[0m
[32mSaved S

This ran but we should also make sure it kind of worked. One good way to tell if the identified ASVs are represnetative of the sample is to see how mant reads were maintained throughout the pipeline. Here the most common issues and solutions are:

**Large fraction of reads is lost during merging**<br>
In order to merge ASVs DADA2 uses an overlap of 12 bases between forward and reverse reads by default. Thus, your reads must allow for sufficient overlap *after* trimming. So if your amplified region is 450bp long and you have 2x250bp reads where you trim the first 10 bases and truncate the length to 230 the total length of covered sequence is 2x(220 - 10) = 420 which is shorter than 450bp so there will be no overlap. To solve this issue trim less of the reads or adjust the `--p-min-overlap` parameters to something lower (but not too low).

**Most of the reads are lost as chimeric**<br>
This is usualluy an experimental issue as chimeras are introduced during amplification of the amplicon. If you can adjust your PCR try to run less cycles. Chimeras can also be introduced by incorrect merging. If your minimum overlap is too small ASVs may be merged randomly. possible fixes are to increase the `--p-min-overlap` parameter or run the analysis on the forward reads only (in our emppirical observation chimeras are more likely to be introduced in the joined reads). *However losing between 5-25% of your reads to chimeras is normal and does not require any adjustments.*

Our denoising stats are an artifcat. To convert it to a visualization we can use `qiime metadata tabulate`.

In [None]:
!qiime metadata tabulate \
    --m-input-file dada2/denoising_stats.qza \
    --o-visualization dada2/denoising-stats.qzv

[32mSaved Visualization to: denoising-stats.qzv[0m


## Phylogeny and diversity

We can build a phylogenetic tree for our sequences using the following command:

In [None]:
!qiime phylogeny align-to-tree-mafft-fasttree \
    --i-sequences dada2/representative_sequences.qza \
    --output-dir tree

[32mSaved FeatureData[AlignedSequence] to: tree/alignment.qza[0m
[32mSaved FeatureData[AlignedSequence] to: tree/masked_alignment.qza[0m
[32mSaved Phylogeny[Unrooted] to: tree/tree.qza[0m
[32mSaved Phylogeny[Rooted] to: tree/rooted_tree.qza[0m


You can visualize your tree using iTOL (https://itol.embl.de/). For that open iTol and upload the artifact from `materials/tree/tree.qza`.

This looks nice but is not very informative. It's real usability will be in complementing our diversity analyses. It will tell us which ASVs are related to each other nad which ones are not which will allow us to improve our diversity estimates.



In [None]:
!qiime diversity core-metrics-phylogenetic \
    --i-table dada2/table.qza \
    --i-phylogeny tree/rooted_tree.qza \
    --p-sampling-depth 8000 \
    --m-metadata-file metadata.tsv \
    --output-dir diversity

[32mSaved FeatureTable[Frequency] to: diversity/rarefied_table.qza[0m
[32mSaved SampleData[AlphaDiversity] to: diversity/faith_pd_vector.qza[0m
[32mSaved SampleData[AlphaDiversity] to: diversity/observed_features_vector.qza[0m
[32mSaved SampleData[AlphaDiversity] to: diversity/shannon_vector.qza[0m
[32mSaved SampleData[AlphaDiversity] to: diversity/evenness_vector.qza[0m
[32mSaved DistanceMatrix to: diversity/unweighted_unifrac_distance_matrix.qza[0m
[32mSaved DistanceMatrix to: diversity/weighted_unifrac_distance_matrix.qza[0m
[32mSaved DistanceMatrix to: diversity/jaccard_distance_matrix.qza[0m
[32mSaved DistanceMatrix to: diversity/bray_curtis_distance_matrix.qza[0m
[32mSaved PCoAResults to: diversity/unweighted_unifrac_pcoa_results.qza[0m
[32mSaved PCoAResults to: diversity/weighted_unifrac_pcoa_results.qza[0m
[32mSaved PCoAResults to: diversity/jaccard_pcoa_results.qza[0m
[32mSaved PCoAResults to: diversity/bray_curtis_pcoa_results.qza[0m
[32mSaved Visua

We can now visualize the PCoA by downloading `materials/diversity/weighted_unifrac_emperor.qzv`. 

In [None]:
!qiime diversity alpha-group-significance \
    --i-alpha-diversity diversity/shannon_vector.qza \
    --m-metadata-file metadata.tsv \
    --o-visualization diversity/alpha_groups.qzv

[32mSaved Visualization to: diversity/alpha_groups.qzv[0m


## Taxonomy

We will use a classifier trained on the GreenGenes database which can be downloaded from https://docs.qiime2.org/2020.6/data-resources/.

In [None]:
!wget https://data.qiime2.org/2020.6/common/gg-13-8-99-515-806-nb-classifier.qza

--2020-08-11 18:36:57--  https://data.qiime2.org/2020.6/common/gg-13-8-99-515-806-nb-classifier.qza
Resolving data.qiime2.org (data.qiime2.org)... 52.35.38.247
Connecting to data.qiime2.org (data.qiime2.org)|52.35.38.247|:443... connected.
HTTP request sent, awaiting response... 302 FOUND
Location: https://s3-us-west-2.amazonaws.com/qiime2-data/2020.6/common/gg-13-8-99-515-806-nb-classifier.qza [following]
--2020-08-11 18:36:57--  https://s3-us-west-2.amazonaws.com/qiime2-data/2020.6/common/gg-13-8-99-515-806-nb-classifier.qza
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.218.245.112
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.218.245.112|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28285585 (27M) [application/x-www-form-urlencoded]
Saving to: ‘gg-13-8-99-515-806-nb-classifier.qza.1’


2020-08-11 18:36:59 (19.2 MB/s) - ‘gg-13-8-99-515-806-nb-classifier.qza.1’ saved [28285585/28285585]



In [None]:
!qiime feature-classifier classify-sklearn \
    --i-reads dada2/representative_sequences.qza \
    --i-classifier gg-13-8-99-515-806-nb-classifier.qza \
    --o-classification taxa.qza

[32mSaved FeatureData[Taxonomy] to: taxa.qza[0m


Now let's have a look what and how much of different bacteria we have in
each sample:

In [None]:
!qiime taxa barplot \
    --i-table dada2/table.qza \
    --i-taxonomy taxa.qza \
    --m-metadata-file metadata.tsv \
    --o-visualization taxa_barplot.qzv

[32mSaved Visualization to: taxa_barplot.qzv[0m


In [None]:
!qiime diversity adonis \
    --i-distance-matrix diversity/weighted_unifrac_distance_matrix.qza \
    --m-metadata-file metadata.tsv \
    --p-formula "disease_stat" \
    --p-n-jobs 2 \
    --o-visualization permanova.qzv

[32mSaved Visualization to: permanova.qzv[0m


In [None]:
#!rm -rf *.qza *.qzv dada2 diversity tree 
#!git pull