### Assigning taxonomy, based on QIIME2's "Atacama soil microbiome",  "Moving pictures", and "Training feature classifiers" tutorials.

https://docs.qiime2.org/2017.12/tutorials/atacama-soils/
https://docs.qiime2.org/2017.12/tutorials/moving-pictures/
https://docs.qiime2.org/2017.12/tutorials/feature-classifier/

Today we will be taking the representative sequences from our OTU table and assigning taxonomy to them - i.e., we will look at a database of known organisms and their sequences, and will compare our own sequences to them.

To do this effectively, we will need to trim / "train" the database, based on the specific portion of the 16S gene that we sequenced, using our specific primers. We will then use this "trained" classifier to classify our own sequences. 

The database we will be using is the "Greengenes" database (http://greengenes.secondgenome.com/downloads), but many others exist out there - for example, the Silva database is another commonly used reference (https://www.arb-silva.de/). For fungi, the UNITE database is popular (https://unite.ut.ee/).

The next step is to train our classifier to be relevant for the reads that we actually sequenced. The Greengenes database contains full-length 16S sequences, but we only sequenced a short portion of this gene (150 bp). We will want to pull out the parts of the database that are relevant for our own amplicons.

To do this, we need to provide reference sequences as input, what primers we used, how long our expected amplicon is, and what we want to call the output.

However, actually training the dataset takes more memory than the virtual machine has, and takes a long time. Thus, for this tutorial, we're going to download the trained classifier from QIIME: 

In [None]:
!wget -O "classifier.qza" "https://data.qiime2.org/2017.12/common/gg-13-8-99-515-806-nb-classifier.qza"

What does the classify command need in order to run?

In [None]:
!qiime feature-classifier classify-sklearn --help

We can use the classifier to assign taxonomy to our own sequences. Use the help file above and the names of you input sequences and classifier to specify them in the command below. (Hint: look for which Options are [required], and remember you want .qza objects, not .qzv objects.)

In [None]:
!qiime feature-classifier classify-sklearn --XXX XXX --XXX XXX --o-classification taxonomy.qza

So, we've got our sample metadata from the first week, our OTU table from last week, and taxonomy data from this week. Let's put it all together and take a look at these samples.

Explore the command qiime taxa barplot --help to figure out how to create a .qzv object summarizing these data

Do you remember how we view .qzv objects? What did you call your output object from the barplot command?

In [None]:
!qiime tools view XXX.qzv

Next week we will look at how many of each OTU are in each sample, combined with the taxonomy, and begin considering distance matrices. We'll be doing those analyses in R, but using the R notebooks in Jupyter, so it will look similar. To prepare for that, we will save our sequences in a format that works outside of QIIME.

From the tutorial so far, we have the following files:

- An OTU table with our samples and the abundance of each OTU in each sample (table.qza)
- A file with the DNA sequence associated with each OTU (rep-seqs.qza)
- A file with the predicted taxonomic identity of each OTU (taxonomy.qza)

We will save them in a new directory we call OTU_table, and copy ("cp") our sample metadata there too

In [None]:
!mkdir OTU_table

In [None]:
!qiime tools export table.qza --output-dir OTU_table
!qiime tools export rep-seqs.qza --output-dir OTU_table
!qiime tools export taxonomy.qza --output-dir OTU_table

In [None]:
!cp sample_metadata.tsv OTU_table/

In [None]:
!ls OTU_table/

If the above command shows two .tsv files, one .fasta file, and one .biom file, we should be ready to go for next week!

In [None]:
!biom add-metadata -i OTU_table/feature-table.biom -o OTU_table/feature-table-metaD.biom --sample-metadata-fp OTU_table/sample_metadata.tsv

In [None]:
!biom add-metadata -i OTU_table/feature-table-metaD.biom -o OTU_table/feature-table-metaD-tax.biom --observation-metadata-fp OTU_table/taxonomy.tsv --sc-separated taxonomy --observation-header OTUID,taxonomy

In [None]:
!biom summarize-table -i OTU_table/feature-table-metaD-tax.biom -o OTU_table/feature-table-metaD-tax-summary.txt

In [None]:
!head -20 OTU_table/feature-table-metaD-tax-summary.txt