### Assigning taxonomy, based on QIIME2's "Atacama soil microbiome",  "Moving pictures", and "Training feature classifiers" tutorials.

https://docs.qiime2.org/2017.2/tutorials/atacama-soils/
https://docs.qiime2.org/2017.2/tutorials/moving-pictures/
https://docs.qiime2.org/2017.2/tutorials/feature-classifier/

Today we will be taking the representative sequences from our OTU table and assigning taxonomy to them - i.e., we will look at a database of known organisms and their sequences, and will compare our own sequences to them.

To do this effectively, we will need to trim / "train" the database, based on the specific portion of the 16S gene that we sequenced, using our specific primers. We will then use this "trained" classifier to classify our own sequences. 

The database we will be using is the "Greengenes" database (http://greengenes.secondgenome.com/downloads), but many others exist out there - for example, the Silva database is another commonly used reference (https://www.arb-silva.de/). For fungi, the UNITE database is popular (https://unite.ut.ee/).

The first step is to download the representative sequences. This might be easier just using your browser - enter the following address: ftp://greengenes.microbio.me/greengenes_release/gg_13_5/gg_13_8_otus.tar.gz


Once it's downloaded, unzip it, go to the rep_set sub-folder, and copy or move the 99_rep_set.fasta file to your qiime2-atacama-tutorial folder. Do the same for 99_otu_taxonomy.txt file in the taxonomy sub-folder. You should see them in the folder. (What do you need to type, to check, using the cell below?)

In [None]:
!

We will want to import these sequences and their classifications into qiime:

In [None]:
!qiime tools import --type FeatureData[Sequence] --input-path 99_otus.fasta --output-path 99_otus.qza

!qiime tools import --type FeatureData[Taxonomy] --input-path 99_otu_taxonomy.txt --output-path ref-taxonomy.qza

The next step is to train our classifier to be relevant for the reads that we actually sequenced. The database contains full-length 16S sequences, but we only sequenced a short portion. We will want to pull out the parts of the database that are relevant for our own amplicons.

To do this, we need to provide reference sequences as input, what primers we used, how long our expected amplicon is, and what we want to call the output. Can you tell which parts of the command below represent each part of this? (You will need to replace XXX with the input reference sequences that we just prepared.)

(This command might take a while.)

In [None]:
!qiime feature-classifier extract-reads --i-sequences XXX --p-f-primer GTGCCAGCMGCCGCGGTAA --p-r-primer GGACTACHVGGGTWTCTAAT --p-length 150 --o-reads ref-seqs.qza

The next step is to actually train the classifier, combining the relevant sequence lengths that we just pulled out of the database and their associated taxonomic identifications. This step will actually run out of memory, because the VM memory is limited. However, we can download the equivalent result.

In [None]:
!qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads ref-seqs.qza --i-reference-taxonomy ref-taxonomy.qza --o-classifier classifier.qza --verbose

In [None]:
!wget -O "classifier.qza" "https://data.qiime2.org/2017.2/common/gg-13-8-99-515-806-nb-classifier.qza"

Finally, we can use the classifier to assign taxonomy to our own sequences. Do you remember what our representative sequences file was called? (Currently XXX in the script below).

In [None]:
!qiime feature-classifier classify --i-classifier classifier.qza --i-reads XXX --o-classification taxonomy.qza

!qiime taxa tabulate --i-data taxonomy.qza --o-visualization taxonomy.qzv

Do you remember how we view .qzv objects? Input the command below - what did the above command generate?

In [None]:
!qiime tools view taxonomy.qzv

Next week we will look at how many of each OTU are in each sample, combined with the taxonomy, and begin considering distance matrices. We'll be doing those analyses in R, but using the R notebooks in Jupyter, so it will look similar. To prepare for that, we will save our sequences in a format that works outside of QIIME.

From the tutorial so far, we have the following files:

- An OTU table with our samples and the abundance of each OTU in each sample (table.qza)
- A file with the DNA sequence associated with each OTU (rep-seqs.qza)
- A file with the predicted taxonomic identity of each OTU (taxonomy.qza)

We will save them in a new directory we call OTU_table, and copy ("cp") our sample metadata there too

In [None]:
!mkdir OTU_table

In [None]:
!qiime tools export table.qza --output-dir OTU_table
!qiime tools export rep-seqs.qza --output-dir OTU_table
!qiime tools export taxonomy.qza --output-dir OTU_table

In [None]:
!cp sample-metadata.tsv OTU_table/

In [None]:
!ls OTU_table/

If the above command shows two .tsv files, one .fasta file, and one .biom file, we should be ready to go for next week!

In [None]:
!biom add-metadata -i OTU_table/feature-table.biom -o OTU_table/feature-table-metaD.biom --sample-metadata-fp OTU_table/sample-metadata.tsv

In [None]:
!biom add-metadata -i OTU_table/feature-table-metaD.biom -o OTU_table/feature-table-metaD-tax.biom --observation-metadata-fp OTU_table/taxonomy.tsv --sc-separated taxonomy --observation-header OTUID,taxonomy

In [None]:
!biom summarize-table -i OTU_table/feature-table-metaD-tax.biom -o OTU_table/feature-table-metaD-tax-summary.txt

In [None]:
!head -20 OTU_table/feature-table-metaD-tax-summary.txt