# Parkinson's Mouse Tutorial - Taxonomy Assignment

Run this notebook in `qiime2-2022.11`.

Continuing the [pd-mouse tutorial](https://docs.qiime2.org/2022.11/tutorials/pd-mice/). Specifically the [Taxonomy](https://docs.qiime2.org/2022.11/tutorials/pd-mice/#taxonomic-classification), and [Phylogeny](https://docs.qiime2.org/2022.11/tutorials/pd-mice/#generating-a-phylogenetic-tree-for-diversity-analysis) steps. *Note we'll use a *de novo* [align-to-tree-mafft-fasttree ](https://docs.qiime2.org/2022.11/tutorials/phylogeny/#pipelines) step so we can run through this tutorial quicker.*

In [1]:
from os import getcwd, listdir, chdir, mkdir
import qiime2 as q2

In [2]:
getcwd()

'/Users/fatimamubeenshaik/IdeaProjects/ParkinsonMouseTrail/src/main'

In [3]:
chdir('./processed')
getcwd()

'/Users/fatimamubeenshaik/IdeaProjects/ParkinsonMouseTrail/src/main/processed'

## Download classifiers if runing on your laptop:

We'll assign taxonomy using SILVA. Can obtain classifiers from the [Data Resource Page](https://docs.qiime2.org/2022.11/data-resources/).

In [4]:
mkdir('silva-classifiers')

FileExistsError: [Errno 17] File exists: 'silva-classifiers'

In [5]:
! wget https://data.qiime2.org/2022.11/common/silva-138-99-515-806-nb-classifier.qza \
    -O ./silva-classifiers/silva-138-99-515-806-nb-classifier.qza

--2024-04-07 17:23:12--  https://data.qiime2.org/2022.11/common/silva-138-99-515-806-nb-classifier.qza
Resolving data.qiime2.org (data.qiime2.org)... 54.200.1.12
Connecting to data.qiime2.org (data.qiime2.org)|54.200.1.12|:443... connected.
HTTP request sent, awaiting response... 302 FOUND
Location: https://s3-us-west-2.amazonaws.com/qiime2-data/2022.11/common/silva-138-99-515-806-nb-classifier.qza [following]
--2024-04-07 17:23:13--  https://s3-us-west-2.amazonaws.com/qiime2-data/2022.11/common/silva-138-99-515-806-nb-classifier.qza
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.92.148.232, 52.218.236.144, 52.92.241.136, ...
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.92.148.232|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 148294965 (141M) [binary/octet-stream]
Saving to: ‘./silva-classifiers/silva-138-99-515-806-nb-classifier.qza’


2024-04-07 17:23:18 (30.3 MB/s) - ‘./silva-classifiers/silva-138-99-51

## If you are running on the HPC the classifiers are located at:
 - `/home/SE/BMIG-6202-MSR/RefDBs/q2-2022.11/silva-138-1-ssu-nr99-515f-806r-classifier.qza`
 - `/home/SE/BMIG-6202-MSR/RefDBs/q2-2022.11/silva-138-1-ssu-nr99-classifier.qza`
 
 You can setup a shortcut like this:

V4:
`silva_classifier='/home/SE/BMIG-6202-MSR/RefDBs/q2-2023.9/silva-138-1-ssu-nr99-515f-806r-classifier.qza'`

V3V4:
`silva_classifier='/home/SE/BMIG-6202-MSR/RefDBs/q2-2023.9/silva-138-1-ssu-nr99-357f-785r-classifier.qza'`

In [6]:
silva_classifier='/Users/fatimamubeenshaik/IdeaProjects/ParkinsonMouseTrail/src/main/processed/silva-classifiers/silva-138-99-515-806-nb-classifier.qza'

## Classify sequences / reads

In the command below, I'll be running on the HPC using the shortcut `$silva_classifier`.

In [7]:
! qiime feature-classifier classify-sklearn \
    --i-reads ./dada2_rep_set.qza \
    --i-classifier $silva_classifier \
    --p-n-jobs 2 \
    --o-classification ./taxonomy.qza

[32mSaved FeatureData[Taxonomy] to: ./taxonomy.qza[0m
[0m

In [17]:
# View list of classifications
! qiime metadata tabulate \
    --m-input-file ./taxonomy.qza \
    --o-visualization ./taxonomy.qzv

[32mSaved Visualization to: ./taxonomy.qzv[0m
[0m

In [18]:
# View a taxonomy barplot
! qiime taxa barplot \
    --i-table ./dada2_table.qza \
    --i-taxonomy ./taxonomy.qza \
    --m-metadata-file ./metadata.tsv \
    --o-visualization ./taxa_barplot.qzv

[32mSaved Visualization to: ./taxa_barplot.qzv[0m
[0m

## Remove poorly classified reads

[Filtering Documentation](https://docs.qiime2.org/2020.11/tutorials/filtering/)

In [19]:
! qiime taxa filter-table \
    --i-table ./dada2_table.qza \
    --i-taxonomy ./taxonomy.qza \
    --p-mode 'contains'  \
    --p-include 'p__' \
    --p-exclude 'p__;,Eukaryota,Chloroplast,Mitochondria' \
    --o-filtered-table ./table-no-ecmu.qza

[32mSaved FeatureTable[Frequency] to: ./table-no-ecmu.qza[0m
[0m

In [None]:
! qiime metadata tabulate \
    --m-input-file ./taxonomy.qza \
    --o-visualization ./taxonomy.qzv

In [23]:
# summarize ESV table
! qiime feature-table summarize \
    --i-table ./table-no-ecmu.qza \
    --m-sample-metadata-file ./metadata.tsv \
    --o-visualization ./table-no-ecmu.qzv

[32mSaved Visualization to: ./table-no-ecmu.qzv[0m
[0m

In [27]:
 #keep seq file in sync with table! 
! qiime feature-table filter-seqs \
    --i-data ./dada2_rep_set.qza \
    --i-table ./table-no-ecmu.qza \
    --o-filtered-data rep_set-no-ecmu.qza

[32mSaved FeatureData[Sequence] to: rep_set-no-ecmu.qza[0m
[0m

In [28]:
! qiime tools export \
    --input-path rep_set-no-ecmu.qza \
    --output-path rep_set-no-ecmu-export

[32mExported rep_set-no-ecmu.qza as DNASequencesDirectoryFormat to directory rep_set-no-ecmu-export[0m
[0m

In [29]:
# View a taxonomy barplot
! qiime taxa barplot \
    --i-table ./table-no-ecmu.qza \
    --i-taxonomy ./taxonomy.qza \
    --m-metadata-file ./metadata.tsv \
    --o-visualization ./table-no-ecmu-taxa-barplot.qzv

[32mSaved Visualization to: ./table-no-ecmu-taxa-barplot.qzv[0m
[0m

#### krona plot

In [12]:
! conda install -c bioconda krona

python(34578) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


Channels:
 - bioconda
 - defaults
 - conda-forge
 - https://packages.qiime2.org/qiime2/2024.5/amplicon/passed
Platform: osx-64
Collecting package metadata (repodata.json): done
Solving environment: done

# All requested packages already installed.



In [29]:
! qiime info

python(34671) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


[32mSystem versions[0m
Python version: 3.8.18
QIIME 2 release: 2024.5
QIIME 2 version: 2024.5.0.dev0+1.g6306962
q2cli version: 2024.5.0.dev0
[32m
Installed plugins[0m
alignment: 2024.5.0.dev0+1.g3a9c58b
composition: 2024.5.0.dev0
cutadapt: 2024.5.0.dev0
dada2: 2024.5.0.dev0
deblur: 2024.5.0.dev0
demux: 2024.5.0.dev0
diversity: 2024.5.0.dev0+1.g99a0cca
diversity-lib: 2024.5.0.dev0
emperor: 2024.5.0.dev0
feature-classifier: 2024.5.0.dev0
feature-table: 2024.5.0.dev0+2.g65222bd
fragment-insertion: 2024.5.0.dev0
longitudinal: 2024.5.0.dev0
metadata: 2024.5.0.dev0
phylogeny: 2024.5.0.dev0
quality-control: 2024.5.0.dev0
quality-filter: 2024.5.0.dev0
rescript: 2024.5.0.dev0+2.ga0df425
sample-classifier: 2024.5.0.dev0
taxa: 2024.5.0.dev0
types: 2024.5.0.dev0+4.g823b5a4
vsearch: 2024.5.0.dev0
[32m
Application config directory[0m
/Users/fatimamubeenshaik/miniconda3/envs/qiime2-dev/var/q2cli[0m
[32m
Getting help[0m
To get help with QIIME 2, visit https://qiime2.org[0m


In [14]:
! qiime krona collapse-and-plot \
    --i-table ./table-no-ecmu.qza \
    --i-taxonomy ./taxonomy.qza \
    --o-krona-plot ./table-no-ecmu-taxa-krona.qzv

python(34599) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


[31m[1mError: QIIME 2 has no plugin/command named 'krona'.[0m


##  Other QA / QC Operations

See [q2-quality-control tutorial](https://docs.qiime2.org/2022.11/tutorials/quality-control/).

In [23]:
mkdir('references')

In [24]:
# download pre-made SILVA refrence
! wget https://data.qiime2.org/2022.11/common/silva-138-99-seqs-515-806.qza \
    -O ./references/silva-138-99-seqs-515-806.qza

--2024-04-07 17:40:24--  https://data.qiime2.org/2022.11/common/silva-138-99-seqs-515-806.qza
Resolving data.qiime2.org (data.qiime2.org)... 54.200.1.12
Connecting to data.qiime2.org (data.qiime2.org)|54.200.1.12|:443... 

python(34636) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


connected.
HTTP request sent, awaiting response... 302 FOUND
Location: https://qiime2-data.s3-us-west-2.amazonaws.com/2022.11/common/silva-138-99-seqs-515-806.qza [following]
--2024-04-07 17:40:24--  https://qiime2-data.s3-us-west-2.amazonaws.com/2022.11/common/silva-138-99-seqs-515-806.qza
Resolving qiime2-data.s3-us-west-2.amazonaws.com (qiime2-data.s3-us-west-2.amazonaws.com)... 52.92.187.194, 52.92.164.90, 3.5.76.198, ...
Connecting to qiime2-data.s3-us-west-2.amazonaws.com (qiime2-data.s3-us-west-2.amazonaws.com)|52.92.187.194|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14620394 (14M) [binary/octet-stream]
Saving to: ‘./references/silva-138-99-seqs-515-806.qza’


2024-04-07 17:40:25 (24.7 MB/s) - ‘./references/silva-138-99-seqs-515-806.qza’ saved [14620394/14620394]



In [25]:
silva_ref_seq='/Users/fatimamubeenshaik/IdeaProjects/ParkinsonMouseTrail/src/main/processed/references/silva-138-99-seqs-515-806.qza'

In [26]:
# remove poor quality sequence that do not have a decent match to our curated reference database.
! qiime quality-control exclude-seqs \
    --i-query-sequences ./rep_set-no-ecmu.qza \
    --i-reference-sequences $silva_ref_seq \
    --p-method vsearch \
    --p-perc-identity 0.90 \
    --p-perc-query-aligned 0.90 \
    --p-threads 8 \
    --o-sequence-hits ./hits.qza \
    --o-sequence-misses ./misses.qza \
    --verbose

python(34642) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


Running external command line application. This may print messages to stdout and/or stderr.
The commands to be run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: vsearch --usearch_global /var/folders/2y/702nfmtx76sd29583bwt4gp00000gn/T/qiime2/fatimamubeenshaik/data/a53a5454-aba0-41ba-9b71-316920390fb3/data/dna-sequences.fasta --id 0.9 --strand both --maxaccepts 1 --maxrejects 0 --db /var/folders/2y/702nfmtx76sd29583bwt4gp00000gn/T/qiime2/fatimamubeenshaik/data/b41681fb-a4e7-4ef8-a23a-a26f1bcfd272/data/dna-sequences.fasta --threads 8 --userfields query+target+ql+qlo+qhi --userout /var/folders/2y/702nfmtx76sd29583bwt4gp00000gn/T/tmpz6z4mjd5

vsearch v2.22.1_macos_x86_64, 16.0GB RAM, 12 cores
https://github.com/torognes/vsearch

Reading file /var/folders/2y/702nfmtx76sd29583bwt4gp00000gn/T/qiime2/fatimamubeenshaik/data/b41681fb-a4e7-4ef8-a23a-a26f1bcfd272/data/dna-sequences.fasta 100%           
86453445 nt in 313

In [30]:
# filter table to match filtered sequence file
! qiime feature-table filter-features \
    --i-table ./table-no-ecmu.qza \
    --m-metadata-file ./hits.qza \
    --o-filtered-table ./table-no-ecmu-hits.qza

python(34694) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


[32mSaved FeatureTable[Frequency] to: ./table-no-ecmu-hits.qza[0m
[0m

#### Given that we filtered our data again, you may want to re-generate the taxonomy plots. Use the prior taxonomy visualization commands above as a guid and run them below, with the new table:

In [31]:
# updated taxonomy barplot
! qiime taxa barplot \
    --i-table ./table-no-ecmu-hits.qza \
    --i-taxonomy ./taxonomy.qza \
    --m-metadata-file ./metadata.tsv \
    --o-visualization ./table-no-ecmu-hits-taxa-barplot.qzv

python(34696) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


[32mSaved Visualization to: ./table-no-ecmu-hits-taxa-barplot.qzv[0m
[0m

In [32]:
# updated krona plot
! qiime krona collapse-and-plot \
    --i-table ./table-no-ecmu-hits.qza \
    --i-taxonomy ./taxonomy.qza \
    --o-krona-plot ./table-no-ecmu-hits-taxa-krona.qzv

python(34698) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


[31m[1mError: QIIME 2 has no plugin/command named 'krona'.[0m


In [33]:
! qiime feature-table group \
    --i-table ./table-no-ecmu-hits.qza \
    --m-metadata-file ./metadata.tsv \
    --m-metadata-column 'genotype' \
    --p-mode 'mean-ceiling' \
    --p-axis 'sample'\
    --o-grouped-table ./table-no-ecmu-hits-genotype.qza

python(34699) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


[32mSaved FeatureTable[Frequency] to: ./table-no-ecmu-hits-genotype.qza[0m
[0m

#### krona collapse by group 

In [34]:
! qiime feature-table group \
    --i-table ./table-no-ecmu-hits.qza \
    --p-axis sample \
    --m-metadata-file ./metadata.tsv \
    --m-metadata-column donor \
    --p-mode 'mean-ceiling' \
    --o-grouped-table ./table-no-ecmu-hits-donor.qza

python(34702) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


[32mSaved FeatureTable[Frequency] to: ./table-no-ecmu-hits-donor.qza[0m
[0m

In [14]:
! qiime krona collapse-and-plot \
    --i-table ./table-no-ecmu-hits-donor.qza \
    --i-taxonomy ./taxonomy.qza \
    --o-krona-plot ./table-no-ecmu-hits-donor-taxa-krona.qzv

[32mSaved Visualization to: ./table-no-ecmu-hits-donor-taxa-krona.qzv[0m
[0m

## Construct phylogeny

See the [Inferring Phylogenies tutorial](https://docs.qiime2.org/2022.11/tutorials/phylogeny/) for more information.

We'll run [FastTree](https://docs.qiime2.org/2022.11/tutorials/phylogeny/#fasttree) to be quick, though I'd recomend [iqtree](https://docs.qiime2.org/2022.11/tutorials/phylogeny/#iqtree) or [fragment-insertion](https://library.qiime2.org/plugins/q2-fragment-insertion/16/).

We'll be using the [align-to-tree-mafft-fasttree](https://docs.qiime2.org/2022.11/tutorials/phylogeny/#pipelines) pipeline.

### *de novo phylogeny*

View with [iTOL](https://itol.embl.de/) or [Empress](https://github.com/biocore/empress).

In [35]:
# pipeline: alignment through phylogeny
! qiime phylogeny align-to-tree-mafft-fasttree \
    --i-sequences ./hits.qza \
    --output-dir ./mafft-fasttree-output \
    --verbose

python(34910) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: mafft --preservecase --inputorder --thread 1 /var/folders/2y/702nfmtx76sd29583bwt4gp00000gn/T/qiime2/fatimamubeenshaik/data/f1275e36-7957-4ba1-bd88-65fdfe10abd3/data/dna-sequences.fasta

inputfile = orig
263 x 150 - 150 d
nthread = 1
nthreadpair = 1
nthreadtb = 1
ppenalty_ex = 0
stacksize: 8192 kb
generating a scoring matrix for nucleotide (dist=200) ... done
Gap Penalty = -1.53, +0.00, +0.00



Making a distance matrix ..
  201 / 263 (thread    0)
done.

Constructing a UPGMA tree (efffree=0) ... 
  260 / 263
done.

Progressive alignment 1/2... 
STEP   148 / 262 (thread    0)
Reallocating..done. *alloclen = 1301
STEP   262 / 262 (thread    0) h
done.

Making a distance matrix from msa.. 
  200 / 263 (thread    0)
done.

Constructing a UPGMA tree (efffree=1) 

### Another phylogenetic approach: Fragment Insertion

In [27]:
! wget https://data.qiime2.org/2022.11/common/sepp-refs-silva-128.qza -O ./references/sepp-refs-silva-128.qza 

--2024-04-07 17:44:04--  https://data.qiime2.org/2022.11/common/sepp-refs-silva-128.qza
Resolving data.qiime2.org (data.qiime2.org)... 54.200.1.12
Connecting to data.qiime2.org (data.qiime2.org)|54.200.1.12|:443... connected.
HTTP request sent, awaiting response... 

python(34648) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


302 FOUND
Location: https://s3-us-west-2.amazonaws.com/qiime2-data/2022.11/common/sepp-refs-silva-128.qza [following]
--2024-04-07 17:44:04--  https://s3-us-west-2.amazonaws.com/qiime2-data/2022.11/common/sepp-refs-silva-128.qza
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.92.250.248, 52.92.242.40, 52.92.136.104, ...
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.92.250.248|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 181253322 (173M) [binary/octet-stream]
Saving to: ‘./references/sepp-refs-silva-128.qza’


2024-04-07 17:44:10 (33.0 MB/s) - ‘./references/sepp-refs-silva-128.qza’ saved [181253322/181253322]



In [36]:
sepp_ref='/Users/fatimamubeenshaik/IdeaProjects/ParkinsonMouseTrail/src/main/processed/references/sepp-refs-silva-128.qza'

In [37]:
! qiime fragment-insertion sepp \
    --i-representative-sequences ./hits.qza \
    --i-reference-database $sepp_ref \
    --o-tree ./tree.qza \
    --o-placements ./tree_placements.qza \
    --p-threads 8

python(35026) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


[32mSaved Phylogeny[Rooted] to: ./tree.qza[0m
[32mSaved Placements to: ./tree_placements.qza[0m
[0m

In [38]:
!  qiime fragment-insertion filter-features \
    --i-table ./table-no-ecmu-hits.qza \
    --i-tree ./tree.qza \
    --o-filtered-table ./table-no-ecmu-fi.qza \
    --o-removed-table ./table-no-ecmu-nofi.qza

python(38609) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


[32mSaved FeatureTable[Frequency] to: ./table-no-ecmu-fi.qza[0m
[32mSaved FeatureTable[Frequency] to: ./table-no-ecmu-nofi.qza[0m
[0m

In [39]:
! qiime feature-table filter-seqs \
    --i-data ./hits.qza \
    --i-table ./table-no-ecmu-fi.qza \
    --o-filtered-data repset-no-ecmu-fi.qza

python(38611) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


[32mSaved FeatureData[Sequence] to: repset-no-ecmu-fi.qza[0m
[0m

## [Empress](https://github.com/biocore/empress)

In [40]:
!qiime empress tree-plot \
    --i-tree ./mafft-fasttree-output/rooted_tree.qza \
    --m-feature-metadata-file ./taxonomy.qza \
    --o-visualization ./tree-viz.qzv

python(38613) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


[31m[1mError: QIIME 2 has no plugin/command named 'empress'.[0m


In [41]:
q2.Visualization.load('./tree-viz.qzv')

ValueError: tree-viz.qzv does not exist.

In [None]:
! qiime empress community-plot \
    --i-tree ./mafft-fasttree-output/rooted_tree.qza \
    --i-feature-table ./table-no-ecmu-hits.qza \
    --m-sample-metadata-file ./metadata.tsv \
    --m-feature-metadata-file ./taxonomy.qza \
    --o-visualization tree-tax-table.qzv

In [None]:
q2.Visualization.load('./tree-tax-table.qzv')