# 1.Import packages

In [13]:
# Importing all required packages at the start of the notebook
import IPython

from qiime2 import Visualization

import qiime2 as q2
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

In [14]:
# Location
data_dir = "Project_data/Taxonomy"

# 2.Euler

The following steps were computed on the Euler server. Therefore, the results of theses steps are already in the taxonomy folder. A mini Conda environment had to be created in order to run qiime2 on Euler. The tutorial provided on Moodle was used for this purpose. Separate scripts were written to run the steps on Euler and these are listed in the following cells.

## 2.1 Import files
The files required to run the scripts were created using the previous notebooks and uploaded to the Polybox. These files were then downloaded in the scripts on Euler.

## 2.2 Train a classifier on data from the UNITE database
A script was created to train a classifier on the Euler server. This script was based on a [tutorial](https://forum.qiime2.org/t/how-to-train-a-unite-classifier-using-rescript/28285?u=nicholas_bokulich) from the qiime2 forum.

```bash
#!/bin/bash
#SBATCH --job-name=train_evaluate_classifier
#SBATCH --time=24:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=32G
#SBATCH --output=train_evaluate_classifier_%j.out
#SBATCH --error=train_evaluate_classifier_%j.err
#SBATCH --mail-type=END,FAIL

# Activate conda
source ~/miniconda3/etc/profile.d/conda.sh
conda activate qiime2-moshpit-2025.7

# Make data folder
mkdir -p "ProjectData"
data_dir="ProjectData/uniteDB"

# Download unite DB
module load eth_proxy

qiime rescript get-unite-data \
  --p-version '2025-02-19' \
  --p-taxon-group eukaryotes \
  --p-cluster-id 99 \
  --p-no-singletons \
  --verbose \
  --output-dir "$data_dir"

echo "Downloading UniteDB done!"


# Edit taxonomy to make more efficient
qiime rescript edit-taxonomy \
    --i-taxonomy "$data_dir/taxonomy.qza" \
    --o-edited-taxonomy "$data_dir/taxonomy-no-SH.qza" \
    --p-search-strings ';sh__.*' \
    --p-replacement-strings '' \
    --p-use-regex

echo "Editing done!"


# Train the classifier
qiime feature-classifier fit-classifier-naive-bayes \
    --i-reference-reads "$data_dir/sequences.qza" \
    --i-reference-taxonomy "$data_dir/taxonomy-no-SH.qza" \
    --o-classifier "$data_dir/selftrained_classifier_eukaryote.qza"

echo "Training done!"


# Evaluate the classifier
qiime rescript evaluate-fit-classifier \
    --i-sequences "$data_dir/sequences.qza"  \
    --i-taxonomy "$data_dir/taxonomy-no-SH.qza" \
    --p-n-jobs ${SLURM_CPUS_PER_TASK} \
    --o-classifier "$data_dir/classifier_evaluation_classifier.qza" \
    --o-evaluation "$data_dir/classifier_evaluation.qzv" \
    --o-observed-taxonomy "$data_dir/classifier_evaluation_predicted_taxonomy.qza"

qiime rescript evaluate-taxonomy \
  --i-taxonomies "$data_dir/taxonomy-no-SH.qza" "$data_dir/classifier_evaluation_predicted_taxonomy.qza" \
  --p-labels ref-taxonomy predicted-taxonomy \
  --o-taxonomy-stats "$data_dir/classifier_taxonomy_evaluation.qzv"

echo "Evaluation done!"

conda deactivate
```

## 2.3 Taxonomy assignment with the self-trained classifier
The classifier created in the above script was then used in another script to assign the taxonomy to the project data.

```bash
#!/bin/bash
#SBATCH --job-name=apply_classifier_selftrained
#SBATCH --time=24:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=32G
#SBATCH --output=apply_classifier_selftrained_%j.out
#SBATCH --error=apply_classifier_selftrained_%j.err
#SBATCH --mail-type=END,FAIL

# Activate conda
source ~/miniconda3/etc/profile.d/conda.sh
conda activate qiime2-moshpit-2025.7

# Data folder
data_dir="ProjectData"

# Download files
wget --content-disposition -nc --progress=dot:giga -P "$data_dir" "https://polybox.ethz.ch/index.php/s/3T2H9pFGBskcJ7e/download"

# Apply the classifier
qiime feature-classifier classify-sklearn \
        --i-classifier "$data_dir/uniteDB/selftrained_classifier_eukaryote.qza" \
        --i-reads $data_dir/dada2_rep_set.qza \
        --p-n-jobs ${SLURM_CPUS_PER_TASK} \
        --o-classification $data_dir/taxonomy_selftrained.qza

echo "Classification done!"

conda deactivate


## 2.4 Taxonomy assignment with a pre-trained classifier
To compare the self-trained classifier, an additional pre-trained classifier was used on the project data. The pre-trained [classifier](https://github.com/colinbrislawn/unite-train/releases), which was mentioned in the tutorial on training a classifier, was selected. Similarly to the self-trained classifier, the pre-trained classifier was trained using the latest UNITE release at the time (2025-02-19), containing all eukaryotes with 99% identity clustering and no singletons.

```bash
#!/bin/bash
#SBATCH --job-name=apply_classifier_pretrained
#SBATCH --time=24:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=32G
#SBATCH --output=apply_classifier_pretrained_%j.out
#SBATCH --error=apply_classifier_pretrained_%j.err
#SBATCH --mail-type=END,FAIL

# Activate conda
source ~/miniconda3/etc/profile.d/conda.sh
conda activate qiime2-moshpit-2025.7

# Data folder
data_dir="ProjectData"


# Download the pretrained classifier and reads
module load eth_proxy

wget -nc --progress=dot:giga -P $data_dir https://github.com/colinbrislawn/unite-train/releases/download/v10.0-2025-02-19-qiime2-2024.10/unite_ver10_99_all_19.02.2025-Q2-202
wget --content-disposition -nc --progress=dot:giga -P "$data_dir" "https://polybox.ethz.ch/index.php/s/3T2H9pFGBskcJ7e/download"

echo "Download done!"


# Apply the classifier
qiime feature-classifier classify-sklearn \
        --i-classifier $data_dir/unite_ver10_99_all_19.02.2025-Q2-2024.10.qza \
        --i-reads $data_dir/dada2_rep_set.qza \
        --p-n-jobs ${SLURM_CPUS_PER_TASK} \
        --o-classification $data_dir/taxonomy_pretrained.qza

echo "Classification done!"

conda deactivate


# 3.Evaluate the obtained data
The data obtained by applying the classifier, as well as the evaluation output from the self-trained classifier, was received via the scp command and uploaded to the Polybox.

In [15]:
%%bash -s $data_dir
mkdir -p "$1"

# Evaluation of the classifier
wget --content-disposition -nc --progress=dot:giga -P "$1" https://polybox.ethz.ch/index.php/s/te9Ww2cKMketCzZ/download

# Taxonomy with the pre- and selftrained taxonomy
wget --content-disposition -nc --progress=dot:giga -P "$1" https://polybox.ethz.ch/index.php/s/fgkmx47cHwSMKxP/download
wget --content-disposition -nc --progress=dot:giga -P "$1" https://polybox.ethz.ch/index.php/s/t6nLxPNTBTEdxJk/download

chmod -R +rxw "$1"

--2025-11-12 10:23:31--  https://polybox.ethz.ch/index.php/s/te9Ww2cKMketCzZ/download
Resolving polybox.ethz.ch (polybox.ethz.ch)... 129.132.71.243
Connecting to polybox.ethz.ch (polybox.ethz.ch)|129.132.71.243|:443... connected.
HTTP request sent, awaiting response... 200 OK
--2025-11-12 10:23:31--  https://polybox.ethz.ch/index.php/s/fgkmx47cHwSMKxP/download
Resolving polybox.ethz.ch (polybox.ethz.ch)... 129.132.71.243
Connecting to polybox.ethz.ch (polybox.ethz.ch)|129.132.71.243|:443... connected.
HTTP request sent, awaiting response... 200 OK
--2025-11-12 10:23:31--  https://polybox.ethz.ch/index.php/s/t6nLxPNTBTEdxJk/download
Resolving polybox.ethz.ch (polybox.ethz.ch)... 129.132.71.243
Connecting to polybox.ethz.ch (polybox.ethz.ch)|129.132.71.243|:443... connected.
HTTP request sent, awaiting response... 200 OK


## 3.1 Self-trained classifier evaluation

In [16]:
Visualization.load(f"{data_dir}/classifier_taxonomy_evaluation.qzv")

## 3.2 Taxonomy from the self-trained classifier


In [17]:
! qiime metadata tabulate \
    --m-input-file $data_dir/taxonomy_selftrained.qza \
    --o-visualization $data_dir/taxonomy_selftrained.qzv

  import pkg_resources
[32mSaved Visualization to: Project_data/Taxonomy/taxonomy_selftrained.qzv[0m
[0m[?25h

In [18]:
Visualization.load(f"{data_dir}/taxonomy_selftrained.qzv")

In [19]:
! qiime taxa barplot \
    --i-table $data_dir/../Import_and_Denoizing/dada2_table.qza \
    --i-taxonomy $data_dir/taxonomy_selftrained.qza \
    --m-metadata-file $data_dir/../Metadata/updated_fungut_metadata.tsv \
    --o-visualization $data_dir/taxa_bar_plots_selftrained.qzv

  import pkg_resources
[32mSaved Visualization to: Project_data/Taxonomy/taxa_bar_plots_selftrained.qzv[0m
[0m[?25h

In [20]:
Visualization.load(f"{data_dir}/taxa_bar_plots_selftrained.qzv")

## 3.3 Taxonomy from the pre-trained classifier


In [21]:
! qiime metadata tabulate \
    --m-input-file $data_dir/taxonomy_pretrained.qza \
    --o-visualization $data_dir/taxonomy_pretrained.qzv

  import pkg_resources
[32mSaved Visualization to: Project_data/Taxonomy/taxonomy_pretrained.qzv[0m
[0m[?25h

In [22]:
Visualization.load(f"{data_dir}/taxonomy_pretrained.qzv")

In [23]:
! qiime taxa barplot \
    --i-table $data_dir/../Import_and_Denoizing/dada2_table.qza \
    --i-taxonomy $data_dir/taxonomy_pretrained.qza \
    --m-metadata-file $data_dir/../Metadata/updated_fungut_metadata.tsv \
    --o-visualization $data_dir/taxa_bar_plots_pretrained.qzv

  import pkg_resources
[32mSaved Visualization to: Project_data/Taxonomy/taxa_bar_plots_pretrained.qzv[0m
[0m[?25h

In [24]:
Visualization.load(f"{data_dir}/taxa_bar_plots_pretrained.qzv")