# Week 12 Supervised Machine Learning

In this tutorial we will be looking into the application of supervised machine learning (ML) methods in Q2 using the data from the Human Microbiome Project. 



We will be going through the following steps:         
       
[1. Import packages and download datasets](#sec1)                
[2. Training and evaluating classifiers](#sec2)          
[3. Training and evaluating regressors](#sec3)          
[4. Over- vs. underfitting](#sec4)                
[5. Creating predictions for all available samples](#sec5)          

<a id='sec1'></a> 

## 1. Import packages & download datasets

As always, let's first import all packages and assign the variables we need in this notebook:

In [2]:
# import all required packages
import os
import biom
import qiime2 as q2
import pandas as pd

from qiime2 import Visualization

In [3]:
# assigning variables used throughout the notebook

# location of this week's data
data_dir = 'w8_hmp_data'

In [4]:
%%bash -s $data_dir
# Please do NOT modify this cell - here we copy the required data into
# your personal Jupyter workspace.

mkdir -p "$1"
cp -rn /data/w8_hmp_data/* "$1"
chmod -R +rxw "$1"

<a id='sec2'></a>  

## 2. Training and evaluating classifiers

In supervised ML, the goal is to fit a model such that it correctly predicts a given target. If the target is discrete then we call the modelling task a classification.

### 2.1 Training classifier to predict  `sample_type` with microbial composition

First we will train a classifier to predict each sample's type (metadata column `sample_type`) given its microbial composition. For this we will use the `classify-samples` method from `q2-sample-classifier`. This method provides different modelling setups and types of classifiers. Depending on which type of classifier you use, you can specify or tune defined hyperparameters of the classifier. 


Let's start with a simple modelling setup where we use the default hyperparameters of a Random Forest classifier. If you are interested in learning more about Random Forests in general, here is [an intuitive article](https://towardsdatascience.com/understanding-random-forest-58381e0602d2) and a more detailed [Wikipedia page](https://en.wikipedia.org/wiki/Random_forest) on this modelling class.        
We aim to use `80%` of our `3'308` different samples as a train set to fit the classifier and the remaining `20%` as a test set to evaluate its modelling performance. In the Q2 CLI command below this is indicated by setting `--p-test-size` to `0.2`. This train-test split is being performed as a "stratified spilt". This means that we aim for these split proportions while ensuring that the distribution of targets is uniform across the splits. Essentially, we want the proportions of each target class to be preserved across splits.      
To make the training of the classifier reproducible, we further choose a specific random seed (`--p-random-state 22`).          

As mentioned at the start our goal is to predict the metadata column `sample_type` given the microbial composition (see below inputs to parameters `--i-table`, `--m-metadata-file` and `--m-metadata-column`). The microbial composition in our case is a `FeatureTable[Frequency]` artifact with individual microbial features depicted as the actual nucleotide sequence. 

In [4]:
! qiime sample-classifier classify-samples \
  --i-table $data_dir/feature-table.qza \
  --m-metadata-file $data_dir/metadata_proc.tsv \
  --m-metadata-column sample_type \
  --p-test-size 0.2 \
  --p-estimator RandomForestClassifier \
  --p-random-state 22 \
  --p-n-jobs 3 \
  --output-dir $data_dir/small-RF-classifier

[32mSaved SampleEstimator[Classifier] to: w8_hmp_data/small-RF-classifier/sample_estimator.qza[0m
[32mSaved FeatureData[Importance] to: w8_hmp_data/small-RF-classifier/feature_importance.qza[0m
[32mSaved SampleData[ClassifierPredictions] to: w8_hmp_data/small-RF-classifier/predictions.qza[0m
[32mSaved Visualization to: w8_hmp_data/small-RF-classifier/model_summary.qzv[0m
[32mSaved Visualization to: w8_hmp_data/small-RF-classifier/accuracy_results.qzv[0m
[32mSaved SampleData[Probabilities] to: w8_hmp_data/small-RF-classifier/probabilities.qza[0m
[32mSaved Visualization to: w8_hmp_data/small-RF-classifier/heatmap.qzv[0m
[32mSaved SampleData[TrueTargets] to: w8_hmp_data/small-RF-classifier/training_targets.qza[0m
[32mSaved SampleData[TrueTargets] to: w8_hmp_data/small-RF-classifier/test_targets.qza[0m
[0m

**a)** Run below command and inspect the previously created output artifacts. Which artifact contains the trained classifier?

In [5]:
! qiime sample-classifier classify-samples --help

Usage: [94mqiime sample-classifier classify-samples[0m [OPTIONS]

  Predicts a categorical sample metadata column using a supervised learning
  classifier. Splits input data into training and test sets. The training set
  is used to train and test the estimator using a stratified k-fold cross-
  validation scheme. This includes optional steps for automated feature
  extraction and hyperparameter optimization. The test set validates
  classification accuracy of the optimized estimator. Outputs classification
  results for test set. For more details on the learning algorithm, see
  http://scikit-learn.org/stable/supervised_learning.html

[1mInputs[0m:
  [94m[4m--i-table[0m ARTIFACT [32mFeatureTable[Frequency | RelativeFrequency |[0m
    [32mPresenceAbsence | Composition][0m
                          Feature table containing all features that should
                          be used for target prediction.            [35m[required][0m
[1mParameters[0m:
  [94m[4m--m-metadat

### 2.2 Evaluate trained classifier: Confusion matrix, accuracy & ROC

Running this command we obtain several output files. Let's first have a look at the produced visualisation `accuracy_results.qzv`. The first plot in this visualisation contains a so called **confusion matrix** which displays how frequently a class was predicted correctly.        

The actual values of the depicted proportions can be found in the table below the confusion matrix. This table additionally contains the overall **accuracy** metrics of the classifier. The overall accuracy is a metric that depicts the fraction of times our test samples were assigned the correct class by the trained model. The baseline accuracy shows the accuracy we would reach by simply predicting the most frequent class for all samples.     

**a)** Have a look at the confusion matrix. Sum up the true count of samples of the `sample_type` classes. Why does the resulting sum not equal the expected count of test samples, namely `20%` of the `3'308` samples in the metadata?   

In [20]:
assert 56 + 279 + 173 + 42 + 41 + 41 == 0.2 * 3308, f"{56 + 279 + 173 + 42 + 41 + 41} does not euqal to 20% of 3,308 (f{round(0.2 * 3308, ndigits=0)})"

AssertionError: 632 does not euqal to 20% of 3,308 (f662.0)

Maybe some of the sample_type are NaN. Let's look at the original metadata

In [22]:
meta = pd.read_csv(f"{data_dir}/metadata_proc.tsv", sep="\t")
meta.head()

Unnamed: 0,sampleid,host_subject_id,env,body_site,sample_type,env_material,elevation,latitude,longitude,geo_loc_name
0,1928.SRS063768.SRX020548.SRR049963,103092734,Skin,UBERON:skin of nose,sebum,sebum,97,38.98,-77.11,USA
1,1928.SRS064411.SRX020548.SRR049630,103092734,Skin,UBERON:skin of elbow,sebum,sebum,97,38.98,-77.11,USA
2,1928.SRS065595.SRX020548.SRR047332,103092734,Skin,UBERON:skin of elbow,sebum,sebum,97,38.98,-77.11,USA
3,1928.SRS045788.SRX020527.SRR049597,132902142,Skin,UBERON:skin of elbow,sebum,sebum,97,38.98,-77.11,USA
4,1928.SRS048971.SRX020527.SRR047153,132902142,Skin,UBERON:skin of external ear,sebum,sebum,97,38.98,-77.11,USA


In [26]:
len(meta)

3308

In [31]:
meta.isna().any()

sampleid           False
host_subject_id    False
env                False
body_site          False
sample_type        False
env_material       False
elevation          False
latitude           False
longitude          False
geo_loc_name       False
dtype: bool

In [25]:
feature = q2.Artifact.load(f"{data_dir}/feature-table.qza").view(pd.DataFrame)
feature.head()

Unnamed: 0,TTCAACCTTGCGGTCGTACTCCCCAGGCGGAGTGCTTAATGCGTTAGCTACGGCACTAAACCCCGGAAAGGGTCTAACACCTAGCACTCATCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGAGCC,TTCAACCTTGCGGTCGTACTCCCCAGGCGGAGTGCTTAATGCGTTAGCTGCAGCACTGATCTCTTATGAGACCAACACTTAGCACTCATCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCGCCTC,TTTAACCTTGCGGCCGTACTCCCCAGGCGGTCGATTTATCACGTTAGCTACGGGCGCCAAGCTTAAAGCTCAACCCCCAAATCGACATCGTTTACAGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCACATGA,TTCAACCTTGCGGTCGTACTCCCCAGGCGGAGTGCTTAATGCGTTAACTGCAGCACTGAAGGGCGGAAACCCTCCAACACTTAGCACTCATCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTACCCACGCTTTCGAGCC,TTCAACCTTGCGGTCGTACTCCCCAGGCGGAGTGCTTAATGCGTTAGCTGCGGCACTAAACCCCGGAAAGGGTCTAACACCTAGCACTCATCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGAGCC,TTCAACCTTGCGGTCGTACTCCCCAGGCGGGGTACTTATTGCGTTAACTCCGGCACAGAAGGGGTCGATACCTCCTACACCTAGTACCCATCGTTTACGGCCAGGACTACCGGGGTATCTAATCCCGTTCGCTCCCCTGGCTTTCGCGCC,TTTAGCCTTGCGGCCGTACTCCCCAGGCGGGGCACTTAATGCGTTAGCTACGGCGCGGAAAACGTGGAATGTTCCCCACACCTAGTGCCCAACGTTTACGGCATGGACTACCAGGGTATCTAATCCTGTTCGCTCCCCATGCTTTCGCTT,TTTAATCTTGCGACCGTACTCCCCAGGCGGTCGATTTCACGCGTTAGCTTCGCTACTAAGCAGTCATGCTGCCCAACAGCTAATCGACATCGTTTAGGGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTACCCACGCTTTCGGGCAT,TTTAACCTTGCGGCCGTACTCCCCAGGCGGTCGATTTATCACGTTAGCTACGGGCGCCAAACTTAAAGTTCAACCCCCAAATCGACATCGTTTACAGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCACATGA,TTTAACCTTGCGGCCGTACTCCCCAGGCGGTCGATTTATCACGTTAGCTACGGGCACCAAGCTTAAAGCCCAATCCCCAAATCGACAGCGTTTACAGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCACATGA,...,TTCACCGTTGCCGGCGTACTCCCCAGGTGGGATGCTTAACGCTTTCGCTTAGCCGCGTACCATAATTGGCATACAGCGGGCATCCATCGTTTACTGTGCGGACTACCAGGGTATCTAATCCTGTTTGATACCCGCACCTTCGAGCTTAAG,TTCACCGTTGCCGGCGTACTCCCCAGGTGGGATGCTTAACGCTTTCGCTTGGCCGCTGAAATCAATATCCCAACGGCGGGCATCCATCGTTTACCGCGCGGACTACCAGGGTATCTAATCCTGTTCGATACCCGCGCTTTCGAGCCTCAG,TTCAACCTTGCGGTCGTACTCCCCAGGCGGAGTGCTTAATGCGTTAGCTGCGGCACTGAATCCCAGAAAGGATCCAACACCTAGCACTCATCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTCGCTCCCCACGCTTTCGAGCC,TTTAATCTTGCGACCGTACTCCCCAGGCGGAGTGCTTAATGCGTTAGCTGCGATACTGATCCGAAGACCAACATCTAGCACTCATCGTTTACAGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCGCCTTAGCG,TTCACTCTTGCGAGCGTACTCCCCAGGTGGGATACTTAACGCTTTCGCTAAGCCAGTAACTGTGTATCGCTACCAGCGAGTATCCATCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTCGCTCCCCACGCTTTCGTGCCTCAG,TTCACACTTGCGTGCGTACTCCCCAGGCGGAGTGTTTAATGCGTTAGCTGCGGCTCCCTGATTATTCCAAGAACCTAACACTCATCGTTTACAGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTACCCACGCTTTCGTACCTCAGCG,TTTAACCTTGCGGTCGTACTCCCCAGGGCGGTCGATTTATCACGTTAGCTACGGGCGCCAAACTCAAAGTTCAACCCCCAAATCGACATCGTTTACAGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCACATG,TTTAACCTTGCGGTCGTACTCCCCCCAGGCGGTCGATTTATCACGTTAGCTACGGGCGCCAAACTCAAAGTTCAACCCCCAAATCGACATCGTTTACAGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCACAT,TTCATCCTTGCGGACGTACTCCCCAGGCGGGGTACTTATTGCGTTAACTCCGGCACAGAAGGGGTCGATACCTCCTACACCTAGTACCCATCGTTTACGGCCAGGACTACCGGGGTATCTAATCCCGTTCGCTACCCTGACTTTCGCATC,TTCAACCTTGCGGTCGTACTCCCCAGGTGGATTACTTATCGTGTTAACTGCGGCACTGAAGGGGTCAATCCTCCAACACCTAGTAATCATCGTTTACAGTGTGGACTACCAGGGTATCTAATCCTGTTTGCTACCCACACTTTCGAACCT
1928.SRS015121.SRX020555.SRR045717,2297.0,1350.0,720.0,718.0,669.0,419.0,190.0,179.0,176.0,94.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1928.SRS064354.SRX020689.SRR048775,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1928.SRS063673.SRX020579.SRR047060,557.0,144.0,0.0,178.0,103.0,717.0,1.0,7.0,19.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1928.SRS021145.SRX022230.SRR058093,0.0,0.0,0.0,5.0,8.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1928.SRS017951.SRX019690.SRR041630,832.0,39.0,27.0,36.0,357.0,23.0,0.0,62.0,224.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [27]:
len(feature)

3157

We can see that the feature table is not of the same length as the metadata table. Now joining them:m

In [34]:
joined = pd.merge(meta, feature, how="left", left_on="sampleid", right_index=True)
joined.head()

Unnamed: 0,sampleid,host_subject_id,env,body_site,sample_type,env_material,elevation,latitude,longitude,geo_loc_name,...,TTCACCGTTGCCGGCGTACTCCCCAGGTGGGATGCTTAACGCTTTCGCTTAGCCGCGTACCATAATTGGCATACAGCGGGCATCCATCGTTTACTGTGCGGACTACCAGGGTATCTAATCCTGTTTGATACCCGCACCTTCGAGCTTAAG,TTCACCGTTGCCGGCGTACTCCCCAGGTGGGATGCTTAACGCTTTCGCTTGGCCGCTGAAATCAATATCCCAACGGCGGGCATCCATCGTTTACCGCGCGGACTACCAGGGTATCTAATCCTGTTCGATACCCGCGCTTTCGAGCCTCAG,TTCAACCTTGCGGTCGTACTCCCCAGGCGGAGTGCTTAATGCGTTAGCTGCGGCACTGAATCCCAGAAAGGATCCAACACCTAGCACTCATCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTCGCTCCCCACGCTTTCGAGCC,TTTAATCTTGCGACCGTACTCCCCAGGCGGAGTGCTTAATGCGTTAGCTGCGATACTGATCCGAAGACCAACATCTAGCACTCATCGTTTACAGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCGCCTTAGCG,TTCACTCTTGCGAGCGTACTCCCCAGGTGGGATACTTAACGCTTTCGCTAAGCCAGTAACTGTGTATCGCTACCAGCGAGTATCCATCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTCGCTCCCCACGCTTTCGTGCCTCAG,TTCACACTTGCGTGCGTACTCCCCAGGCGGAGTGTTTAATGCGTTAGCTGCGGCTCCCTGATTATTCCAAGAACCTAACACTCATCGTTTACAGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTACCCACGCTTTCGTACCTCAGCG,TTTAACCTTGCGGTCGTACTCCCCAGGGCGGTCGATTTATCACGTTAGCTACGGGCGCCAAACTCAAAGTTCAACCCCCAAATCGACATCGTTTACAGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCACATG,TTTAACCTTGCGGTCGTACTCCCCCCAGGCGGTCGATTTATCACGTTAGCTACGGGCGCCAAACTCAAAGTTCAACCCCCAAATCGACATCGTTTACAGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCACAT,TTCATCCTTGCGGACGTACTCCCCAGGCGGGGTACTTATTGCGTTAACTCCGGCACAGAAGGGGTCGATACCTCCTACACCTAGTACCCATCGTTTACGGCCAGGACTACCGGGGTATCTAATCCCGTTCGCTACCCTGACTTTCGCATC,TTCAACCTTGCGGTCGTACTCCCCAGGTGGATTACTTATCGTGTTAACTGCGGCACTGAAGGGGTCAATCCTCCAACACCTAGTAATCATCGTTTACAGTGTGGACTACCAGGGTATCTAATCCTGTTTGCTACCCACACTTTCGAACCT
0,1928.SRS063768.SRX020548.SRR049963,103092734,Skin,UBERON:skin of nose,sebum,sebum,97,38.98,-77.11,USA,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1928.SRS064411.SRX020548.SRR049630,103092734,Skin,UBERON:skin of elbow,sebum,sebum,97,38.98,-77.11,USA,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1928.SRS065595.SRX020548.SRR047332,103092734,Skin,UBERON:skin of elbow,sebum,sebum,97,38.98,-77.11,USA,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1928.SRS045788.SRX020527.SRR049597,132902142,Skin,UBERON:skin of elbow,sebum,sebum,97,38.98,-77.11,USA,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1928.SRS048971.SRX020527.SRR047153,132902142,Skin,UBERON:skin of external ear,sebum,sebum,97,38.98,-77.11,USA,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [37]:
# filter the rows that contians NA
rows_with_nan = joined[joined.isna().any(axis=1)]
rows_with_nan.head()

Unnamed: 0,sampleid,host_subject_id,env,body_site,sample_type,env_material,elevation,latitude,longitude,geo_loc_name,...,TTCACCGTTGCCGGCGTACTCCCCAGGTGGGATGCTTAACGCTTTCGCTTAGCCGCGTACCATAATTGGCATACAGCGGGCATCCATCGTTTACTGTGCGGACTACCAGGGTATCTAATCCTGTTTGATACCCGCACCTTCGAGCTTAAG,TTCACCGTTGCCGGCGTACTCCCCAGGTGGGATGCTTAACGCTTTCGCTTGGCCGCTGAAATCAATATCCCAACGGCGGGCATCCATCGTTTACCGCGCGGACTACCAGGGTATCTAATCCTGTTCGATACCCGCGCTTTCGAGCCTCAG,TTCAACCTTGCGGTCGTACTCCCCAGGCGGAGTGCTTAATGCGTTAGCTGCGGCACTGAATCCCAGAAAGGATCCAACACCTAGCACTCATCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTCGCTCCCCACGCTTTCGAGCC,TTTAATCTTGCGACCGTACTCCCCAGGCGGAGTGCTTAATGCGTTAGCTGCGATACTGATCCGAAGACCAACATCTAGCACTCATCGTTTACAGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCGCCTTAGCG,TTCACTCTTGCGAGCGTACTCCCCAGGTGGGATACTTAACGCTTTCGCTAAGCCAGTAACTGTGTATCGCTACCAGCGAGTATCCATCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTCGCTCCCCACGCTTTCGTGCCTCAG,TTCACACTTGCGTGCGTACTCCCCAGGCGGAGTGTTTAATGCGTTAGCTGCGGCTCCCTGATTATTCCAAGAACCTAACACTCATCGTTTACAGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTACCCACGCTTTCGTACCTCAGCG,TTTAACCTTGCGGTCGTACTCCCCAGGGCGGTCGATTTATCACGTTAGCTACGGGCGCCAAACTCAAAGTTCAACCCCCAAATCGACATCGTTTACAGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCACATG,TTTAACCTTGCGGTCGTACTCCCCCCAGGCGGTCGATTTATCACGTTAGCTACGGGCGCCAAACTCAAAGTTCAACCCCCAAATCGACATCGTTTACAGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCACAT,TTCATCCTTGCGGACGTACTCCCCAGGCGGGGTACTTATTGCGTTAACTCCGGCACAGAAGGGGTCGATACCTCCTACACCTAGTACCCATCGTTTACGGCCAGGACTACCGGGGTATCTAATCCCGTTCGCTACCCTGACTTTCGCATC,TTCAACCTTGCGGTCGTACTCCCCAGGTGGATTACTTATCGTGTTAACTGCGGCACTGAAGGGGTCAATCCTCCAACACCTAGTAATCATCGTTTACAGTGTGGACTACCAGGGTATCTAATCCTGTTTGCTACCCACACTTTCGAACCT
42,1928.SRS021613.SRX020679.SRR048054,158013734,Oral,UBERON:tongue,saliva,sebum,97,38.98,-77.11,USA,...,,,,,,,,,,
147,1928.SRS022127.SRX020197.SRR042796,158337416,Skin,UBERON:skin of elbow,sebum,sebum,97,38.98,-77.11,USA,...,,,,,,,,,,
288,1928.SRS023577.SRX020197.SRR042796,158802708,Skin,UBERON:skin of elbow,sebum,sebum,97,38.98,-77.11,USA,...,,,,,,,,,,
289,1928.SRS011290.SRX020515.SRR045368,158802708,Skin,UBERON:skin of external ear,sebum,sebum,97,38.98,-77.11,USA,...,,,,,,,,,,
302,1928.SRS011163.SRX020570.SRR045452,158822939,Oral,UBERON:hard palate,saliva,sebum,97,38.98,-77.11,USA,...,,,,,,,,,,


In [38]:
len(rows_with_nan)

151

The resulting sum actually equal to the 20% of the rows in the feature table, i.e. the samples that pass the QC and have counts.

In [42]:
len(feature) * 0.2

631.4000000000001

In [40]:
56 + 279 + 173 + 42 + 41 + 41

632

In [5]:
Visualization.load(f"{data_dir}/small-RF-classifier/accuracy_results.qzv")

**b)** Which `sample_type` class does our model predict accurately and which ones does it predict less accurately? (List `sample_type` in decreasing accuracy)
            

- Most accurately: `muscus`, which has accuracy near 100%.
- Less accurately: `subgingival dental plaque` and `supragingival dental plaque`, which have accuracy ~61% and ~68%, respectively.

**c)** By which factor is our trained classifier more accurate compared to a model that always predicts the most frequent class?        
       

It says "2.1" (accuracy ratio) in the output confusion matrix tsv.

**d)** Which other `sample_type` does our trained classifier mix up true `supragingival dental plaque` samples with most frequently?      
           

It's `subgingival dental plaque` (that's natural)

The last plot in `accuracy_results.qzv` depicts two [Receiver Operating Characteristic (ROC) curves](https://en.wikipedia.org/wiki/Receiver_operating_characteristic). The ROC curve is a frequently used graphical representation of how well a trained classifier performs. It plots the relationship between the true positive rate (TPR on y-axis) and the false positive rate (FPR on x-axis) at various thresholds. If our classifier were to choose at random between two classes ("Chance" classifier), the ROC curve would be the grey linear line with slope 1. The further the ROC curve is to the top-left corner, the better the classifier. Frequently, when plotting the ROC curve one also calculates the area under the ROC curve (AUC). The larger the AUC of the ROC curve, the better the trained classifier in distingishing the target classes. 

**e)** According to the ROC curves and their AUC values, which `sample_type` is predicted least accurately?       

`supragingival dental plaque`

**f)** How would you evaluate our trained classifier's modelling performance overall?

Looks pretty good, the ROC curve is very steep (pushed up to the upper left corner), indicating high TPR and very low FPR. Also the AUC is all nearly 1.0.

**g)** Train another classifier to predict `env` from the metadata. How does this classifier's performance compare to the `sample_type` classifier?        

In [43]:
! qiime sample-classifier classify-samples \
  --i-table $data_dir/feature-table.qza \
  --m-metadata-file $data_dir/metadata_proc.tsv \
  --m-metadata-column env \
  --p-test-size 0.2 \
  --p-estimator RandomForestClassifier \
  --p-random-state 22 \
  --p-n-jobs 3 \
  --output-dir $data_dir/small-RF-classifier-env

[32mSaved SampleEstimator[Classifier] to: w8_hmp_data/small-RF-classifier-env/sample_estimator.qza[0m
[32mSaved FeatureData[Importance] to: w8_hmp_data/small-RF-classifier-env/feature_importance.qza[0m
[32mSaved SampleData[ClassifierPredictions] to: w8_hmp_data/small-RF-classifier-env/predictions.qza[0m
[32mSaved Visualization to: w8_hmp_data/small-RF-classifier-env/model_summary.qzv[0m
[32mSaved Visualization to: w8_hmp_data/small-RF-classifier-env/accuracy_results.qzv[0m
[32mSaved SampleData[Probabilities] to: w8_hmp_data/small-RF-classifier-env/probabilities.qza[0m
[32mSaved Visualization to: w8_hmp_data/small-RF-classifier-env/heatmap.qzv[0m
[32mSaved SampleData[TrueTargets] to: w8_hmp_data/small-RF-classifier-env/training_targets.qza[0m
[32mSaved SampleData[TrueTargets] to: w8_hmp_data/small-RF-classifier-env/test_targets.qza[0m
[0m

In [44]:
Visualization.load(f"{data_dir}/small-RF-classifier-env/accuracy_results.qzv")

Wow, almost every class prediction is 1.0 accuracy, the ROC curve is even more pushed to the upper left, and the AUC values are all 1.0. It works even better than the `sample_type` classifier.

### 2.3 Evaluate trained classifier: Individual predictions

Another output that one obtains when training a classifier in Q2 are the individual test sample's predictions and probabilities, in `predictions.qza` and `probabilities.qza` respectively, and their true values, in `test_targets.qza`. We can view them in the Q2 CLI with:

In [45]:
! qiime metadata tabulate \
  --m-input-file $data_dir/small-RF-classifier/test_targets.qza \
  --m-input-file $data_dir/small-RF-classifier/predictions.qza \
  --m-input-file $data_dir/small-RF-classifier/probabilities.qza \
  --o-visualization $data_dir/small-RF-classifier/test_predprob.qzv

  lambda x: pd.to_numeric(x, errors='ignore')))
  lambda x: pd.to_numeric(x, errors='ignore')))
[32mSaved Visualization to: w8_hmp_data/small-RF-classifier/test_predprob.qzv[0m
[0m

**a)** Have a look at the rows of sampleid `1928.SRS011475.SRX020664.SRR047524`. Why was the sample classified as `stool`?  

In [46]:
Visualization.load(f"{data_dir}/small-RF-classifier/test_predprob.qzv")

In [52]:
qza = f"{data_dir}/small-RF-classifier/test_predprob.qzv"
a = !unzip -o $qza
digest = a[1].split('/')[0].replace("  inflating: ","")
fname = os.path.join(digest, "data", "metadata.tsv")
meta_pred = pd.read_csv(fname, sep="\t", index_col=[0])
!rm -r $digest

In [54]:
meta_pred.loc["1928.SRS011475.SRX020664.SRR047524", :]

sample_type                    sebum
prediction                     stool
mucus                           0.08
saliva                          0.13
sebum                           0.35
stool                           0.43
subgingival dental plaque       0.01
supragingival dental plaque        0
Name: 1928.SRS011475.SRX020664.SRR047524, dtype: object

Because the classifier thinks the chance of the sample being `stool` (`0.43`) is higher than that of `sebum` (`0.35`)? Or maybe the prediced probability of `stool` supasses the threshold to be predicted as `stool`, so it is classified as `stool` (maybe it overwrites other predictions). Although the `sebum` classification is the second highest.

**b)** If we evaluated the stool predicted probability at a threshold of `0.5`, would it still be classified as stool?         
           


Nope, because the predicted probability is 0.43 

### 2.4 Evaluate trained classifier: Feature importances 

Knowing that our classifier predicts `sample_types` quite accurately, we are now interested in knowing which microbial compositions are the most important ones for distinguishing the `sample_types`. We can find a list of most predictive features in the produced output `feature_importance.qza`.     

**a)** Tabulate the `feature_importance.qza` and view the created visualisation. What are the first 5 nucleotides of the most important feature in distinguishing `sample_types` as indicated by the highest feature importance score?          


Note: For the Human Microbiome project dataset we never performed a taxonomic classification of the features. Hence, we'd need more information on the features to make interesting conclusions from the most predictive features of our trained classifier. 

In [56]:
! qiime metadata tabulate \
  --m-input-file $data_dir/small-RF-classifier/feature_importance.qza \
  --o-visualization $data_dir/small-RF-classifier/feature_importance.qzv

[32mSaved Visualization to: w8_hmp_data/small-RF-classifier/feature_importance.qzv[0m
[0m

In [58]:
Visualization.load(f"{data_dir}/small-RF-classifier/feature_importance.qzv")

In [60]:
feat_importance = q2.Artifact.load(f"{data_dir}/small-RF-classifier/feature_importance.qza").view(pd.DataFrame)

In [61]:
feat_importance.columns

Index(['importance'], dtype='object')

In [62]:
feat_importance.sort_values(by="importance", ascending=False).head(5).index

Index(['TTTAGCCTTGCGGCCGTACTCCCCAGGCGGGGTACTTAAAGCGTTAGCTACGGCACGGAACCCGTGGAATGGACCCCACACCTAGTACCCACCGTTTACAGCGTGGACTACCAGGGTATCTAAGCCTGTTCGCTCCCCACGCTTTCGCTC',
       'TTCAACCTTGCGGTCGTACTCCCCAGGCGGAGTGCTTAATGCGTTAGCTGCAGCACTAAGGGGCGGAAACCCCCTAACACTTAGCACTCATCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTTGATCCCCACGCTTTCGCACA',
       'TTCAACCTTGCGGTCGTACTCCCCAGGCGGAGTGCTTAATGCGTTAGCTGCAGCACTGAGAGGCGGAAACCTCCCAACACTTAGCACTCATCGTTTACGGCATGGACTACCAGGGTATCTAATCCTGTTCGCTACCCATGCTTTCGAGCC',
       'TTTAGCCTTGCGGCCGTACTCCCCAGGCGGGGCGCTTAATGCGTTAGCTACGGCACGAAAGTCGTGAAAAGACCCTCACACCTAGCGCCCACCGTTTACGGCATGGACTACCAGGGTATCTAATCCTGTTCGCTACCCATGCTTTCGCTC',
       'TTCAACCTTGCGGTCGTACTCCCCAGGCGGAGTGCTTAATGCGTTAGCTACGGCACTAAACCCCGGAAAGGGTCTAACACCTAGCACTCATCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGAGCC'],
      dtype='object', name='id')

Q2 further allows you to investigate the top feature abundances in your sample groups of interest with below command:

In [63]:
! qiime sample-classifier heatmap \
  --i-table $data_dir/feature-table.qza \
  --i-importance $data_dir/small-RF-classifier/feature_importance.qza \
  --m-sample-metadata-file $data_dir/metadata_proc.tsv  \
  --m-sample-metadata-column sample_type \
  --p-group-samples \
  --p-feature-count 30 \
  --o-filtered-table $data_dir/small-RF-classifier/important-feature-table-top-30.qza \
  --o-heatmap $data_dir/small-RF-classifier/important-feature-heatmap.qzv

[32mSaved Visualization to: w8_hmp_data/small-RF-classifier/important-feature-heatmap.qzv[0m
[32mSaved FeatureTable[Frequency] to: w8_hmp_data/small-RF-classifier/important-feature-table-top-30.qza[0m
[0m

**b)** Inspect the created visualisation in `$data_dir/small-RF-classifier/important-feature-heatmap.qzv`. Which `sample_types` depict clustered abundances of the top features?

In [64]:
Visualization.load(f"{data_dir}/small-RF-classifier/important-feature-heatmap.qzv")

- Dental plaque samples (both subgingival and supragingival): Form another distinct cluster with similar abundance patterns
- The oral-related samples (saliva and mucus): Show somewhat related abundance patterns (the first 2 and the last several top features), which makes sense given they're both from the oral cavity

<a id='sec3'></a>  

## 3. Training and evaluating regressors

If the target to be predicted in supervised ML is of a numeric type, then we call the modelling task a regression. 

There's not really a regression task suitable given the Human Microbiome Project as all the numeric columns are constant across all samples (latitude and longitude). Hence, we will use a slightly edited dataset from the Earth Microbiome Project here.

In [65]:
# new earth microbiome folder:
dir_earth = 'w8_emp_data'

In [66]:
%%bash -s $dir_earth
# Please do NOT modify this cell - here we copy the required data into
# your personal Jupyter workspace.

mkdir -p "$1"
cp -rn /data/w8_emp_data/* "$1"
chmod -R +rxw "$1"

### 3.1 Choosing target and setup for regression

**a)** Inspect the downloaded metadata. Which column(s) would be suitable as a regression target? Why?

**b)** How many samples do we have in this excerpt of the Earth Microbiome Project? Is this enough to train a regressor? (Hint: Check out [this nice overview on choosing the right ML model](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html) by scikit-learn)

### 3.2 Training regressor to predict `age` with microbial composition

Let's now train two different regression models that predicts `age` with the microbial composition which is here given as a `FeatureTable[Frequency]` table. Given our limited sample size we choose the Lasso regression & a RandomForestRegressor. For more information about Lasso regression read [this short description](https://scikit-learn.org/stable/modules/linear_model.html#lasso) and for more information on RandomForestRegressors read [this description](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html). For both models we choose a simple 80% to 20% train-test split with a fixed random seed for reproducibility. 

Training the Lasso regressor:

In [67]:
! qiime sample-classifier regress-samples \
  --i-table $dir_earth/table.qza \
  --m-metadata-file $dir_earth/sample-metadata.tsv \
  --m-metadata-column age \
  --p-test-size 0.2 \
  --p-estimator Lasso \
  --p-random-state 22 \
  --output-dir $dir_earth/lasso-regressor

[32mSaved SampleEstimator[Regressor] to: w8_emp_data/lasso-regressor/sample_estimator.qza[0m
[32mSaved FeatureData[Importance] to: w8_emp_data/lasso-regressor/feature_importance.qza[0m
[32mSaved SampleData[RegressorPredictions] to: w8_emp_data/lasso-regressor/predictions.qza[0m
[32mSaved Visualization to: w8_emp_data/lasso-regressor/model_summary.qzv[0m
[32mSaved Visualization to: w8_emp_data/lasso-regressor/accuracy_results.qzv[0m
[0m

Training the RandomForestRegressor:

In [68]:
! qiime sample-classifier regress-samples \
  --i-table $dir_earth/table.qza \
  --m-metadata-file $dir_earth/sample-metadata.tsv \
  --m-metadata-column age \
  --p-test-size 0.2 \
  --p-estimator RandomForestRegressor \
  --p-random-state 22 \
  --output-dir $dir_earth/rf-regressor

[32mSaved SampleEstimator[Regressor] to: w8_emp_data/rf-regressor/sample_estimator.qza[0m
[32mSaved FeatureData[Importance] to: w8_emp_data/rf-regressor/feature_importance.qza[0m
[32mSaved SampleData[RegressorPredictions] to: w8_emp_data/rf-regressor/predictions.qza[0m
[32mSaved Visualization to: w8_emp_data/rf-regressor/model_summary.qzv[0m
[32mSaved Visualization to: w8_emp_data/rf-regressor/accuracy_results.qzv[0m
[0m

### 3.3 Evaluate trained regressor: Accuracy

When training a regressor in Q2, we obtain a `accuracy_results.qzv` visualisation. As opposed to the previously trained classifier, this artifact contains a scatter plot displaying predicted values (y-axis) vs. true values (x-axis) for each sample of the test set. The scatter plot further contains a linear regression line fitted to the test data with 95% confidence intervals and a dotted line representing where the predictions of a "perfect" regressor would lie ("perfect" regressor predicting each true value correctly). 

**a)** Inspect the `accuracy_results` of both trained regressors. Which regressor performs better?

In [69]:
Visualization.load(f"{dir_earth}/lasso-regressor/accuracy_results.qzv")

In [70]:
Visualization.load(f"{dir_earth}/rf-regressor/accuracy_results.qzv")

The Random Forest Regressor performs better because:

1. Higher r-squared value (0.235 vs 0.008)
2. Lower mean squared error (0.178 vs 0.332)

But I tend to say neither model performs exceptionally well...

### 3.4 Training regressor to predict `age` with metadata AND microbial composition

Let's try to improve our regressor by adding metadata features to the microbial composition features.        

Currently Q2 only allows for numeric metadata features to be used to enrich a FeatureTable not categoricals (for more info see output of `! qiime sample-classifier metatable --help`).
Let's use the `metatable` method in sample-classifier to convert our metadata table to a feature table and see which features our feature table is enriched with:

In [71]:
! qiime sample-classifier metatable \
  --i-table $dir_earth/table.qza \
  --m-metadata-file $dir_earth/sample-metadata.tsv \
  --o-converted-table $dir_earth/table-w-metadata.qza

[32mSaved FeatureTable[Frequency] to: w8_emp_data/table-w-metadata.qza[0m
[0m

**a)** Inspect the created feature table. Which columns were added to the microbial features?

In [75]:
table_w_meta = q2.Artifact.load(f"{dir_earth}/table-w-metadata.qza").view(pd.DataFrame)
[c for c in table_w_meta.columns if len(c) < len("fdcd6808ef8269653d25dce4a55a025d") ]

['age', 'height']

`age` and `height` are added to the microbial features.

Let's now train a RandomForestRegressor with the new feature table. 

In [76]:
! qiime sample-classifier regress-samples \
  --i-table $dir_earth/table-w-metadata.qza \
  --m-metadata-file $dir_earth/sample-metadata.tsv \
  --m-metadata-column age \
  --p-test-size 0.2 \
  --p-estimator RandomForestRegressor \
  --p-random-state 22 \
  --output-dir $dir_earth/rf-regressor-dangerous

[32mSaved SampleEstimator[Regressor] to: w8_emp_data/rf-regressor-dangerous/sample_estimator.qza[0m
[32mSaved FeatureData[Importance] to: w8_emp_data/rf-regressor-dangerous/feature_importance.qza[0m
[32mSaved SampleData[RegressorPredictions] to: w8_emp_data/rf-regressor-dangerous/predictions.qza[0m
[32mSaved Visualization to: w8_emp_data/rf-regressor-dangerous/model_summary.qzv[0m
[32mSaved Visualization to: w8_emp_data/rf-regressor-dangerous/accuracy_results.qzv[0m
[0m

**b)** Inspect the model's performance. Why is the trained model equal to a "perfect" predictor? Is this actually the "perfect" model or did we do something that could be considered cheating during training? 



Because we included our target variable `age` in the feature table ðŸ˜‚ so when we then trained the random forest regressor, we were essentially using `age` to predict `age`! This is a major form of data leakage -- It didn't learn any real patterns & meaningful relationships from the microbial composition or other features (`height` etc).

In [77]:
Visualization.load(f"{dir_earth}/rf-regressor-dangerous/accuracy_results.qzv")

Generally when training any sort of model, you want to ensure that the features that are used are not secretly (or obviously) carrying the target data within. Another less obvious example for this would be having a feature of height in `cm` and the target being height in `km`.           
Another important point is to ensure that different samples are being used to train and to evaluate the model. If you train and evaluate the model on the same samples, you lack the information on how your model performs on previously unseen samples.  


Okay, so keeping all this in mind, let's now remove `age` from the `$dir_earth/table-w-metadata.qza` file and re-train the RandomForestRegressor:    

In [78]:
# load artifact and view as pandas dataframe
df_table_all = q2.Artifact.load(
    f"{dir_earth}/table-w-metadata.qza").view(pd.DataFrame)

# drop age column
df_table_corr = df_table_all.drop(columns=['age'])

# assert that age is not a column in tha dataframe anymore
assert('age' not in df_table_corr.columns)

# re-import pandas feature table into QIIME 2
table_corr_artifact = q2.Artifact.import_data(
    'FeatureTable[Frequency]', df_table_corr)

# save QIIME 2 Artifact
table_corr_artifact.save(f'{dir_earth}/table-w-metadata-corr.qza')

'w8_emp_data/table-w-metadata-corr.qza'

In [79]:
# re-training the RandomForestRegressor w/o age as input feature and target:
! qiime sample-classifier regress-samples \
  --i-table $dir_earth/table-w-metadata-corr.qza \
  --m-metadata-file $dir_earth/sample-metadata.tsv \
  --m-metadata-column age \
  --p-test-size 0.2 \
  --p-estimator RandomForestRegressor \
  --p-random-state 22 \
  --output-dir $dir_earth/rf-regressor-corr

[32mSaved SampleEstimator[Regressor] to: w8_emp_data/rf-regressor-corr/sample_estimator.qza[0m
[32mSaved FeatureData[Importance] to: w8_emp_data/rf-regressor-corr/feature_importance.qza[0m
[32mSaved SampleData[RegressorPredictions] to: w8_emp_data/rf-regressor-corr/predictions.qza[0m
[32mSaved Visualization to: w8_emp_data/rf-regressor-corr/model_summary.qzv[0m
[32mSaved Visualization to: w8_emp_data/rf-regressor-corr/accuracy_results.qzv[0m
[0m

**c)** Compare the model performance of this new RandomForest regressor (`$dir_earth/rf-regressor-corr`) trained on microbial features and metadata (only `height`) to the RandomForest regressor we had previously trained on only microbial features (`$dir_earth/rf-regressor`). Which model performs better?         

The new regressor performed way better! It has $R^2 \approx 0.81 \gg 0.24$, $MSE \approx 0.04 \ll 0.17$. The predicted points clustered very close to the true prediction line, and the predicted line almost align exactly with the true prediction line.

In [80]:
Visualization.load(f"{dir_earth}/rf-regressor-corr/accuracy_results.qzv")

In [81]:
Visualization.load(f"{dir_earth}/rf-regressor/accuracy_results.qzv")

<a id='sec4'></a>    

## 4. Over- vs. underfitting


Another way of evaluating a trained supervised ML model (classifier or regressor) is by checking whether the model tends to overfit or underfit the data. 

Let's first define these two terms: 
* Underfitting: The model displays poor performance on the training as well as the test dataset. It essentially fails to learn from the data provided.  
* Overfitting: The model fits the training data very well (almost learns it by heart) but its performance on the test data is disproportionally bad. 

As we have seen above in the classification section 2.2, our initially trained classifier (in `$data_dir/small-RF-classifier/sample_estimator.qza`) performs very well on the test set. Hence, it generalises well on previously unseen data. If the test performance is that high, there is usually no need to worry about over- or underfitting as these are both indicated by low test performance. But assuming the above trained model performed very bad on the test set, we would want to find out whether this was due to over- or underfitting. In case of overfitting, we would need to restrict the model's learning, e.g. by regularisation or choosing a simpler model. In case of underfitting, we might need to choose a more complex model or create some sophisticated features (feature engineering). 

Now you might be interested in how you could find out whether your model over- or underfits in QIIME 2. Great thought :). To do this, you want to create predictions on the previously used train set, evaluate the model on these and compare this model performance to the model performance on the test set.

If you want see the model performance on the train set, run and inspect the commands provided below.


* Our first goal is to get a feature table that only contains features of the train set. For this we first want to filter the metadata file by the train targets that were used to train the classifier:


In [83]:
# defining some paths
path2classifier_results = f'{data_dir}/small-RF-classifier/'
path2train_targets = os.path.join(
    path2classifier_results, 'training_targets.qza')

# load complete metadata table to dataframe & check shape
df_metadata = q2.Metadata.load(f'{data_dir}/metadata_proc.tsv').to_dataframe()
df_metadata.shape

(3308, 9)

In [84]:
# load training targets artifact, view as pandas series & check shape
ser_train_target = q2.Artifact.load(path2train_targets).view(pd.Series)
ser_train_target.shape


  return pd.to_numeric(df.iloc[:, 0], errors='ignore')


(2525,)

In [85]:
# filter df_metadata to only contain train samples
df_metadata_train = df_metadata[df_metadata.index.isin(ser_train_target.index)]
df_metadata_train.shape

(2525, 9)

In [86]:
# save filtered metadata to q2 artifact
q2.Metadata(df_metadata_train).save(f'{data_dir}/metadata_proc_train.tsv')

'w8_hmp_data/metadata_proc_train.tsv'

* With the metadata only containing the training samples, we want to filter the feature table such that it also only includes the training samples: 

In [87]:
# filter feature table by filtered metadata creating feature table of only samples belonging to train set
! qiime feature-table filter-samples \
    --i-table $data_dir/feature-table.qza \
    --m-metadata-file $data_dir/metadata_proc_train.tsv \
    --o-filtered-table $data_dir/feature-table-train.qza

[32mSaved FeatureTable[Frequency] to: w8_hmp_data/feature-table-train.qza[0m
[0m

* Now we're ready to create predictions with the formerly trained classifier only for the train set:

In [88]:
! qiime sample-classifier predict-classification \
   --i-table $data_dir/feature-table-train.qza \
   --i-sample-estimator $data_dir/small-RF-classifier/sample_estimator.qza \
   --output-dir $data_dir/small-RF-classifier/training-predictions

[32mSaved SampleData[ClassifierPredictions] to: w8_hmp_data/small-RF-classifier/training-predictions/predictions.qza[0m
[32mSaved SampleData[Probabilities] to: w8_hmp_data/small-RF-classifier/training-predictions/probabilities.qza[0m
[0m

* Lastly, we evaluate the created train predictions & inspect the visualisation:

In [89]:
! qiime sample-classifier confusion-matrix \
   --i-predictions $data_dir/small-RF-classifier/training-predictions/predictions.qza \
   --i-probabilities $data_dir/small-RF-classifier/training-predictions/probabilities.qza \
   --m-truth-file $data_dir/metadata_proc_train.tsv \
   --m-truth-column sample_type \
   --o-visualization $data_dir/small-RF-classifier/training-predictions/model_performance

[32mSaved Visualization to: w8_hmp_data/small-RF-classifier/training-predictions/model_performance.qzv[0m
[0m

In [90]:
Visualization.load(f"{data_dir}/small-RF-classifier/training-predictions/model_performance.qzv")

<a id='sec5'></a>    

## 5. Creating predictions for all available samples


Throughout this exercise sheet we have always employed a 80%:20% train:test split. Allowing us to evaluate the trained model on the test samples (20%). In some cases you are interested in creating predictions for all available samples, without using the same samples to train the model (! no data leakage). If you are interested in how this works (could be helpful for your group project), have a look at the section "Nested cross-validation provides predictions for all samples" in [this Q2 tutorial](https://docs.qiime2.org/2024.5/tutorials/sample-classifier/). 