# 2. Quality Control and Denoising
## Import data & packages

In [2]:
#Import all necessary packages
import IPython
import pandas as pd
import matplotlib.pyplot as plt
import os
import qiime2 as q2
from qiime2 import Visualization

%matplotlib inline

In [7]:
#Set working directory
os.chdir("/home/jovyan/MicrobiomeAnalysis_TummyTribe/")

# Verify that your wroking directory is the overall project folder (.../MicrobiomeAnalysis_TummyTribe)
print("Current working directory:", os.getcwd())

Current working directory: /home/jovyan/MicrobiomeAnalysis_TummyTribe


In [4]:
#Data directory for the raw data
data_dir = "data/raw"
processed_data_dir = "data/processed"

## Quality Control

In [20]:
! qiime tools peek $data_dir/sequences-demux-paired.qza

[32mUUID[0m:        b4782ab7-550b-41f5-b906-ca2cda29ca9b
[32mType[0m:        SampleData[PairedEndSequencesWithQuality]
[32mData format[0m: SingleLanePerSamplePairedEndFastqDirFmt


In [21]:
! qiime demux summarize \
    --i-data $data_dir/sequences-demux-paired.qza \
    --o-visualization $data_dir/sequences-demux-paired.qzv

  import pkg_resources
[32mSaved Visualization to: data/raw/sequences-demux-paired.qzv[0m
[0m[?25h

In [9]:
Visualization.load(f"{data_dir}/sequences-demux-paired.qzv")

## Denoising and merging

Parameters
- `p-trunc-len` - we will truncate the reads to 130 bp (sequences shorter than this will be removed automatically)
- `p-n-threads` - if we have more than 1 CPU available, we can specify the number here to make the processing faster
- `o-table` - this will be our ASVs feature table
- `o-representative-sequences` - this will be a list of all the denoised features (DNA sequences)
- `o-denoising-stats` - this will be some stats from the denoising process

Information on parameters and function: https://docs.qiime2.org/2024.10/plugins/available/dada2/denoise-paired/
Example tutorial of paired read analysis: https://docs.qiime2.org/2024.10/tutorials/atacama-soils/

Even though the forward reads look good, we still need sufficient overlap for merging the paired reads into one continuous sequence. Using a truncation length of 140 for both the forward and reverse reads allows us to get less chimeras and lose less reads in the denoising than if we keep a longer part of the forward read. 

So we shorten both reads to balance:
- High-quality bases (for denoising)
- Enough overlap (for merging)

In [8]:
# this cell takes a loooong time to run. Time for coffee? 
# Or a cool video? https://www.youtube.com/watch?v=-z4gNr7mN3U
# Or a pull-up!
# Nvm that last one, too difficult
# Find Laura's long lost dads

! qiime dada2 denoise-paired \
    --i-demultiplexed-seqs $data_dir/sequences-demux-paired.qza \
    --p-trunc-len-f 140 \
    --p-trunc-len-r 140 \
    --p-n-threads 3 \
    --o-table $processed_data_dir/dada2_table_140.qza \
    --o-representative-sequences $processed_data_dir/dada2_rep_set_140.qza \
    --o-denoising-stats $processed_data_dir/dada2_stats_140.qza \
    --verbose

  import pkg_resources
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada.R --input_directory /tmp/tmpodeddc0l/forward --input_directory_reverse /tmp/tmpodeddc0l/reverse --output_path /tmp/tmpodeddc0l/output.tsv.biom --output_track /tmp/tmpodeddc0l/track.tsv --filtered_directory /tmp/tmpodeddc0l/filt_f --filtered_directory_reverse /tmp/tmpodeddc0l/filt_r --truncation_length 140 --truncation_length_reverse 140 --trim_left 0 --trim_left_reverse 0 --max_expected_errors 2.0 --max_expected_errors_reverse 2.0 --truncation_quality_score 2 --min_overlap 12 --max_merge_mismatch 0 --trim_overhang False --pooling_method independent --chimera_method consensus --min_parental_fold 1.0 --allow_one_off False --num_threads 3 --learn_min_reads 1000000

R version 4.3.3 (2024-02-29) 
Loading required package

In [6]:
! qiime metadata tabulate \
    --m-input-file $processed_data_dir/dada2_stats_140.qza \
    --o-visualization $processed_data_dir/dada2_stats_140.qzv

  import pkg_resources
[31m[1mThere was an issue with loading the file data/processed/dada2_stats_140.qza as metadata:

  Metadata file path doesn't exist, or the path points to something other than a file. Please check that the path exists, has read permissions, and points to a regular file (not a directory): data/processed/dada2_stats_140.qza

  There may be more errors present in the metadata file. To get a full report, sample/feature metadata files can be validated with Keemei: https://keemei.qiime2.org

  Find details on QIIME 2 metadata requirements here: https://docs.qiime2.org/2025.7/tutorials/metadata/[0m

[0m[?25h

In [7]:
Visualization.load(f"{processed_data_dir}/dada2_stats_140.qzv")

ValueError: data/processed/dada2_stats_140.qzv does not exist.

In [None]:
! qiime feature-table summarize \
    --i-table $processed_data_dir/dada2_table_140.qza \
    --m-sample-metadata-file $data_dir/metadata.tsv \
    --o-visualization $processed_data_dir/dada2_table_140.qzv

In [None]:
Visualization.load(f"{processed_data_dir}/dada2_table_140.qzv")