# 2. Quality Control and Denoising
## Import data & packages

In [4]:
#Import all necessary packages
import IPython
import pandas as pd
import matplotlib.pyplot as plt
import os
import qiime2 as q2
from qiime2 import Visualization

%matplotlib inline

In [5]:
#Set working directory
os.chdir("/home/jovyan/Project/MicrobiomeAnalysis_TummyTribe/")

# Verify that your wroking directory is the overall project folder (.../MicrobiomeAnalysis_TummyTribe)
print("Current working directory:", os.getcwd())

Current working directory: /home/jovyan/Project/MicrobiomeAnalysis_TummyTribe


In [6]:
#Data directory for the raw data
data_dir = "data/raw"
processed_data_dir = "data/processed"

## Quality Control

In [5]:
! qiime tools peek $data_dir/sequences-demux-paired.qza

[32mUUID[0m:        b4782ab7-550b-41f5-b906-ca2cda29ca9b
[32mType[0m:        SampleData[PairedEndSequencesWithQuality]
[32mData format[0m: SingleLanePerSamplePairedEndFastqDirFmt


In [6]:
! qiime demux summarize \
    --i-data $data_dir/sequences-demux-paired.qza \
    --o-visualization $data_dir/sequences-demux-paired.qzv

  import pkg_resources
[32mSaved Visualization to: data/raw/sequences-demux-paired.qzv[0m
[0m[?25h

In [7]:
Visualization.load(f"{data_dir}/sequences-demux-paired.qzv")

## Denoising and merging

Parameters
- `p-trunc-len` - we will truncate the reads to 130 bp (sequences shorter than this will be removed automatically)
- `p-n-threads` - if we have more than 1 CPU available, we can specify the number here to make the processing faster
- `o-table` - this will be our ASVs feature table
- `o-representative-sequences` - this will be a list of all the denoised features (DNA sequences)
- `o-denoising-stats` - this will be some stats from the denoising process

Information on parameters and function: https://docs.qiime2.org/2024.10/plugins/available/dada2/denoise-paired/
Example tutorial of paired read analysis: https://docs.qiime2.org/2024.10/tutorials/atacama-soils/

Even though the forward reads look good, we still need sufficient overlap for merging the paired reads into one continuous sequence. Using a truncation length of 140 for both the forward and reverse reads allows us to get less chimeras and lose less reads in the denoising than if we keep a longer part of the forward read. 

So we shorten both reads to balance:
- High-quality bases (for denoising)
- Enough overlap (for merging)

In [8]:
# this cell takes a loooong time to run. Time for coffee? 
# Or a cool video? https://www.youtube.com/watch?v=-z4gNr7mN3U
# Or a pull-up!
# Nvm that last one, too difficult
# Find Laura's long lost dads

! qiime dada2 denoise-paired \
    --i-demultiplexed-seqs $data_dir/sequences-demux-paired.qza \
    --p-trunc-len-f 140 \
    --p-trunc-len-r 140 \
    --p-n-threads 3 \
    --o-table $processed_data_dir/dada2_table_140.qza \
    --o-representative-sequences $processed_data_dir/dada2_rep_set_140.qza \
    --o-denoising-stats $processed_data_dir/dada2_stats_140.qza \
    --verbose

Usage: [94mqiime dada2 denoise-paired[0m [OPTIONS]

  This method denoises paired-end sequences, dereplicates them, and filters
  chimeras.

[1mInputs[0m:
  [94m[4m--i-demultiplexed-seqs[0m ARTIFACT [32mSampleData[PairedEndSequencesWithQuality][0m
                          The paired-end demultiplexed sequences to be
                          denoised.                                 [35m[required][0m
[1mParameters[0m:
  [94m[4m--p-trunc-len-f[0m INTEGER Position at which forward read sequences should be
                          truncated due to decrease in quality. This truncates
                          the 3' end of the of the input sequences, which will
                          be the bases that were sequenced in the last cycles.
                          Reads that are shorter than this value will be
                          discarded. After this parameter is applied there
                          must still be at least a 12 nucleotide overlap
                

In [9]:
! qiime metadata tabulate \
    --m-input-file $processed_data_dir/dada2_stats_140.qza \
    --o-visualization $processed_data_dir/dada2_stats_140.qzv

  import pkg_resources
[32mSaved Visualization to: data/processed/dada2_stats_140.qzv[0m
[0m[?25h

In [10]:
Visualization.load(f"{processed_data_dir}/dada2_stats_140.qzv")

In [11]:
! qiime feature-table summarize \
    --i-table $processed_data_dir/dada2_table_140.qza \
    --m-sample-metadata-file $data_dir/metadata.tsv \
    --o-visualization $processed_data_dir/dada2_table_140.qzv

  import pkg_resources
[32mSaved Visualization to: data/processed/dada2_table_140.qzv[0m
[0m[?25h

In [12]:
Visualization.load(f"{processed_data_dir}/dada2_table_140.qzv")

We end up with 3'358 unique features (number of unique ASVs) and roughly 16 million reads. So we have overall good microbial richness and sequencing depth. 

Before rarefaction we might need to exclude the samples with very low read counts (below 5'000). Also the mean is higher than the median → distribution is right-skewed due to a few high-depth samples. This skewed distribution is common in amplicon data but means we’ll need to rarefy carefully to balance retention of samples and features.

In [7]:
! qiime feature-table filter-samples \
      --i-table $processed_data_dir/dada2_table_140.qza \
      --p-min-frequency 5000 \
      --o-filtered-table $processed_data_dir/sample_frequency_filtered_table.qza

  import pkg_resources
[32mSaved FeatureTable[Frequency] to: data/processed/sample_frequency_filtered_table.qza[0m
[0m[?25h

In [8]:
! qiime feature-table summarize \
    --i-table $processed_data_dir/sample_frequency_filtered_table.qza \
    --m-sample-metadata-file $data_dir/metadata.tsv \
    --o-visualization $processed_data_dir/sample_frequency_filtered_table.qzv

  import pkg_resources
[32mSaved Visualization to: data/processed/sample_frequency_filtered_table.qzv[0m
[0m[?25h

In [9]:
Visualization.load(f"{processed_data_dir}/sample_frequency_filtered_table.qzv")