# 01 Preprocessing

This notebook covers the preprocessing of the 16S rRNA amplicon sequencing data. It includes data import, quality control, denoising, and taxonomic classification.

<img src=./figures/workflow_preprocessing.jpg alt="Description" width="750" height="">

## Setup

Activate the environment `microbEvolve` before running this Jupiter notebook. 

This step loads all requited packages. The paths to the scripts and the data are stored in the variables `scripts_dir` and `data_dir`. 

In [1]:
import os
import pandas as pd
from qiime2 import Visualization
import matplotlib.pyplot as plt

%matplotlib inline

In [3]:
scripts_dir = "src"
data_dir = "../data"

## Importing sequencing data and metadata

This step imports the demultiplexed data, which is already available as a QIIME 2 artifact on Polybox.

Next, the metadata is imported. It is provided as an Excel file with three sheets: DataDictionary, metadata_per_sample, and metadata_per_age. Each sheet is imported as its own .tsv file (`metadata_dictionary`, `metadata_per_sample`, and `metadata_per_age`). Lastly, the first column in metadata_per_sample `Unnamed: 0` is renamed to `sampleid`.

In [4]:
! sh {scripts_dir}/importing.sh

[2025-12-06 14:40:57] Starting importing script
[2025-12-06 14:40:57] Importing 16S rRNA sequencing data...
--2025-12-06 14:40:59--  https://polybox.ethz.ch/index.php/s/zi5ZBrBwcn7SYof/download/demux-paired-end.qza
Auflösen des Hostnamens polybox.ethz.ch (polybox.ethz.ch)… 129.132.71.243
Verbindungsaufbau zu polybox.ethz.ch (polybox.ethz.ch)|129.132.71.243|:443 … verbunden.
HTTP-Anforderung gesendet, auf Antwort wird gewartet … 200 OK
Länge: 572079812 (546M) [application/octet-stream]
Wird in »../data/raw/demux_paired_end.qza« gespeichert.


2025-12-06 14:43:47 (3,25 MB/s) - »../data/raw/demux_paired_end.qza« gespeichert [572079812/572079812]

[2025-12-06 14:43:48] 16S rRNA sequencing import completed and stored in ../data/raw/dada2_rep_set.qza
[2025-12-06 14:43:48] Importing metadata...
--2025-12-06 14:43:48--  https://polybox.ethz.ch/index.php/s/YQQggAqcQCApJmQ/download
Auflösen des Hostnamens polybox.ethz.ch (polybox.ethz.ch)… 129.132.71.243
Verbindungsaufbau zu polybox.ethz.ch (pol

## Quality Control

For quality control, we converted the `demux_paired_end.qza` file into a `.qzv` artifact.

In [5]:
! sh {scripts_dir}/quality_control.sh

  import pkg_resources
[31m[1mPlugin error from demux:

  /var/folders/56/1kg2s72566n3fjbq7sy5d0mm0000gn/T/qiime2/yararoth/data/3e48be7c-af27-4f87-b7d2-2726aca3b65d/data is not a(n) SingleLanePerSamplePairedEndFastqDirFmt:

  A pair of paired-end files were found not to have the same number of records. /var/folders/56/1kg2s72566n3fjbq7sy5d0mm0000gn/T/qiime2/yararoth/data/3e48be7c-af27-4f87-b7d2-2726aca3b65d/data/sample_5_S771_L001_R1_001.fastq.gz has 112892 records. /var/folders/56/1kg2s72566n3fjbq7sy5d0mm0000gn/T/qiime2/yararoth/data/3e48be7c-af27-4f87-b7d2-2726aca3b65d/data/sample_51_S266_L001_R1_001.fastq.gz has 14284 records.

Debug info has been saved to /var/folders/56/1kg2s72566n3fjbq7sy5d0mm0000gn/T/qiime2-q2cli-err-svzmrumr.log[0m
[?25h[0m

In [9]:
Visualization.load(f"{data_dir}/processed/demux_paired_end.qzv")

The minimum and maximum sequencing length during subsampling was 301 bp for both forward and reverse reads.

Overall, sequence quality was very high:
- Forward reads: median quality score of 34 across all base positions.
- Reverse reads: median quality score of 34 up to position 293, dropping to 20 from position 294 onward.

Variability was generally higher in the reverse reads and increased substantially from position 221 onward.

## Cutadapt and Denoising

### Cutadapt

Interestingly, the V4 region of the 16S rRNA gene is much shorter than 301 bp, which is why we thought that we might still have primer sequences in our reads.

#### 1. Initial Trimming Attempt (Original and Modified Primers)

The first attempt used the known V4 specific forward and reverse primer sequences ([source](https://earthmicrobiome.ucsd.edu/protocols-and-standards/16s/)). We used the `--p-discard-untrimmed True` flag in order to see how many reads would be trimmed.
- This step resulted in zero sequences being trimmed or retained.
- In hindsight, it is irrelevant if the original or modified primers are used, as they are not required to match perfectly and the one base difference does not influence the result.
- This indicated that the forward and reverse primers were already removed from the sequences by the sequencing facility prior to data delivery. This was also confirmed by our TA. Because we anchored the primers, the `--p-discard-untrimmed True` setting caused all reads to be discarded even if the reverse complement might be present.

The other possibility that could explain the length of our reads would be read-through.

#### 2. Identifying and Trimming Read-Through (Successful Strategy)

To be able to identify read-through, even though the forward and reverse primers had already been removed from the sequences, we decided to only look for the reverse complement of the primers.
- Approximately 4.5 million sequences were successfully truncated and retained using this approach, which confirms the presence of read-through.
- The successful trim yielded reads with an approximate length of 250 bases, which likely represents the true length of the amplicon.
- **Many reverse reads were longer** than 250 bases, suggesting that the low base quality towards the end prevented cutadapt from recognizing the reverse-complement forward primer due to many mismatches.

In [6]:
! sh {scripts_dir}/cutadapt.sh

  import pkg_resources
[31m[1mPlugin error from cutadapt:

  /var/folders/56/1kg2s72566n3fjbq7sy5d0mm0000gn/T/qiime2/yararoth/data/3e48be7c-af27-4f87-b7d2-2726aca3b65d/data is not a(n) SingleLanePerSamplePairedEndFastqDirFmt:

  A pair of paired-end files were found not to have the same number of records. /var/folders/56/1kg2s72566n3fjbq7sy5d0mm0000gn/T/qiime2/yararoth/data/3e48be7c-af27-4f87-b7d2-2726aca3b65d/data/sample_5_S771_L001_R1_001.fastq.gz has 112892 records. /var/folders/56/1kg2s72566n3fjbq7sy5d0mm0000gn/T/qiime2/yararoth/data/3e48be7c-af27-4f87-b7d2-2726aca3b65d/data/sample_51_S266_L001_R1_001.fastq.gz has 14284 records.

Debug info has been saved to /var/folders/56/1kg2s72566n3fjbq7sy5d0mm0000gn/T/qiime2-q2cli-err-7eh3c_zw.log[0m
[?25h[0m

In [None]:
Visualization.load(f"{data_dir}/processed/demux_paired_end_trimmed-modified-primers.qza")

In [None]:
Visualization.load(f"{data_dir}/processed/demux_paired_end_trimmed-original-primers.qzv")

### Denoising with DADA2

As trimming with cutadapt failed for many reverse reads, we decided to truncate aggressively during denoising to remove any read-through sequences. The truncation lengths were set to 220 bp for forward reads and 200 bp for reverse reads.

A truncation length of 220 basses for the forward reads and 200 bass pairs for the reverse reads resulted in good denoising performance. Around 90% of the reads passed the filtering step and nearly all of those reads were able to be merged.

In [None]:
! sh {scripts_dir}/denoising.sh

In [7]:
Visualization.load(f"{data_dir}/processed/dada2_stats.qzv")

## Taxonomy

To identify which organisms are present in our samples, we used a pretrained, weighted classifier optimized for stool samples. It targets the 16S rRNA V4 region (515F/806R) and is based on the SILVA 138.2 database (99% NR). This classifier incorporates weights derived from a large database of human stool samples. This is designed to improve classification accuracy for samples derived from the human gut by prioritizing taxa commonly found in that environment.

In [None]:
! sh {scripts_dir}/taxonomy.sh