# 01 Preprocessing

This notebook covers the preprocessing of the 16S rRNA amplicon sequencing data. It includes data import, quality control, denoising, and taxonomic classification.

<img src=./figures/workflow_preprocessing.jpg alt="Description" width="750" height="">

## Setup

Activate the environment `microbEvolve` before running this Jupiter notebook. You can also submit this notebook as a SLURM job with the following command. Please make sure you submit the job from the `scrips/` directory.

```bash
sbatch --time=03:59:00 --cpus-per-task=16 --mem-per-cpu=6G --output=slurm-%j.out --error=slurm-%j.err --wrap="bash -c 'module load eth_proxy && source $HOME/.bashrc && conda activate microbEvolve && jupyter execute --inplace ./01-1_preprocessing.ipynb'"
```

This step loads all requited packages. The paths to the scripts and the data are stored in the variables `scripts_dir` and `data_dir`. 

In [1]:
import os
import pandas as pd
from qiime2 import Visualization
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
scripts_dir = "src"
data_dir = "../data"

## Importing sequencing data and metadata

This step imports the demultiplexed data, which is already available as a QIIME 2 artifact on Polybox.

Next, the metadata is imported. It is provided as an Excel file with three sheets: DataDictionary, metadata_per_sample, and metadata_per_age. Each sheet is imported as its own .tsv file (`metadata_dictionary`, `metadata_per_sample`, and `metadata_per_age`). Lastly, the first column in metadata_per_sample `Unnamed: 0` is renamed to `sampleid`.

In [3]:
! bash {scripts_dir}/importing.sh

[2025-12-17 16:52:27] Starting importing script
[2025-12-17 16:52:27] Importing 16S rRNA sequencing data...
--2025-12-17 16:52:27--  https://polybox.ethz.ch/index.php/s/zi5ZBrBwcn7SYof/download/demux-paired-end.qza


Resolving proxy.service.consul (proxy.service.consul)... 10.205.212.167
Connecting to proxy.service.consul (proxy.service.consul)|10.205.212.167|:3128... connected.
Proxy request sent, awaiting response... 

200 OK
Length: 572079812 (546M) [application/octet-stream]
Saving to: ‘../data/raw/demux_paired_end.qza’

          ../data/r   0%[                    ]       0  --.-KB/s               

         ../data/ra  10%[=>                  ]  54.80M   274MB/s               

        ../data/raw  23%[===>                ] 126.71M   317MB/s               

       ../data/raw/  33%[=====>              ] 184.15M   307MB/s               












2025-12-17 16:52:28 (302 MB/s) - ‘../data/raw/demux_paired_end.qza’ saved [572079812/572079812]



[2025-12-17 16:52:29] 16S rRNA sequencing import completed and stored in ../data/raw/dada2_rep_set.qza
[2025-12-17 16:52:29] Importing metadata...
--2025-12-17 16:52:29--  https://polybox.ethz.ch/index.php/s/YQQggAqcQCApJmQ/download
Resolving proxy.service.consul (proxy.service.consul)... 10.205.212.167
Connecting to proxy.service.consul (proxy.service.consul)|10.205.212.167|:3128... connected.
Proxy request sent, awaiting response... 

200 OK
Length: 25082 (24K) [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet]
Saving to: ‘../data/raw/metadata.xlsx’


2025-12-17 16:52:30 (2.97 MB/s) - ‘../data/raw/metadata.xlsx’ saved [25082/25082]

[2025-12-17 16:52:30] Metadata import completed and stored in ../data/raw/metadata.xlsx
[2025-12-17 16:52:30] Saving metadata as single .tsv files...


[2025-12-17 16:52:31] metadata_dictionary, metadata_per_age and metadata_per_sample stored successfuly as single .tsv files in ../data/raw/
[2025-12-17 16:52:31] Import script completed successfully!


## Quality Control

For quality control, we converted the `demux_paired_end.qza` file into a `.qzv` artifact.

In [4]:
! bash {scripts_dir}/quality_control.sh

  import pkg_resources


[32mSaved Visualization to: ../data/processed/demux_paired_end.qzv[0m
[?25h

[0m

In [5]:
Visualization.load(f"{data_dir}/processed/demux_paired_end.qzv")

The minimum and maximum sequencing length during subsampling was 301 bp for both forward and reverse reads.

Overall, sequence quality was very high:
- Forward reads: median quality score of 34 across all base positions.
- Reverse reads: median quality score of 34 up to position 293, dropping to 20 from position 294 onward.

Variability was generally higher in the reverse reads and increased substantially from position 221 onward.

## Cutadapt and Denoising

### Cutadapt

Interestingly, the V4 region of the 16S rRNA gene is much shorter than 301 bp, which is why we thought that we might still have primer sequences in our reads.

#### 1. Initial Trimming Attempt (Original and Modified Primers)

The first attempt used the known V4 specific forward and reverse primer sequences ([source](https://earthmicrobiome.ucsd.edu/protocols-and-standards/16s/)). We used the `--p-discard-untrimmed True` flag in order to see how many reads would be trimmed.
- This step resulted in zero sequences being trimmed or retained.
- In hindsight, it is irrelevant if the original or modified primers are used, as they are not required to match perfectly and the one base difference does not influence the result.
- This indicated that the forward and reverse primers were already removed from the sequences by the sequencing facility prior to data delivery. This was also confirmed by our TA. Because we anchored the primers, the `--p-discard-untrimmed True` setting caused all reads to be discarded even if the reverse complement might be present.

The other possibility that could explain the length of our reads would be read-through.

#### 2. Identifying and Trimming Read-Through (Successful Strategy)

To be able to identify read-through, even though the forward and reverse primers had already been removed from the sequences, we decided to only look for the reverse complement of the primers.
- Approximately 4.5 million sequences were successfully truncated and retained using this approach, which confirms the presence of read-through.
- The successful truncation yielded reads with an approximate length of 250 bases, which likely represents the true length of the amplicon.
- **Many reverse reads were longer** than 250 bases, suggesting that the low base quality towards the end prevented cutadapt from recognizing the reverse-complement forward primer due to many mismatches.

In [6]:
! bash {scripts_dir}/cutadapt.sh

  import pkg_resources


[32mSaved SampleData[PairedEndSequencesWithQuality] to: ../data/raw/demux_paired_end_trimmed-modified-primers.qza[0m
[?25h

[0m

  import pkg_resources


[32mSaved Visualization to: ../data/processed/demux_paired_end_trimmed-modified-primers.qzv[0m
[?25h

[0m

  import pkg_resources


[32mSaved SampleData[PairedEndSequencesWithQuality] to: ../data/raw/demux_paired_end_trimmed-original-primers.qza[0m


[?25h

[0m

  import pkg_resources


[32mSaved Visualization to: ../data/processed/demux_paired_end_trimmed-original-primers.qzv[0m


[?25h

[0m

In [7]:
Visualization.load(f"{data_dir}/processed/demux_paired_end_trimmed-modified-primers.qzv")

In [8]:
Visualization.load(f"{data_dir}/processed/demux_paired_end_trimmed-original-primers.qzv")

### Denoising with DADA2

As trimming with cutadapt failed for many reverse reads, we decided to truncate aggressively during denoising to remove any read-through sequences. The truncation lengths were set to 220 bp for forward reads and 200 bp for reverse reads.

A truncation length of 220 basses for the forward reads and 200 bass pairs for the reverse reads resulted in good denoising performance. Around 90% of the reads passed the filtering step and nearly all of those reads were able to be merged.

In [9]:
! bash {scripts_dir}/denoising.sh

  import pkg_resources


[32mSaved FeatureTable[Frequency] to: ../data/raw/dada2_table.qza[0m


[32mSaved FeatureData[Sequence] to: ../data/raw/dada2_rep_set.qza[0m
[32mSaved SampleData[DADA2Stats] to: ../data/raw/dada2_stats.qza[0m
[?25h

[0m

  import pkg_resources


[32mSaved Visualization to: ../data/processed/dada2_stats.qzv[0m
[?25h

[0m

  import pkg_resources


[32mSaved Visualization to: ../data/processed/dada2_rep_set.qzv[0m
[?25h

[0m

  import pkg_resources


[32mSaved Visualization to: ../data/processed/dada2_table.qzv[0m
[?25h

[0m

In [10]:
Visualization.load(f"{data_dir}/processed/dada2_stats.qzv")

## Taxonomy

To identify which organisms are present in our samples, we used a pretrained, weighted classifier optimized for stool samples. It targets the 16S rRNA V4 region (515F/806R) and is based on the SILVA 138.2 database (99% NR). This classifier incorporates weights derived from a large database of human stool samples. This is designed to improve classification accuracy for samples derived from the human gut by prioritizing taxa commonly found in that environment.

In [11]:
! ./$scripts_dir/taxonomy.sh

[2025-12-17 17:13:50] Starting taxonomy classification script


[2025-12-17 17:13:50] Input file verified: ../data/raw/dada2_rep_set.qza
[2025-12-17 17:13:50] Downloading weighted classifier for human stool samples...
--2025-12-17 17:13:50--  https://www.arb-silva.de/fileadmin/silva_databases/current/QIIME2/2025.7/SSU/V4-515f-806r/weighted/human-stool/SILVA138.2_SSURef_NR99_weighted_classifier_V4-515f-806r_human-stool.qza
Resolving proxy.service.consul (proxy.service.consul)... 10.205.212.167
Connecting to proxy.service.consul (proxy.service.consul)|10.205.212.167|:3128... connected.


Proxy request sent, awaiting response... 

200 OK
Length: 164259156 (157M) [application/octet-stream]
Saving to: ‘../data/raw/silva-138-99-515-806-nb-classifier-weighted-stool.qza’

          ../data/r   0%[                    ]       0  --.-KB/s               

         ../data/ra   0%[                    ]   1.30M  5.83MB/s               

        ../data/raw  10%[=>                  ]  16.75M  39.6MB/s               

       ../data/raw/  12%[=>                  ]  18.81M  30.0MB/s               

      ../data/raw/s  18%[==>                 ]  28.55M  30.7MB/s               

     ../data/raw/si  28%[====>               ]  44.69M  39.5MB/s               

    ../data/raw/sil  29%[====>               ]  45.59M  34.1MB/s               

   ../data/raw/silv  34%[=====>              ]  53.58M  32.9MB/s               
























2025-12-17 17:13:55 (35.6 MB/s) - ‘../data/raw/silva-138-99-515-806-nb-classifier-weighted-stool.qza’ saved [164259156/164259156]

[2025-12-17 17:13:55] Weighted classifier downloaded successfully
[2025-12-17 17:13:55] Classifier file verified
[2025-12-17 17:13:55] Starting taxonomic classification with weighted classifier...


  import pkg_resources


[32mSaved FeatureData[Taxonomy] to: ../data/raw/taxonomy_weighted_stool.qza[0m


[?25h

[0m

[2025-12-17 17:18:56] Weighted taxonomic classification completed
[2025-12-17 17:18:56] Creating visualization for weighted taxonomy results...


  import pkg_resources


[32mSaved Visualization to: ../data/processed/taxonomy_weighted_stool.qzv[0m
[?25h

[0m

[2025-12-17 17:19:30] Weighted taxonomy visualization created
[2025-12-17 17:19:30] Creating taxa bar plot for weighted taxonomy results...


  import pkg_resources


[32mSaved Visualization to: ../data/processed/taxa-bar-plots_weighted.qzv[0m
[?25h

[0m

[2025-12-17 17:20:04] Taxa bar plot for weighted taxonomy results created
[2025-12-17 17:20:04] Taxonomy classification script completed successfully!
