# **CHAPTER 1. Quality Control and Taxonomic identification**

**Install conda env and activate it**

```
conda env create -f qc_n_tax_id.yaml
```

```
conda activate qc_n_tax_id
```

> Disclaimer: This part of the work was performed on a university server. The taxonomic identification step using Kraken2 and the PlusPF database requires a minimum of 84 GB of RAM.

## **Part 1. Combine several `.fastq` files by sample**

Import `os` and `shutil`

In [None]:
import os
import shutil

Combine files

In [None]:
# Directory with the source sequences
raw_sequences_dir = 'samples/raw_sequences'
# Directory for merged sequences
unified_sequences_dir = 'samples/unified_sequences'

# List of sample folders
samples = ['D1', 'D2', 'D3', 'D4', 'D5', 'P1', 'P2', 'P3', 'P4', 'P5']

# Create a directory for merged sequences, if it does not exist
if not os.path.exists(unified_sequences_dir):
    os.makedirs(unified_sequences_dir)

for sample in samples:
    # Path to the current sample folder
    sample_dir = os.path.join(raw_sequences_dir, sample)
    # Folder path for merged files of the current sample
    sample_output_dir = os.path.join(unified_sequences_dir, sample)
    
    # Create a folder for merged files, if there is no such folder
    if not os.path.exists(sample_output_dir):
        os.makedirs(sample_output_dir)
    
    # File paths to be merged
    r1_files_pattern = os.path.join(sample_dir, '*_R1_*.fastq.gz')
    r2_files_pattern = os.path.join(sample_dir, '*_R2_*.fastq.gz')
    
    # Paths for saving merged files
    combined_r1_path = os.path.join(sample_output_dir, f'{sample}_combined_R1.fastq.gz')
    combined_r2_path = os.path.join(sample_output_dir, f'{sample}_combined_R2.fastq.gz')
    
    # Merging R1 files
    cat_r1_command = f'cat {r1_files_pattern} > {combined_r1_path}'
    os.system(cat_r1_command)
    
    # Merging R2 files
    cat_r2_command = f'cat {r2_files_pattern} > {combined_r2_path}'
    os.system(cat_r2_command)

print("File merge completed.")

## **Part 2. Quality control**

`QC_pipeline` is a Snakemake pipeline that first runs `FastQC` and then merges its reports with `MultiQC`

In [None]:
! snakemake -s QC_pipeline --cores all --use-conda

As the result of this pipeline `FastQC` reports are stored in `fastqc_reports` folder<br>
And `MultiQC` reports are stored in `multiqc_reports`

## **Part 3. Taxonomic identification**

>Disclaimer: Do not forget to install PlusPF (full) database

**Option 1**: Direct download<br>
1. Visit https://benlangmead.github.io/aws-indexes/k2<br>
2. Download PlusPF (full) `.tar.gz` file<br>
3. Unzip it

**Option 2**: `kraken2` built-in method (Recommended)<br>

```
kraken2-build --pluspf --threads $THREADS --db $DBNAME
```

Replace $THREADS with number of your threads<br>
Replace $DBNAME with your desirable path to database<br>

We had PlusPF database located by `~/PlusPF` path

`kraken2_pipeline` is a Snakemake pipeline that runs `kraken2` along with `bracken` against `PlusPF` database

In [None]:
! snakemake -s kraken2_pipeline --cores all --use-conda

As the result of this pipeline reports (`kraken_report.txt`) are stored in `kraken2_bracken` folder