# **Sample Processing – Nextflow Assembly Pipeline (GHRU)**

For the processing of samples obtained after Illumina sequencing, the **GHRU assembly pipeline** was used. This notebook documents the **inputs, execution steps, parameters, and expected outputs** of the pipeline.

The workflow is implemented using **Nextflow**, ensuring reproducibility and scalability across multiple samples.

## Pipeline Description

The GHRU assembly pipeline is a Nextflow-based workflow designed for **short-read whole-genome assembly**. It performs genome assembly and generates quality-controlled outputs suitable for downstream genomic analyses.

This pipeline is used as the **initial processing step** prior to annotation, typing, and comparative genomics.

## Cloning the Pipeline

The GHRU assembly repository contains the Nextflow workflow for genome assembly. The pipeline was cloned from the official GitHub repository:

https://github.com/ghruproject/GHRU-assembly

The repository is cloned into a dedicated pipeline directory to keep workflow code separate from analysis outputs and notebooks.

In [None]:
%%bash

git clone https://github.com/ghruproject/GHRU-assembly.git \

In [24]:
%%bash

cd /home/nidhi/GHRU-assembly
git rev-parse HEAD

733404aaba789fad54ef942b3d4a036f11882ad5


The specific commit hash of the pipeline was recorded to ensure reproducibility.

## Tools Included in the Pipeline

The GHRU assembly pipeline integrates multiple tools. The tools listed below directly correspond to the output directories generated by the pipeline.

- **Trimmomatic**  
  Used for adapter removal and quality trimming of raw Illumina paired-end reads. 
  Output directory:
  - `trimmed_fastqs`

- **FastQC**  
  Used to assess sequencing read quality before and after trimming.  
  Output directories:
  - `qc_summary`
  - `post_trimming_short_read_stats`

- **Shovill (SPAdes)**  
  Used for short-read genome assembly from trimmed Illumina reads.  
  Output directory:
  - `assemblies`

- **QUAST**  
  Used to evaluate assembly quality metrics such as genome length, number of contigs, and N50.  
  Output directory:
  - `quast_summary`

- **Sylph**  
  Used for k-mer–based species identification directly from sequencing reads.  
  Output directory:
  - `sylph_summary`

- **SpecCheck**  
  Used to validate and confirm species assignments generated by Sylph.  
  Output directory:
  - `speccheck`

- **CheckM**  
  Used to assess genome completeness and contamination of assembled genomes.  
  Output directory:
  - `checkm_summary`

- **Species Classification (Sylph + SpecCheck)**  
  Combined species-level assignment based on read-based prediction and validation.  
  Output directory:
  - `speciation_summary`

## Execution Metadata

- **Pipeline:** GHRU assembly pipeline (Nextflow)
- **Sequencing platform:** Illumina
- **Read type:** Paired-end
- **Assembly strategy:** Short-read assembly using Shovill (SPAdes backend)

## Input Files

The pipeline requires a **CSV samplesheet** as input.

### Samplesheet format

The samplesheet must contain the following columns:

- `sample_id` – Unique identifier for each sample  
- `short_reads1` – Absolute path to forward reads (R1 FASTQ)  
- `short_reads2` – Absolute path to reverse reads (R2 FASTQ)

Each row corresponds to one sample.

### Input Requirements

- FASTQ files must be **paired-end Illumina reads**
- File paths must be **absolute paths**
- Sample IDs should not contain spaces or special characters
- All input files must be readable by the execution environment

In [None]:
%%bash

# Go to FASTQ folder
cd /data/internship_data/nidhi/aba/new_fastqs

# Create header
echo "sample_id,short_reads1,short_reads2" > ../fastq_sheet.csv

# Loop over common paired-end patterns: sample_1.fastq.gz or sample_R1.fastq.gz
for r1 in *_1.fastq.gz *_1.fq.gz *_R1.fastq.gz *_R1.fq.gz; do
  [ -e "$r1" ] || continue
  # derive r2 name by replacing _1 with _2 or _R1 with _R2
  if [[ "$r1" == *_1.* ]]; then
    r2="${r1/_1./_2.}"
  else
    r2="${r1/_R1./_R2.}"
  fi
  [ -e "$r2" ] || { echo "Warning: partner R2 not found for $r1" >&2; continue; }
  sample=$(echo "$r1" | sed -E 's/(_R?1).*//')   # remove the _1/_R1 and everything after
  echo "${sample},$(realpath "$r1"),$(realpath "$r2")" >> ../fastq_sheet.csv
done

# Show resulting CSV
echo "Written:" && realpath ../fastq_sheet.csv
column -s, -t ../fastq_sheet.csv | sed -n '1,5p'

Written:
/data/internship_data/nidhi/aba/fastq_sheet.csv
sample_id  short_reads1                                                     short_reads2
ABA-1000   /data/internship_data/nidhi/aba/new_fastqs/ABA-1000_R1.fastq.gz  /data/internship_data/nidhi/aba/new_fastqs/ABA-1000_R2.fastq.gz
ABA-1001   /data/internship_data/nidhi/aba/new_fastqs/ABA-1001_R1.fastq.gz  /data/internship_data/nidhi/aba/new_fastqs/ABA-1001_R2.fastq.gz
ABA-1002   /data/internship_data/nidhi/aba/new_fastqs/ABA-1002_R1.fastq.gz  /data/internship_data/nidhi/aba/new_fastqs/ABA-1002_R2.fastq.gz
ABA-1003   /data/internship_data/nidhi/aba/new_fastqs/ABA-1003_R1.fastq.gz  /data/internship_data/nidhi/aba/new_fastqs/ABA-1003_R2.fastq.gz


## Pipeline Execution

The pipeline was executed within a Conda-managed environment. To avoid excessive runtime output in the Jupyter notebook, standard output and error streams were redirected to an external log file.

In [19]:
%%bash

# initialise conda
source /home/anaconda/miniconda3/etc/profile.d/conda.sh
conda activate nextflow

# move into GHRU-assembly folder 
cd /home/nidhi/GHRU-assembly

# command for execution
nextflow run main.nf \
  --samplesheet /data/internship_data/nidhi/aba/fastq_sheet.csv \
  --outdir /data/internship_data/nidhi/aba/new_output/nextflow_output \
  > /data/internship_data/nidhi/aba/new_output/logs/nextflow.log 2>&1

## Output Directory Structure

The pipeline generates the following outputs:

In [20]:
%%bash

ls /data/internship_data/nidhi/aba/new_output/nextflow_output

assemblies
checkm_summary
post_trimming_short_read_stats
qc_summary
quast_summary
speccheck
speciation_summary
sylph_summary
trimmed_fastqs


All results are written to a dedicated output directory specified at runtime.