# Submodule 3: Workflow automation with Nextflow and aquisition of public datasets
--------
## Overview

The primary goal of this submodule is to **generate multiple genome datasets** that will be used in the subsequent submodule on comparative genomics. This will be a short submodule in terms of content, but it is the most computationally intensive submodule.

To achieve this, we will **automate the processes outlined in Submodules 1 & 2** using **Nextflow**, a workflow management platform designed to streamline and ensure reproducibility in bioinformatics pipelines. Nextflow allows us to automate the entire workflow, including program version control, ensuring data integrity, and facilitating parallel execution of multiple datasets. This enables us to efficiently run many genome datasets simultaneously, while managing the computational resources available in our environment.

After executing the Nextflow workflow on our set of genomes, we will supplement our data with **reference-quality genome sequences from NCBI** in **Submodule 4**.


### Learning Objectives
- **Learn Reproducibility in Bioinformatics**:  
  Understand the importance of reproducibility in bioinformatics workflows and learn how to implement practices that ensure consistent, reliable results across different systems and environments.

- **Automate Genome Data Processing**:  
  Learn how to automate and streamline the processes of genome assembly, quality control, and annotation to handle multiple genome datasets efficiently using the workflow management platform **Nextflow**.

- **Execute Parallel Processing and Ensure Data Integrity**:  
  Gain proficiency in managing and processing large sets of genomic data simultaneously, while also developing the ability to monitor, assess, and maintain data integrity and quality throughout the workflow.

- **Prepare and Refine Genome Sequences for Comparative Genomics**:  
  Refine and prepare a collection of genome datasets, ensuring they are processed, cleaned, and ready for detailed examination in comparative genomics.

## The Importance of Reproducibility in Bioinformatics

Reproducibility is a cornerstone of scientific research, particularly in bioinformatics, where computational analyses of biological data play a significant role in advancing our understanding of genomics, disease, and evolution. In bioinformatics, ensuring that analyses can be repeated with the same results is critical for validating findings and building trust within the scientific community. This is especially important given the large datasets and complex analyses often involved, where slight variations in workflows, software versions, or data can lead to different outcomes. Reproducible research ensures that experiments can be independently verified, improving the credibility of results and facilitating the development of robust, evidence-based conclusions.

To support reproducibility in bioinformatics, many workflows are built using standardized, transparent tools and platforms, such as **Nextflow**, **Snakemake**, and **Galaxy**, which allow scientists to document and automate every step of their analysis. This makes it possible for others to re-run the same pipeline on their own systems, using the same parameters and input data, thus ensuring consistent results.

### FAIR Principles

The FAIR principles are guidelines designed to enhance the visibility and accessibility of research data, particularly in bioinformatics. FAIR stands for **Findable, Accessible, Interoperable, and Reusable**. These principles aim to make scientific data easier to find, use, and share, which is especially important in bioinformatics, where data often comes from multiple sources and is essential for advancing knowledge in genetics, medicine, and ecology.

- **Findable**: Data should be easy to locate through searchable metadata, unique identifiers, and standardized naming conventions.
- **Accessible**: Data should be available in formats that are easy to access and retrieve, even if only through proper authentication or permissions.
- **Interoperable**: Data and tools should be compatible, allowing for the exchange of information across systems, platforms, and communities.
- **Reusable**: Data should be well-documented, with clear descriptions of methods, metadata, and usage licenses, so others can confidently reuse it in future research.

Adhering to the FAIR principles enhances collaboration, promotes transparency, and ensures that bioinformatics data can be effectively utilized by researchers across disciplines, ultimately accelerating scientific discovery and innovation.


### Tools  to Aid in Reproducability in Bioinformatics
To ensure reproducibility in bioinformatics workflows, several tools and platforms are available that help automate, document, and track the analysis process. These tools facilitate consistent execution of pipelines, improve collaboration, and ensure that results can be verified and reproduced across different environments and systems.

| Tool               | Description | Link |
|--------------------|-------------|------|
| **Nextflow**       | A workflow management system for running bioinformatics pipelines in a highly scalable, reproducible, and portable manner. | [Nextflow Documentation](https://www.nextflow.io/) |
| **RepeatFS**       | A tool designed to automate repetitive computational tasks, similar to Nextflow, for large-scale genomic data analysis. | [RepeatFS GitHub](https://github.com/RepeatFS) |
| **Snakemake**      | A popular workflow management system for bioinformatics, known for its simple syntax and scalability. | [Snakemake Documentation](https://snakemake.readthedocs.io/en/stable/) |
| **Cromwell**       | A workflow engine designed for scientific workflows, with built-in support for WDL (Workflow Description Language). | [Cromwell GitHub](https://github.com/broadinstitute/cromwell) |
| **Galaxy**         | A web-based platform for data-intensive research, offering tools for reproducible data analysis and workflow management. | [Galaxy Project](https://galaxyproject.org/) |
| **Airflow**        | An open-source tool for orchestrating complex workflows, with strong scheduling and monitoring features. | [Apache Airflow](https://airflow.apache.org/) |
| **Bash / Python**  | Bash and Python scripts can automate tasks, with Bash being simple and flexible for Unix-like systems and Python offering more extensive libraries for bioinformatics tasks. | [Bash Scripting Guide](https://www.gnu.org/software/bash/manual/bash.html), [Python Documentation](https://docs.python.org/3/) |
| **Other Languages**| Other programming languages like **R**, **Perl**, and **Ruby** can be used for bioinformatics workflows, depending on the user's needs and preferences. | N/A |
| **PipeLine**       | A pipeline management tool that allows for easy tracking and execution of bioinformatics workflows, especially useful in large-scale projects. | [PipeLine GitHub](https://github.com/Pipeline) |


## Nextflow Overview

**Nextflow** is a powerful workflow management platform designed for bioinformatics and data science. It enables the automation of complex workflows, facilitating reproducibility, parallel execution of tasks, and efficient use of computational resources across different environments. Nextflow is especially useful in genomics pipelines where large-scale datasets need to be processed with consistent results.

---

### Key Features of Nextflow:
- **Reproducibility**: Nextflow ensures that workflows are reproducible by capturing the environment, dependencies, and exact command used for each execution.
- **Parallelization**: It efficiently runs tasks in parallel, utilizing computational resources optimally, which speeds up large-scale data analyses.
- **Compatibility**: Nextflow supports multiple computing environments, including local machines, cloud infrastructure, and high-performance computing clusters.
- **Version Control**: Integrated support for managing software versions and dependencies, ensuring consistency across different systems and workflows.

---

### Learning Resources for Nextflow

| Resource | Description | Link |
|----------|-------------|------|
| **Nextflow Official Documentation** | Comprehensive guide on installation, syntax, and core concepts. | [Nextflow Documentation](https://www.nextflow.io/docs/latest/) |
| **Nextflow Tutorials** | Step-by-step tutorials for beginners and advanced users. | [Nextflow Tutorials](https://www.nextflow.io/docs/latest/tutorial/) |
| **Nextflow YouTube Channel** | Recorded webinars, conference talks, and demos for visual learners. | [Nextflow YouTube Channel](https://www.youtube.com/c/Nextflow) |
| **Nextflow Community Forum** | A forum for discussing issues, asking questions, and sharing knowledge. | [Nextflow Forum](https://www.nextflow.io/community/) |
| **Nextflow GitHub Repository** | Access to source code, examples, and community contributions. | [Nextflow GitHub](https://github.com/nextflow-io/nextflow) |
| **Bioinformatics with Nextflow – A Book** | In-depth resource for bioinformaticians, with practical examples. | [Bioinformatics with Nextflow](https://www.packtpub.com/product/bioinformatics-with-nextflow/9781801071363) |


---

By using these resources, you’ll be able to gain the skills necessary to leverage Nextflow in your genomic data analysis workflows and harness its full potential for automation and scalability.

---

## **Install required software**

The only additional tools required for this submodule is **Nextflow**. All the tools required for the Nextflow workflow have already been installed and described in the previous submodules. To install Nextflow see the official documentation https://www.nextflow.io/docs/latest/install.html.


## Run the Nextflow Workflow

### Overview

Next we will run a bash command that initiates the Nextflow workflow on a directory containing raw FASTQ files. Each of these paired FASTQ files represents the sequencing data genearted from unique isolates. We begin by listing the contents of the directory and then proceed with executing the Nextflow workflow. This workflow automates all the steps from Submodules 1 and 2, ensuring a seamless process. Nextflow intelligently determines which steps can be executed concurrently (e.g., BLAST and BWA read mapping) and which steps require prior completion of others (e.g., running BlobTools after BLAST and related processes finish).

First lets take a quick look at the nextflow code format.

### Example of Nextflow Step: Trimming Paired-End Reads with FASTP
Below is a code snippet of the contents of wgs-nf/main.nf as an example of the Nextflow language process for runnign a command. In this step of the workflow, we trim paired-end sequencing reads using **FASTP**, a fast and efficient tool for preprocessing high-throughput sequencing data. The `RUN_FASTP` process takes paired-end reads as input and outputs the trimmed versions of the reads. Here's the Nextflow code to perform this task:

```nextflow
// Step 2: Trim paired-end reads with FASTP
process RUN_FASTP {
    input:
    tuple val(sampleid), path(fq1), path(fq2)
    
    output:
    tuple val(sampleid), path("${sampleid}.trimmed_R1.fastq.gz"), path("${sampleid}.trimmed_R2.fastq.gz")
    
    script:
    """
    fastp -i $fq1 -I $fq2 \
        -o "$sampleid".trimmed_R1.fastq.gz -O "$sampleid".trimmed_R2.fastq.gz \
        --thread ${params.threads}
    """
}


### Example of Nextflow Workflow: Putting it all together

A similar snippet of code exists for each step of the workflow, with controlled options and threads for each step. After all the processes are written you can execute the code using the workflow block. Below is the Nextflow code for this workflow, this includes steps for read assessment, quality control, genome assembly, and various downstream analyses.

```nextflow
workflow {
    // Load paired-end FASTQ files
    fastq_files = Channel
    .fromFilePairs(params.reads, flat: false)
    .map { sample_id, files -> tuple(sample_id, files[0], files[1]) }

    // Step 0: Count reads
    read_info = fastq_files | ASSESS_READS
    read_info.view()

    // Step 1: Run FASTQC
    fastqc_report = fastq_files | RUN_FASTQC

    // Step 2: Trim reads with FASTP
    trimmed_fastq = fastq_files | RUN_FASTP

    // Step 3: Assemble genome with SPADES
    genome = trimmed_fastq | RUN_SPADES

    // Step 4.1: Run BWA and Samtools
    bwa_channel = trimmed_fastq.join(genome)
    bam = RUN_BWA(bwa_channel)

    // Step 4.2: Run BUSCO
    busco_results = genome | RUN_BUSCO

    // Step 4.3: Run QUAST
    quast_report = genome | RUN_QUAST

    // Step 4.4: Run BLAST
    blast_results = genome | RUN_BLAST

    // Step 5: Run Blobtools
    busco_channel = genome.join(bam).join(blast_results)
    blob_output = RUN_BLOBTOOLS(busco_channel)
}



In [16]:
%%bash

cd wgs-nf/

# starting directory
echo "Directory contents:"
echo "--------------------"

ls raw-reads/

Directory contents:
--------------------
full-file
testagain2_1.fastq.gz
testagain2_2.fastq.gz
testagain_1.fastq.gz
testagain_2.fastq.gz


In [8]:
%%bash

# move into working directory
cd wgs-nf/

# run nextflow with the conitnue option
#nextflow run main.nf --continue
nextflow run main.nf -dryRun

[33mNextflow 24.10.3 is available - Please consider updating your version to it[m



 N E X T F L O W   ~  version 24.04.4

Launching `main.nf` [elegant_church] DSL2 - revision: d4f95c633b

[-        ] ASSESS_READS -
[-        ] RUN_FASTQC   -
[-        ] RUN_FASTP    -
[-        ] RUN_SPADES   -
[-        ] RUN_BWA      -
[-        ] RUN_BUSCO    -

[-        ] ASSESS_READS  -
[-        ] RUN_FASTQC    -
[-        ] RUN_FASTP     -
[-        ] RUN_SPADES    -
[-        ] RUN_BWA       -
[-        ] RUN_BUSCO     -
[-        ] RUN_QUAST     -
[-        ] RUN_BLAST     -
[-        ] RUN_BLOBTOOLS -
[-        ] RUN_BAKTA     -

executor >  local (1)
[-        ] ASSESS_READS  | 0 of 2
[-        ] RUN_FASTQC    | 0 of 2
[58/8a3928] RUN_FASTP (1) | 0 of 2
[-        ] RUN_SPADES    -
[-        ] RUN_BWA       -
[-        ] RUN_BUSCO     -
[-        ] RUN_QUAST     -
[-        ] RUN_BLAST     -
[-        ] RUN_BLOBTOOLS -
[-        ] RUN_BAKTA     -

executor >  local (2)
[-        ] ASSESS_READS   | 0 of 2
[8e/a002af] RUN_FASTQC (1) | 0 of 2
[58/8a3928] RUN_FASTP (1)  | 0 o

CalledProcessError: Command 'b'\ncd wgs-nf\n\nnextflow run main.nf --continue\n\n'' returned non-zero exit status 1.

## Explanation:
The above command simulated the Nextflow workflow run using the --continue option. This simply checks to make sure the intended files exist and runs any parts of the workflow that need to run. For the sake of time most outputs were pre-prepared and only the summaries and finalization should have run. The main output is a directy of proteomes and a report file. We also saved select files from processes throughout the workflow (blobtools plots etc.). Lets view these files.

In [17]:
%%bash

# view contents of output directory
ls wgs-nf/output-dir/*

wgs-nf/output-dir/output-blast:
testagain2-blast-genome-vs-db.tsv

wgs-nf/output-dir/output-fastqc:
output-fastqc

wgs-nf/output-dir/output-genome:
testagain.fasta
testagain2.fasta

wgs-nf/output-dir/output-quast:
testagain2_report.txt


In [14]:
%%bash

# view the results file
cat wgs-nf/results_summary.tsv

cat: wgs-nf/results_summary.tsv: No such file or directory


CalledProcessError: Command 'b'\ncat wgs-nf/results_summary.tsv\n'' returned non-zero exit status 1.

In [None]:
%%bash

# final output directory for the next steps

ls wgs-nf/output-dir/proteomes