# Running Nextflow from Colab

This is a guide with code to be able to run nf-core pipelines from colab notebooks.

## Installing Java

 Nextflow is a workflow management system that is written in the Groovy programming language. Groovy runs on the **Java Virtual Machine (JVM)**, which means that a Java Development Kit (JDK) or Java Runtime Environment (JRE) is a non-negotiable prerequisite.
 
 This code block uses the `apt` package manager (native to the Ubuntu-based Colab environment) to install Java.
 
 * `!apt update`: This refreshes the list of available packages from the software repositories.
 
 * `!apt install openjdk-17-jdk`: This installs version 17 of the open-source Java Development Kit.
 
 * `!export ...`: These commands attempt to set the `JAVA_HOME` and `PATH` environment variables. This is standard practice in a regular shell to tell the system where to find the Java executables. (Note: In Colab, each `!` command runs in a separate shell, so these `export` commands won't persist for subsequent cells, but the package installation itself often configures the default Java path correctly.)

In [None]:
!apt update
!apt install openjdk-17-jdk
!export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
!export PATH=$JAVA_HOME/bin:$PATH
!source ~/.bashrc

## Installing Nextflow

With Java installed, we can now install Nextflow. This cell uses the standard quick-install method provided by the Nextflow team.


1.  `!wget -qO- https://get.nextflow.io | bash`: This command downloads the installer script from `get.nextflow.io` (`-qO-` means \"quiet\" and \"output to standard out\") and immediately pipes (`|`) the script's content to the `bash` interpreter, which executes it. This downloads the `nextflow` executable file into the current directory.

2.  `!mv nextflow /usr/bin/nextflow`: We move the downloaded `nextflow` file from our local directory to `/usr/bin/`. This directory is part of the system's `PATH`, which allows us to run the `nextflow` command from any location.

3.  `!chmod +x /usr/bin/nextflow`: This command modifies the file's permissions to make it executable (`+x`).

4.  `!nextflow -v`: Finally, we run `nextflow -v` (version) to test the installation and confirm that the system can find and execute the program.

In [None]:
!wget -qO- https://get.nextflow.io | bash # Download Nextflow
!mv nextflow /usr/bin/nextflow # Move to a path Colab can access
!chmod +x /usr/bin/nextflow # Make it executable
!nextflow -v # Test it

## Setting up Conda

Bioinformatics pipelines often depend on many different software tools, each with its own specific version requirements. Managing these dependencies manually is extremely difficult.

**Conda** is a package and environment manager that solves this problem. Nextflow can integrate directly with Conda, allowing it to automatically create isolated environments for each step of the pipeline and install the exact software versions needed.

This cell uses `condacolab`, a small Python library, to install the Conda package manager directly into our Google Colab environment. The `-q` flag for `pip` means \"quiet,\" suppressing the installation output.

After this cell runs, the Colab kernel will restart to activate the Conda installation.

In [None]:
!pip install -q condacolab # -q here means quite
import condacolab
condacolab.install()

In [None]:
!conda config --add channels bioconda
!conda config --add channels conda-forge
!conda config --set channel_priority strict

## Running a Test pipeline

Before running our complex analysis, we'll test our environment with `nf-core/demo`, a simple pipeline designed for this purpose.

The `nextflow pull` command (commented out here) downloads the pipeline's code and dependency definitions from the nf-core repository. This "pulls" the pipeline into the local Nextflow cache, which can make the subsequent `run` command start faster.

**Note*: uncomment the next cells to run the Test*

In [None]:
#! nextflow pull nf-core/demo

In [None]:
#! nextflow run nf-core/demo -profile conda,test --outdir demo-results

# Running Taxprofiler pipeline

**nf-core/taxprofiler** is a bioinformatics best-practice analysis pipeline for taxonomic classification and profiling of shotgun short- and long-read metagenomic data. It allows for in-parallel taxonomic identification of reads or taxonomic abundance estimation with multiple classification and profiling tools against multiple databases, and produces standardised output tables for facilitating results comparison between different tools and databases.

![image.png](https://raw.githubusercontent.com/Multiomics-Analytics-Group/course_multi-omics_data_science/refs/heads/main/metagenomics/notebooks/img/taxprofiler_tube.png)


### Creating Data Folders for Metagenomics Analysis

Now we begin the setup for our real analysis. The first step is to create a structured set of directories to keep our files organized.

* `!mkdir metagenomics`: Creates a main parent directory for the project.

* `!mkdir metagenomics/data`: Creates a subdirectory to hold our raw sequencing data and the sample sheet.

* `!mkdir metagenomics/databases`: Creates a subdirectory to hold our database configuration file.

In [None]:
!mkdir metagenomics
!mkdir metagenomics/data
!mkdir metagenomics/databases

### Downloading Sample Sheet and Database File for nf-core/Taxprofiler

The `nf-core/taxprofiler` pipeline requires two main configuration files to run:

1.  **Sample Sheet**: This `sample_sheet.csv` file is the primary input. It's a table that tells the pipeline what samples to process and, critically, where to find their corresponding raw sequencing files (the forward and reverse reads). We use `wget` to download a pre-configured sample sheet and save it to `metagenomics/data/`.

2.  **Database Sheet**: This `database_full_v1.2.csv` file tells Taxprofiler which taxonomic databases to use (e.g., Kraken2, MetaPhlAn) and what parameters to use when running them. We download this file and save it to `metagenomics/databases/`.

In [None]:
! wget https://raw.githubusercontent.com/Multiomics-Analytics-Group/course_multi-omics_data_science/refs/heads/main/metagenomics/data/sample_sheet.csv -O metagenomics/data/sample_sheet.csv
! wget https://raw.githubusercontent.com/Multiomics-Analytics-Group/course_multi-omics_data_science/refs/heads/main/metagenomics/databases/database_full_v1.2.csv -O metagenomics/databases/database_full_v1.2.csv

### Download Taxprofiler configuration
 
 Metagenomics analysis can be very computationally intensive, requiring large amounts of RAM and many CPUs. Since Google Colab provides a resource-limited environment, running the pipeline with its default settings (which are designed for servers or clusters) would likely cause it to crash.
 
 This command downloads a custom Nextflow configuration file named `low_resources.config`. This file contains settings that override the pipeline's defaults, instructing it to use less memory and fewer CPUs for each step. We will later pass this file to Nextflow using the `-c` flag.

In [None]:
! wget https://raw.githubusercontent.com/Multiomics-Analytics-Group/course_multi-omics_data_science/refs/heads/main/metagenomics/low_resources.config -O metagenomics/low_resources.config

### Downloading Raw Sequencing Data (FASTQ files) ðŸ’¾

Download data from the [The Inflammatory Bowel Disease Multi'omics Database (IBDMdb)
](https://ibdmdb.org/).

All files can be found in the [Download Data](https://ibdmdb.org/results) option.

In [None]:
# Sample HSMA33OT
! wget https://g-227ca.190ebd.75bc.data.globus.org/ibdmdb/raw/HMP2/MGX/2018-05-04/HSMA33OT.tar -O metagenomics/data/HSMA33OT.tar
! tar -xf metagenomics/data/HSMA33OT.tar -C metagenomics/data
! rm metagenomics/data/HSMA33OT.tar

In [None]:
# Sample CSM9X233
! wget https://g-227ca.190ebd.75bc.data.globus.org/ibdmdb/raw/HMP2/MGX/2018-05-04/CSM9X233.tar -O metagenomics/data/CSM9X233.tar
! tar -xf metagenomics/data/CSM9X233.tar -C metagenomics/data
! rm  metagenomics/data/CSM9X233.tar

In [None]:
# Sample CSM5MCWG
! wget https://g-227ca.190ebd.75bc.data.globus.org/ibdmdb/raw/HMP2/MGX/2018-05-04/CSM5MCWG.tar -O metagenomics/data/CSM5MCWG.tar
! tar -xf metagenomics/data/CSM5MCWG.tar -C metagenomics/data
! rm  metagenomics/data/CSM5MCWG.tar

In [None]:
# Sample MSMAPC7P
! wget https://g-227ca.190ebd.75bc.data.globus.org/ibdmdb/raw/HMP2/MGX/2018-05-04/MSMAPC7P.tar -O metagenomics/data/MSMAPC7P.tar
! tar -xf metagenomics/data/MSMAPC7P.tar -C metagenomics/data
! rm  metagenomics/data/MSMAPC7P.tar

### Pulling the nf-core/rnaseq Pipeline Version

This command uses `nextflow pull` to download and cache the latest version of the `nf-core/taxprofiler` pipeline locally. This ensures that the execution uses a defined, stable version.

In [None]:
! nextflow pull nf-core/taxprofiler

### Executing the nf-core/taxprofiler Pipeline ðŸš€

This is the final command that executes the entire `nf-core/taxprofiler` pipeline. Let's break down each argument:

* `! nextflow run nf-core/taxprofiler`: The main command to run this specific pipeline.

* `--input ./metagenomics/data/sample_sheet.csv`: Points the pipeline to our sample sheet. This is how it discovers the input files.

* `--databases ./metagenomics/databases/database_full_v1.2.csv`: Points to our database configuration file.

* `--outdir metagenomics/results`: Tells the pipeline to save all output files into a new directory named `metagenomics/results`.

* `-profile conda`: Instructs Nextflow to use Conda for managing all software dependencies.

* `-c metagenomics/low_resources.config`: Loads our custom configuration file (`-c`) to ensure the pipeline runs within Colab's memory and CPU limits.

* `-resume`: This is a powerful Nextflow feature. If the pipeline is interrupted (e.g., Colab disconnects), you can run this exact same command again, and Nextflow will intelligently skip any steps that have already completed successfully, picking up right where it left off.

In [None]:
! nextflow run nf-core/taxprofiler \
    --input ./metagenomics/data/sample_sheet.csv \
    --databases ./metagenomics/databases/database_full_v1.2.csv \
    --outdir metagenomics/results \
    -profile conda \
    -c metagenomics/low_resources.config \
    -resume