# Running Nextflow from Colab

This are the same steps as before:

1.  Installing **Java** (a prerequisite for Nextflow).
2.  Installing **Nextflow**.
3.  Setting up **Conda** to manage software dependencies.

Then, we are ready to run the nf-code/rnaseq pipeline.

## Installing Java

In [None]:
!apt update
!apt install openjdk-17-jdk
!export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
!export PATH=$JAVA_HOME/bin:$PATH
!source ~/.bashrc

## Installing Nextflow

In [None]:
!wget -qO- https://get.nextflow.io | bash # Download Nextflow
!mv nextflow /usr/bin/nextflow # Move to a path Colab can access
!chmod +x /usr/bin/nextflow # Make it executable
!nextflow -v # Test it

## Setting up Conda

In [None]:
!pip install -q condacolab # -q here means quite
import condacolab
condacolab.install()

In [None]:
!conda config --add channels bioconda
!conda config --add channels conda-forge
!conda config --add channels defaults
!conda config --set channel_priority strict

# Running RNAseq pipeline

This section focuses on executing the `nf-core/rnaseq` pipeline for transcriptome analysis. The data used is from **GSE137344** and is downsampled for quick execution in the Colab environment.

![RNASeq pipeline](https://raw.githubusercontent.com/Multiomics-Analytics-Group/course_multi-omics_data_science/refs/heads/main/transcriptomics/notebooks/img/rnaseq.png)

Data downloaded from [GSE137344](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE137344).

### Creating Data Folders for Transcriptomics Analysis

The following commands create the necessary directory structure: a main `transcriptomics` directory and a `data` subdirectory within it. All raw data and configuration files will be stored here.

In [None]:
!mkdir transcriptomics
!mkdir transcriptomics/data

### Downloading Sample Sheet File for nf-core/rnaseq

The **sample sheet** (`sample_sheet.csv`) is a critical input file for all nf-core pipelines. It is a comma-separated file that tells the pipeline the location of the raw sequencing files, the sample name, and any associated experimental metadata (like condition, batch, etc.). This step downloads a pre-formatted sample sheet for the test data.

In [None]:
! wget https://raw.githubusercontent.com/Multiomics-Analytics-Group/course_multi-omics_data_science/refs/heads/main/transcriptomics/data/sample_sheet.csv -O transcriptomics/data/sample_sheet.csv

### Getting the Config file for the pipeline

Nextflow pipelines can be customized using configuration files. The `low_resources.config` file is specifically designed to adjust the resource requirements (CPU, memory) for processes within the pipeline to run successfully on a limited-resource environment like Google Colab. This step downloads that custom configuration.

In [None]:
! wget https://raw.githubusercontent.com/Multiomics-Analytics-Group/course_multi-omics_data_science/refs/heads/main/transcriptomics/low_resources.config -O transcriptomics/low_resources.config

### Downloading Raw Sequencing Data (FASTQ files) ðŸ’¾

These commands download four raw sequencing files (in compressed FASTQ format, `.fastq.gz`) from the NCBI Sequence Read Archive (SRA), using their accession numbers (`SRR10104255` through `SRR10104258`). These are the actual input files for the RNA-seq analysis.

In [None]:
!wget https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=SRR10104255 -O transcriptomics/data/SRR10104255.fastq.gz
!wget https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=SRR10104256 -O transcriptomics/data/SRR10104256.fastq.gz
!wget https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=SRR10104257 -O transcriptomics/data/SRR10104257.fastq.gz
!wget https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=SRR10104258 -O transcriptomics/data/SRR10104258.fastq.gz

### Pulling the nf-core/rnaseq Pipeline Version

This command uses `nextflow pull` to download and cache the latest version of the `nf-core/rnaseq` pipeline locally. This ensures that the execution uses a defined, stable version.

In [None]:
! nextflow pull nf-core/rnaseq

### Executing the nf-core/rnaseq Pipeline ðŸš€

This is the main execution command. It runs the `nf-core/rnaseq` pipeline with the following critical parameters:

* `--input transcriptomics/data/sample_sheet.csv`: Specifies the sample sheet with file and metadata information.

* `--outdir transcriptomics/results`: Designates the output directory where all results, reports, and logs will be stored.
        
* `-profile conda`: Instructs Nextflow to use **Conda** to manage and install all required software dependencies.

* `-c transcriptomics/low_resources.config`: Applies the custom configuration file to adjust resource settings for Colab.

* `--igenomes_ignore` and `--genome null`: These parameters tell the pipeline to skip the automated genome fetching/indexing and rely on the configuration file to define a minimal or custom reference setup, which is necessary for low-resource testing.

* `-resume`: Allows the pipeline to pick up from the last successful step if a run is interrupted, saving significant time.

In [None]:
! nextflow run \
    nf-core/rnaseq \
    --input transcriptomics/data/sample_sheet.csv \
    --outdir transcriptomics/results \
    --igenomes_ignore \
    --genome null \
    -profile conda \
    -c transcriptomics/low_resources.config \
    -resume