# miRNa-Seq Analysis Training Demo

## Overview

This code analyzes partial zebrafish data from the King BL et al. study and provides a comprehensive workflow for processing mouse miRNA-seq data. The workflow starts by setting up the environment and necessary directories, and it utilizes the SRA Toolkit to fetch and download sequence data from the NCBI database. FastQC and MultiQC are employed for quality control, while Docker is leveraged to run bioinformatics tools such as Cutadapt for adapter trimming and STAR for genome alignment and index creation. After processing, gene counts are combined into a matrix for differential expression analysis, and various visualization plots are used to interpret and understand the results.

## Step 1: Getting Started

<div class="alert alert-block alert-warning"> NOTE: This Jupyter Notebook was developed to run within a customized container on AWS with all software and packages pre-configured. If running without this customized container, you will need to install tools using the Miniforge environment setup instructions below before moving on to Step 2.</div>

### Without Container: Install Miniforge and R Packages

Miniforge is a lightweight Conda distribution that offers a streamlined installation process and efficient package management. It provides access to a vast repository of packages.

The following code performs these steps:
- Downloads Miniforge or Mambaforge (you can use either based on preference)
- Installs Miniforge (or Mambaforge) - no need to install conda since mamba will be available immediately
- Installs gsutil and dependencies
- Using miniforge and bioconda, installs R packages that will be used in this tutorial


<div class="alert alert-block alert-info">Tip: If using the Miniforge install, run the following code cells by removing the # pound from each command line. </div>

In [None]:
# Download Miniforge
#system("curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh", ignore.stdout = TRUE, ignore.stderr = TRUE)

# Install Miniforge (you can change the path as needed)
#system("bash Miniforge3-$(uname)-$(uname -m).sh -b -u -p $HOME/miniforge", ignore.stdout = TRUE, ignore.stderr = TRUE)

# Update PATH to point to the Miniforge bin files
#Sys.setenv(PATH = paste0(Sys.getenv("HOME"), "/miniforge/bin:", Sys.getenv("PATH")))

In [None]:
# Use mamba to install the required bioinformatics packages
#system("mamba install -y -c conda-forge -c bioconda fastqc multiqc entrez-direct parallel-fastq-dump sra-tools samtools subread")

In [None]:
# Install packages if not already installed
#if (!requireNamespace("BiocManager", quietly = TRUE))
#  install.packages("BiocManager")

#BiocManager::install(c("DESeq2", "dplyr", "ggplot2", "pheatmap", "apeglm", "ggrepel", "EnhancedVolcano", "ComplexHeatmap", "RColorBrewer", "plotly", "base64enc", "IRdisplay"))

### Step 1.2: Pull Docker Image

In [None]:
# Pull the Docker image
#system("docker pull encodedcc/mirna-seq-pipeline:1.2.2")

----------------------------------------------------

## If running from a container, as noted above, start with <b> STEP 1.3 </b> below:

### Step 1.3: Load Libraries, Create Directories, and Define Thread Number

In [None]:
# Load libraries
library(DESeq2)
library(ggplot2)
library(pheatmap)
library(apeglm) 
library(ggrepel) 
library(EnhancedVolcano)
library(RColorBrewer)
library(ComplexHeatmap)
library(plotly)
library(IRdisplay)

In [None]:
# Create necessary directories on your host machine (outside Docker)
dir.create("data", recursive = TRUE)
dir.create("data/aligned_bam")
dir.create("data/fastqc")
dir.create("data/fastqc_samples")
dir.create("data/raw_fastq")
dir.create("data/reference")
dir.create("data/sample_STAR")
dir.create("data/star_output")
dir.create("data/trimmed")
dir.create("data/zebrafish_STAR_index")

In [None]:
# Detect number of cores
num_cores <- parallel::detectCores(logical = TRUE)
THREADS <- max(1, num_cores - 1)
print(paste("Number of threads:", THREADS))

Executing the SRA Toolkit commands to fetch accession numbers.

In [None]:
system("esearch -db sra -query 'PRJNA418313' | efetch -format runinfo > all_sra_info.txt")
system("grep -E 'GSM2856755|GSM2856756|GSM2856757|GSM2856761|GSM2856762|GSM2856763' all_sra_info.txt | cut -d',' -f1 > accs.txt")
system("cat accs.txt")

## Step 2: Download Data and Reference Files


The prefetch command will access Sequence Read Archive (SRA) records (SRR) in parallel and download the corresponding FastQ files from the NCBI database.

In [None]:
# Download multiple files using the SRA-Toolkit with parallel threads
system(paste("cat accs.txt | xargs -P", THREADS, "-I {} prefetch {} -O data/raw_fastq -f yes"))

The reference genome, annotation information for mouse genome, and the primer adapter are download from S3 bucket.

In [None]:
# Download genome and annotation files
system("wget -P data/reference/ https://nigms-sandbox.s3.us-east-1.amazonaws.com/bulk-scRNAseq/reference/GRCz11.fa")
system("wget -P data/reference/ https://nigms-sandbox.s3.us-east-1.amazonaws.com/bulk-scRNAseq/reference/dre_zebrafish.gtf")

## Step 3: Download and Convert SRA to FASTQ

Run the this command to download and convert SRA files to FASTQ using the prefetch and fasterq-dump tools.

In [None]:
# Convert SRA files to Fastq format using fastq-dump
system(paste("cat accs.txt | xargs -P", THREADS, "-I {} fastq-dump --outdir data/raw_fastq/ --gzip data/raw_fastq/{}/{}.sra"))

## Step 4: Run FastQC

Run FastQC to analyze the quality of the FASTQ files, and then generate a MultiQC report.

In [None]:
# Run FastQC on the downloaded Fastq files
system(paste("cat accs.txt | xargs -P", THREADS, "-I {} fastqc data/raw_fastq/{}.fastq.gz -o data/fastqc/"))

In [None]:
display_html('<iframe src="./data/fastqc/SRR6289638_fastqc.html" width="800" height="600"></iframe>')

In [None]:
# Run MultiQC to generate a combined QC report
system("multiqc -f data/fastqc/")

# Read and display the MultiQC data using R's data frames
multiqc_data <- read.csv("./multiqc_data/multiqc_fastqc.txt", sep = "\t")
print(multiqc_data)

## Step 5: Adapter Trimming using Cutadapt

The code uses Docker to run cutadapt on each FASTQ file, removing adapter sequences and saving the trimmed results to the specified output directory.

In [None]:
# Define the list of SRA accessions from your file
accs_ids <- readLines("accs.txt")

# Specify adapter file location
adapter_file <- "data/trimmed/three_prime_adapter.fa"
trimmed_output_dir <- "data/trimmed"

# Run cutadapt in Docker for each accession
for (i in 1:length(accs_ids)) {
  acc <- accs_ids[i]
  input_fastq <- paste0("data/raw_fastq/", acc, ".fastq.gz")
  trimmed_fastq <- paste0(trimmed_output_dir, "/", acc, "_trimmed.fastq")

  # Run cutadapt inside Docker
  system(paste0(
    "cutadapt -a file:/data/", adapter_file, " -e 0.25 -m 15 -M 30 ",
    " --untrimmed-output /data/", trimmed_output_dir, "/", acc, "_untrimmed.fastq",
    " -o /data/", trimmed_output_dir, "/", acc, "_trimmed.fastq",
    " --cores ", THREADS, " /data/", input_fastq
  ))
}

<details>
<summary><b>If running without a container replace above code with code hidden within this dropdown.</b></summary>

```
# Define the list of SRA accessions from your file
accs_ids <- readLines("accs.txt")

# Specify adapter file location
adapter_file <- "data/trimmed/three_prime_adapter.fa"
trimmed_output_dir <- "data/trimmed"

# Run cutadapt in Docker for each accession
for (i in 1:length(accs_ids)) {
  acc <- accs_ids[i]
  input_fastq <- paste0("data/raw_fastq/", acc, ".fastq.gz")
  trimmed_fastq <- paste0(trimmed_output_dir, "/", acc, "_trimmed.fastq")

  # Run cutadapt inside Docker
  system(paste0(
    "docker run --rm -v ", getwd(), ":/data ",
    "encodedcc/mirna-seq-pipeline:1.2.2 ",
    "cutadapt -a file:/data/", adapter_file, " -e 0.25 -m 15 -M 30 ",
    " --untrimmed-output /data/", trimmed_output_dir, "/", acc, "_untrimmed.fastq",
    " -o /data/", trimmed_output_dir, "/", acc, "_trimmed.fastq",
    " --cores ", THREADS, " /data/", input_fastq
  ))
}
```
</details>



## Step 6: Run Fastqc

Run Fastqc after cutadapt to ensure that the trimmed FASTQ files are of good quality before proceeding with STAR.

In [None]:
# Run FastQC on the downloaded Fastq files
system(paste("cat accs.txt | xargs -P", THREADS, "-I {} fastqc data/trimmed/{}_trimmed.fastq -o data/fastqc_samples"))

In [None]:
display_html('<iframe src="./data/fastqc_samples/SRR6289638_trimmed_fastqc.html" width="800" height="600"></iframe>')

In [None]:
# Run MultiQC to generate a combined QC report
system("multiqc -f data/fastqc_samples/")

# Read and display the MultiQC data using R's data frames
multiqc_data <- read.csv("./multiqc_data/multiqc_fastqc.txt", sep = "\t")
print(multiqc_data)

## Step 7: STAR Genome Indexing

<div class="alert alert-block alert-warning"> NOTE: This step is computationally expensive. Make sure your instance RAM aligns with the --limitGenomeGenerateRAM parameter. You may need to lower the parameter value. Additionally, make sure you have adequate disk space to store the generated index or the process may not complete.</div>

STAR is a genome aligner that requires a pre-built genome index to efficiently map reads to the reference genome. This step creates a genome index for the zebrafish reference genome, which will be used in the subsequent alignment step.

In [None]:
# Run STAR genome indexing inside Docker
system(paste0(
  "STAR --runThreadN ", THREADS, 
  " --runMode genomeGenerate",
  " --genomeDir /data/data/zebrafish_STAR_index",
  " --genomeFastaFiles /data/data/reference/GRCz11.fa",
  " --sjdbGTFfile /data/data/reference/dre_zebrafish.gtf",
  " --sjdbOverhang 1 --limitGenomeGenerateRAM 29000000000"
))

<details>
<summary><b>If running without a container replace above code with code hidden within this dropdown.</b></summary>

```
# Run STAR genome indexing inside Docker
system(paste0(
  "docker run --rm -v ", getwd(), ":/data ",
      "encodedcc/mirna-seq-pipeline:1.2.2 ",
  "STAR --runThreadN ", THREADS, 
  " --runMode genomeGenerate",
  " --genomeDir /data/data/zebrafish_STAR_index",
  " --genomeFastaFiles /data/data/reference/GRCz11.fa",
  " --sjdbGTFfile /data/data/reference/dre_zebrafish.gtf",
  " --sjdbOverhang 1 --limitGenomeGenerateRAM 29000000000"
))
```
</details>



## Step 8: STAR Genome Alignment

After create the alignment it will get the trimmed FASTQ files and align to the reference genome using the STAR aligner. The aligned reads are then used for downstream analysis, such as differential gene expression analysis.

In [None]:
# Read accession IDs from the file
accs_ids <- readLines("accs.txt")

# Loop through each accession ID
for (i in 1:length(accs_ids)) {
  acc <- accs_ids[i]

  # Construct the command
  system(paste0(
    "STAR --genomeDir /data/data/zebrafish_STAR_index ",
    " --readFilesIn /data/data/trimmed/", acc, "_trimmed.fastq ",
    " --sjdbGTFfile /data/data/reference/dre_zebrafish.gtf ",
    " --runThreadN ", THREADS,
    " --alignEndsType EndToEnd ",
    " --outFilterMismatchNmax 1 ",
    " --outFilterMultimapScoreRange 0 ",
    " --quantMode TranscriptomeSAM GeneCounts ",
    " --outReadsUnmapped Fastx ",
    " --outSAMtype BAM SortedByCoordinate ",
    " --outFilterMultimapNmax 10 ",
    " --outSAMunmapped Within ",
    " --outFilterScoreMinOverLread 0 ",
    " --outFilterMatchNminOverLread 0 ",
    " --outFilterMatchNmin 16 ",
    " --alignSJDBoverhangMin 1000 ",
    " --alignIntronMax 1 ",
    " --outWigType wiggle ",
    " --outWigStrand Stranded ",
    " --outWigNorm RPM ",
    "--outFileNamePrefix /data/data/star_output/", acc, "_"
  ))
}

<details>
<summary><b>If running without a container replace above code with code hidden within this dropdown.</b></summary>

```
# Read accession IDs from the file
accs_ids <- readLines("accs.txt")

# Loop through each accession ID
for (i in 1:length(accs_ids)) {
  acc <- accs_ids[i]

  # Construct the command
  system(paste0(
    "docker run --rm -v ", getwd(), ":/data ",
    "encodedcc/mirna-seq-pipeline:1.2.2 ",
    "STAR --genomeDir /data/data/zebrafish_STAR_index ",
    " --readFilesIn /data/data/trimmed/", acc, "_trimmed.fastq ",
    " --sjdbGTFfile /data/data/reference/dre_zebrafish.gtf ",
    " --runThreadN ", THREADS,
    " --alignEndsType EndToEnd ",
    " --outFilterMismatchNmax 1 ",
    " --outFilterMultimapScoreRange 0 ",
    " --quantMode TranscriptomeSAM GeneCounts ",
    " --outReadsUnmapped Fastx ",
    " --outSAMtype BAM SortedByCoordinate ",
    " --outFilterMultimapNmax 10 ",
    " --outSAMunmapped Within ",
    " --outFilterScoreMinOverLread 0 ",
    " --outFilterMatchNminOverLread 0 ",
    " --outFilterMatchNmin 16 ",
    " --alignSJDBoverhangMin 1000 ",
    " --alignIntronMax 1 ",
    " --outWigType wiggle ",
    " --outWigStrand Stranded ",
    " --outWigNorm RPM ",
    "--outFileNamePrefix /data/data/star_output/", acc, "_"
  ))
}
```
</details>



## Step 9: Performing Differential Expression Analysis

Now, this step processes the gene count files generated by STAR and combines them into a single data frame for further analysis.

In [None]:
# Set up libraries
library(dplyr)

# Define the accession IDs and output directory
accs <- c("SRR6289638", "SRR6289639", "SRR6289640", "SRR6289644", "SRR6289645", "SRR6289646")
output_dir <- "data/star_output/"

# Create a function to read in gene counts from STAR output
read_gene_counts <- function(acc, output_dir) {
  filepath <- file.path(output_dir, paste0(acc, "_ReadsPerGene.out.tab"))
  # Read the data and extract the second column (unstranded counts)
  gene_counts <- read.table(filepath, header = FALSE, sep = "\t", stringsAsFactors = FALSE)
  gene_counts <- gene_counts[, c(1, 2)]  # First column is gene, second is unstranded counts
  colnames(gene_counts) <- c("Gene", acc)
  return(gene_counts)
}

# Initialize the combined matrix with the first file
combined_counts <- read_gene_counts(accs[1], output_dir)

# Loop through the remaining files and merge the counts by gene
for (i in 2:length(accs)) {
  acc <- accs[i]
  gene_counts <- read_gene_counts(acc, output_dir)
  combined_counts <- full_join(combined_counts, gene_counts, by = "Gene")
}

# View the combined gene counts
head(combined_counts)


The differential expression analysis using the DESeq2 package to identify genes that show significant changes in expression between the two experimental conditions ("0dpa" and "3dpa").

In [None]:
# Load necessary libraries
library(DESeq2)

# Prepare the count matrix
rownames(combined_counts) <- combined_counts$Gene
count_data <- as.matrix(combined_counts[, -1])

# Define sample conditions (e.g., two groups: control vs treatment)
conditions <- factor(c("0dpa", "0dpa", "0dpa", "3dpa", "3dpa", "3dpa"))
coldata <- data.frame(row.names = accs, condition = conditions)

# Create DESeq2 dataset
dds <- DESeqDataSetFromMatrix(countData = count_data, colData = coldata, design = ~ condition)

# Run DESeq2
dds <- DESeq(dds)

# Get results
res <- results(dds)


## Step 10: Gene Expression Visualization and Exploration

In [None]:
#Volcano Plot
EnhancedVolcano(res,
    lab = rownames(res),
    x = 'log2FoldChange',
    y = 'pvalue',
    pCutoff = 0.05,
    FCcutoff = 1.0,
    title = 'Volcano Plot')


In [None]:
rld <- rlog(dds)
# Select top 20 genes by adjusted p-value
top_genes <- head(order(res$padj), 20)
mat <- assay(rld)[top_genes, ]

# Plot the heatmap
pheatmap(mat, cluster_rows = TRUE, cluster_cols = TRUE)


In [None]:
#PCA plot
rld <- rlog(dds) 
plotPCA(rld, intgroup = "condition")


In [None]:
#MA plot
plotMA(res, main="DESeq2 MA Plot", ylim=c(-2, 2))


In [None]:
# Interactive MA Plot
res_df <- as.data.frame(res)

p <- ggplot(res_df, aes(x=log10(baseMean), y=log2FoldChange, 
                        text=paste("Gene: ", rownames(res_df)), 
                        color=padj < 0.05)) +
  geom_point(alpha=0.5) +
  scale_color_manual(values=c("grey", "red")) + # Highlight significant genes
  labs(title="DESeq2 MA Plot", x="log10(baseMean)", y="log2 Fold Change") +
  theme_minimal() +
  ylim(-2, 2)

interactive_plot <- ggplotly(p, tooltip="text")
interactive_plot
