# <span> Module 1: RNA-seq analysis <span>

### <span> Pre-Module Flashcards to Revise the Basics <span>

In [None]:
IRdisplay::display_html('<iframe src = "../../docs/quiz_files/rna-pre_module.html" width=95% height=600></iframe>')

## Overview
### <span> Central Dogma of Molecular Biology. <span>
    
<figure>
<img src="../../images/central-dogma.jpeg" width="800" height="400">
<figcaption align = "center"> <b> Fig 1: Different omes and corresponding omics technologies. [1]</b> </figcaption>
</figure>
    
<div class="alert alert-block alert-warning">
    <i class="fa fa-pencil" aria-hidden="true"></i>
    <b>[1] Reference:</b> Virkud, Y. V., Kelly, R. S., Wood, C., & Lasky-Su, J. A. (2019). The nuts and bolts of omics for the clinical allergist. Annals of Allergy, Asthma and Immunology, 123(6), 558-563.
</div>

The central dogma of molecular biology is the representation of omes and omics. Omic data of various kinds are frequently employed in human medical research. The fields of omic research include the omic data produced by the central dogma, which includes the fields of genomics (DNA), transcriptomics (RNA), proteomics (proteins), and metabolomics (small molecules, including amino acids, fatty acids, carbohydrates, vitamins, lipids, and nucleotides); however, new types of omic data have emerged, including the fields of epigenomics (methyl tags and histones), exposomics (allergens, toxins, diet (bacteria and microorganisms). As a result, the majority of early scientific efforts were devoted to describing the genome, transcriptome, and proteome. However, seven major omics disciplines are currently being investigated in great detail: the genome (DNA), transcriptome (RNA), proteome (proteins), epigenome (DNA modifications that influence expression), metabolome (metabolites), microbiome (microbiota), and exposome (exposures).

### <span> Next Generation Sequencing Technique for RNA-Seq. <span>

<figure>
<img src="../../images/ngs.png" width="800" height="400">
<figcaption align = "center"> <b> Fig 2: High level workflow of RNA sequencing. [2]</b> </figcaption>
</figure>
    
<div class="alert alert-block alert-warning">
    <i class="fa fa-pencil" aria-hidden="true"></i>
    <b>[2] Reference:</b> Van den Berge, K., Hembach, K. M., Soneson, C., Tiberi, S., Clement, L., Love, M. I., ... & Robinson, M. D. (2019). RNA sequencing data: Hitchhiker's guide to expression analysis. Annual Review of Biomedical Data Science, 2(1), 139-173.
</div>
    
The above figure displays an overview of an RNA-seq protocol's experimental phases. The sequenced reads from the cDNA library are mapped to a reference genome or transcriptome after being created from isolated RNA targets. Depending on the objective of the experiment, downstream data analysis may entail, among other things, evaluating differential expression, variant calling, or genome annotation.

### <span> **Gene Expression** <span>
The phrase "gene expression" refers to how a gene affects a cell's overall phenotype and functions through the activity of the molecular products that are encoded in a given nucleotide sequence of the gene.

Knowing how much gene expression levels vary from the norm can help identify the genes that are genuinely crucial for things like disease prognosis or cell/tissue identity. In order to determine whether a single gene is expressed at all, low-throughput techniques such using a reporter gene and fluorescent protein product have been replaced by high-throughput techniques.
  
### <span> **RNA-Seq** <span>
RNA sequencing has proven to be a ground breaking tool in the study of transcriptomics in the last decade. The accuracy, throughput, and the resolution produced with RNA-seq analysis has provided phenomenal results. There is a variety of applications available for transcriptomic sequencing. Currently, RNA-seq is considered to be the most effective, reliable, and flexible method to determine gene expression and transcription activation at genome-wide level.
    
### <span> **Differential Expression Analysis** <span>    
Differential expression analysis is the process of statistically analyzing the normalized read count data to identify quantifiable differences in expression levels between experimental groups. A gene is said to be differentially expressed if there is a statistically significant difference or change in read counts or expression levels between two experimental conditions. Understanding the biological differences between healthy and diseased states depends on differential gene expression. With the use of this approach, researchers can choose specific gene expression targets for further investigation and pinpoint the molecular causes of phenotypic variations.
    
### <span> General Roadmap for RNA-Seq Experiment <span>   
    
<figure>
<img src="../../images/rna-seq-roadmap.png" width="800" height="400">
<figcaption align = "center"> <b> Fig 3: Extensive roadmap for types of computational RNA-Seq methods. [3]</b> </figcaption>
</figure>
    
<div class="alert alert-block alert-warning">
    <i class="fa fa-pencil" aria-hidden="true"></i>
    <b>[3] Reference:</b> Conesa, A., Madrigal, P., Tarazona, S., Gomez-Cabrero, D., Cervera, A., McPherson, A., Szcześniak, M. W., Gaffney, D. J., Elo, L. L., Zhang, X., & Mortazavi, A. (2016). A survey of best practices for RNA-seq data analysis. Genome biology, 17, 13. https://doi.org/10.1186/s13059-016-0881-8</div>
    
This figure depicts a generic strategy for computing analysis using RNA-seq. The primary analytical steps for pre-analysis, core analysis, and advanced analysis are listed above the lines. The text discusses the major analysis points for each stage that are mentioned below the lines. (a) Experimental design, sequencing design, and quality control are all processes in preprocessing. (b) Transcriptome profiling, differential gene expression, and functional profiling are fundamental analyses. (c) Advanced analysis involves data integration, visualization, and other RNA-seq technologies.
    
### <span> Some Handy Abbreviations. <span>
+ **ChIP-seq:** Chromatin immunoprecipitation sequencing
+ **eQTL:** Expression quantitative loci
+ **sQTL:** Splicing quantitative trait loci
+ **TPM:** Transcripts per million
+ **FPKM:** Fragments per kilobase of exon model per million mapped reads
+ **RPKM:** Reads per kilobase of exon model per million reads
+ **GSEA:** Gene set enrichment analysis
+ **PCA:** Principal component analysis
+ **TF:** Transcription factor



### <span> **Analysis Architecture for this Module** <span>

<figure>
<img src="../../images/read_counts.png" width="900" height="400">
<figcaption align = "center"> <b> Fig 4: Analysis workflow for RNA-Seq module.</b> </figcaption>
</figure>
    
This figure represents the analysis architecture followed in this module. The module has been designed according to the resources and the availability of data. The blue box represents the pipeline that can be implemented using the Nextflow nf-core/rnaseq module. The purple box represents the data that can be directly extracted from GEO. Both the blue and purple boxes generate gene counts, and the user can implement either of the methods to generate gene counts and feed them to perform the further downstream analysis. However, Nextflow would take a lot of storage and processing power, so it is recommended to extract the data from GEO if available. If the required data is not available from GEO, then the Nextflow pipeline can be used to extract the gene counts. The downstream analysis is carried through using the R kernel of a Jupyter notebook, and all the steps are discussed in detail in this module.

### Raw Reads to Gene Counts Table (Optional)

### <span> Preprocessing Raw reads to generate gene count table using nextflow </span>

<figure>
<img src="../../images/nfcore-rnaseq.png" width="900" height="400">
<figcaption align = "center"> <b> Fig 5: Summary of different methods under nf-core RNA-Seq pipeline. [4]</b> </figcaption>
</figure>

<div class="alert alert-block alert-warning">
    <i class="fa fa-pencil" aria-hidden="true"></i>
    <b>[4] Reference:</b> Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen. Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x. </div>


A bioinformatics pipeline called nf-core/rnaseq can be used to analyze RNA sequencing data from organisms with an annotated reference genome. This pipeline represents different stages of the analysis. It contains all the analysis steps, starting from preprocessing of the fastq data followed by genome alignment and quantification. Gene expression levels are generated from mRNA and miRNA sequencing data using RNA-Seq quantification. The next step is pseudo-alignment and quantification, followed by post-processing of the data, and then the final quality control of the input data is performed. The different colors of the pipeline represent the different methods of processing the fastq files. For example, the black line represents STAR, quantification, and salmon software usage to process the files. The user can choose any method of their choice while processing their files. 

This step is **optional** as it is the preprocessing step to let you experience generating your own gene counts table. To save on computational and storage resources, we have already provided the gene count table with this module that will be copied from our bucket in step 3. The gene counts can also be extracted from the NCBI's GEO website using the same data acccession under the supplementary files section.  

If however you want to try the nextflow analysis, here are a few tips to help you along. You will need to configure your config file to point to AWS Batch. We provide a template that you can modify with your S3 bucket (need to create one, `aws s3 mb s3://UNIQUE-BUCKET-NAME`. For further details on how to use Nextflow for RNA Seq analysis, please refer to [nf-core/rnaseq](https://nf-co.re/rnaseq) or [Transcriptome-Assembly-Refinement-and-Applications](https://github.com/NIGMS/Transcriptome-Assembly-Refinement-and-Applications) module to learn more about pre-processing through Nextflow.

## Learning Objectives

* **Understanding RNA-seq fundamentals:**  The notebook introduces the central dogma of molecular biology, gene expression, RNA-seq technology, and differential expression analysis.

* **RNA-seq data processing:**  It demonstrates how to process raw RNA-seq data (using the optional Nextflow pipeline) to generate a gene counts table. This includes concepts like quality control and read mapping (though the details are mostly referenced rather than fully explained within the notebook).

* **Differential gene expression analysis using DESeq2:** The core of the notebook focuses on using the `DESeq2` R package to perform differential gene expression analysis on a provided gene count table (or one generated by Nextflow).  This includes normalization, transformation, exploration of results (using plots like MA plots and heatmaps), and interpretation of the results.

* **Data visualization and interpretation:** The notebook emphasizes creating and interpreting various visualizations, including boxplots, mean-standard deviation plots, PCA plots, histograms of p-values, MA plots, and heatmaps.

* **Using bioinformatics tools:** The notebook shows how to use Nextflow (a workflow management system) for RNA-seq data processing and R for statistical analysis and visualization.

## Prerequisites

**Software and Packages:**

* **Nextflow:** For running the RNA-Seq pipeline (optional, but recommended for learning the complete process).
* **R and several R packages:**  `DESeq2`, `vsn`, `genomation`, `hexbin`, `tidyverse`, `Biobase`, `NMF`, `ggplot2` and `BiocManager`.  These are crucial for the differential gene expression analysis and visualization steps.  The notebook includes code to install many of these.

**APIs that should be enabled:**

* **Amazon S3**  The notebook extensively uses `aws s3` commands, indicating it needs access to Amazon S3 for downloading input files (`GSE173380_RNAseq_counts.csv.gz`, `sample_info.txt`) and potentially for storing intermediate and output files from the Nextflow pipeline.  The notebook suggests creating a bucket (`aws s3 mb s3://UNIQUE-BUCKET-NAME`).

* **AWS Batch Compute Environment and Job Queue:** You must have an AWS Batch compute environment and job queue configured. The CloudFormation template automates this. You can set up one manually following the instructions in the link provided in the notebook, but using the template is *recommended* for ease of setup.

## Get Started

### AWS Batch Setup

AWS Batch will create the needed permissions, roles and resources to run Nextflow in a serverless manner. You can set up AWS Batch manually or deploy it **automatically** with a stack template. The Launch Stack button below will take you to the cloud formation create stack webpage with the template with required resources already linked. 

If you prefer to skip manual deployment and deploy automatically in the cloud, click the Launch Stack button below. For a walkthrough of the screens during automatic deployment please click [here](https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/HowToLaunchAWSBatch.md). The deployment should take ~5 min and then the resources will be ready for use. 

[![Launch Stack](../../images/LaunchStack.jpg)](https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/new?stackName=aws-batch-nigms&templateURL=https://nigms-sandbox.s3.us-east-1.amazonaws.com/cf-templates/AWSBatch_template.yaml )


Before begining this tutorial, if you do not have required roles, policies, permissions or compute environment and would like to **manually** set those up please click [here](https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/AWS-Batch-Setup.md) to set that up.

### Install Nextflow

In [None]:
#Install nexflow, make it exceutable, and update it
system('curl https://get.nextflow.io | bash' , intern=TRUE)
system('chmod +x nextflow' , intern=TRUE)
system('./nextflow self-update' , intern=TRUE)

**The size of the output data generated by Nextflow is quite large we can mitigate that by storing the temporary and output files to a bucket by setting the 'workDir' and 'params.outdir' to an existing bucket. Make sure you modify the file called rnaseq-aws.config**
 
`workDir = 's3://your_bucket_name/rna-tmp'`  
`params.outdir = 's3://your_bucket_name/rna-outputs'`

This next step can take about 45 min, but since it runs serverlessly using AWS Batch, we recommend you actually paste the command into a terminal, and then run the rest of the notebook. Then you can review the output at the end and see how the different jobs were processed. Plus Nextflow looks strange in `R` so it is better in the terminal anyway. 

In [None]:
system('./nextflow run nf-core/rnaseq -c rnaseq-aws.config -profile test,aws', intern=TRUE)

<div class="alert alert-block alert-info">
    <i class="fa fa-lightbulb-o" aria-hidden="true"></i>
    <b>Tip: </b> If you don't immediately see a output on your screen check your output directory you have pointed to in your config file to insure that Nextflow is running. You should see some output directories/files.
</div>

## Gene Counts Table to Differential Expression

### <span> Install and Load required packages </span>

Run the following to create a Kernel with all required packaged installed. It will take about 5 minutes to install packages and create a kernel.

In [None]:
system('chmod +x install_packages.sh' , intern=TRUE)
system('bash install_packages.sh >> logs.txt' , intern=TRUE) # This creates "R-Custom-Kernel"

**Important**: Choose "R-Custom-Kernel" for the rest of the notebook.

In [None]:
library(DESeq2)
library(magrittr)

An RNA-Seq experiment analysis involves a number of phases. Sequencing reads are analyzed first (FASTQ files). Usually, they are aligned to a reference genome. The number of reads that were mapped to each gene may then be determined. We execute statistical studies on a table of counts as a result to identify differentially expressed genes and pathways.

### <span> Importing RNA_seq raw counts and annotation file </span>

The importing file can be generated locally. It combines several samples from multiple runs. We will annotate all of these runs and file names as control since, for instance, control might have four different names for runs and files.

In [None]:
# download data files from storage bucket
system("aws s3 cp s3://nigms-sandbox/nosi-und/RNA-Seq/GSE173380_RNAseq_counts.csv.gz .", intern=TRUE)
system("aws s3 cp s3://nigms-sandbox/nosi-und/RNA-Seq/sample_info.txt .", intern=TRUE)

readcounts <- read.table("GSE173380_RNAseq_counts.csv.gz", sep = ",", header = T, row.names = 1)
sample_info <- read.table("sample_info.txt", sep = "\t")

<div class="alert alert-block alert-success">
    <i class="fa fa-hand-paper-o" aria-hidden="true"></i>
    <b>Note: </b>  If you've used Nextflow to produce your gene counts table and would like to use it for the down processing analysis instead of the provided counts table enter your own files into the code above by copying the <b>salmon.merged.gene_counts.tsv</b> from the salmon subdirectory within your Nextflow output directory.
</div>

### <span> Store the information in a dataframe </span>
The next step is to make a data frame. We will store the raw read counts, sample informations, and the conditions data into the dataframe here named as DESeq.ds.

In [None]:
DESeq.ds <- DESeqDataSetFromMatrix(countData = round(readcounts), colData = sample_info, design = ~condition)


### <span> Explore the dataframe </span>

It is important to investigate the dataframe to see if the raw counts uploaded correctly and all the required information for the analysis is available. 

In [None]:
colData(DESeq.ds) %>% head
assay(DESeq.ds, "counts") %>% head
rowData(DESeq.ds) %>% head
counts(DESeq.ds) %>% str
rowSums(counts(DESeq.ds)) %>% head

### <span> Improving the quality by removing the genes with no gene count </span>
Enhancing the quality of the input reads is the following stage in the RNA-seq analysis workflow. When the sequencing quality is very high, this step may be viewed as optional. However, this step may still enhance the quality of the input sequences even with the highest-quality sequencing datasets. The adapter sequences that contaminate the sequenced reads and the low-quality nucleotides that are typically present at the ends of the sequences are the most frequent technical artifacts that can be filtered out.

Before going on to the downstream analytical processes, the sequencing quality control and read pre-processing steps can be repeated several times until a suitable level of quality in the sequence data is reached.

In [None]:
DESeq.ds <- DESeq.ds[ rowSums(counts(DESeq.ds)) > 0, ]
#Inspect data after manupalation
rowSums(counts(DESeq.ds)) %>% head
colSums(readcounts)
colSums(counts(DESeq.ds)) 
#colSums(counts(DESeq.ds)) and colSums(readcounts) as we only removed the genes that did not express in any of the samples. 

### <span> Normalizing the read counts</span>
The read counts are normalized by computing size factors, which addresses the differences not only in the library sizes, but also the library compositions.


In [None]:
# Get the size factor using estimateSizeFactors from DESeq.
DESeq.ds <- estimateSizeFactors(DESeq.ds)
# Check that the size factor has been added to the dataframe replacing raw reads.
sizeFactors(DESeq.ds)
# colData now contains the normalized reads in the form of sizeFactors
colData(DESeq.ds)
# We can also retrieve the normalized read counts using counts() function
counts.sf_normalized <- counts(DESeq.ds, normalized = TRUE)

The next step is to transform the size-factor normalized read counts to log2 scale. If the read counts are further changed to log scale after normalization, most downstream analyses perform better. This is a result of the RNA-seq data's unusually wide range of expression values, which may be explored and visualized in various ways.

In [None]:
# transform size-factor normalized read counts to log2 scale using pseudocount of 1
log.norm.counts <- log2(counts.sf_normalized + 1)
par(mfrow=c(2,1)) # to plot the following two images parallel.

# first, boxplots of non-transformed read counts (one per sample)
boxplot(counts.sf_normalized, notch = TRUE,
        main = "Untransformed Read Counts", ylab = "read counts")

# box plots of log2-transformed read counts
boxplot(log.norm.counts, notch = TRUE,
          main = "log2-Transformed Read Counts",
          ylab = "log2(read counts)")

Numerous statistical tests and analyses make the assumption that the data is homoskedastic, or that the variance of each variable is identical. But heteroskedastic behavior frequently appears in data with significant variations in the sizes of the individual observations. Plotting the mean vs. the standard deviation is one method for visually examining heteroskedasticity. Some variability is expected, but if there is a hump as in these data, it means that the variance is influenced by the mean, which goes against the homoskedasticity assumption.

In [None]:
# mean-sd plot
library(vsn)
library(ggplot2)
library(hexbin)

msd_plot <- meanSdPlot(log.norm.counts, 
                       ranks=FALSE, # show the data on the original scale
                       plot = FALSE)
msd_plot$gg + 
  ggtitle("Sequencing Depth Normalized log2(read counts)") +
  ylab("standard deviation")

Utilizing DESeq's rlog() method, we will lower the heteroskedasticity. Rlog() translates numbers to log2 scale and normalizes for sequencing depth.

In [None]:
# Regularized log-transformed values
DESeq.rlog <- rlog(DESeq.ds, blind = TRUE)
rlog.norm.counts <- assay(DESeq.rlog)
# Mean-SD plot for rlog-transformed data
msd_plot <- meanSdPlot(rlog.norm.counts, 
                       ranks=FALSE, 
                       plot = FALSE)
msd_plot$gg + 
  ggtitle("rlog-Transformed Read Counts") +
  ylab("standard deviation")

### <span> Hierarchical clustering </span>
As an exploratory tool, clustering RNA-seq data enables the user to arrange and visualize correlations between groups of genes and to choose particular genes for further analysis. Based on chosen traits, this approach aims to isolate reasonably homogeneous gene groupings. Here, we are performing Pearson method of clustering.

In [None]:
# cor() calculates the correlation between columns of a matrix
distance.m_rlog <- as.dist(1 - cor(rlog.norm.counts, method = "pearson" ))
# plot() can directly interpret the output of hclust()
plot( hclust(distance.m_rlog), 
      labels = colnames(rlog.norm.counts),
      main = "rlog Transformed Read Counts\nDistance: Pearson Correlation")

The outcomes of differential gene expression analysis are frequently plotted in shapes like volcano, MA, and heatmaps, while many other types of analysis can be carried out as well. These plots enable one to look more closely at the output list of differentially expressed genes and perhaps find or further explore prospective genes.

### <span> PCA using DESeq </span>
Lets generate a PCA plot to visualize the clustering of the replicates as scatter plot in a two dimension plot.

A PCA plot biological reproducibility of the sample replicates might be used to make a final diagnosis. We must take the normalized counts out of the DESeqDataSet object in order to plot the PCA findings. To see if the duplicates cluster properly, it is feasible to color the points in the scatter plot according to the relevant variable.

In [None]:
P <- plotPCA(DESeq.rlog)
P <- P + theme_bw() + ggtitle("Rlog Transformed Counts")
print(P)

### <span> Differential Gene Expression Analysis (DGE) </span>
One of the most popular uses of RNA-sequencing (RNA-seq) data is differential gene expression (DGE) analysis. This procedure is frequently utilized in various RNA-seq data analysis applications because it enables the identification of genes that are differentially expressed across two or more conditions.

Internal normalization is carried out by DESeq2 in which the geometric mean for each gene across all samples is determined. Then, this mean is divided by the gene counts in each sample. The size factor for a sample is the median of these ratios in that sample.

In [None]:
# DESeq2 uses the levels of the condition to determine the order of the comparison
str(colData(DESeq.ds)$condition)
# set the first-level-factor
colData(DESeq.ds)$condition <- relevel(colData(DESeq.ds)$condition , "control")
# Finally run the DESeq analysis
DESeq.ds <- DESeq(DESeq.ds)

### <span> Explore the DGE analysis results </span>

In [None]:
#Check the results of deseq analysis
DGE.results <- results(DESeq.ds, independentFiltering = TRUE, alpha = 0.05)
summary(DGE.results)

### <span> P-value Histogram </span>
Our DE results can be quickly and easily "sanity checked" by creating a p-value histogram. A high bar between 0 and 0.05 should be seen, followed by a somewhat uniform tail to the right.

In [None]:
#Histogram
hist(DGE.results$pvalue, 
     col = "grey", border = "white", xlab = "", ylab = "",
     main = "Frequencies of P-values (all genes)")

### <span> MA Plot</span>
MA plot is helpful to determine whether the data normalization was successful. The MA plot is a scatter plot where the y-axis represents the log fold change in the specified contrast and the x-axis represents the average of normalized counts across samples. Since most genes are not anticipated to have differential expression, the majority of points are predicted to be on the horizontal 0 line.

In [None]:
#MA plot
plotMA(DGE.results, alpha = 0.05, main = "Control vs Test", ylim = c(-4,4))

### <span> Heatmap</span>
<div class="alert alert-block alert-success">
    <i class="fa fa-hand-paper-o" aria-hidden="true"></i>
    <b>Note:</b> Heatmap will be saved as a PDF file in the current working directory. </div>

This step shows how to make a heatmap of the top genes in an RNA-seq dataset that have differential expression. To do this we need to extract the differentially expressed genes from the DE results. Heat Maps help viewers focus on the parts of data visualizations that matter most by helping to better visualize the volume of locations and events inside a dataset.

In [None]:
#HEATMAP
# load the library with the aheatmap() function
library(NMF)

# aheatmap needs a matrix of values, e.g., a matrix of DE genes with the transformed read counts for each replicate
# sort the results according to the adjusted p-value
DGE.results.sorted <- DGE.results[order(DGE.results$padj), ]
# identify genes with the desired adjusted p-value cut-off
DGEgenes <- rownames(subset(DGE.results.sorted, padj < 0.05))
# extract the normalized read counts for DE genes into a matrix
hm.mat_DGEgenes <- log.norm.counts[DGEgenes , ]
#plot the normalized read counts of DE genes sorted by the adjusted p-value
pdf("plot.pdf")
aheatmap(hm.mat_DGEgenes, Rowv = NA, Colv = NA)
# combine the heatmap with hierarchical clustering
image1 <- aheatmap(hm.mat_DGEgenes, Rowv = TRUE, Colv = TRUE, # add dendrograms to rows and columns 
         distfun = "euclidean", hclustfun = "average")


# scale the read counts per gene to emphasize the sample-type-specific differences
aheatmap(hm.mat_DGEgenes ,
         Rowv = TRUE , Colv = TRUE ,
         distfun = "euclidean", hclustfun = "average",
         scale = "row") 
dev.off()
#values are transformed into distances from the center of the 
#row-specific average: (actual value - mean of the group) / standard deviation

### <span> Write the DGE results to a text file </span>

In [None]:
write.table(DGE.results.sorted, file="rna-seq_dge-results.txt", sep = "\t")

## Conclusion

This Jupyter Notebook provided a comprehensive introduction to RNA-seq analysis, guiding users through the process from raw read data to differential gene expression results.  We explored key concepts such as the central dogma, next-generation sequencing, and differential expression analysis, illustrated with clear diagrams and explanations. The notebook offered both a Nextflow pipeline for processing raw reads (optional, due to computational demands) and a streamlined approach utilizing pre-processed gene count data readily available from GEO.  Downstream analysis using R and DESeq2 enabled normalization, quality control assessment (through mean-SD plots and PCA), and the identification of differentially expressed genes, visualized via MA plots and heatmaps.  The resulting differentially expressed gene list, along with generated visualizations, provides a solid foundation for further biological interpretation and investigation.  Supplementary resources and references are provided for users seeking to delve deeper into specific aspects of RNA-seq data analysis.

## Clean up

Remember stop your notebook instance if you are finished.

<hr style="border:2px solid Orange">

### <span> References and useful links </span>

- #### https://bioinformatics-core-shared-training.github.io/RNAseq_May_2020_remote/html/05_Annotation_and_Visualisation.html
- #### https://bioinformatics-core-shared-training.github.io/cruk-summer-school-2018/RNASeq2018/html/02_Preprocessing_Data.nb.html  
- #### https://girke.bioinformatics.ucr.edu/GEN242/tutorials/sprnaseq/sprnaseq/ 
- #### https://compgenomr.github.io/book/rnaseqanalysis.html