# Formation RNAseq CEA - juin 2023

*Enseignantes : Sandrine Caburet et Claire Vandiedonck*

Session IFB : 5 CPU + 21 GB de RAM

# Part 10 : Exploratory analysis of normalized read counts


- 0. 1 - Setting up this R session on IFB core cluster  
- 0. 2 - Parameters to be set or modified by the user   
- 1 - Loading input data and metadata   
- 2 - Principal Component Analysis   
- 3 - Clustering and Heatmaps
- 4 - Correlograms
- 5 - Volcano plot
- 6 - Saving our results for later use: DE genes lists and RData file

---

## 0.1 - Setting up this R session on IFB core cluster

<em>loaded JupyterLab</em> : Version 3.2.1

In [None]:
## Code cell 1 ##

session_parameters <- function(){
    
    jupytersession <- c(system('echo "=== Cell launched on $(date) ==="', intern = TRUE),
                        system('squeue -hu $USER', intern = TRUE))
    
    jobid <- system("squeue -hu $USER | awk '/jupyter/ {print $1}'", intern = TRUE)
    jupytersession <- c(jupytersession,
                        "=== Current IFB session size: Medium (5CPU, 21 GB) ===",
                        system(paste("sacct --format=JobID,AllocCPUS,NODELIST -j", jobid), intern = TRUE))
    print(jupytersession[1:6])
    
    return(invisible(NULL))
}

session_parameters()

__

Next we load into this R session the various tools that we will use.   
***DO NOT worry*** if you see a large red output!!   
You should see this large red output only once, when the relevant packages are installed in your home directory. Afterwards, they will be detected as present, and this large red output won't show if you run the cell another time.

In [None]:
## Code cell 2 ##

# list the required libraries from the CRAN repository
requiredLib <- c(
    "ggfortify",
    "ggrepel",
    "RColorBrewer",
    "ggplot2",
    "stringr",
    "matrixStats",
    "ComplexHeatmap",
    "corrplot",
    "BiocManager"
)

# list the required libraries from the Bioconductor project
requiredBiocLib <- c("DESeq2")

# install required libraries if not yet installed
for (lib in requiredLib) {
  if (!require(lib, character.only = TRUE, quiet = TRUE)) {
    install.packages(lib, quiet = TRUE)
  }
}

for( lib in requiredBiocLib) {
  if (!require(lib, character.only = TRUE, quiet = TRUE)) {
  BiocManager::install(lib, quiet = TRUE)
  }
}

# load libraries
message("Loading required libraries")
for (lib in requiredLib) {
  library(lib, character.only = TRUE)}
for (lib in requiredBiocLib) {
  library(lib, character.only = TRUE)}

# remove variables from the R session if they are no longer necessary 
rm(lib, requiredLib, requiredBiocLib)

---

## 0.2 - Parameters to be set or modified by the user


- Using a full path with a `/` at the end, **define the folder** of the project as  `gohome` variable, and the folder where you work as the `myfolder` variable:

In [None]:
## Code cell 3 ##


gohome <- "/shared/projects/2312_rnaseq_cea/"
gohome

# In a Jupyter Hub and a jupyter notebook in R, by default the working directory is where the notebook is opened
getwd()

myfolder <- getwd()
myfolder


- With a `/` at the end, define the path to the folder where the results of this exploratory analysis will be stored. As it is a logical step usually performed together with the normalisation by `DESeq2`, we can stay in the same output folder :

In [None]:
## Code cell 4 ##

# creation of the directory, recursive = TRUE is equivalent to the mkdir -p in Unix
# we can skip this step as the folder is already created
# dir.create(paste(myfolder,"/Results/deseq2/", sep = ""), recursive = TRUE)

# storing the path to this output folder in a variable
deseq2folder <- paste(myfolder,"/Results/deseq2/", sep = "")
deseq2folder

# listing the content of the folder
print(system(paste("ls -hlt", deseq2folder), intern = TRUE) )

- Last, we specify the size of the graphical outputs that will be used for all the plots in the notebook.    
This setting could be modified at will for each plot. 

In [None]:
## Code cell 5 ##

options(repr.plot.width=15, repr.plot.height=8)

## 1 - Loading input data and metadata

We now need to retrieve the normalized data that we generated in the previous session.   
As we stored it in a global Rdata objet at the end of Pipe_09, we can simply reload all our information by opening this Rdata object.  

In [None]:
## Code cell 6 ##

rdata <- paste0(deseq2folder,"deseq2.RData")
rdata
load(rdata,verbose = TRUE)

We can now list all the object we have currently in our session: 

In [None]:
## Code cell 7 ##

ls() 

What we need now is the `rlog.dds2.annot` object, that contains the normalized read counts, with the Ensembl Gene ID in the first column, and the gene name in the 19th column:

In [None]:
## Code cell 8 ##

head(rlog.dds2.annot) 


In [None]:
## Code cell 9 ##

head(rlog.dds2.annot[ , 19])


We will use this matrix of normalized read counts as input for our in-depth exploratory analysis, so we store it in a specific object, `norm_counts`, together with the column with gene names, taht we put in the first column:

In [None]:
## Code cell 10 ##

norm_counts <- rlog.dds2.annot[,c(19, 2:12)]
dim(norm_counts)
head(norm_counts, n= 5)
summary(norm_counts)

In order to have all the visualisation in a single session, we can plot again the distribution of normalised reads:

In [None]:
## Code cell 11 ##

# make a colour vector
conditionColor <- match(samples$Condition, c("dHet", "dHetRag", "WT")) + 1
# '+1' to avoid color '1' i.e. black

# Check distributions of samples using boxplots, using only the columns with read counts
boxplot(norm_counts[,2:12],
        xlab="",
        ylab="rlog.dds2.annot Counts",
        las=2,
        col=conditionColor,
        main="rlog.dds2.annot Counts")
# Let's add a blue horizontal line that corresponds to the median
abline(h=median(as.matrix(norm_counts[ ,2:12])), col="blue")

## 2 - Principal Component Analysis

### 2.1 - PCA on all genes   
We performed a first PCA before normalising the data (in Pipe_08), we are now going to see if the normalisation of the read counts enables a better reduction of dimensionality.    
We run the PCA the same way we did in Pipe_08: 

In [None]:
## Code cell 12 ##

# run PCA
PCAdata <- prcomp(t(norm_counts[, 2:12]))
summary(PCAdata)

* **Scree plot**

We add a **scree plot**, that provides a visualisation of the part of variance described in the succesive Principal Components. The first components are always the ones describing the largest part of variance, but a scree plot is a good way to see how many componetns could be interesting to explore.

In [None]:
## Code cell 13 ##

# to display the two scree plots side by side
layout(matrix(1:2, ncol=2))

screeplot(PCAdata)
screeplot(PCAdata, type="lines", main = "Screeplot PCAdata - Eigenvalues")

* **PCA plots**

Now we plot the PCA for the first 2 dimensions, and a second one for the third and fourth PC:

In [None]:
## Code cell 14 ##

autoplot(PCAdata,
         data = samples, 
         colour="Condition", 
         shape="Tissue",
         size=6) +
        geom_text_repel(aes(x=PC1, y=PC2, label=SampleName), box.padding = 0.8)


In [None]:
## Code cell 15 ##

autoplot(PCAdata,
         x = 3,    # PC3
         y = 4,    # PC4
         data = samples, 
         colour="Condition", 
         shape="Tissue",
         size=6) +
    geom_text_repel(aes(x=PC3, y=PC4, label=SampleName), box.padding = 0.8)


We can see that, now that we look at the normalised data, PC1 separates clearly the leukemic cells from the non-leukemic cells, and that PC2 separates clearly the dHet cells from the WT ones. Again PC3 and PC4 seem to  separate the various mice, and to group the replicates.   
    

### 2.2 - PCA on the most variant genes

#### 2.2.a - Selection of the most variable genes

Although we can see a nice separation for our dataset, it is not recommended to use all the genes to perform a PCA, even if we limit to expressed genes. Indeed, the differences between the conditions are likely to be due to a more limited number of genes that do vary because of the conditions in a way that is truly meaningful biologically.   
So the genes under consideration for a PCA are usually restricted to the more variant ones, i.e the 1000 most variable genes, or the top 5 or 10 %.   
Here, we are going to select the top 50 genes, on the basis of their variances.   

* **Initial number of genes in our data:**

In [None]:
## Code cell 16 ##

dim(norm_counts)

* **Variance and Coefficient of variation**   
The coefficient of variation (CV) is a relative measure of variability that indicates the size of a standard deviation in relation to its mean. It is a standardized, unitless measure that allows you to compare variability between disparate groups and characteristics.   It is recommended to use the CV instead of variance when the variables under consideration in the dataset are very different and present large differences in ranges (such as age and dosages, for example). Then the ranges of variations of the different variable conditions have to be scaled.

We could use the CV to select the most variant genes, but here, it is not really required, as we do not have varoabes with different scales.  
Therefore we provide the way to compute the CV for reference, but we go on with the variance. 


We first compute the variance of all genes, using the rowVars() function, and we verify that no gene has a null variance:

In [None]:
## Code cell 17 ##
#m <- as.matrix(norm_counts[ ,2:12])
var_genes <- matrixStats::rowVars(as.matrix(norm_counts[ ,2:12]))
str(var_genes)
length(which(is.na(var_genes)))

`var_genes`does not contain the gene names, so we add them:

In [None]:
## Code cell 18 ##

names(var_genes) <- norm_counts[,1]
head(var_genes)
length(unique(var_genes))

We sort `var_genes` on decreasing values of the variance, and we store the first 50 rows in `topVargenes`:

In [None]:
## Code cell 19 ##

topVargenes <- head(sort(var_genes, decreasing = TRUE), 50)
head(topVargenes)


and we create a dataframe containing the normalised counts for these top genes, again with gene names in the first column, by selecting the top genes from the `norm_counts`dataframe:

In [None]:
## Code cell 20 ##

top50var  <- subset(norm_counts, gene_name %in% names(topVargenes))

We verify the size of the `top50var`dataframe, and if we have indeed unique genes:

In [None]:
## Code cell 21 ##

dim(top50var)
length(unique(top50var$gene_name))
head(top50var, n=5)


We can have a quick glance at those genes, to see if we recognize some of them... and indeed, the first one is interesting! :-D 

In [None]:
## Code cell 22 ##

names(topVargenes)

In order to use this dataframe for the next steps, we put the gene names as row names (instead of numbers). 

In [None]:
## Code cell 23 ##

row.names(top50var) = top50var$gene_name

#### 2.2.b - PCA plots

We now use these top genes to perform our new PCA:

In [None]:
## Code cell 24 ##

# run PCA
PCAdata2 <- prcomp(t(top50var[,-1]))


* **Scree plots**

We display the corresponding scree plots and the first 2 PC:

In [None]:
## Code cell 25 ##

# to display the two scree plots side by side
layout(matrix(1:2, ncol=2))

screeplot(PCAdata2)
screeplot(PCAdata2, type="lines")

The scree plots confirm that there is  no interest in looking into PC dimensions beyond PC3 or PCA, as the remaining ones explain very little parts of the inertia (or total variance).  

* **PCA plots**

In [None]:
## Code cell 26 ##

autoplot(PCAdata2,
         data = samples, 
         colour="Condition", 
         shape="Tissue",
         size=6) +
        geom_text_repel(aes(x=PC1, y=PC2, label=SampleName), box.padding = 0.8)


We can see that the portion of variance explained by PC1 and PC2 is higher, when we take into consideration only our top genes. This is indeed logical, as these genes are likely to be the most impacted by the change of conditions.   


* **Biplot**
 
The biplot is a very popular way for visualization of results from PCA, as it combines both the principal component scores and the loading vectors in a single biplot display. Each vector represents a gene, and the arrow represents the influence of this gene on the PC: the longer the arrow, the stronger the influence.

- The orientation (direction) of the vector, with respect to the principal component space, in particular, its angle with the principal component axes: the more parallel to a principal component axis is a vector, the more it contributes only to that PC.

- The length in the space: the longer the vector, the more variability of this variable is represented by the two displayed principal components; short vectors are thus better represented in other dimension.

- The angles between vectors of different variables show their correlation in this space: small angles represent high positive correlation, right angles represent lack of correlation, opposite angles represent high negative correlation.



In [None]:
## Code cell 27 ##

biplot(PCAdata2,
       scale = 0)

Here, we can see that `Xist` points towards the bottom right, but its vector is longer towards the bottom than towards the right. Therefore, we can deduce that the variation of `Xist` expression in our dataset contributes more to the PC2 axis, that is to the difference of dHet cells (leukemic or not) compared to WT cells. 

You can obtain more refined plots for PCA, scree plots and bi plots (and more) by using dedicated packages.    
One of the best and most popular is [FactoMineR](http://factominer.free.fr/index_fr.html), with its  companion package factoextra.    
You can find many tutorials, such as this one:    
http://www.sthda.com/english/wiki/wiki.php?id_contents=7851

---  


## 3 - Clustering and Heatmaps 

### 3.1 - Hierarchical clustering   

As we did before normalisation, we will look at the grouping of our samples using a Hierarchical Clustering.    
This  representation is used to cluster the samples based on dissimilarity indexes *(see lecture 12, here we will use the Ward distance)*. More information can be found with `?hclust` (or in the Contextual Help panel on the right, that can be opened via the Help menu).

Hierarchical clustering is performed in two steps: calculate the distance matrix and apply clustering. 

In [None]:
## Code cell 28 ##

clusters <- hclust(dist(as.matrix(t(top50var[ ,-1]))), method ="ward.D")
plot(clusters)

rm(clusters)    

We can see that replicates from the same mouse cluster together, as expected. What is less expected is the grouping together of leukemic cells with WT ones! You could test other methods and see that this clustering is robust.

### 3.2 - Simple heatmaps   

Another way of looking at  expression data is a heatmap, that shows the level of expression of genes as colors, usually organised with samples in columns, and genes as rows.   

A simple and efficient way of plotting such a display is the function `heatmap()`, that gives a combined output combining a heatmap and dendrograms grouping samples and genes, as in hierarchical clustering:

In [None]:
## Code cell 29 ##

heatmap(as.matrix(top50var[,-1]))

### 3.2 - Enhanced heatmaps with `ComplexHeatmap`
---

There are two main packages to draw enhanced heatmaps:
- [pheatmap](https://cran.r-project.org/web/packages/pheatmap/index.html): a CRAN packages.
- [ComplexHeatmap](https://jokergoo.github.io/ComplexHeatmap-reference/book/) on GitHub but also available in Bioconductor, the second generation of pheatmap.

ComplexHeatmap is the most flexible tool to draw heatmaps with a lot of options, notably to add annotations.  
Below is presented a quick example on ComplexHeatmap usage.

- To use ComplexHeatmap, the dataset must be a **matrix** with samples in columns and genes in rows. We thus transpose our matrix `top50var`.

- In addition, we use the function `scale()` to **center our data with a Z score**. Centering the values will provide a heatmap with a more neutral color for middle values, thus enabling a better visualisation of variations.

In [None]:
# Code cell 30 ##

t_top50var <- t(apply(top50var[,-1], 1, scale))

- Then you select a **distance** for the similarity between samples. It can be a pre-defined character which is in (`euclidean`, `maximum`, `manhattan`, `canberra`, `binary`, `minkowski`, `pearson`, `spearman`, `kendall`). Default is `euclidean`.    
It can also be a function. In R the function to compute distances is `dist()`. The correlation distance is defined as 1 - cor(x, y, method). [See there](https://jokergoo.github.io/ComplexHeatmap-reference/book/a-single-heatmap.html#distance-methods) for further details.

- You can also select the **clustering method** by `clustering_method_rows` and `clustering_method_columns`. Possible methods are those supported in `hclust()` function: `ward.D`, `ward.D2`, `single`, `complete`, `average` (= UPGMA), `mcquitty` (= WPGMA), `median` (= WPGMC) or `centroid` (= UPGMC).

In [None]:
# Code cell 31 ##

# we change the dimension of the output for better rendering
options(repr.plot.width=10, repr.plot.height=10)

ComplexHeatmap::Heatmap(t_top50var,
                        name = "Z-score",
                        clustering_distance_rows = "pearson",
                        column_title = "pre-defined distance method (1 - pearson)", row_title = "Top 50 genes",
                        clustering_method_rows = "ward.D")

- One nice usage of ComplexHeatmap is the possibility to add custom **annotations**: `ha` for "heatmap annotation" is the object name used in ComplexHeatmap tutorial.

- You can add annotations on samples, either quantitative annotations with the fundtion `anno_points()` or `anno_barplot()` or qualitative annotations from a dataframe. The `col` argument is used to specify the colors of the different categorical values. For the quantitative values, you can add points or boxplots for example.   

Many beautiful examples can be found in the [Complex Heatmap manual](https://jokergoo.github.io/ComplexHeatmap-reference/book/), and a nice progressive tutorial is available [here](https://www.datanovia.com/en/lessons/heatmap-in-r-static-and-interactive-visualization/). 

---  

## 4 - Correlograms

Gene co-expression correlations provide a robust methodology for predicting gene function, as genes sharing a biological process or a common implication in pathways are often co-regulated.  This is particularly helpful to provide clues to the functions and roles of lncRNA genes, on the basis of the *"guilty by association"* method: if lncRNA genes cluster with genes with known functions, they are likely to participate to the same biological process.    

We want to see if the top variant genes identified above have a correlated expression in a pair-wise manner, that is if two genes share a similar pattern of expression across samples.

A Pearson or Spearman correlation is performed between continuous variables with the `cor()`  function, resulting in a correlation coefficient between each pair of genes. This correlation is displayed in a scatter plot with the function `plot()`.

<div class="alert alert-block alert-info"><b><u>What are correlograms?</u></b><br>
<br>When we want to study pairwise correlation between several variables (here genes are the variables), the <code>cor()</code> function can also be applied on a matrix of data. <b><br>
<i>A correlogram is the pairwise graphical representation of the matrix of correlation coefficients</i></b>. Such a correlogram helps highlighting the most correlated variables. Some R packages will also allow to display the pvalues or to reorder variables according to their degree of correlation.

You have arleady used heatmaps in session 1 to look at similar patterns of expression among a set of genes, such as the DE genes. Among distances that could be used to cluster genes in a heatmap, we could have used the Pearson distance = (1-r). In a correlogram, we rather display the <b><i>correlation coefficient ranging from -1 to 1</i></b>.
</div>


###  4.1 - With R native functions
---

- with the `heatmap` function   
which is also able to display a correlogramme.

We will have first to transpose the matrix of data with the `t()` function to have each gene name as a variable.

In [None]:
# Code cell 32 ##

heatmap(cor(t(top50var[,-1])))

### 4.2 - With the `corrplot` package


The `corrplot`package is particularly efficient for customizing correlograms. https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html   
It allows for many improvements of the rendering.

In [None]:
## Code cell 33 ##

options(repr.plot.width = 13, repr.plot.height = 13)

# A first "simple" plot
corrplot(cor(t(top50var[,-1])))

# method = "square" is the default 
# hclust clusterises the genes, grouping together the most correlated ones
# rectangles around clusters and correlation coefficients are also added
corrplot(cor(t(top50var[,-1])),
         method = "square",
         order = 'hclust',
         addrect = 4,
         addCoef.col = 'black',
         number.cex = 0.3)

By definition a correlation matrix is symmetrical.   
We can add arguments to the `corrplot()` function in order to display only one half of the matrix (here, the upper one). We can also change the shape of the  representation of each pair-wise correlation.</span>

In [None]:
## Code cell 34 ##

# type = 'upper' will show only the upper half of the matrix
corrplot(cor(t(top50var[,-1])),
        method = 'ellipse',
        type = 'upper',
        insig = 'blank')

____
## 5 - Volcano plot

  

Volcano plots are very commonly used to display the results of RNA-seq or other omics experiments. This plot uses the  information from the Differential Expression analysis, in particular the adjusted p-value and the log2 Fold Change, to highlight the most differentially expressed genes in a very evident way.    
A volcano plot is a type of scatterplot that shows **statistical significance (P value) versus magnitude of change (fold change)**.    
It enables quick visual identification of genes with large fold changes that are also statistically significant, because these are likely to be the most biologically significant genes.    

In a volcano plot :
- the most upregulated genes are towards the right (positive log2 fold change), 
- the most downregulated genes are towards the left (negative log2 fold change), 
- the most statistically significant genes are towards the top (smaller adjusted p-value).   

In our case, we have these values  in the `res2_ranked_p` object, respectively in the `padj` and in the `log2FoldChange` columns.

* **Thresholds for adjusted p-values and log2 Fold Change**

First we define the cutoff for the adjusted p-value:

In [None]:
## Code cell 35 ##

alpha <- 0.00001

Now we want to identify the genes that have an adjusted p-value below that cutoff, with a positive or a negative log2FoldChange.

We also define a threshold for the log2 fold change, for example 2 and -2.   

In [None]:
## Code cell 36 ##

## Number of genes with an adjusted p-value < alpha *and a positive log(FC)*
dim(res2_dHet_dHetRag_sig_ranked_annot[which(res2_dHet_dHetRag_sig_ranked_annot$padj < alpha & res2_dHet_dHetRag_sig_ranked_annot$log2FoldChange > 2),])[1]
upGenes <- res2_dHet_dHetRag_sig_ranked_annot[which(res2_dHet_dHetRag_sig_ranked_annot$padj < alpha & res2_dHet_dHetRag_sig_ranked_annot$log2FoldChange > 2),]

## Number of genes with an adjusted p-value < alpha *and a negative log(FC)*
dim(res2_dHet_dHetRag_sig_ranked_annot[which(res2_dHet_dHetRag_sig_ranked_annot$padj < alpha & res2_dHet_dHetRag_sig_ranked_annot$log2FoldChange < -2),])[1]
downGenes <- res2_dHet_dHetRag_sig_ranked_annot[which(res2_dHet_dHetRag_sig_ranked_annot$padj < alpha & res2_dHet_dHetRag_sig_ranked_annot$log2FoldChange < -2),]


* **Plotting of the volcano plot**   

When drawing the volcano plot, we can color the genes according to this selection.   
FInally, we add horizontal and vertical lines to show the thresholds. 

In [None]:
## Code cell 37 ##

options(repr.plot.width = 10, repr.plot.height = 10)

#draw the plot
plot(-log10(res2_dHet_dHetRag_sig_ranked_annot$padj) ~ res2_dHet_dHetRag_sig_ranked_annot$log2FoldChange, pch = "", xlim = c(-15,15), ylim = c(0,90),
     xlab = "",ylab = "", bty = "n", xaxt = "n", yaxt = "n"  )
title("Leukemic dHet vs non-leukemic dHetRag", font.main = 1, cex.main = 0.9)
axis(1, at = -15:15, tcl = -0.5,cex = 0.7, labels = F )
mtext(-15:15,side = 1,line = 1,at = -15:15, cex = 0.7)
axis(2, at = 0:90, tcl = -0.2, cex = 0.7, labels = F )
mtext(seq(0,90,10),side = 2,line = 0.5, at = seq(0,90,10), cex = 0.7)


## Color in grey genes
points(-log10(res2_dHet_dHetRag_sig_ranked_annot$padj) ~ res2_dHet_dHetRag_sig_ranked_annot$log2FoldChange, pch = 16, cex = 0.5, col = "grey")


## Override the previous colouring for some genes according to the fact they are DE
points(-log10(upGenes$padj) ~ upGenes$log2FoldChange,pch = 16, cex = 0.5, col = "red")
points(-log10(downGenes$padj) ~ downGenes$log2FoldChange, pch = 16,cex = 0.5, col = "green")
mtext("log2(FC)", side = 1, line = 2, cex = 0.8)
mtext("-log10 (adjusted p-value)", side = 2, line = 1.5, cex = 0.8)

abline(h=-log10(alpha), col="blue")
abline(v=-2, col="green")
abline(v=2, col="red")

* **Enlarging the volcano plot** 

As we can see in the list of DE genes ranked by p-value: 

In [None]:
## Code cell 38 ##

head(res2_dHet_dHetRag_sig_ranked_annot, n = 3)

the top DE genes have *very very* low adjusted p-value, that were not displayed in this first volcano plot.    

So in the cell below, we change the ylim and the ranges for the y axis to accomodate these high values.    
We also change 
- the size of colored points: `cex = 0.7` instead of `cex = 0.5`
- the lettering size for the main title: `cex.main = 1.5` instead of `cex.main = 0.9`
- and for the axis titles: `cex = 1` instead of `cex = 0.8`. 

In [None]:
## Code cell 39 ##

options(repr.plot.width = 10, repr.plot.height = 10)

#draw the plot
plot(-log10(res2_dHet_dHetRag_sig_ranked_annot$padj) ~ res2_dHet_dHetRag_sig_ranked_annot$log2FoldChange, pch = "", xlim = c(-15,15), ylim = c(0,245),
     xlab = "",ylab = "", bty = "n", xaxt = "n", yaxt = "n"  )
title("Leukemic dHet vs non-leukemic dHetRag", font.main = 1, cex.main = 1.5)
axis(1, at = -15:15, tcl = -0.5,cex = 0.7, labels = F )
mtext(-15:15,side = 1,line = 1,at = -15:15, cex = 0.7)
axis(2, at = 0:245, tcl = -0.2, cex = 0.7, labels = F )
mtext(seq(0,245,10),side = 2,line = 0.5, at = seq(0,245,10), cex = 0.7)


## Color in grey genes
points(-log10(res2_dHet_dHetRag_sig_ranked_annot$padj) ~ res2_dHet_dHetRag_sig_ranked_annot$log2FoldChange, pch = 16, cex = 0.5, col = "grey")


## Override the previous colouring for some genes according to the fact they are DE
points(-log10(upGenes$padj) ~ upGenes$log2FoldChange,pch = 16, cex = 0.8, col = "red")
points(-log10(downGenes$padj) ~ downGenes$log2FoldChange, pch = 16,cex = 0.8, col = "green")
mtext("log2(FC)", side = 1, line = 2, cex = 1.1)
mtext("-log10 (adjusted p-value)", side = 2, line = 1.5, cex = 1.1)

abline(h=-log10(alpha), col="blue")
abline(v=-2, col="green")
abline(v=2, col="red")

Finally, we write the differentially expressed gene list to a tabulated txt file, in the deseq2folder that we decided at first

___
## 6 - Saving our results for later use: DE genes lists and RData file

### 6.1 - DE genes lists    

As we have selected DE genes on the basis of their adjusted p-value and their log2 Fold change, we can save those list for further enrichment analysis.  

In [None]:
## Code cell 40 ##

dim(upGenes)
dim(downGenes)
head(upGenes)

We save those 2 lists in our Results/deseq2 folder:

In [None]:
## Code cell 41 ##

write.table(as.data.frame(upGenes), file=paste0(deseq2folder,"DESeq2_significant_genes-0_00001-up.tsv"), sep="\t", quote=F, col.names=T)
write.table(as.data.frame(downGenes), file=paste0(deseq2folder,"DESeq2_significant_genes-0_00001-down.tsv"), sep="\t", quote=F, col.names=T)


### 6.2 - RData object for the session

We can save all the R objects created in this session in a single R object.   
This will help us to reload our results without having to run the same commands.   

In [None]:
## Code cell 42 ##

print(ls())

In [None]:
## Code cell 43 ##

head(res2_dHet_dHetRag_sig_ranked_annot)

We keep only the relevant objects:

In [None]:
## Code cell 44 ##

rm(PCAdata, PCAdata2, rdata, t_top50var)

and we save all our info in a single RData object in our output folder:

In [None]:
## Code cell 45 ##

ls()
save.image(file=paste0(deseq2folder,"deseq2-final.RData"))

---
___

Now we go on with the analysis of enrichments of gene list of interest.  
  
**=> Step 11: Overrepresentation and enrichment analysis** 

The jupyter notebook used for the next session will be *Pipe_11-R403-Normcounts-exploratory-analysis-II.ipynb*    
***It is not ready yet!!***  

So we will not retrieve it in our personal directory as usual. 




**Save executed notebook**

To end the session, save your executed notebook in your `run_notebooks` folder. Adjust the name with yours and reformat as code cell to run it.

<div class="alert alert-block alert-success"><b>Success:</b> Well done! You now know how to visualize normalised expression data.<br>
Don't forget to save you notebook and export a copy as an <b>html</b> file as well <br>
- Open "File" in the Menu<br>
- Select "Export Notebook As"<br>
- Export notebook as HTML<br>
- You can then open it in your browser even without being connected to the server! 
</div>

---
---

## Useful commands
<div class="alert alert-block alert-info"> 
    
- <kbd>CTRL</kbd>+<kbd>S</kbd> : save notebook<br>    
- <kbd>CTRL</kbd>+<kbd>ENTER</kbd> : Run Cell<br>  
- <kbd>SHIFT</kbd>+<kbd>ENTER</kbd> : Run Cell and Select Next<br>   
- <kbd>ALT</kbd>+<kbd>ENTER</kbd> : Run Cell and Insert Below<br>   
- <kbd>ESC</kbd>+<kbd>y</kbd> : Change to *Code* Cell Type<br>  
- <kbd>ESC</kbd>+<kbd>m</kbd> : Change to *Markdown* Cell Type<br> 
- <kbd>ESC</kbd>+<kbd>r</kbd> : Change to *Raw* Cell Type<br>    
- <kbd>ESC</kbd>+<kbd>a</kbd> : Create Cell Above<br> 
- <kbd>ESC</kbd>+<kbd>b</kbd> : Create Cell Below<br> 

<em>  
To make nice html reports with markdown: <a href="https://dillinger.io/" title="dillinger.io">html visualization tool 1</a> or <a href="https://stackedit.io/app#" title="stackedit.io">html visualization tool 2</a>, <a href="https://www.tablesgenerator.com/markdown_tables" title="tablesgenerator.com">to draw nice tables</a>, and the <a href="https://medium.com/analytics-vidhya/the-ultimate-markdown-guide-for-jupyter-notebook-d5e5abf728fd" title="Ultimate guide">Ultimate guide</a>. <br>
Further reading on JupyterLab notebooks: <a href="https://jupyterlab.readthedocs.io/en/latest/user/notebook.html" title="Jupyter Lab">Jupyter Lab documentation</a>.<br>   
</em>    
 
</div>

Bénédicte Noblet - 05-07 2021   
Sandrine Caburet et Claire Vandiedonck - 02-06 2023   
with adaptations from https://bioinformatics-core-shared-training.github.io/RNAseq_November_2020_remote/html/02_Preprocessing_Data.html   
and https://rpubs.com/adoughan/778146  
Maj 08/06/2023 par @SCaburet   