# Formation RNAseq CEA - juin 2023

*Enseignantes : Sandrine Caburet et Claire Vandiedonck*

Session IFB : 5 CPU + 21 GB de RAM

# Part 10 : Exploratory analysis of normalized read counts and visualization of DGE analysis


    0 - Gettibg started
        0. 1 - Setting up this R session on IFB core cluster  
        0. 2 - Parameters to be set or modified by the user   
    1 - Loading input data and metadata   
    2 - Principal Component Analysis   
    3 - Clustering and Heatmaps
    4 - Correlograms
    5 - Volcano plot to see DE genes
    6 - Saving our results for later use: DE genes lists and RData file

---
## **Before going further**

<div class="alert alert-block alert-danger"><b>Caution:</b> 
Before starting the analysis, save a backup copy of this notebok : in the left-hand panel, right-click on this file and select "Duplicate"<br>
You can also make backups during the analysis. Don't forget to save your notebook regularly: <kbd>Ctrl</kbd> + <kbd>S</kbd> or click on the 💾 icon.
</div>

<div class="alert alert-block alert-warning"><b>Warning:</b>You are strongly advised to run the cells in the indicated order. If you want to rerun cells above, you can just restart the kernel to start at 1 again. </div>

---
---
## 0. Getting started

---
### 0.1 - Setting up this R session on IFB core cluster
---

<em>loaded JupyterLab</em> : Version 3.2.1

#### **0.1-a. Jupyter session**

In [None]:
## Code cell 1 ##

session_parameters <- function(){
    
    jupytersession <- c(system('echo "=== Cell launched on $(date) ==="', intern = TRUE),
                        system('squeue -hu $USER', intern = TRUE))
    
    jobid <- system("squeue -hu $USER | awk '/jupyter/ {print $1}'", intern = TRUE)
    jupytersession <- c(jupytersession,
                        "=== Current IFB session size: Medium (5CPU, 21 GB) ===",
                        system(paste("sacct --format=JobID,AllocCPUS,NODELIST -j", jobid), intern = TRUE))
    print(jupytersession[1:6])
    
    return(invisible(NULL))
}

session_parameters()

__

#### **0.1-b. R session**

Next we load into this R session the various tools that we will use.   
***DO NOT worry*** if you see a large red output! It contains a warning message, including for functions from different packages sharing the same names. If some packages are required but not yet installed on the server, you will also see a message when the relevant packages are installed in your home directory ("~/R/x86_64-conda-linux-gnu-library/4.0").

<div class="alert alert-danger" role="alert"> <b><u> Caution on R version and installation of R packages when using Jupyter notebooks </u></b>
<br>
In notebooks, we cannot interactively answer prompted questions when installing new packages in a session (except the question concerning the choice of the repository).
<br>If you never installed packages with the 4.0.3 R version on the IFB core cluster, you do not have a home folder called '~/R/x86_64-conda-linux-gnu-library/4.0'. In that case, you will have to open a terminal and enter the following commands:<br>
    <code>module load r/4.0.3</code><br>
    <code>R</code><br>
then you copy the whole content of the cell below and execute it (typing enter): two questions will be prompted, just answer "y" (without the quotes) to both of them -> R will create a 4.0.3 library folder and install missing packages in.
<br>Then quit R by typing <code> quit()</code> and ansewr "n" (without the quotes).
<br>Finally, unload the R 4.03 version with <code> module unload r</code>
</div>



In [None]:
## Code cell 2 ##

# list the required libraries from the CRAN repository
requiredLib <- c(
    "ggfortify",
    "ggrepel",
    "RColorBrewer",
    "ggplot2",
    "stringr",
    "matrixStats",
    "corrplot",
    "BiocManager",
    "FactoMineR",
    "factoextra",
    "writexl",
    "readxl"
)

# list the required libraries from the Bioconductor project
requiredBiocLib <- c(
    "DESeq2",
    "ComplexHeatmap")


# install required libraries if not yet installed
for (lib in requiredLib) {
  if (!require(lib, character.only = TRUE, quiet = TRUE)) {
    install.packages(lib, quiet = TRUE, repos = "https://cloud.r-project.org")
  }
}
for( lib in requiredBiocLib) {
  if (!require(lib, character.only = TRUE, quiet = TRUE)) {
  BiocManager::install(lib, quiet = TRUE, update = FALSE)
  }
}

# load libraries
message("Loading required libraries")
for (lib in requiredLib) {
  library(lib, character.only = TRUE)}
for (lib in requiredBiocLib) {
  library(lib, character.only = TRUE)}

# remove variables from the R session if they are no longer necessary 
rm(lib, requiredLib, requiredBiocLib)



In [None]:
## Code cell 3 ##   

cat("Here is my R session with the loaded packages:\n")
sessionInfo()

---

### 0.2 - Parameters to be set or modified by the user
---

- Using a full path with a `/` at the end, **define the folder** of the project as  `gohome` variable, and the folder where you work as the `myfolder` variable:

<div class="alert alert-block alert-warning"> <b> Warning on working directory: </b><br>In a Jupyter Hub and a jupyter notebook in R, by default the working directory is where the notebook is opened for the <b> fisrt time </b>. Even if you move it ton another directory, it keeps the original working diretory unless you set it again with the function <code>setwd()</code>.</div>

In [None]:
## Code cell 4 ##

gohome <- "/shared/projects/2312_rnaseq_cea/"
gohome

myfolder <- getwd()
myfolder

- With a `/` at the end, define the path to the folder where the results of this analysis will be stored. As it is a logical step usually performed together with the normalisation by `DESeq2`, we can stay in the same output folder :

In [None]:
## Code cell 5 ##

# creation of the directory, recursive = TRUE is equivalent to the mkdir -p in Unix
# we can skip this step as the folder is already created
# dir.create(paste(myfolder,"/Results/deseq2/", sep = ""), recursive = TRUE)

# storing the path to this output folder in a variable
deseq2folder <- paste(myfolder,"/Results/deseq2/", sep = "")
deseq2folder

# listing the content of the folder
print(system(paste("ls -hlt", deseq2folder), intern = TRUE) )

- Last, we specify the size of the graphical outputs that will be used for all the plots in the notebook.    
This setting could be modified at will for each plot. 

In [None]:
## Code cell 6 ##

options(repr.plot.width = 15, repr.plot.height = 8) # for figure display in the notebook

---
---
## 1 - Input data
---

### 1.1 - Loading input data and metadata
---

We now need to retrieve the **normalized data** and **differential expression analysis data** that we generated in the previous session.   
As we stored it in a global Rdata objet at the end of Pipe_09, we can simply reload all our information by opening this Rdata object.  

In [None]:
## Code cell 7 ##

rdata <- paste0(deseq2folder, "deseq2.RData") # generate the path of the deseq2 data
rdata # check the path
load(rdata, verbose = TRUE) # load the path and ask to write what is loaded
rm(rdata) # remove the path of deseq2 data, no longer needed

We can now list all the object we have currently in our session: 

In [None]:
## Code cell 8 ##

ls() 

=> We can now see it contains the four following dataframes:

- `rlog.dds2.annot`: the dataframe containing all normalized read counts using the rlog method implemented in DESeq2.
- `res2_dHet_dHetRag_sig_ranked_annot`: the output of the DE analysis between dHet and dHetRag condition with the significantly DE genes, ranked by increasing pvalues and annotated.
- `samples`:  the metadata dataframe containing the information about the samples, in particular the conditions of the experiment.
- `genecode` : the Gencode annotation GTF file of the mouse genome (vM32) we imported also in Pipe09.


---
### 1.2 - Preparing "rlog" norm data for the exploratory analyses
---

In parts 2 to 5 of the current notebook, we will run exploratory analyses on the rlog normalized data. What we need for this is the `rlog.dds2.annot` object, that contains the normalized read counts, with the Ensembl Gene ID in the first column, and the gene name in the 19th column:

In [None]:
## Code cell 9 ##

head(rlog.dds2.annot) 
str(rlog.dds2.annot)

In [None]:
## Code cell 10 ##

head(rlog.dds2.annot[ , 19])
head(rlog.dds2.annot[,"gene_name"]) #the two commands are equivalent since column 19 name is "gene_name"

We notice below that the number of gene_names is smaller than the number of ensemblID, 12 genes having the same name but two different geneIDs.

In [None]:
## Code cell 11 ##

length(unique(rlog.dds2.annot$ensemblID))
length(unique(rlog.dds2.annot$gene_name))
table(table(rlog.dds2.annot$gene_name)) # contingency table of gene_wise contingency tables
dup_genes <- names(which(table(rlog.dds2.annot$gene_name) == 2)) # to get the gene_names of the 12 gene_names with two ensemblID
subset(rlog.dds2.annot, gene_name %in% dup_genes)[, c("ensemblID", "gene_name")]

<div class="alert alert-block alert-warning"> <b> <u>Caution on gene names: </u></b><br>
One may be tempted to keep working only using gene_names, but as you can see from the previous cell, it happens quite often that some genes have several distinct IDs. Since the mapping and counts per features was performed on Ensembl IDs, it is strongly advised to keep working on them. However, for nice plots, it is often easier to display gene_names. </il>
    </ol></div>


For furher in-depth exploratory analyses, we will only use the columns with the normalised counts as input, so we may store it in a specific lighter object, `norm_counts`, together with the column with "Ensembl IDs", that we put in the first column, and the column with "gene_name" in a second column.

In [None]:
## Code cell 12 ##

#norm_counts <- rlog.dds2.annot[,c(19, 2:12)]
# or sidem in a more explicit way:
norm_counts <- rlog.dds2.annot[,c("ensemblID", "gene_name",
                                  "dHet_B-ALL_686_rep1", "dHet_B-ALL_686_rep2",
                                  "dHet_B-ALL_713_rep1", "dHet_B-ALL_713_rep2",
                                 "dHet_B-ALL_760_rep1", "dHet_B-ALL_760_rep2",
                                 "dHet_FetalLiver_proB_rep1",  "dHet_FetalLiver_proB_rep2", "dHet_FetalLiver_proB_rep3",
                                  "wt_BoneMar_proB_rep1", "wt_BoneMar_proB_rep2")]
dim(norm_counts)
head(norm_counts, n = 5)
summary(norm_counts[, -c(1:2)]) # useless to do summary on the first two columns that contains qualitative data

We remove the `rlog.dds2.annot` to use less memory. If at some point we need other gene info, they are still present in the gencode object.

In [None]:
## Code cell 13 ##

rm(rlog.dds2.annot)
ls()

In order to have all the visualisation in a single session, we can plot again the distribution of normalised reads *(already done in Pipe08)*:

In [None]:
## Code cell 14 ##

# make a colour vector
conditionColor <- match(samples$Condition, c("dHet", "dHetRag", "WT")) + 1
# '+1' to avoid color '1' i.e. black

# Check distributions of samples using boxplots, using only the columns with read counts
boxplot(norm_counts[, 3:13], # or can do -c(1:2)
        xlab = "",
        ylab = "rlog.dds2.annot Counts",
        las = 2,
        col = conditionColor,
        main = "rlog.dds2.annot Counts")
# Let's add a blue horizontal line that corresponds to the median
abline(h = median(as.matrix(norm_counts[ , -c(1:2)])), col = "blue")

---
---
## 2 - Principal Component Analysis
---

### 2.1 - PCA on all genes with R base functions
---

#### **2.1-a. Run the PCA**

We performed a first PCA before normalising the data (in Pipe_08), we are now going to see if the normalisation of the read counts enables a better reduction of dimensionality.    
Here, we run again the PCA the same way we did in Pipe_08, but this time on norm data: 

In [None]:
## Code cell 15 ##

# run PCA
PCAdata <- prcomp(t(norm_counts[, -c(1:2)])) # we get rid of the first tow columns of norm_counts that do not contain norm data
summary(PCAdata)

#### **2.1-b. Scree plot**

When looking at the summary of the PCA just above, we can see the proportion of the ***inertia*** *(total variance)* explained by each ***eigen vector*** *(other name for PC axis)*. 

The quality of the ACP can be detemined by looking at these proportions.  The higher inertia is explained, the better is the ACP (it means that we maintain at most the shape of the original scatter plot).

To evaluate this quality, we can draw a **scree plot**, that provides a visualisation of the part of variance described in the succesive Principal Component (eigen vectors). The first components are always the ones describing the largest part of variance, but a scree plot is a good way to see how many components could be interesting to explore (look at the inflection point).

In [None]:
## Code cell 16 ##

# to display the two scree plots side by side
layout(matrix(1:2, ncol = 2))

screeplot(PCAdata) # barplot representation
screeplot(PCAdata, type = "lines", main = "Screeplot PCAdata - Eigenvalues") # same but with a line

Here we see that the first three components explain most of the inertia. We will see further down another package to draw nicer scree plots.

#### **2.1-c. PCA plots**

Now we plot the PCA for the first 2 dimensions, and a second one for the third and fourth PC:

In [None]:
## Code cell 17 ##

autoplot(PCAdata,
         data = samples, 
         colour = "Condition", 
         shape = "Tissue",
         size = 6) +
        geom_text_repel(aes(x = PC1, y = PC2, label = SampleName), box.padding = 0.8)


In [None]:
## Code cell 18 ##

autoplot(PCAdata,
         x = 3,    # PC3
         y = 4,    # PC4
         data = samples, 
         colour = "Condition", 
         shape = "Tissue",
         size = 6) +
    geom_text_repel(aes(x = PC3, y = PC4, label = SampleName), box.padding = 0.8)


We can see that, now that we look at the normalised data, PC1 separates clearly the leukemic cells from the non-leukemic cells, and that PC2 separates clearly the dHet cells from the WT ones. Again PC3 and PC4 seem to  separate the various mice, and to group the replicates.   
    

<div class="alert alert-block alert-info"><b>To go furher:</b><br> We will see in sections 2.2 and 2.3 which other plots can be drawn.</div>

---
### 2.2 - PCA on the most variant genes with R base functions
---

#### **2.2-a. Initial number of genes**

*based on ensembl IDs* in our data:

In [None]:
## Code cell 19 ##

length(unique(norm_counts$ensemblID))

#### __2.2-b - Selection of the **most variable genes**__

<div class="alert alert-block alert-warning"> <b><u> Warning on PCA: </u></b><br><br>

Although we can see a nice separation for our dataset, <b>it is not recommended to use all the genes to perform a PCA </b>, even if we limit to expressed genes. Indeed, the differences between the conditions are likely to be due to a more limited number of genes that do vary because of the conditions in a way that is truly meaningful biologically. <br>  
    <b>So the genes under consideration for a PCA are usually restricted to the most variant ones</b>, i.e the 1000 most variable genes, or the top 5 or 10 %. Two main descriptive statistic values can be used to look at the gene-wise variation : <ol>
    <li><b> the variance </b> (or standard deviation (sd)) of each gene <il>
    <li><b> the coefficient of variation (CV) </b>(<i>i.e sd/mean</i>) of each gene: the most recommanded one to account for the risk of <b><i>'heteroskedasticity'</i></b> (when variance is correlated to the mean, which is not fine to keep doing analyses). Indeed, the coefficient of variation (CV) is a relative measure of variability that indicates the size of a standard deviation in relation to its mean. It is a standardized, unitless measure that allows you to compare variability between disparate groups and characteristics. It is also recommended to use the CV instead of variance when the variables under consideration in the dataset are very different and present large differences in ranges (such as age and dosages, for example). Then the ranges of variations of the different variable conditions have to be scaled. <il>   
</div>    

Here, we are going to select the top 50 genes, on the basis of their variances. Indeed, all considered variables have the same scale (expression levels of the 50 genes). In addition, with the rlog normalization performed by DESeq2, the variance has been stabilized in the model to have homoskedastik data. But we provide you below the code to generate the coefficient of variation for your own analyses.



- using the **coefficient of variation**: we provide the way to compute the CV for reference, but we go on with the variance. 



- using the **variance**:

We first compute the variance of all genes, using the `rowVars()` function of the ***matrixStats*** package *(or directly with the R base `apply()` function passing the `var()` function as an argument)*. We also verify that no gene has a null variance:

In [None]:
## Code cell 20 ##

var_genes <- matrixStats::rowVars(as.matrix(norm_counts[ , 3:13]))
#var_genes <- apply(X = norm_counts[, 3:13], MARGIN = 1, FUN = var, na.rm = TRUE) # simmilar way to generate var_genes

head(var_genes)
str(var_genes)
summary(var_genes)

We add these variance values to the norm_counts dataframe as a new `var` variable.

In [None]:
## Code cell 21 ##

norm_counts$var <- var_genes

We sort `norm_counts` on decreasing values of the variance and we add a variable with the rank of the variances.

In [None]:
## Code cell 22 ##

norm_counts <- norm_counts[order(norm_counts$var, decreasing = TRUE),]
norm_counts$rank_var <- 1:nrow(norm_counts)
head(norm_counts)

We now select only the top 50 most variable genes.

In [None]:
## Code cell 23 ##

top50var <- norm_counts[1:50,]
head(top50var)

We verify the size of the `top50var`dataframe, and if we have indeed unique ensemblIDs. 

In [None]:
## Code cell 24 ##

str(top50var)
dim(top50var)
length(unique(top50var$ensemblID))

head(top50var, n = 5)

We also notice these 50 genes have a unique gene_name and sort the dataframe by gene name.

In [None]:
## Code cell 25 ##

length(unique(top50var$gene_name))
top50var <- top50var[order(top50var$gene_name, decreasing = TRUE),]
head(top50var)

- looking at **the 50 most variable genes**

We can have a quick glance at those genes, to see if we recognize some of them... and indeed, the first one is interesting! :-D 

In [None]:
## Code cell 26 ##

top50var$gene_name

In order to use this dataframe for the next steps, we put the gene names as row names (instead of numbers). 

In [None]:
## Code cell 27 ##

row.names(top50var) <- top50var$gene_name

- **Are they differentially expressed?** 

Since we are curious, we may wonder where these top 50 variable genes fall within the DE analysis.

    - We first look at the DE object:

In [None]:
## Code cell 28 ##

str(res2_dHet_dHetRag_sig_ranked_annot)

    - We generate a new variable with the rank of each gene in the DE analysis:

In [None]:
## Code cell 29 ##

res2_dHet_dHetRag_sig_ranked_annot$rank_DE <- 1:nrow(res2_dHet_dHetRag_sig_ranked_annot)
str(res2_dHet_dHetRag_sig_ranked_annot)

    - We look at the rank of the 50 most variable genes in the DE analysis

In [None]:
## Code cell 30 ##

merge(top50var, res2_dHet_dHetRag_sig_ranked_annot[,-14],
      by = "ensemblID", all = FALSE, sort = FALSE) [,c("ensemblID", "gene_name", "rank_var", "rank_DE", "log2FoldChange", "padj")]

We notice that Xist, the top var gene, is not among the DE genes...neither 9 other genes :

In [None]:
## Code cell 31 ##

setdiff(top50var$gene_name, res2_dHet_dHetRag_sig_ranked_annot$gene_name)

They may have been removed by DESeq2 during the DE analysis as they were considered as outliers...

Reciprocally, if we look at the top50 DE expressed genes, they are not all in the list of the most variable genes. Indeed, the most variable genes may vary whatever the conditions, while DE genes vary between conditions!

In [None]:
## Code cell 32 ##

top50DE <- res2_dHet_dHetRag_sig_ranked_annot[1:50, ]
setdiff(top50DE$gene_name, top50var$gene_name) # list the top50 DE genes that are not among the top50 var genes
merge(top50DE, norm_counts[, - 2], by = "ensemblID", all.x = TRUE, all.Y = TRUE, sort = FALSE)[,c("ensemblID", "gene_name", "rank_var", "rank_DE", "log2FoldChange", "padj")]

We will further get gene expression data of "Sos1" and "Gm2629", the second and third genes in the DE analysis that are not among the top 50 var genes (while the first DE gene,  "Ighj1", ranks 26th among the most variable genes).

--
#### __2.2-c - PCA on the top 50 var genes__

We now use these top 50 variable genes to perform a proper PCA:

In [None]:
## Code cell 33 ##

# run PCA
PCAdata2 <- prcomp(t(top50var[, 3:13]))


* **Scree plot**

We display the corresponding scree plot :

In [None]:
## Code cell 34 ##

# to display the two scree plots side by side
layout(matrix(1:2, ncol = 2))

screeplot(PCAdata2)
screeplot(PCAdata2, type = "lines")

The scree plot confirms that there is  no interest in looking into PC dimensions beyond PC3 or PCA, as the remaining ones explain very little parts of the inertia (or total variance).  

* **PCA plots**

In [None]:
## Code cell 35 ##

autoplot(PCAdata2,
         data = samples, 
         colour = "Condition", 
         shape = "Tissue",
         size = 6) +
        geom_text_repel(aes(x = PC1, y = PC2, label = SampleName), box.padding = 0.8)


We can see that the portion of variance explained by PC1 and PC2 is higher, when we take into consideration only our top genes. This is indeed logical, as these genes are likely to be the most impacted by the change of conditions.   


* **Correlation circle**
 
The correlation circle plot is a very popular way for visualization of results from PCA, as it combines both the principal component scores and the loading vectors *(the variables)* that mostly contribute to the PC axes in a single display. Here, each vector represents a gene, and the arrow represents the influence of this gene on the PC: the longer the arrow, the stronger the influence.

- The orientation (direction) of the vector, with respect to the principal component space, in particular, its angle with the principal component axes: the more parallel to a principal component axis is a vector, the more it contributes only to that PC.

- The length in the space: the longer the vector, the more variability of this variable is represented by the two displayed principal components; short vectors are thus better represented in other dimension.

- The angles between vectors of different variables show their correlation in this space: small angles represent high positive correlation, right angles represent lack of correlation, opposite angles represent high negative correlation.

The function `biplot` is used to draw correlation circles. It is also used to draw biplots (see explanations of what is a biplot with the FactoMiner package section 2.3).

In [None]:
## Code cell 36 ##

biplot(PCAdata2,
       scale = 0)

Here, we can see that `Xist` points towards the bottom right, but its vector is longer towards the bottom than towards the right. Therefore, we can deduce that the variation of `Xist` expression in our dataset contributes more to the PC2 axis, that is to the difference of dHet cells (leukemic or not) compared to WT cells. 

---

### 2.3. PCA using FactoMineR package on the most variable genes
---



You can obtain more refined plots for PCA, scree plots, bi plots, correlation plots (and more) by using dedicated packages.    

One of the best and most popular is [FactoMineR](http://factominer.free.fr/index_fr.html), with its  companion package [factoextra](https://cran.r-project.org/web/packages/factoextra/index.html).    

You can find many tutorials, such as this one:    
http://www.sthda.com/english/wiki/wiki.php?id_contents=7851

Lets' have a look at a quick analysis with FactoMineR.

#### **2.3-a. Prepare input files for factominer**

- FactoMineR also needs the **variables *(i.e genes here)* in columns**. So again wa have to start with the transposed version of the norm_counts, here with the top 50 variable genes. Let's put this transforsed dataframe in an object called `for_factominer`:

In [None]:
## Code cell 37 ##

for_factominer <- t(top50var [, 3:13])
str(for_factominer)
head(for_factominer)

- We can add **qualtitative variables** as additional columns. Here we add the columns of the `samples` dataframe after checking the samples are in the same order.

In [None]:
## Code cell 38 ##

row.names(for_factominer) == samples$SampleName
for_factominer <- data.frame(for_factominer, samples)
str(for_factominer)

Our dataframe has now 60 columns: 50 with the norm counts data for the top 50 most variable genes, followed by 10 columns with sample metadata.

We may create new qualitative columns :
    
    - mouse ID (using abritrary increment values from 1 to 5 for the proB mice
    - B type: B-ALL versus proB

In [None]:
## Code cell 39 ##

for_factominer$mouseID <- c(rep("686", 2), rep("713", 2), rep("760", 2), 1:5)
for_factominer$B_type <- c(rep("B-ALL", 6), rep("proB", 5))

Unfortunatley, the sex was not provided as a metadata by the authors. We will use the level of expression of `Xist` as a surrogate to look at the sex.

In [None]:
## Code cell 40 ##

summary(for_factominer$Xist)
hist(for_factominer$Xist)
for_factominer$sex <- ifelse(for_factominer$Xist < 6, "male", "female")
table(for_factominer$sex)
for_factominer[, c("Xist", "sex")]

- add **quantitative variables** as additional columns

SHould you have some quantitative variables add them as additional columns. Here, as we have no quantitative data in the metadata, let's choose arbitrary one. We will take the level of expression of the second and third differentially expressed genes, "Sos1" and "Gm2629" that are not among the top 50 variable genes as seen above.

In [None]:
## Code cell 41 ##

for_factominer$Sos1 <- unlist(subset(norm_counts, gene_name == "Sos1")[,3:13])
for_factominer$Gm2629 <- unlist(subset(norm_counts, gene_name == "Gm2629")[,3:13])
head(for_factominer)
str(for_factominer)

#### **2.3-b. Run PCA and store results**

The `PCA()` function of FactoMineR creates an object of type `list` including all PCA results.

In [None]:
## Code cell 42 ##

# Use PCA() of FactoMineR
#?PCA
PCAres <- FactoMineR::PCA(for_factominer,
                          quali.sup = 51:(ncol(for_factominer)-2), # specify the index of the qulatitative colums
                          quanti.sup = (ncol(for_factominer)-1):ncol(for_factominer),
                                                                graph = FALSE)
#str(PCAres) # longue structure!
print(class(PCAres))
names(PCAres)

#### **2.3-c. Visual analysis of the first two eigen vectors**


The first two axes explain most of the interia of the scatter plot. Let's visualize the graph by plotting each mouse along these two axes.

In [None]:
## Code cell 43 ##

# Creates graphs with the samples (called "individuals in factominer) according to each axsis :

FactoMineR::plot.PCA(PCAres, choix = "ind", autoLab = "yes", invisible = "quali", cex = 1)

Note that the contribution of each axis to the inertia is displayed in %.

Let's label the dots using some qualtitative metadata:

In [None]:
## Code cell 44 ##

# Same graph adding colors for qualitative variables
# save in pdf if wanted

#pdf("PCA_individus.pdf")

FactoMineR::plot.PCA(PCAres, choix = "ind", autoLab = "yes", invisible = "quali", cex = 1)
FactoMineR::plot.PCA(PCAres, choix = "ind", autoLab = "yes", invisible = "quali", habillage = "Condition", cex = 1)
FactoMineR::plot.PCA(PCAres, choix = "ind", autoLab = "yes", invisible = "quali", habillage = "Tissue", cex = 1)
FactoMineR::plot.PCA(PCAres, choix = "ind", autoLab = "yes", invisible = "quali", habillage = "mouseID" , cex = 1)
FactoMineR::plot.PCA(PCAres, choix = "ind", autoLab = "yes", invisible = "quali", habillage = "B_type" , cex = 1)
FactoMineR::plot.PCA(PCAres, choix = "ind", autoLab = "yes", invisible = "quali", habillage = "sex", cex = 1)
#dev.off()

To look at other axes:

In [None]:
## Code cell 45 ##

FactoMineR::plot.PCA(PCAres, axes = c(2,3), choix = "ind", autoLab = "yes", invisible = "quali", cex = 1)

#### **2.3-d. PCA quality**

- The **inertia explained by each component** is stored in `PCAres$eig`.

In [None]:
## Code cell 46 ##

PCAres$eig

# only for first five:
PCAres$eig[1:5]

- **scree plots**: We draw the standard scree plot and a cumulative one with r base commands:

In [None]:
## Code cell 47 ##

# Graphic Inertia and dimensions

#pdf("PCA_inertia.pdf")
eig.val <- PCAres$eig
barplot(eig.val[, 2], 
        names.arg = 1:nrow(eig.val), 
        main = "Variances Explained by PCs (%)",
        xlab = "Principal Components",
        ylab = "Percentage of variances",
        ylim = c(0,15),
        col ="steelblue")

barplot(eig.val[, 3], 
        names.arg = 1:nrow(eig.val), 
        main = "Variances Explained by PCs (%)",
        xlab = "Principal Components",
        ylab = "Cumulative Percentage of variance",
        ylim = c(0,100),
        col ="steelblue")

#dev.off()

Using the `factoextra` package on factominer results allows to draw a nicer scree plots: 

In [None]:
## Code cell 48 ##

# diagramme des éboulis avec 10 composantes par défaut
factoextra::fviz_eig(PCAres, addlabels = TRUE)

In [None]:
## Code cell 49 ##

# idem same with only 5 components:
factoextra::fviz_eig(PCAres, addlabels = TRUE, ncp = 5)

#### **2.3-e. Analysis of correlations between PCA variables and metadata**




In addition to visually look at each component, it is possible to test the correlation between each component and the metadata.

- Correlation between each component and **qualitative metadata:**

In [None]:
## Code cell 50 ##

round(PCAres$quali.sup$eta2, 2)

- Correlation between each component and **quantitative metadata:**

In [None]:
## Code cell 51 ##

round(PCAres$quanti.sup$cor,2)

- graphical **representation of correlations with qualitative metadata**:

In particular, use the argument `axes` to specify the dimensions of interest like in the example below.

In [None]:
## Code cell 52 ##

#pdf("nom_fichier.pdf")
# les composantes par défaut sont les 1 et 2
# vous pouvez les modifier via l'argument "axes"

FactoMineR::plot.PCA(PCAres, choix = "ind", autoLab = "yes", invisible = "quali", cex = 1,
                     habillage = "Condition", axes = c(1,2))

#dev.off()

The package `factoextra` allows to color points and to add ***circles*** or ***ellipses*** around the groups of points sharing some qualitative characteristics. A large dot is plotted at the barycentre.
 
We can thus visualize points per conditions. Note that there are too few points for the WT and dHetRag to draw an ellipse.


In [None]:
## Code cell 53 ##

factoextra::fviz_pca_ind(PCAres, label = "none",
                         habillage = as.factor(PCAres$call$quali.sup$quali.sup$Condition),
             addEllipses = TRUE, ellipse.level = 0.95)

We look now at the sex on dim1 and dim2, then dim2 and dim3:

In [None]:
## Code cell 54 ##

factoextra::fviz_pca_ind(PCAres, label = "none",
                         habillage = as.factor(PCAres$call$quali.sup$quali.sup$sex),
             addEllipses = TRUE, ellipse.level = 0.95)

factoextra::fviz_pca_ind(PCAres, label = "none",
                         habillage = as.factor(PCAres$call$quali.sup$quali.sup$sex),
             addEllipses = TRUE, ellipse.level = 0.95, axes = c(2,3))

or here with the tissue :

In [None]:
## Code cell 55 ##

factoextra::fviz_pca_ind(PCAres, label = "none",
                         habillage = as.factor(PCAres$call$quali.sup$quali.sup$Tissue),
             addEllipses = TRUE, ellipse.level = 0.95)

- graphical **representation of correlations with quantitative metadata**:

One can also use a quantitative variable to color the points with a ***gradient***.

Let'st give it a go with Sos1 gene and Gm2629 genes:

In [None]:
## Code cell 56 ##

factoextra::fviz_pca_ind(PCAres, label="none",
                         col.ind = as.numeric(PCAres$call$quanti.sup$Sos1))

In [None]:
## Code cell 57 ##

factoextra::fviz_pca_ind(PCAres, label="none",
                         col.ind = as.numeric(PCAres$call$quanti.sup$Gm2629))

That's a good way to know which variable (*here gene*) best contributes to the inertia described by one componant/axis.

#### **2.3-f. Analysis of correlations between observations/conditions, variables and the PCA components**

- **Contribution of one observation *(here a mouse condition)* to the different PCA components:**

This is used to know which condition contributes the most to one axis. Belwo is a code to detect the top 30 genes contributing to axis 1.

In [None]:
## Code cell 58 ##

factoextra::fviz_contrib(PCAres, choice = "ind", axes = 1:2, top = 30)

- **Contribution of each variable used for the PCA *(levels of expression)* to PCA components** :

In [None]:
## Code cell 59 ##

# Contributions to axis one of the top 10 genes contributing to it :
factoextra::fviz_contrib(PCAres, choice = "var", axes = 1, top = 10)

# Contributions to second axis:
factoextra::fviz_contrib(PCAres, choice = "var", axes = 2, top = 10)

- **Correlation between quantitative variables used for the PCA and the PCA components**

In our example, a gene that strongly contributes to a PCA axis will have its level of expression correlated with it. If the contribution is expressed in % and is positive, the correlation can be either positive or negative.

- **positive correlation** *(0 < rho < 1)*: observations *(here conditions)* with high values for the considered component have also a high level of the quantitative variable *(here gene expression)*.

- **negative correlation** *(-1 < rho <0)*: observations high values for the considered component have a small level for the quantitative variable.
 
The __correlation circle__ shows the correlation between the quantitative variables *(here gene levels)* and the PCA eigen vectors. Each transcipt is represented with an arrow. The end of the arrow indicates the correlation coefficient between the transcript and the two displayed components.

In [None]:
## Code cell 60 ##

# Correlation circle
# with the argument "contrib 10" you display the top 10 genes with the best correlations with axes 1 and 2

#pdf(file = "PCA_diab_contrib30.pdf")
FactoMineR::plot.PCA(PCAres, choix = "var", cex = 1, select = "contrib 10", unselect = 1)
#dev.off()

Same plot, using `factoextra`:

In [None]:
## Code cell 61 ##

factoextra::fviz_pca_var(PCAres, select.var = list(contrib = 10))

One can decide to put a threshold on the level of correlation.

In [None]:
## Code cell 62 ##

factoextra::fviz_pca_var(PCAres, select.var = list(cos2 = 0.8))

- **Biplots of observations and variables**

A biplot allows to display simultaneoulsy the scatter plot (PCA plot) of observations (with the best labelling) and the variables mostly contributing to the displayed axes.   


<div class="alert alert-block alert-warning"><b>Warning:</b><br> Do not try to display too many variables, otherwise the plot would not be meaningful.</div>

Here we choose to display the top 10 variables and observations colred according to the condition on the first and second PCA axes : 

In [None]:
## Code cell 63 ##

factoextra::fviz_pca_biplot(PCAres, select.var = list(contrib = 10),
                            col.var = "black",
                            habillage = as.factor(PCAres$call$quali.sup$quali.sup$Condition))

- **Correlation of a specific variable with specific axes:**

Finally, we can compute the correlation of one transcript level with positions on a given axis.

In [None]:
## Code cell 64 ##

# Lignes de codes pour calculer les coefficients de corrélation entre le niveau des transcrits et la position des échantillons sur les axes
cor.test(for_factominer$Xist, PCAres$ind$coord[,1])
cor.test(for_factominer$Xist, PCAres$ind$coord[,2])

---  
---

## 3 - Clustering and Heatmaps 
---

### 3.1 - Hierarchical clustering   
---

As we did before normalisation, we will look at the grouping of our samples using a Hierarchical Clustering.    
This  representation is used to cluster the samples based on dissimilarity indexes *(see lecture 12, here we will use the Ward distance)*. More information can be found with `?hclust` (or in the Contextual Help panel on the right, that can be opened via the Help menu).

Hierarchical clustering is performed in two steps: calculate the distance matrix and apply clustering. 

In [None]:
## Code cell 65 ##

clusters <- hclust(dist(as.matrix(t(top50var[ ,- c(1:2)]))), method = "ward.D" )
plot(clusters)

rm(clusters)    

We can see that replicates from the same mouse cluster together, as expected. What is less expected is the grouping together of leukemic cells with WT ones! You could test other methods and see that this clustering is robust.

### 3.2 - Simple heatmaps   
---

Another way of looking at  expression data is a heatmap, that shows the level of expression of genes as colors, usually organised with samples in columns, and genes as rows.   

A simple and efficient way of plotting such a display is the function `heatmap()`, that gives a combined output combining a heatmap and dendrograms grouping samples and genes, as in hierarchical clustering:

In [None]:
## Code cell 66 ##

heatmap(as.matrix(top50var[,-c(1:2)]))

### 3.2 - Enhanced heatmaps with `ComplexHeatmap`
---

There are two main packages to draw enhanced heatmaps:
- [pheatmap](https://cran.r-project.org/web/packages/pheatmap/index.html): a CRAN packages.
- [ComplexHeatmap](https://jokergoo.github.io/ComplexHeatmap-reference/book/) on GitHub but also available in Bioconductor, the second generation of pheatmap.

ComplexHeatmap is the most flexible tool to draw heatmaps with a lot of options, notably to add annotations.  
Below is presented a quick example on ComplexHeatmap usage.

- To use ComplexHeatmap, the dataset must be a **matrix** with samples in columns and genes in rows. We thus transpose our matrix `top50var`.

- In addition, we use the function `scale()` to **center our data with a Z score**. Centering the values will provide a heatmap with a more neutral color for middle values, thus enabling a better visualisation of variations.

In [None]:
## Code cell 67 ##

t_top50var <- t(apply(top50var[,-c(1:2)], 1, scale))

- Then you select a **distance** for the similarity between samples. It can be a pre-defined character which is in (`euclidean`, `maximum`, `manhattan`, `canberra`, `binary`, `minkowski`, `pearson`, `spearman`, `kendall`). Default is `euclidean`.    
It can also be a function. In R the function to compute distances is `dist()`. The correlation distance is defined as 1 - cor(x, y, method). [See there](https://jokergoo.github.io/ComplexHeatmap-reference/book/a-single-heatmap.html#distance-methods) for further details.

- You can also select the **clustering method** by `clustering_method_rows` and `clustering_method_columns`. Possible methods are those supported in `hclust()` function: `ward.D`, `ward.D2`, `single`, `complete`, `average` (= UPGMA), `mcquitty` (= WPGMA), `median` (= WPGMC) or `centroid` (= UPGMC).

In [None]:
## Code cell 68 ##

# we change the dimension of the output for better rendering
options(repr.plot.width = 10, repr.plot.height = 10)

ComplexHeatmap::Heatmap(t_top50var,
                        name = "Z-score",
                        clustering_distance_rows = "pearson",
                        column_title = "pre-defined distance method (1 - pearson)", row_title = "Top 50 genes",
                        clustering_method_rows = "ward.D")

- One nice usage of ComplexHeatmap is the possibility to add custom **annotations**: `ha` for "heatmap annotation" is the object name used in ComplexHeatmap tutorial.

- You can add annotations on samples, either quantitative annotations with the fundtion `anno_points()` or `anno_barplot()` or qualitative annotations from a dataframe. The `col` argument is used to specify the colors of the different categorical values. For the quantitative values, you can add points or boxplots for example.   

Many beautiful examples can be found in the [Complex Heatmap manual](https://jokergoo.github.io/ComplexHeatmap-reference/book/), and a nice progressive tutorial is available [here](https://www.datanovia.com/en/lessons/heatmap-in-r-static-and-interactive-visualization/). 

---  
---

## 4 - Correlograms
---

Gene co-expression correlations provide a robust methodology for predicting gene function, as genes sharing a biological process or a common implication in pathways are often co-regulated.  This is particularly helpful to provide clues to the functions and roles of lncRNA genes, on the basis of the *"guilty by association"* method: if lncRNA genes cluster with genes with known functions, they are likely to participate to the same biological process.    

We want to see if the top variant genes identified above have a correlated expression in a pair-wise manner, that is if two genes share a similar pattern of expression across samples.

A Pearson or Spearman correlation is performed between continuous variables with the `cor()`  function, resulting in a correlation coefficient between each pair of genes. This correlation is displayed in a scatter plot with the function `plot()`.

<div class="alert alert-block alert-info"><b><u>What are correlograms?</u></b><br>
<br>When we want to study pairwise correlation between several variables (here genes are the variables), the <code>cor()</code> function can also be applied on a matrix of data. <b><br>
<i>A correlogram is the pairwise graphical representation of the matrix of correlation coefficients</i></b>. Such a correlogram helps highlighting the most correlated variables. Some R packages will also allow to display the pvalues or to reorder variables according to their degree of correlation.

You have arleady used heatmaps in session 1 to look at similar patterns of expression among a set of genes, such as the DE genes. Among distances that could be used to cluster genes in a heatmap, we could have used the Pearson distance = (1-r). In a correlogram, we rather display the <b><i>correlation coefficient ranging from -1 to 1</i></b>.
</div>


###  4.1 - With R native functions
---

- with the `heatmap` function   
which is also able to display a correlogramme.

We will have first to transpose the matrix of data with the `t()` function to have each gene name as a variable.

In [None]:
## Code cell 69 ##

heatmap(cor(t(top50var[,-c(1:2)])))

### 4.2 - With the `corrplot` package


The `corrplot`package is particularly efficient for customizing correlograms. https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html   
It allows for many improvements of the rendering.

In [None]:
## Code cell 70 ##

options(repr.plot.width = 13, repr.plot.height = 13)

# A first "simple" plot
corrplot(cor(t(top50var[,-c(1:2)])))

# method = "square" is the default 
# hclust clusterises the genes, grouping together the most correlated ones
# rectangles around clusters and correlation coefficients are also added
corrplot(cor(t(top50var[,-c(1:2)])),
         method = "square",
         order = 'hclust',
         addrect = 4,
         addCoef.col = 'black',
         number.cex = 0.3)

By definition a correlation matrix is symmetrical.   
We can add arguments to the `corrplot()` function in order to display only one half of the matrix (here, the upper one). We can also change the shape of the  representation of each pair-wise correlation.</span>

In [None]:
## Code cell 71 ##

# type = 'upper' will show only the upper half of the matrix
corrplot(cor(t(top50var[,-c(1:2)])),
        method = 'ellipse',
        type = 'upper',
        insig = 'blank')

____
---
## 5 - Volcano plot
---

  

Volcano plots are very commonly used to display the results of RNA-seq or other omics experiments. This plot uses the  information from the Differential Expression analysis, in particular the adjusted p-value and the log2 Fold Change, to highlight the most differentially expressed genes in a very evident way.    
A volcano plot is a type of scatterplot that shows **statistical significance (P value) versus magnitude of change (fold change)**.    
It enables quick visual identification of genes with large fold changes that are also statistically significant, because these are likely to be the most biologically significant genes.    

In a volcano plot :
- the most upregulated genes are towards the right (positive log2 fold change), 
- the most downregulated genes are towards the left (negative log2 fold change), 
- the most statistically significant genes are towards the top (smaller adjusted p-value).   

In our case, we have these values  in the `res2_ranked_p` object, respectively in the `padj` and in the `log2FoldChange` columns.

* **Thresholds for adjusted p-values and log2 Fold Change**

First we define the cutoff for the adjusted p-value:

In [None]:
## Code cell 72 ##

alpha <- 0.00001

Now we want to identify the genes that have an adjusted p-value below that cutoff, with a positive or a negative log2FoldChange.

We also define a threshold for the log2 fold change, for example 2 and -2.   

In [None]:
## Code cell 73 ##

## Number of genes with an adjusted p-value < alpha *and a positive log(FC)*
dim(res2_dHet_dHetRag_sig_ranked_annot[which(res2_dHet_dHetRag_sig_ranked_annot$padj < alpha & res2_dHet_dHetRag_sig_ranked_annot$log2FoldChange > 2),])[1]
upGenes <- res2_dHet_dHetRag_sig_ranked_annot[which(res2_dHet_dHetRag_sig_ranked_annot$padj < alpha & res2_dHet_dHetRag_sig_ranked_annot$log2FoldChange > 2),]

## Number of genes with an adjusted p-value < alpha *and a negative log(FC)*
dim(res2_dHet_dHetRag_sig_ranked_annot[which(res2_dHet_dHetRag_sig_ranked_annot$padj < alpha & res2_dHet_dHetRag_sig_ranked_annot$log2FoldChange < -2),])[1]
downGenes <- res2_dHet_dHetRag_sig_ranked_annot[which(res2_dHet_dHetRag_sig_ranked_annot$padj < alpha & res2_dHet_dHetRag_sig_ranked_annot$log2FoldChange < -2),]


* **Plotting of the volcano plot**   

When drawing the volcano plot, we can color the genes according to this selection.   
FInally, we add horizontal and vertical lines to show the thresholds. 

In [None]:
## Code cell 74 ##

options(repr.plot.width = 10, repr.plot.height = 10)

#draw the plot
plot(-log10(res2_dHet_dHetRag_sig_ranked_annot$padj) ~ res2_dHet_dHetRag_sig_ranked_annot$log2FoldChange, pch = "", xlim = c(-15,15), ylim = c(0,90),
     xlab = "",ylab = "", bty = "n", xaxt = "n", yaxt = "n"  )
title("Leukemic dHet vs non-leukemic dHetRag", font.main = 1, cex.main = 0.9)
axis(1, at = -15:15, tcl = -0.5,cex = 0.7, labels = F )
mtext(-15:15,side = 1,line = 1,at = -15:15, cex = 0.7)
axis(2, at = 0:90, tcl = -0.2, cex = 0.7, labels = F )
mtext(seq(0,90,10),side = 2,line = 0.5, at = seq(0,90,10), cex = 0.7)


## Color in grey genes
points(-log10(res2_dHet_dHetRag_sig_ranked_annot$padj) ~ res2_dHet_dHetRag_sig_ranked_annot$log2FoldChange, pch = 16, cex = 0.5, col = "grey")


## Override the previous colouring for some genes according to the fact they are DE
points(-log10(upGenes$padj) ~ upGenes$log2FoldChange,pch = 16, cex = 0.5, col = "red")
points(-log10(downGenes$padj) ~ downGenes$log2FoldChange, pch = 16,cex = 0.5, col = "green")
mtext("log2(FC)", side = 1, line = 2, cex = 0.8)
mtext("-log10 (adjusted p-value)", side = 2, line = 1.5, cex = 0.8)

abline(h=-log10(alpha), col="blue")
abline(v=-2, col="green")
abline(v=2, col="red")

* **Enlarging the volcano plot** 

As we can see in the list of DE genes ranked by p-value: 

In [None]:
## Code cell 75 ##

head(res2_dHet_dHetRag_sig_ranked_annot, n = 3)

the top DE genes have *very very* low adjusted p-value, that were not displayed in this first volcano plot.    

So in the cell below, we change the ylim and the ranges for the y axis to accomodate these high values.    
We also change 
- the size of colored points: `cex = 0.7` instead of `cex = 0.5`
- the lettering size for the main title: `cex.main = 1.5` instead of `cex.main = 0.9`
- and for the axis titles: `cex = 1` instead of `cex = 0.8`. 

In [None]:
## Code cell 76 ##

options(repr.plot.width = 10, repr.plot.height = 10)

#draw the plot
plot(-log10(res2_dHet_dHetRag_sig_ranked_annot$padj) ~ res2_dHet_dHetRag_sig_ranked_annot$log2FoldChange, pch = "", xlim = c(-15,15), ylim = c(0,245),
     xlab = "",ylab = "", bty = "n", xaxt = "n", yaxt = "n"  )
title("Leukemic dHet vs non-leukemic dHetRag", font.main = 1, cex.main = 1.5)
axis(1, at = -15:15, tcl = -0.5,cex = 0.7, labels = F )
mtext(-15:15,side = 1,line = 1,at = -15:15, cex = 0.7)
axis(2, at = 0:245, tcl = -0.2, cex = 0.7, labels = F )
mtext(seq(0,245,10),side = 2,line = 0.5, at = seq(0,245,10), cex = 0.7)


## Color in grey genes
points(-log10(res2_dHet_dHetRag_sig_ranked_annot$padj) ~ res2_dHet_dHetRag_sig_ranked_annot$log2FoldChange, pch = 16, cex = 0.5, col = "grey")


## Override the previous colouring for some genes according to the fact they are DE
points(-log10(upGenes$padj) ~ upGenes$log2FoldChange,pch = 16, cex = 0.8, col = "red")
points(-log10(downGenes$padj) ~ downGenes$log2FoldChange, pch = 16,cex = 0.8, col = "green")
mtext("log2(FC)", side = 1, line = 2, cex = 1.1)
mtext("-log10 (adjusted p-value)", side = 2, line = 1.5, cex = 1.1)

abline(h=-log10(alpha), col="blue")
abline(v=-2, col="green")
abline(v=2, col="red")

Finally, we write the differentially expressed gene list to a tabulated txt file, in the deseq2folder that we decided at first

___
---
## 6 - Saving our results for later use: DE genes lists and RData file
---

### 6.1 - DE genes lists    
---

As we have selected DE genes on the basis of their adjusted p-value and their log2 Fold change, we can save those list for further enrichment analysis.  

In [None]:
## Code cell 77 ##

dim(upGenes)
dim(downGenes)
head(upGenes)

We save those 2 lists in our Results/deseq2 folder:

In [None]:
## Code cell 78 ##

write.table(upGenes$gene_name, file=paste0(deseq2folder,"DESeq2_significant_genes-0_00001-up.tsv"), sep="\t", quote=F, col.names=T, row.names = F)
write.table(downGenes$gene_name, file=paste0(deseq2folder,"DESeq2_significant_genes-0_00001-down.tsv"), sep="\t", quote=F, col.names=T)


### 6.2 - RData object for the session
---

We can save all the R objects created in this session in a single R object.   
This will help us to reload our results without having to run the same commands.   

In [None]:
## Code cell 79 ##

print(ls())

In [None]:
## Code cell 80 ##

head(res2_dHet_dHetRag_sig_ranked_annot)

We keep only the relevant objects:

In [None]:
## Code cell 81 ##

rm(PCAdata, PCAdata2, rdata, t_top50var)

and we save all our info in a single RData object in our output folder:

In [None]:
## Code cell 82 ##

ls()
save.image(file=paste0(deseq2folder,"deseq2-final.RData"))

---
___

Now we go on with the analysis of enrichments of gene list of interest.  
  
**=> Step 11: Overrepresentation and enrichment analysis** 

The jupyter notebook used for the next session will be *Pipe_11-R403-Normcounts-exploratory-analysis-II.ipynb*    
***It is not ready yet!!***  

So we will not retrieve it in our personal directory as usual. 


**Save executed notebook**

To end the session, save your executed notebook in your `run_notebooks` folder. Adjust the name with yours and reformat as code cell to run it.

<div class="alert alert-block alert-success"><b>Success:</b> Well done! You now know how to visualize normalised expression data.<br>
Don't forget to save you notebook and export a copy as an <b>html</b> file as well <br>
- Open "File" in the Menu<br>
- Select "Export Notebook As"<br>
- Export notebook as HTML<br>
- You can then open it in your browser even without being connected to the server! 
</div>

---
---

## Useful commands
<div class="alert alert-block alert-info"> 
    
- <kbd>CTRL</kbd>+<kbd>S</kbd> : save notebook<br>    
- <kbd>CTRL</kbd>+<kbd>ENTER</kbd> : Run Cell<br>  
- <kbd>SHIFT</kbd>+<kbd>ENTER</kbd> : Run Cell and Select Next<br>   
- <kbd>ALT</kbd>+<kbd>ENTER</kbd> : Run Cell and Insert Below<br>   
- <kbd>ESC</kbd>+<kbd>y</kbd> : Change to *Code* Cell Type<br>  
- <kbd>ESC</kbd>+<kbd>m</kbd> : Change to *Markdown* Cell Type<br> 
- <kbd>ESC</kbd>+<kbd>r</kbd> : Change to *Raw* Cell Type<br>    
- <kbd>ESC</kbd>+<kbd>a</kbd> : Create Cell Above<br> 
- <kbd>ESC</kbd>+<kbd>b</kbd> : Create Cell Below<br> 

<em>  
To make nice html reports with markdown: <a href="https://dillinger.io/" title="dillinger.io">html visualization tool 1</a> or <a href="https://stackedit.io/app#" title="stackedit.io">html visualization tool 2</a>, <a href="https://www.tablesgenerator.com/markdown_tables" title="tablesgenerator.com">to draw nice tables</a>, and the <a href="https://medium.com/analytics-vidhya/the-ultimate-markdown-guide-for-jupyter-notebook-d5e5abf728fd" title="Ultimate guide">Ultimate guide</a>. <br>
Further reading on JupyterLab notebooks: <a href="https://jupyterlab.readthedocs.io/en/latest/user/notebook.html" title="Jupyter Lab">Jupyter Lab documentation</a>.<br>   
</em>    
 
</div>

Bénédicte Noblet - 05-07 2021   
Sandrine Caburet et Claire Vandiedonck - 02-06 2023   
with adaptations from https://bioinformatics-core-shared-training.github.io/RNAseq_November_2020_remote/html/02_Preprocessing_Data.html   
and https://rpubs.com/adoughan/778146  
Maj 14/09/2023 par @CVandiedonck   