# Formation RNAseq CEA - juin 2024

*Enseignantes: Sandrine Caburet et Claire Vandiedonck*

Session IFB : 7 CPU + 33 GB de RAM

R version 4.2.3

# Part 8: Exploratory analysis I (before normalisation)

   

- 0. 1 - Setting up this R session on IFB core cluster
- 0. 2 - Parameters to be set or modified by the user
- 1 - Loading input data and metadata
- 2 - Quality assessment
- 3 - Saving our results

---
## **Before going further**

<div class="alert alert-block alert-danger"><b>Caution:</b> 
Before starting the analysis, save a backup copy of this notebok : in the left-hand panel, right-click on this file and select "Duplicate"<br>
You can also make backups during the analysis. Don't forget to save your notebook regularly: <kbd>Ctrl</kbd> + <kbd>S</kbd> or click on the 💾 icon.
</div>

<div class="alert alert-block alert-warning"><b>Warning:</b>You are strongly advised to run the cells in the indicated order. If you want to rerun cells above, you can just restart the kernel to start at 1 again. </div>

---
---
## 0. Set up parameters and RSession

---

### 0.1 - Setting up this R session on IFB core cluster

<em>loaded JupyterLab</em> : Version 3.5.0

#### **0.1-a. Jupyter session**

Just as in a bash notebook, we can have information about the number of threads available for this session. To do so, we will use the `system()`function in R which executes bash commands.

In [1]:
## Code cell 1 ##

session_parameters <- function(){
    
    jupytersession <- c(system('echo "=== Cell launched on $(date) ==="', intern = TRUE),
                        system('squeue -hu $USER', intern = TRUE))
    
    jobid <- system("squeue -hu $USER | awk '/sys/dash {print $1}'", intern = TRUE)
    jupytersession <- c(jupytersession,
                        "=== Current IFB session size: Medium (5CPU, 21 GB) ===",
                        system(paste("sacct --format=JobID,AllocCPUS,ReqMem,NodeList,Elapsed,State -j", jobid), intern = TRUE))
    print(jupytersession[1:6])
    
    return(invisible(NULL))
}

session_parameters()

[1] "=== Cell launched on Sun Jun 23 18:51:00 CEST 2024 ==="                         
[2] "          40340009      fast sys/dash cvandied  R    2:55:17      1 cpu-node-51"
[3] "=== Current IFB session size: Medium (5CPU, 21 GB) ==="                         
[4] "JobID         AllocCPUS     ReqMem        NodeList    Elapsed      State "      
[5] "------------ ---------- ---------- --------------- ---------- ---------- "      
[6] "40340009              5        21G     cpu-node-51   02:55:17    RUNNING "      


__

#### **0.1-b. R session**

Next we load into this R session the various packages that we will use.


<div class="alert alert-block alert-info"> <b> Info on packages: </b><br>
    
Since this is the first time during this course we are using R packages, the next code cells 2-5 will explain you some basic commands to deal with R packages. </div>

In R we can call packages with some functions to do special analyses.
To call such a package, the usual command is `library(name_of_the_package)`. It works if the package is already installed on your computer/server, otherwise you have to install it from the main R repository called CRAN (https://cran.r-project.org/). The main function to install R packages is `install.packages()`. Some packages are not available via the CRAN package repository, specifically the ones used in bioinformatic analyses, but via another repository: ***Bioconductor***. In order to install a package from Bioconductor, you need to install the `BiocManager` package first, then use its fonction `install()`. An example of the commands to use is provided in Code cell 5.

- We can first identify the directories, called ***libraries***, where these packages are installed. 

In [2]:
## Code cell 2 ##

.libPaths()

If this is the first time you are using this R version on the IFB core cluster or you never installed packages before, the previous command should return only one folder, readable by all IFB users. It should be : `/shared/ifbstor1/software/miniconda/envs/r-4.2.3/lib/R/library` since the R version currently installed is 4.2.3.

If you have already installed R packages, you should see another folder in your home directory, most likely with this absolute path: `/shared/ifbstor1/home/mylogin/R/x86_64-conda-linux-gnu-library/4.2` *(for R versions 4.2)*.

- You can see how many **packages were already installed on the IFB core cluster** with the following command:

In [3]:
## Code cell 3 ##

length(list.files("/shared/ifbstor1/software/miniconda/envs/r-4.2.3/lib/R/library"))  # modifier l'indice selon le repertoire souhaité

That is an impressive number! It is highly likely that the packages you need were already installed on the server!

- To know **which packages were already installed**, you can either go to the corresponding folder, or use the R function `installed.packages()` which will return all packages installed with a lot of information. The most interesting infos are in the first three columns, with the name, the version and the library path. We can see how it looks like for the first packages with the command `head()`, or verify the presence of a given package with the command `grep()`.

In [4]:
## Code cell 4 ##

head(installed.packages()[,c(1,2,3)])

grep("ggplot2", installed.packages()[,c(1,2,3)], value = TRUE)

Unnamed: 0,Package,LibPath,Version
ggfortify,ggfortify,/shared/ifbstor1/home/cvandiedonck/R/x86_64-conda-linux-gnu-library/4.2,0.4.16
writexl,writexl,/shared/ifbstor1/home/cvandiedonck/R/x86_64-conda-linux-gnu-library/4.2,1.4.2
ADGofTest,ADGofTest,/shared/ifbstor1/software/miniconda/envs/r-4.2.3/lib/R/library,0.3
ANCOMBC,ANCOMBC,/shared/ifbstor1/software/miniconda/envs/r-4.2.3/lib/R/library,2.0.2
AUCell,AUCell,/shared/ifbstor1/software/miniconda/envs/r-4.2.3/lib/R/library,1.20.2
AlgDesign,AlgDesign,/shared/ifbstor1/software/miniconda/envs/r-4.2.3/lib/R/library,1.2.1


- **If your package of interest was not yet installed on the IFB cluster, you have two options:**

1. either you ask them to install it for you *(and therefore for all other users)* using the community forum (https://community.france-bioinformatique.fr/). They are quite reactive and this would be a good option to make sure all dependecies are correctly installed together with your package.

2. or you install it in your home directory.


If you go for the second option because you are in a rush with your analyses, you may skip the above commands and just try to install it, only if needed.

If the `.libPaths()`command run in Code cell 2 above did not return the name of your R library folder in your home, you first have to create it, then load its path as one of the possibilities: 


In [5]:
## Code cell 5 ##

# creation of the directory, recursive = TRUE is equivalent to the mkdir -p in Unix (thus only created if not yet present)
dir.create("~/R/x86_64-conda-linux-gnu-library/4.2", recursive = TRUE, showWarnings = FALSE)

# this new directory is added to the possible paths:
.libPaths('~/R/x86_64-conda-linux-gnu-library/4.2')

# and we verify its addition:
.libPaths()

Now we can install new packages in our home, if they are not available on the cluster.    

The following command lines using `require` and a condition with `if()` will install the package only if it is not already there!<br>
Then the packages are loaded into your R session.       
   
<span style="color:red">It is good practice to have such a command at the top of your notebook/script when you are doing analyses in R.</span>

In [6]:
## Code cell 6 ##

# list the required libraries from the CRAN repository
requiredLib <- c(
    "tidyverse",
    "data.table",
    "ggfortify",
    "ggrepel",
    "RColorBrewer",
    "matrixStats",
    "BiocManager"    
)

# list the required libraries from the Bioconductor project
requiredBiocLib <- c("affy")

# install required libraries if not yet installed
for (lib in requiredLib) {
  if (!require(lib, character.only = TRUE, quiet = TRUE)) {
    install.packages(lib, quiet = TRUE)
  }
}

for( lib in requiredBiocLib) {
  if (!require(lib, character.only = TRUE, quiet = TRUE)) {
  BiocManager::install(lib, quiet = TRUE)
  }
}

# load libraries
message("Loading required libraries")
for (lib in requiredLib) {
  library(lib, character.only = TRUE)}
for (lib in requiredBiocLib) {
  library(lib, character.only = TRUE)}

# remove variables from the R session if they are no longer necessary 
rm(lib, requiredLib, requiredBiocLib)

-- [1mAttaching core tidyverse packages[22m ------------------------ tidyverse 2.0.0 --
[32mv[39m [34mdplyr    [39m 1.1.4     [32mv[39m [34mreadr    [39m 2.1.5
[32mv[39m [34mforcats  [39m 1.0.0     [32mv[39m [34mstringr  [39m 1.5.1
[32mv[39m [34mggplot2  [39m 3.5.1     [32mv[39m [34mtibble   [39m 3.2.1
[32mv[39m [34mlubridate[39m 1.9.3     [32mv[39m [34mtidyr    [39m 1.3.1
[32mv[39m [34mpurrr    [39m 1.0.2     
-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mi[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors

Attaching package: 'data.table'


The following objects are masked from 'package:lubridate':

    hour, isoweek, mday, minute, month, quarter, second, wday, week,
    y



<div class="alert alert-block alert-warning"><b>Warning:</b><ul>
    <li><b><i>DO NOT worry</i></b> if you see a large red output!! </b> You should see this large red output only once, when the relevant packages are installed in your home directory. (Afterwards, they will be detected as present, and this large red output won't show if you run the cell another time. <il>
 <li><b><i> DO NOT panic </i></b> if you see a warning in the output: such commands return sometimes warnings that can arise due to changes in R versions, but as a general rule the functions in these packages will still work.<il>
     </div>

The next command will return the versions for R and R-packages used in your current R session. You can check that the packages called in a previous command are indeed loaded:

In [7]:
## Code cell 7 ##   

cat("---Here are the info of my R session with the loaded packages:\n")
sessionInfo()

---Here are the info of my R session with the loaded packages:


R version 4.2.3 (2023-03-15)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /shared/ifbstor1/software/miniconda/envs/r-4.2.3/lib/libopenblasp-r0.3.21.so

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] affy_1.76.0         Biobase_2.62.0      BiocGenerics_0.44.0
 [4] BiocManager_1.30.23 matrixStats_1.3.0   RColorBrewer_1.1-3 
 [7] ggrepel_0.9.5       ggfortify_0.4.16    data.table_1.15.4  
[10] lubridate_1.9.3     forcats_1.0.0       stringr_1.5.1      
[13] dplyr_1.1.4         purrr_1.0.2         readr_2.1.5        
[16] tidyr_1.3.1         tibble_3.2.1        ggplot2_3.5.1      
[19] tidyverse_2.0.0    

loaded via a namespace (and not attached):
 [1] pbdZMQ_0.3-11         tidyselect_1.2.1      repr_1.1.7           
 [4] colorspace_2.1-0      vctrs_0.6.5           generics_0.1.3       
 [7] htmltools_0.5.8.1     base64en

It looks ok and we can start the analysis!

--

### 0.2 - Parameters to be set or modified by the user
---



- To know our **current working directory**, we use the function ̀`getwd()` (for *get working directory*, equivalent to `pwd` in bash).

In [8]:
## Code cell 8 ##   

getwd()

<div class="alert alert-block alert-info"> <b> Info on working directory: </b><br>In a Jupyter Hub and a jupyter notebook in R, by default the working directory is where the notebook is opened</div>

- Using a full path with a `/` at the end, **define the folder** of the project as  `gohome` variable, the folder where you work as the `myfolder` variable.

In [9]:
## Code cell 9 ##

gohome <- "/shared/projects/2413_rnaseq_cea/"
gohome

myfolder <- getwd()
myfolder

- We also define the folder with the raw data as `datafolder`, the reference folder with the reference genome data `reffolder` and we specify the variable of the reference genome annotation used `annot_version`.

In [10]:
## Code cell 10 ##

datafolder <- paste0(myfolder, "/Data/")
datafolder

reffolder <- paste0(gohome, "alldata/Reference/")
reffolder
annot_version <- "vM35"

- With a `/` at the end, we also add the folder `salmonfolder` with salmon data  we still need to aggregate for all samples at both the transcript and gene levels.

In [11]:
## Code cell 11 ##

salmonfolder <- paste0(myfolder, "/Results/salmon/")
salmonfolder

- With a `/` at the end, define the path to the folder where the results of this initial exploratory analysis will be stored:

In [12]:
## Code cell 12 ##

# creation of the directory, recursive = TRUE is equivalent to the mkdir -p in Unix
dir.create(paste0(myfolder,"/Results/pca1/"), recursive = TRUE)

# storing the path to this output folder in a variable
pca1folder <- paste(myfolder,"/Results/pca1/", sep = "")
pca1folder

# listing the content of the folder
print(system(paste("ls -hlt", pca1folder), intern = TRUE) )

"'/shared/ifbstor1/projects/2413_rnaseq_cea/cvandiedonck/Results/pca1' already exists"


[1] "total 668K"                                                                       
[2] "-rw-rw----+ 1 cvandiedonck cvandiedonck 663K Jun 21 18:14 RawCounts_Samples.RData"


(When creating a directory with the `recursive = TRUE`argument, you obtain a red warning if the directory already exists.    
You can inactivate this warning with the additional argument  `showWarnings = FALSE`, as in the Code cell 5 above. ) 

- Last, we specify the size of the graphical outputs that will be used for all the plots in the notebook.    
This setting could be modified at will for each plot. 

In [13]:
## Code cell 13 ##

options(repr.plot.width = 15, repr.plot.height = 8)

---
## 1 - Loading metadata and input (STAR and Salmon) data
---

<div class="alert alert-block alert-info"><b>Info:</b><br>
In the previous parts of the pipeline, we worked only on three samples. Now that we obtained read counts per features, we have <i>lighter</i> files and we can work on <b> the whole dataset with all 11 samples</b>.</div>

We now need three types of files:   
1. a **metadata** file, providing information about the samples, in particular the conditions of the experiment. 
2. the **gene read counts** produced by `STAR` and `featureCounts` (as in the Pipe_06 notebook), on all 11 samples   
3. the **transcript counts** produced by `SALMON` for each sample

You should have obtained 2 TRUE output, and the files should appear in your folder in the left column.

### 1.1 - Loading metadata   
---

We copy the first metadata file in our personal folder from the alldata folder with the `file.copy()` function.

In [14]:
## Code cell 14 ##

file.copy(from = paste0(gohome, "alldata/Data/sampleData-GSE158661.tsv"),
         to = datafolder)

We read the tabulated table and we store the data in a dataframe named `samples` in our R session.  
The file has column names, so we set `header=TRUE`, and the separator as a tab (<code>sep = "\t"</code>).

In [15]:
## Code cell 15 ##

samples <- read.table(paste0(datafolder, "sampleData-GSE158661.tsv"),
                      sep = "\t",
                      header = TRUE,
                      stringsAsFactors = FALSE)

We then look at the class, structure and first rows of the imported file.

In [16]:
## Code cell 16 ##

class(samples)

In [17]:
## Code cell 17 ##

str(samples)

'data.frame':	11 obs. of  10 variables:
 $ SampleID           : chr  "SRR12730403" "SRR12730404" "SRR12730405" "SRR12730406" ...
 $ SampleName         : chr  "dHet_B-ALL_686_rep1" "dHet_B-ALL_686_rep2" "dHet_B-ALL_713_rep1" "dHet_B-ALL_713_rep2" ...
 $ Condition          : chr  "dHet" "dHet" "dHet" "dHet" ...
 $ Genotype           : chr  "Ebf1+/- Pax5+/-" "Ebf1+/- Pax5+/-" "Ebf1+/- Pax5+/-" "Ebf1+/- Pax5+/-" ...
 $ GEO_Accession..exp.: chr  "GSM4805160" "GSM4805161" "GSM4805162" "GSM4805163" ...
 $ Sample.Name        : chr  "GSM4805160" "GSM4805161" "GSM4805162" "GSM4805163" ...
 $ source_name        : chr  "Ebf1+/- Pax5+/- dHet leukemic cells derived from lymph node (mouse 686)\\,Replicate1" "Ebf1+/- Pax5+/- dHet leukemic cells derived from lymph node (mouse 686)\\,Replicate2" "Ebf1+/- Pax5+/- dHet leukemic cells derived from lymph node (mouse 713)\\,Replicate1" "Ebf1+/- Pax5+/- dHet leukemic cells derived from lymph node (mouse 713)\\,Replicate2" ...
 $ SRA.Study          : chr  "SRP

In [18]:
## Code cell 18 ##

head(samples, n = 12)

Unnamed: 0_level_0,SampleID,SampleName,Condition,Genotype,GEO_Accession..exp.,Sample.Name,source_name,SRA.Study,Strain,Tissue
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,SRR12730403,dHet_B-ALL_686_rep1,dHet,Ebf1+/- Pax5+/-,GSM4805160,GSM4805160,"Ebf1+/- Pax5+/- dHet leukemic cells derived from lymph node (mouse 686)\,Replicate1",SRP285633,C57BL/6J,Ebf1+/- Pax5+/- lymph node
2,SRR12730404,dHet_B-ALL_686_rep2,dHet,Ebf1+/- Pax5+/-,GSM4805161,GSM4805161,"Ebf1+/- Pax5+/- dHet leukemic cells derived from lymph node (mouse 686)\,Replicate2",SRP285633,C57BL/6J,Ebf1+/- Pax5+/- lymph node
3,SRR12730405,dHet_B-ALL_713_rep1,dHet,Ebf1+/- Pax5+/-,GSM4805162,GSM4805162,"Ebf1+/- Pax5+/- dHet leukemic cells derived from lymph node (mouse 713)\,Replicate1",SRP285633,C57BL/6J,Ebf1+/- Pax5+/- lymph node
4,SRR12730406,dHet_B-ALL_713_rep2,dHet,Ebf1+/- Pax5+/-,GSM4805163,GSM4805163,"Ebf1+/- Pax5+/- dHet leukemic cells derived from lymph node (mouse 713)\,Replicate2",SRP285633,C57BL/6J,Ebf1+/- Pax5+/- lymph node
5,SRR12730407,dHet_B-ALL_760_rep1,dHet,Ebf1+/- Pax5+/-,GSM4805164,GSM4805164,"Ebf1+/- Pax5+/- dHet leukemic cells derived from lymph node (mouse 760)\,Replicate1",SRP285633,C57BL/6J,Ebf1+/- Pax5+/- lymph node
6,SRR12730408,dHet_B-ALL_760_rep2,dHet,Ebf1+/- Pax5+/-,GSM4805165,GSM4805165,"Ebf1+/- Pax5+/- dHet leukemic cells derived from lymph node (mouse 760)\,Replicate2",SRP285633,C57BL/6J,Ebf1+/- Pax5+/- lymph node
7,SRR12730409,dHet_FetalLiver_proB_rep1,dHetRag,Ebf1+/- Pax5+/- Rag2-/-,GSM4805166,GSM4805166,"Ebf1+/- Pax5+/- dHet fetal liver proB cells\,Replicate1",SRP285633,C57BL/6J,Ebf1+/- Pax5+/- Rag2-/- proB fetal liver
8,SRR12730410,dHet_FetalLiver_proB_rep2,dHetRag,Ebf1+/- Pax5+/- Rag2-/-,GSM4805167,GSM4805167,"Ebf1+/- Pax5+/- dHet fetal liver proB cells\,Replicate2",SRP285633,C57BL/6J,Ebf1+/- Pax5+/- Rag2-/- proB fetal liver
9,SRR12730411,dHet_FetalLiver_proB_rep3,dHetRag,Ebf1+/- Pax5+/- Rag2-/-,GSM4805168,GSM4805168,"Ebf1+/- Pax5+/- dHet fetal liver proB cells\,Replicate3",SRP285633,C57BL/6J,Ebf1+/- Pax5+/- Rag2-/- proB fetal liver
10,SRR12730412,wt_BoneMar_proB_rep1,WT,Ebf1+/+ Pax5+/+ Rag2+/+,GSM4805169,GSM4805169,"wt proB cells\, Replicate1",SRP285633,C57BL/6J,wt proB (Fraction BC) bone marrow


Note that samples are in rows and metadata in columns.

### 1.2 - STAR/feature counts raw data   
---

#### 1.2.1. Loading the data

We copy the file with STAR/featurecounts readcounts in our personal folder from the alldata folder with the `file.copy()` function.

In [19]:
## Code cell 19 ##

file.copy(from = paste0(gohome, "alldata/Results/featurecounts/11samples_paired-unstranded.counts"),
          to = datafolder)

We read the tabulated table and we store the data in a dataframe named `counts`.   
In addition to specifying the separator and the columns names, we also indicate that lines beginning with "#" are comments and should be ignored.

In [20]:
## Code cell 20 ##

countdata <- read.table(paste0(datafolder, "11samples_paired-unstranded.counts"),
                     sep = "\t",
                     header = TRUE,
                     comment="#")

In [21]:
## Code cell 21 ##

class(countdata)

In [22]:
## Code cell 22 ##

str(countdata)

'data.frame':	57186 obs. of  17 variables:
 $ Geneid                                                                                               : chr  "ENSMUSG00000102693.2" "ENSMUSG00000064842.3" "ENSMUSG00000051951.6" "ENSMUSG00000102851.2" ...
 $ Chr                                                                                                  : chr  "chr1" "chr1" "chr1;chr1;chr1;chr1;chr1;chr1;chr1" "chr1" ...
 $ Start                                                                                                : chr  "3143476" "3172239" "3276124;3276746;3283662;3283832;3284705;3491925;3740775" "3322980" ...
 $ End                                                                                                  : chr  "3144545" "3172348" "3277540;3277540;3285855;3286567;3287191;3492124;3741721" "3323459" ...
 $ Strand                                                                                               : chr  "+" "+" "-;-;-;-;-;-;-" "+" ...
 $ Length                   

In [23]:
## Code cell 23 ##

head(countdata)

Unnamed: 0_level_0,Geneid,Chr,Start,End,Strand,Length,X.shared.projects.2413_rnaseq_cea.alldata.Results.featurecounts.SRR12730403_Aligned.sortedByNames.bam,X.shared.projects.2413_rnaseq_cea.alldata.Results.featurecounts.SRR12730404_Aligned.sortedByNames.bam,X.shared.projects.2413_rnaseq_cea.alldata.Results.featurecounts.SRR12730405_Aligned.sortedByNames.bam,X.shared.projects.2413_rnaseq_cea.alldata.Results.featurecounts.SRR12730406_Aligned.sortedByNames.bam,X.shared.projects.2413_rnaseq_cea.alldata.Results.featurecounts.SRR12730407_Aligned.sortedByNames.bam,X.shared.projects.2413_rnaseq_cea.alldata.Results.featurecounts.SRR12730408_Aligned.sortedByNames.bam,X.shared.projects.2413_rnaseq_cea.alldata.Results.featurecounts.SRR12730409_Aligned.sortedByNames.bam,X.shared.projects.2413_rnaseq_cea.alldata.Results.featurecounts.SRR12730410_Aligned.sortedByNames.bam,X.shared.projects.2413_rnaseq_cea.alldata.Results.featurecounts.SRR12730411_Aligned.sortedByNames.bam,X.shared.projects.2413_rnaseq_cea.alldata.Results.featurecounts.SRR12730412_Aligned.sortedByNames.bam,X.shared.projects.2413_rnaseq_cea.alldata.Results.featurecounts.SRR12730413_Aligned.sortedByNames.bam
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,ENSMUSG00000102693.2,chr1,3143476,3144545,+,1070,0,0,0,0,0,0,0,0,0,0,0
2,ENSMUSG00000064842.3,chr1,3172239,3172348,+,110,0,0,0,0,0,0,0,0,0,0,0
3,ENSMUSG00000051951.6,chr1;chr1;chr1;chr1;chr1;chr1;chr1,3276124;3276746;3283662;3283832;3284705;3491925;3740775,3277540;3277540;3285855;3286567;3287191;3492124;3741721,-;-;-;-;-;-;-,6094,0,0,0,0,0,0,0,2,0,0,0
4,ENSMUSG00000102851.2,chr1,3322980,3323459,+,480,0,0,0,0,0,0,0,0,0,0,1
5,ENSMUSG00000103377.2,chr1,3435954,3438772,-,2819,0,0,0,0,0,0,0,0,0,0,0
6,ENSMUSG00000104017.2,chr1,3445779,3448011,-,2233,0,0,0,0,0,0,0,0,0,0,0


Note that each row corresponds to a gene, with the first 6 columns providing Ensembl/Gencode gene annotations. The next columns correspond to our samples.

---
#### 1.2.2 - Counts data formatting


- **removing useless columns**

The `countdata`dataframe contains some columns which are irrelevant for our current analysis and will be removed.
These include the chromosome numbers, start and end positions, strand and gene length (columns 2 to 6). We don't need this information to perform the exploratory and differential expression analysis, hence we drop these with the following code. We however keep the geneID in the first column *(better than as row names)*. Thus, if we want to get only the columns of data counts without the geneID, we will have to use either `countdata[,-1]` or `countdata[,2:12])`.

<div class="alert alert-block alert-warning"> <b> Warning: </b><br>
<b>R is extremely memory-consuming</b>. It is not recommanded to keep redundant R objects in a session. While formating the <code>countdata</code> object in this section, and filtering it in the next one, we will replace it by itself after each step. Thus, be careful to run the cells in the right order. If you have a doubt at some point, go back to cell 13 to start again from the initial dataframe</div>

In [24]:
## Code cell 24 ##

countdata <- countdata[, c(1,7:17)]
str(countdata)
head(countdata)

'data.frame':	57186 obs. of  12 variables:
 $ Geneid                                                                                               : chr  "ENSMUSG00000102693.2" "ENSMUSG00000064842.3" "ENSMUSG00000051951.6" "ENSMUSG00000102851.2" ...
 $ X.shared.projects.2413_rnaseq_cea.alldata.Results.featurecounts.SRR12730403_Aligned.sortedByNames.bam: int  0 0 0 0 0 0 0 0 0 0 ...
 $ X.shared.projects.2413_rnaseq_cea.alldata.Results.featurecounts.SRR12730404_Aligned.sortedByNames.bam: int  0 0 0 0 0 0 0 0 0 0 ...
 $ X.shared.projects.2413_rnaseq_cea.alldata.Results.featurecounts.SRR12730405_Aligned.sortedByNames.bam: int  0 0 0 0 0 0 0 0 0 0 ...
 $ X.shared.projects.2413_rnaseq_cea.alldata.Results.featurecounts.SRR12730406_Aligned.sortedByNames.bam: int  0 0 0 0 0 0 0 0 0 2 ...
 $ X.shared.projects.2413_rnaseq_cea.alldata.Results.featurecounts.SRR12730407_Aligned.sortedByNames.bam: int  0 0 0 0 0 0 0 0 0 0 ...
 $ X.shared.projects.2413_rnaseq_cea.alldata.Results.featurecounts.SRR12730

Unnamed: 0_level_0,Geneid,X.shared.projects.2413_rnaseq_cea.alldata.Results.featurecounts.SRR12730403_Aligned.sortedByNames.bam,X.shared.projects.2413_rnaseq_cea.alldata.Results.featurecounts.SRR12730404_Aligned.sortedByNames.bam,X.shared.projects.2413_rnaseq_cea.alldata.Results.featurecounts.SRR12730405_Aligned.sortedByNames.bam,X.shared.projects.2413_rnaseq_cea.alldata.Results.featurecounts.SRR12730406_Aligned.sortedByNames.bam,X.shared.projects.2413_rnaseq_cea.alldata.Results.featurecounts.SRR12730407_Aligned.sortedByNames.bam,X.shared.projects.2413_rnaseq_cea.alldata.Results.featurecounts.SRR12730408_Aligned.sortedByNames.bam,X.shared.projects.2413_rnaseq_cea.alldata.Results.featurecounts.SRR12730409_Aligned.sortedByNames.bam,X.shared.projects.2413_rnaseq_cea.alldata.Results.featurecounts.SRR12730410_Aligned.sortedByNames.bam,X.shared.projects.2413_rnaseq_cea.alldata.Results.featurecounts.SRR12730411_Aligned.sortedByNames.bam,X.shared.projects.2413_rnaseq_cea.alldata.Results.featurecounts.SRR12730412_Aligned.sortedByNames.bam,X.shared.projects.2413_rnaseq_cea.alldata.Results.featurecounts.SRR12730413_Aligned.sortedByNames.bam
Unnamed: 0_level_1,<chr>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,ENSMUSG00000102693.2,0,0,0,0,0,0,0,0,0,0,0
2,ENSMUSG00000064842.3,0,0,0,0,0,0,0,0,0,0,0
3,ENSMUSG00000051951.6,0,0,0,0,0,0,0,2,0,0,0
4,ENSMUSG00000102851.2,0,0,0,0,0,0,0,0,0,0,1
5,ENSMUSG00000103377.2,0,0,0,0,0,0,0,0,0,0,0
6,ENSMUSG00000104017.2,0,0,0,0,0,0,0,0,0,0,0


In [25]:
## Code cell 25 ##

dim(countdata)

- **reformatting Sample names**

Below are the sample names in the counts dataframe:

In [26]:
## Code cell 26 ##

names(countdata)

The names of the columns are not very easy to read, and most of all, they don't correspond to the sample names present in the first column of the metadata. Therefore, we cannot have any correspondance between the two information. So we modify the names of the columns in `counts`, in order to remove the first part in front of the sample number, and then the last part after the `_`. 

In [27]:
## Code cell 27 ##

# We remove the common prefix in all columns except the first one
colnames(countdata)[-1] <- gsub("X.shared.projects.2413_rnaseq_cea.alldata.Results.featurecounts.",
                         "",
                         colnames(countdata)[-1])

# We remove the common suffix in all columns except the first one
colnames(countdata)[-1] <- gsub("_Aligned.sortedByNames.bam",
                         "",
                         colnames(countdata)[-1])

We check the new names:

In [28]:
## Code cell 28 ##

names(countdata)

To ensure the metadata `samples` and `countdata` dataframes have the same samples names *(tested with `==`)*, we run the code below.   
This will return TRUE if the names in both files are the same. In all there are 11 samples in both files.

In [29]:
## Code cell 29 ##

table(colnames(countdata)[-1] == samples$SampleID)


TRUE 
  11 

- **rename SampleID with SampleName**:

SRR names are not very informative. We replace them by the corresponding sample name.

In [30]:
## Code cell 30 ##

colnames(countdata)[-1] <- samples$SampleName
str(countdata)
head(countdata)

'data.frame':	57186 obs. of  12 variables:
 $ Geneid                   : chr  "ENSMUSG00000102693.2" "ENSMUSG00000064842.3" "ENSMUSG00000051951.6" "ENSMUSG00000102851.2" ...
 $ dHet_B-ALL_686_rep1      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ dHet_B-ALL_686_rep2      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ dHet_B-ALL_713_rep1      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ dHet_B-ALL_713_rep2      : int  0 0 0 0 0 0 0 0 0 2 ...
 $ dHet_B-ALL_760_rep1      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ dHet_B-ALL_760_rep2      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ dHet_FetalLiver_proB_rep1: int  0 0 0 0 0 0 0 0 0 0 ...
 $ dHet_FetalLiver_proB_rep2: int  0 0 2 0 0 0 0 0 0 0 ...
 $ dHet_FetalLiver_proB_rep3: int  0 0 0 0 0 0 0 0 0 0 ...
 $ wt_BoneMar_proB_rep1     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ wt_BoneMar_proB_rep2     : int  0 0 0 1 0 0 0 0 0 0 ...


Unnamed: 0_level_0,Geneid,dHet_B-ALL_686_rep1,dHet_B-ALL_686_rep2,dHet_B-ALL_713_rep1,dHet_B-ALL_713_rep2,dHet_B-ALL_760_rep1,dHet_B-ALL_760_rep2,dHet_FetalLiver_proB_rep1,dHet_FetalLiver_proB_rep2,dHet_FetalLiver_proB_rep3,wt_BoneMar_proB_rep1,wt_BoneMar_proB_rep2
Unnamed: 0_level_1,<chr>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,ENSMUSG00000102693.2,0,0,0,0,0,0,0,0,0,0,0
2,ENSMUSG00000064842.3,0,0,0,0,0,0,0,0,0,0,0
3,ENSMUSG00000051951.6,0,0,0,0,0,0,0,2,0,0,0
4,ENSMUSG00000102851.2,0,0,0,0,0,0,0,0,0,0,1
5,ENSMUSG00000103377.2,0,0,0,0,0,0,0,0,0,0,0
6,ENSMUSG00000104017.2,0,0,0,0,0,0,0,0,0,0,0


### 1.3 - Salmon pseudo-mapping data   
---

In [31]:
## Code cell 31 ##

cat("Start time is :", date(), "\n")
cat("Gencode reference : ", annot_version, "\n")  
cat("Salmon directory is ", salmonfolder, "\n")

Start time is : Sun Jun 23 18:51:05 2024 
Gencode reference :  vM35 
Salmon directory is  /shared/ifbstor1/projects/2413_rnaseq_cea/cvandiedonck/Results/salmon/ 


#### 1.3.1. Getting the salmon files

We directly do it for the 11 samples after copying them to our own directory *(be aware that here we erase our previous folders)*. You can skip this cell in your own study if all your salmon results are already in your personnal directory.

In [32]:
## Code cell 32 ##

R.utils::copyDirectory(from = paste0(gohome, "alldata/Results/salmon/"),
                       to = salmonfolder,
                       recursive = TRUE)

We now retrieve IDs of our samples: 

In [33]:
## Code cell 33 ##

sample_names <- list.files("./Results/salmon/")
sample_names

We then list the file paths to the corresponding `quant.sf` files containing the Salmon pseudomapping counts per transcripts for each sample.

In [34]:
## Code cell 34 ##

sample_files <- paste0("./Results/salmon/",sample_names, "/quant.sf")
sample_files

 => We check the number of samples we are dealing with:

In [35]:
## Code cell 34 ##

cat("\nNumber of samples : ", length(sample_names), "\n")


Number of samples :  11 


--
#### 1.3.2. Create a table of **raw read counts** *(NumReads)* per transcript for all samples

We now start with the first sample and read the `quant.sf` file with its first column *(name of the transcript)* and its last (=5th) column *(NumReads)* containg the raw counts per transcript:

In [36]:
## Code cell 36 ##

cat("\nFirst sample :", sample_files[1], "\n")


First sample : ./Results/salmon/SRR12730403/quant.sf 


In [37]:
## Code cell 37 ##

numreads <- read.table(sample_files[1], sep = "\t", header= T)
numreads <- numreads [,c(1,5)]
head(numreads)
dim(numreads)

Unnamed: 0_level_0,Name,NumReads
Unnamed: 0_level_1,<chr>,<dbl>
1,ENSMUST00000193812.2,0
2,ENSMUST00000082908.3,0
3,ENSMUST00000162897.2,0
4,ENSMUST00000159265.2,0
5,ENSMUST00000070533.5,0
6,ENSMUST00000192857.2,0


We keep going with a loop to retrieve in the same dataframe the other samples and we rename the columns: the first one as `transcript_id` and the `NumReads` columns with the sample name.

In [38]:
## Code cell 38 ##

for (i in 2:length(sample_files)) { 
 mytab <- read.table(sample_files[i], sep = "\t", header= T)
 mytab  <- mytab  [,c(1,5)]  
 names(mytab)[2] <- sample_names[i]
 numreads <- numreads %>% left_join(mytab, by='Name')
}
names(numreads) <- c("transcript_id", sample_names)	
head(numreads)
rm(mytab)

Unnamed: 0_level_0,transcript_id,SRR12730403,SRR12730404,SRR12730405,SRR12730406,SRR12730407,SRR12730408,SRR12730409,SRR12730410,SRR12730411,SRR12730412,SRR12730413
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,ENSMUST00000193812.2,0,0,0,0,0,0,0,0,0,0,0
2,ENSMUST00000082908.3,0,0,0,0,0,0,0,0,0,0,0
3,ENSMUST00000162897.2,0,0,0,0,0,0,0,0,0,0,0
4,ENSMUST00000159265.2,0,0,0,0,0,0,0,0,0,0,0
5,ENSMUST00000070533.5,0,0,0,0,0,0,0,0,0,0,0
6,ENSMUST00000192857.2,0,0,0,0,0,0,0,0,0,0,0


For further analyses, it is even better to rename the samples currently idntified by their SRR number with the corresponding sample name after checking the correct correspondance of IDs. 

In [39]:
## Code cell 39 ##

table(names(numreads)[-1] == samples$SampleID)
names(numreads)[-1]  <- samples$SampleName
head(numreads)


TRUE 
  11 

Unnamed: 0_level_0,transcript_id,dHet_B-ALL_686_rep1,dHet_B-ALL_686_rep2,dHet_B-ALL_713_rep1,dHet_B-ALL_713_rep2,dHet_B-ALL_760_rep1,dHet_B-ALL_760_rep2,dHet_FetalLiver_proB_rep1,dHet_FetalLiver_proB_rep2,dHet_FetalLiver_proB_rep3,wt_BoneMar_proB_rep1,wt_BoneMar_proB_rep2
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,ENSMUST00000193812.2,0,0,0,0,0,0,0,0,0,0,0
2,ENSMUST00000082908.3,0,0,0,0,0,0,0,0,0,0,0
3,ENSMUST00000162897.2,0,0,0,0,0,0,0,0,0,0,0
4,ENSMUST00000159265.2,0,0,0,0,0,0,0,0,0,0,0
5,ENSMUST00000070533.5,0,0,0,0,0,0,0,0,0,0,0
6,ENSMUST00000192857.2,0,0,0,0,0,0,0,0,0,0,0


--
#### 1.3.3. Create a table of **transcripts per millions** *(TPM)* for all samples

Similarly, we generate a dataframe with TPM counts for all samples. These TPMs were already calculated by salmon and are present in the forth column of the `quant.sf` files.

In [40]:
## Code cell 40 ##

tpm <- read.table(sample_files[1], sep = "\t", header= T)
tpm <- tpm [,c(1,4)]
#head(tpm)

for (i in 2:length(sample_files)) { 
 mytab <- read.table(sample_files[i], sep = "\t", header= T)
 mytab  <- mytab  [,c(1,4)]  
 names(mytab)[2] <- sample_names[i]
 tpm <- tpm %>% left_join(mytab, by='Name')
}
names(tpm) <- c("transcript_id", sample_names)	
cat("Table dimensions : ", dim(tpm), "\n")
head(tpm)
rm(mytab)

Table dimensions :  147556 12 


Unnamed: 0_level_0,transcript_id,SRR12730403,SRR12730404,SRR12730405,SRR12730406,SRR12730407,SRR12730408,SRR12730409,SRR12730410,SRR12730411,SRR12730412,SRR12730413
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,ENSMUST00000193812.2,0,0,0,0,0,0,0,0,0,0,0
2,ENSMUST00000082908.3,0,0,0,0,0,0,0,0,0,0,0
3,ENSMUST00000162897.2,0,0,0,0,0,0,0,0,0,0,0
4,ENSMUST00000159265.2,0,0,0,0,0,0,0,0,0,0,0
5,ENSMUST00000070533.5,0,0,0,0,0,0,0,0,0,0,0
6,ENSMUST00000192857.2,0,0,0,0,0,0,0,0,0,0,0


For further analyses, it is even better to rename the samples currently idntified by their SRR number with the corresponding sample name after checking the correct correspondance of IDs. 

In [41]:
## Code cell 41 ##

table(names(tpm)[-1] == samples$SampleID)
names(tpm)[-1]  <- samples$SampleName
head(tpm)


TRUE 
  11 

Unnamed: 0_level_0,transcript_id,dHet_B-ALL_686_rep1,dHet_B-ALL_686_rep2,dHet_B-ALL_713_rep1,dHet_B-ALL_713_rep2,dHet_B-ALL_760_rep1,dHet_B-ALL_760_rep2,dHet_FetalLiver_proB_rep1,dHet_FetalLiver_proB_rep2,dHet_FetalLiver_proB_rep3,wt_BoneMar_proB_rep1,wt_BoneMar_proB_rep2
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,ENSMUST00000193812.2,0,0,0,0,0,0,0,0,0,0,0
2,ENSMUST00000082908.3,0,0,0,0,0,0,0,0,0,0,0
3,ENSMUST00000162897.2,0,0,0,0,0,0,0,0,0,0,0
4,ENSMUST00000159265.2,0,0,0,0,0,0,0,0,0,0,0
5,ENSMUST00000070533.5,0,0,0,0,0,0,0,0,0,0,0
6,ENSMUST00000192857.2,0,0,0,0,0,0,0,0,0,0,0


--
#### 1.3.4. Aggregate counts at the **gene level** for all samples

- **Upload the transcript reference:**

We need to get the transcript IDs and the corresponding gene IDs.

We read rge transcript fast gz file and select only the rows staring with an `>` to get the transcript IDs.

In [42]:
## Code cell 42 ##

transcripts_info <- data.table::fread(paste0(reffolder, "salmon/gencode.vM35.transcripts.fa.gz"), header = FALSE) %>% filter(str_detect(V1, "^>"))
str(transcripts_info)

Classes 'data.table' and 'data.frame':	149138 obs. of  1 variable:
 $ V1: chr  ">ENSMUST00000193812.2|ENSMUSG00000102693.2|OTTMUSG00000049935.1|OTTMUST00000127109.1|4933401J01Rik-201|4933401J01Rik|1070|TEC|" ">ENSMUST00000082908.3|ENSMUSG00000064842.3|-|-|Gm26206-201|Gm26206|110|snRNA|" ">ENSMUST00000162897.2|ENSMUSG00000051951.6|OTTMUSG00000026353.2|OTTMUST00000086625.1|Xkr4-203|Xkr4|4153|protein"| __truncated__ ">ENSMUST00000159265.2|ENSMUSG00000051951.6|OTTMUSG00000026353.2|OTTMUST00000086624.1|Xkr4-202|Xkr4|2989|protein"| __truncated__ ...
 - attr(*, ".internal.selfref")=<externalptr> 


As you can see, the first and only column contains several information separated by a `|` so we separate this column in 8.

In [43]:
## Code cell 43 ##

transcripts_info <- transcripts_info %>% separate(V1,
                                                  into = c("transcript_id", "gene_id", "havana_gene_id", "havana_tr_id", "ext_tr_name", "gene_name", "length", "biotype"),
                                                  remove = TRUE,
                                                 sep = "\\|",
                                                 extra ="drop")
str(transcripts_info)

'data.frame':	149138 obs. of  8 variables:
 $ transcript_id : chr  ">ENSMUST00000193812.2" ">ENSMUST00000082908.3" ">ENSMUST00000162897.2" ">ENSMUST00000159265.2" ...
 $ gene_id       : chr  "ENSMUSG00000102693.2" "ENSMUSG00000064842.3" "ENSMUSG00000051951.6" "ENSMUSG00000051951.6" ...
 $ havana_gene_id: chr  "OTTMUSG00000049935.1" "-" "OTTMUSG00000026353.2" "OTTMUSG00000026353.2" ...
 $ havana_tr_id  : chr  "OTTMUST00000127109.1" "-" "OTTMUST00000086625.1" "OTTMUST00000086624.1" ...
 $ ext_tr_name   : chr  "4933401J01Rik-201" "Gm26206-201" "Xkr4-203" "Xkr4-202" ...
 $ gene_name     : chr  "4933401J01Rik" "Gm26206" "Xkr4" "Xkr4" ...
 $ length        : chr  "1070" "110" "4153" "2989" ...
 $ biotype       : chr  "TEC" "snRNA" "protein_coding_CDS_not_defined" "protein_coding_CDS_not_defined" ...


Finally, we need ro further format this dataframe by removing the `>` at the beginning of the ENSG transcript ID in the first column.

In [44]:
## Code cell 44 ##

transcripts_info <- transcripts_info %>% mutate(transcript_id = str_replace(transcript_id, ">", ""))
head(transcripts_info)
tail(transcripts_info)

Unnamed: 0_level_0,transcript_id,gene_id,havana_gene_id,havana_tr_id,ext_tr_name,gene_name,length,biotype
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,ENSMUST00000193812.2,ENSMUSG00000102693.2,OTTMUSG00000049935.1,OTTMUST00000127109.1,4933401J01Rik-201,4933401J01Rik,1070,TEC
2,ENSMUST00000082908.3,ENSMUSG00000064842.3,-,-,Gm26206-201,Gm26206,110,snRNA
3,ENSMUST00000162897.2,ENSMUSG00000051951.6,OTTMUSG00000026353.2,OTTMUST00000086625.1,Xkr4-203,Xkr4,4153,protein_coding_CDS_not_defined
4,ENSMUST00000159265.2,ENSMUSG00000051951.6,OTTMUSG00000026353.2,OTTMUST00000086624.1,Xkr4-202,Xkr4,2989,protein_coding_CDS_not_defined
5,ENSMUST00000070533.5,ENSMUSG00000051951.6,OTTMUSG00000026353.2,OTTMUST00000065166.1,Xkr4-201,Xkr4,3634,protein_coding
6,ENSMUST00000192857.2,ENSMUSG00000102851.2,OTTMUSG00000049958.1,OTTMUST00000127143.1,Gm18956-201,Gm18956,480,processed_pseudogene


Unnamed: 0_level_0,transcript_id,gene_id,havana_gene_id,havana_tr_id,ext_tr_name,gene_name,length,biotype
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
149133,ENSMUST00000082418.1,ENSMUSG00000064367.1,-,-,mt-Nd5-201,mt-Nd5,1824,protein_coding
149134,ENSMUST00000082419.1,ENSMUSG00000064368.1,-,-,mt-Nd6-201,mt-Nd6,519,protein_coding
149135,ENSMUST00000082420.1,ENSMUSG00000064369.1,-,-,mt-Te-201,mt-Te,69,Mt_tRNA
149136,ENSMUST00000082421.1,ENSMUSG00000064370.1,-,-,mt-Cytb-201,mt-Cytb,1144,protein_coding
149137,ENSMUST00000082422.1,ENSMUSG00000064371.1,-,-,mt-Tt-201,mt-Tt,67,Mt_tRNA
149138,ENSMUST00000082423.1,ENSMUSG00000064372.1,-,-,mt-Tp-201,mt-Tp,67,Mt_tRNA


As a final check, we verify all the transcript ids persent in numreads (and tpm) are well present in this transcript annotation file.

In [45]:
## Code cell 45 ##

cat("\nCheck missing transcripts...\n")
cat(" this experiment - the reference :")
cat(length(setdiff(numreads$transcript_id, transcripts_info$transcript_id)), "\n")
# [1] 0
cat(" the reference - this experiment :")
cat(length(setdiff(transcripts_info$transcript_id, numreads$transcript_id)), "\n")
# [1] 847



Check missing transcripts...
 this experiment - the reference :0 
 the reference - this experiment :1582 


we are now ready to preform the aggregation by genes, since the transcript annotation file contains both the transcripts and the gene ids.

##### **a-gene aggregation on numreads**

In [46]:
## Code cell 46 ##

genecounts <- numreads  %>% left_join(transcripts_info %>% select(transcript_id, gene_id),
                                      by='transcript_id') %>% 
 select(transcript_id, gene_id, everything()) %>% select(-1) %>% 
 group_by(gene_id) %>% summarize_all(sum, na.rm = TRUE) 

genecounts <- genecounts %>% mutate(gene_id = str_replace(gene_id, "\\.\\d{1,2}$", ""))

cat("Table dimensions:", dim(genecounts), "\n")

head(genecounts)

Table dimensions: 56065 12 


gene_id,dHet_B-ALL_686_rep1,dHet_B-ALL_686_rep2,dHet_B-ALL_713_rep1,dHet_B-ALL_713_rep2,dHet_B-ALL_760_rep1,dHet_B-ALL_760_rep2,dHet_FetalLiver_proB_rep1,dHet_FetalLiver_proB_rep2,dHet_FetalLiver_proB_rep3,wt_BoneMar_proB_rep1,wt_BoneMar_proB_rep2
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
ENSMUSG00000000001,10106.681,7297.753,8223.932,7960.522,7963.566,7331.798,8147.421,7791.765,7611.429,9301.767,8276.171
ENSMUSG00000000003,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ENSMUSG00000000028,2506.125,1785.969,1188.757,1324.565,1818.4,1720.502,1429.305,2342.227,649.554,2712.591,2054.161
ENSMUSG00000000031,0.0,3.0,24.0,31.0,149.0,160.0,100.061,102.0,416.616,0.0,0.0
ENSMUSG00000000037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,17.0,42.012
ENSMUSG00000000049,1.0,3.0,12.0,13.0,3.0,3.0,5.0,0.0,8.0,2.0,2.0


We add gene names/symbol for further analyses

In [47]:
## Code cell 47 ##

genecounts_symbols <- genecounts %>% left_join(transcripts_info %>%
                                   mutate(gene_id = str_remove(gene_id, ".\\d{1,2}$")) %>%  
 select(gene_id, gene_name), by = 'gene_id') %>% select(-gene_id) %>% 
        select(gene_name, everything()) %>% distinct()

cat("Table dimensions:", dim(genecounts_symbols), "\n")

head(genecounts_symbols)

Table dimensions: 55970 12 


gene_name,dHet_B-ALL_686_rep1,dHet_B-ALL_686_rep2,dHet_B-ALL_713_rep1,dHet_B-ALL_713_rep2,dHet_B-ALL_760_rep1,dHet_B-ALL_760_rep2,dHet_FetalLiver_proB_rep1,dHet_FetalLiver_proB_rep2,dHet_FetalLiver_proB_rep3,wt_BoneMar_proB_rep1,wt_BoneMar_proB_rep2
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Gnai3,10106.681,7297.753,8223.932,7960.522,7963.566,7331.798,8147.421,7791.765,7611.429,9301.767,8276.171
Pbsn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Cdc45,2506.125,1785.969,1188.757,1324.565,1818.4,1720.502,1429.305,2342.227,649.554,2712.591,2054.161
H19,0.0,3.0,24.0,31.0,149.0,160.0,100.061,102.0,416.616,0.0,0.0
Scml2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,17.0,42.012
Apoh,1.0,3.0,12.0,13.0,3.0,3.0,5.0,0.0,8.0,2.0,2.0


##### **b-gene aggregation on TPMs**

We do not recommand to keep working with gene levels on TPMs, but for a quick and dirty evaluation of gene levels, we might want them.

In [48]:
## Code cell 48 ##

gtpm <- tpm  %>% left_join(transcripts_info %>% select(transcript_id, gene_id), by='transcript_id') %>% 
	select(transcript_id, gene_id, everything()) %>% select(-1) %>% 
	group_by(gene_id) %>% summarize_all(sum, na.rm = TRUE) 

gtpm <- gtpm %>% mutate(gene_id = str_replace(gene_id, "\\.\\d{1,2}$", ""))

cat("Table dimensions:", dim(gtpm), "\n")

head(gtpm)

Table dimensions: 56065 12 


gene_id,dHet_B-ALL_686_rep1,dHet_B-ALL_686_rep2,dHet_B-ALL_713_rep1,dHet_B-ALL_713_rep2,dHet_B-ALL_760_rep1,dHet_B-ALL_760_rep2,dHet_FetalLiver_proB_rep1,dHet_FetalLiver_proB_rep2,dHet_FetalLiver_proB_rep3,wt_BoneMar_proB_rep1,wt_BoneMar_proB_rep2
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
ENSMUSG00000000001,94.90927,85.63001,107.549622,99.406256,89.016454,94.202083,129.173375,115.818792,129.110559,103.294256,110.482353
ENSMUSG00000000003,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ENSMUSG00000000028,39.795792,35.414221,26.567223,28.400625,35.635202,38.068106,39.002803,59.254796,18.93121,52.404483,47.476351
ENSMUSG00000000031,0.0,0.084361,0.87498,0.561346,4.494676,4.208726,4.311468,3.427708,18.374148,0.0,0.0
ENSMUSG00000000037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028093,0.0,0.12396,0.474874
ENSMUSG00000000049,0.200706,0.700875,0.709392,1.361124,0.665852,0.720562,0.702987,0.0,2.401379,0.249507,0.491915


and we add gene names :

In [49]:
## Code cell 49 ##

gtpm_symbols <- gtpm %>% left_join(transcripts_info %>%
                                   mutate(gene_id = str_remove(gene_id, ".\\d{1,2}$")) %>%  
 select(gene_id, gene_name), by = 'gene_id') %>% select(-gene_id) %>% 
        select(gene_name, everything()) %>% distinct()

cat("Table dimensions:", dim(gtpm_symbols), "\n")

head(gtpm_symbols)

Table dimensions: 55970 12 


gene_name,dHet_B-ALL_686_rep1,dHet_B-ALL_686_rep2,dHet_B-ALL_713_rep1,dHet_B-ALL_713_rep2,dHet_B-ALL_760_rep1,dHet_B-ALL_760_rep2,dHet_FetalLiver_proB_rep1,dHet_FetalLiver_proB_rep2,dHet_FetalLiver_proB_rep3,wt_BoneMar_proB_rep1,wt_BoneMar_proB_rep2
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Gnai3,94.90927,85.63001,107.549622,99.406256,89.016454,94.202083,129.173375,115.818792,129.110559,103.294256,110.482353
Pbsn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Cdc45,39.795792,35.414221,26.567223,28.400625,35.635202,38.068106,39.002803,59.254796,18.93121,52.404483,47.476351
H19,0.0,0.084361,0.87498,0.561346,4.494676,4.208726,4.311468,3.427708,18.374148,0.0,0.0
Scml2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028093,0.0,0.12396,0.474874
Apoh,0.200706,0.700875,0.709392,1.361124,0.665852,0.720562,0.702987,0.0,2.401379,0.249507,0.491915


In [50]:
## Code cell 50 ##

cat("End time is :", date(), "\n")

End time is : Sun Jun 23 18:51:44 2024 


---

## 2 - Filtering out genes non informative genes

---

For many analysis methods, it is advisable to filter out as many genes as possible before the analysis to decrease the impact of multiple testing correction on false discovery rates.   

This is normally done by filtering out **genes with low numbers of reads** and thus likely to be uninformative from a biological point of view.  
(With DESeq2, the tool that we will use later for differential analysis, this filtering is not necessary beforehand, as DESeq2 applies independent filtering during the analysis.)   
In addition, filtering out genes that are very lowly expressed does reduce the size of the dataset, meaning that less memory is required and processing steps are carried out faster.

<div class="alert alert-block alert-info"> <b> What is a lowly expressed gene ? </b><br>
A general rule is to keep genes with a read count across all samples equal or above to number of samples.</div>

For some biological questions, we may want to **keep only coding genes with or without long-non-coding RNAs**, thus also getting rid of small RNAs, microRNAs, pseudogenes, tRNAs, Ig genes, TCR genes, ..

=> Here, we decide to keep all gene categories but to filter on the level of expression. We thus keep genes where the total number of reads across all 11 samples is greater than 10.


- For each gene, we compute the total count of read in all samples, and compare to our threshold. We count the number of genes passing the test.

In [None]:
## Code cell 51 ##

# keeping outcome in a vector of 'logical' values (ie TRUE or FALSE, or NA)
keep <- rowSums(countdata[-1]) > 10

# summary of test outcome: number of genes in each class:
table(keep, useNA="always") 

This means that 32754 genes have a total count across samples below 10. We keep the other 24432 genes.  
For information, when we performed the same analysis in 2023 with the M32 annotation, we eliminated 34083 genes with a total count across samples below 10 and we kept 22927 genes

- We extract those genes and replace `counts` with this subset of genes.

In [None]:
## Code cell 52 ##


# subset genes where test was TRUE
countdata <- countdata[keep,]
rm(keep)

# check dimension of new count matrix
dim(countdata)

# 24432 12

<div class="alert alert-block alert-warning"> <b> Warning: </b><br>
If your dataframe after cell 29 does not have 24432 rows and 12 columns, you probably run a command twice. In that case, rerun all cells from cell 17.</div>

---
## 3 - Quality assessment
---

Before moving on to the actual differential expression analysis, it is important to assess the quality of our data.   
Quality control is an important step in any data analysis. Since our purpose is to identify differentially expressed genes, we may consider excluding a sample if it is an obvious outlier that could reflect an issue in its upstream processing or preparation or if it is inconsistent with metadata (eg. Y chromosome genes expressed in females).

In the ensuing steps, we will explore our data using boxplots, PCA, density plots and heatmaps.


### 3.1 - Initial visualization of the count distributions
---

Differential expression calculations with DESeq2 uses raw read counts as input, but for visualization purposes we use transformed counts.

Why not raw counts? Two issues:

    - Raw counts range is very large: some highly expressed genes can have hundreds of thousands of reads
    - Variance increases with mean gene expression, and this has impact on assessing the relationships.
    
Let's display the range of expression of our raw read counts:

In [None]:
## Code cell 53 ##

summary(countdata[,-1])

We also can see that few outlier genes affect distribution visualization:

In [None]:
## Code cell 54 ##

opar <- par(no.readonly=TRUE) # l'argument no.readonly=TRUE permet de supprimer l'affichage d'éventuels warnings 
par(mar = c(8,2,2,2)) # to increase margin at the bottom to display full sample names
boxplot(countdata[,-1], main = 'Raw readcounts distribution across samples', xaxt = "n") # no display of xlabels
text(1:11, y = par("usr")[3] - 0.45,
     labels = names(countdata)[-1],
     srt = 30, adj = 1, xpd = NA, cex= 0.9)# add xlabels with 30 degres srotation
suppressWarnings(par(opar))


We can see that the variance increases with the average level of expression by plotting standard deviation vs mean expression :

In [None]:
## Code cell 55 ##

# Raw counts mean expression vs Standard Deviation (SD)
plot(rowMeans(countdata[,-1]), apply(countdata[,2:12], 1, sd, na.rm = TRUE), 
     main = 'Raw read counts: sd vs mean', 
     xlim = c(0,100000), # to zoom
     ylim = c(0,50000), # to zoom
     xlab = "Mean of raw counts per gene",
     ylab = "SD of raw counts per gene"
    )


This problem is called heteroskedasticy : variances and means are not independent.

### 3.2 -  Data transformation and visualization
---

In exploratory data analysis, we are going to look at the distance between samples. It is thus crucial to make the data homoscedatic.

To avoid the problems posed by raw counts, they can be transformed. **Several transformation methods exist to limit the dependence of variance on mean gene expression**, among which :

    - Simple log2 transformation
    - VST : variance stabilizing transformation
    - rlog : regularized log transformation

For the moment, we are going to use only the first one to perform this first exploratory analysis on non-normalised data.   
The two other types of transformation are available in the DESeq2 package, that we will use in the next jupyter notebook.  
 

#### **-> log2 transformation**

This is one of the most used transformation in transcriptomics, as it helps to normalize the data (with the meaning of changing the data distribution to one closer to the normal distribution), and it enables a better visualization of low counts.   
    
Because some genes are not expressed (detected) in some samples, their counts are 0. As log2(0) equals -Inf, we add an ***offset***, usually of 1, to every count value to create **pseudocounts**. The lowest value then is 1, or 0 on the log2 scale (log2(1) = 0).

In [None]:
## Code cell 56 ##

summary(log2(countdata[,2]+1)) # summary for first sample column 2

In [None]:
## Code cell 57 ##

summary(log2(countdata[,2:12]+1)) # summary for each sample

We will check the distribution of read pseudocounts using a boxplot and add some colour to see if there is any difference between sample groups.

In [None]:
## Code cell 58 ##

# make a colour vector
conditionColor <- match(samples$Condition, c("dHet", "dHetRag", "WT")) + 1
# '+1' to avoid color '1' i.e. black

# Check distributions of samples using boxplots
opar <- par(no.readonly=TRUE) # l'argument no.readonly=TRUE permet de supprimer l'affichage d'éventuels warnings 
par(mar = c(6,2,2,2))
boxplot(log2(countdata[,2:12]+1),
        xlab = "", xaxt = "n",
        ylab = "Log2(Counts)",
        las = 2,
        col = conditionColor,
        main = "Log2(Counts) distribution across samples")
text(1:11, y = par("usr")[3] - 0.45,
     labels = names(countdata)[-1],
     srt = 30, adj = 1, xpd = NA, cex= 0.9)


# Let's add a blue horizontal line that corresponds to the median
abline(h = median.default(as.matrix(log2(countdata[,2:12]+1))), col="blue")

suppressWarnings(par(opar))

From the boxplot, we see that overall the density distributions of raw log-counts are not identical but still not very different from one sample to another.   
If a sample is really far above or below the blue horizontal line (overall median) we may need to investigate that sample further.

In [None]:
## Code cell 59 ##

# Log2 counts standard deviation (sd) vs mean expression

plot(rowMeans(log2(countdata[,2:12]+1)),
              matrixStats::rowSds(as.matrix(log2(countdata[,2:12]+1), na.rm = TRUE)), 
     main = 'Log2 Counts: sd vs mean',
     xlab = "Mean of log2(raw counts) per gene",
     ylab = "SD of log2(raw counts) per gene"
    
    )

In contrast to raw counts, with log2 transformed counts, lowly expressed genes (mean expression around 5 or below) show higher variation than highly expressed genes (mean expression above 10-12).

### 3.3 - Principal Component Analysis (PCA)
---

A principal component analysis (PCA) is an unsupervised method used to explore the data variance structure **by reducing its dimensions to a few principal components (PC) that explain the greatest variation in the data**.  PCA is an example of an unsupervised analysis, where we don’t specify the grouping of the samples *(see lecture 12)*.  
If the experiment is well controlled and has worked properly, we should find that replicate samples cluster closely, whilst the greatest sources of variation in the data should be between treatments/sample groups.   
It is also an incredibly useful tool for checking for outliers and batch effects.

We cannot run the PCA directly on raw non-normalized data. Data must have a common scale. To circumvent this, we thus run here the PCA on the log2 transformation data. Another method could have been to center (to substract the mean) and reduce (to divide by the variance). 

- transposition of data : 

In a PCA where we want to visualize the grouping of samples, the variables are genes, and thus should be in columns. So in the code below, the `t()` function is used to transpose the dataframe: the result is a new dataframe with samples as rows and genes as columns.

In [None]:
## Code cell 60 ##

tlogcounts <- t(log2(countdata[,2:12]+1))
dim(tlogcounts)

- plot PCA:

To plot the PCA results, we will use here the `autoplot` function from the `ggfortify` package (Tang, Horikoshi, and Li 2016). `ggfortify` is built on top of `ggplot2` and is able to recognise common statistical objects such as PCA results or linear models and to automatically generate a summary plot of the results in an appropriate manner.

In [None]:
## Code cell 61 ##

# run PCA
PCAdata <- prcomp(tlogcounts)

# plot PCA
autoplot(PCAdata)

Without the names of the samples, or colours indicating their group, we cannot see if they cluster correctly.    

- add labels to PCA plot:

So we add colours and labels to the PCA plot.   
The package `ggrepel` allows us to add text to the plot, but ensures that points that are close together don’t have overlapping labels  (they repel each other).

In [None]:
## Code cell 62 ##

autoplot(PCAdata,
         data = samples, 
         colour = "Condition", 
         shape = "Tissue",
         size = 6) +
    geom_text_repel(aes(x = PC1, y = PC2, label = SampleName),
                        box.padding = 0.8)


We can see that the first Principal Component (explaining the largest source of variation) shows in this dataset variation between samples from different conditions (the effect of interest), while the second PC (explaining the second largest source of variation) displays here sample differences due to WT vs mutant genotypes.   
It seems that there is no batch effect, but let's verify if none appears in the next PCs eigen vectors.  

In [None]:
## Code cell 63 ##

autoplot(PCAdata,
         x = 2,    # PC2
         y = 3,    # PC3
         data = samples, 
         colour = "Condition", 
         shape = "Tissue",
         size = 6) +
    geom_text_repel(aes(x = PC2, y = PC3, label = SampleName),
                    box.padding = 0.8)


In [None]:
## Code cell 64 ##

autoplot(PCAdata,
         x = 3,    # PC3
         y = 4,    # PC4
         data = samples, 
         colour = "Condition", 
         shape = "Tissue",
         size = 6) +
    geom_text_repel(aes(x = PC3, y = PC4, label = SampleName),
                    box.padding = 0.8)

rm(PCAdata) # we remove the PCA dtaa from the session for memory reasons

On the basis of these last two plots, we cannot see a clear effect of a given factor to the variation depicted in PC3 and PC4, except maybe the difference between mice.

*Note: We will perform a more detailed PCA after data normalisation in notebook Pipe_10.*

### 3.4 - Hierarchical clustering
---

This  representation can also be used to cluster the samples based on dissimilarity indexes *(see lecture 13, here we will use Ward distance)*. More information can be found with `?hclust` (or in the Contextual Help panel on the right, that can be opened via the Help menu).

In [None]:
## Code cell 65 ##

clusters <- hclust(dist(as.matrix(tlogcounts)), method ="ward.D")
plot(clusters, labels = samples$SampleName)

rm(clusters, tlogcounts)    

We can see that our samples are grouped correctly, and that replicates from the same mouse cluster together, as expected.

### 3.5 - Density plot
---

A Density Plot can be used to visualize the distribution of data over a continuous interval. In RNA-seq analysis, this could be used to detect the presence or absence of batch effects in the data. Batch effects may be introduced through different experimental platforms, laboratory conditions, different sources of samples, different technicians, etc, and may introduce spurious variability which is not due to the condition under study (cancerous state of B and pro-B cells).   
This [paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4636836/) comprehensively discusses batch effects and how they can be corrected.

In [None]:
## Code cell 66 ##

affy::plotDensity(log2(countdata[,2:12]+1),
                  xlab("Density"),ylab("log2(Counts)"),
                  col = 1:11)
legend(x = 14, y = 0.13,legend = names(log2(countdata[,2:12]+1)),
       col = 1:11, lty = 1:11, bty = "n")

The density plots overlap for all samples, none stand out as having a different profile.   
This confirms that no batch effect is visible in our data that would need to be corrected before going on with normalisation.   

To see examples of batch effects, you can have a look [there](https://evayiwenwang.github.io/Managing_batch_effects/detect.html). 

## 4 - Saving our results

We can save all the R objects created in this session in a single R object.   
This will help us to reload our dataframes without having to run the same commands.   

In [None]:
## Code cell 67 ##  

ls()
save(countdata, samples, conditionColor, file = paste0(pca1folder,"RawCounts_Samples.RData"))

---
___

Now we go on with the normalisation of read counts and differential expression analysis using `DESeq2`.  
  
**=> Step 9: DESeq2 Normalisation and Differential Expression analysis** 

The jupyter notebook used for the next session will be *Pipe_09-R-DESeq2-normalisation-DE.ipynb*    
Let's retrieve it in our directory, in order to have a private copy to work on:   

In [None]:
## Code cell 68 ##   

myfolder
file.copy("/shared/projects/2413_rnaseq_cea/pipeline/Pipe_09-R-DESeq2-normalisation-DE.ipynb", myfolder)




**Save executed notebook**

To end the session, save your executed notebook in your `run_notebooks` folder. **Adjust the name with yours** and reformat as code cell to run it.

In [None]:
## Code cell 69 ##   

# creation of the directory, recursive = TRUE is equivalent to the mkdir -p in Unix
dir.create(paste0(myfolder,"/run_notebooks"), recursive = TRUE)

runfolder <- paste0(myfolder,"/run_notebooks")
       
# file.copy(paste0(myfolder, "/Pipe_08-R_counts-exploratory-analysis-I.ipynb"), runfolder)
file.copy(paste0(myfolder, "/Pipe_08-R_counts-exploratory-analysis-I.ipynb"), paste0(runfolder, "/Pipe_08-R_counts-exploratory-analysis-I-run.ipynb"))

---

<div class="alert alert-block alert-success"><b>Success:</b> Well done! You now know how to perform a exploratory analysis of RNAseq expression data in R.<br>
Don't forget to save you notebook and export a copy as an <b>html</b> file as well <br>
- Open "File" in the Menu<br>
- Select "Export Notebook As"<br>
- Export notebook as HTML<br>
- You can then open it in your browser even without being connected to the server! 
</div>

## Useful commands
<div class="alert alert-block alert-info"> 
    
- <kbd>CTRL</kbd>+<kbd>S</kbd> : save notebook<br>    
- <kbd>CTRL</kbd>+<kbd>ENTER</kbd> : Run Cell<br>  
- <kbd>SHIFT</kbd>+<kbd>ENTER</kbd> : Run Cell and Select Next<br>   
- <kbd>ALT</kbd>+<kbd>ENTER</kbd> : Run Cell and Insert Below<br>   
- <kbd>ESC</kbd>+<kbd>y</kbd> : Change to *Code* Cell Type<br>  
- <kbd>ESC</kbd>+<kbd>m</kbd> : Change to *Markdown* Cell Type<br> 
- <kbd>ESC</kbd>+<kbd>r</kbd> : Change to *Raw* Cell Type<br>    
- <kbd>ESC</kbd>+<kbd>a</kbd> : Create Cell Above<br> 
- <kbd>ESC</kbd>+<kbd>b</kbd> : Create Cell Below<br> 

<em>  
To make nice html reports with markdown: <a href="https://dillinger.io/" title="dillinger.io">html visualization tool 1</a> or <a href="https://stackedit.io/app#" title="stackedit.io">html visualization tool 2</a>, <a href="https://www.tablesgenerator.com/markdown_tables" title="tablesgenerator.com">to draw nice tables</a>, and the <a href="https://medium.com/analytics-vidhya/the-ultimate-markdown-guide-for-jupyter-notebook-d5e5abf728fd" title="Ultimate guide">Ultimate guide</a>. <br>
Further reading on JupyterLab notebooks: <a href="https://jupyterlab.readthedocs.io/en/latest/user/notebook.html" title="Jupyter Lab">Jupyter Lab documentation</a>.<br>   
</em>    
 
</div>

Claire Vandiedonck - 2021-2023   
Sandrine Caburet - 05/2023   
MAJ : 23/06/2024 by @CVandiedonck