# Formation RNAseq CEA - juin 2024

*Enseignantes: Sandrine Caburet et Claire Vandiedonck*

Session IFB : 5 CPU + 21 GB de RAM

# Part 07a: Introduction to R - part A :
## *Discovery of the basics of R "base" language*

    I. First steps into R
      0. What is R ?
      1. R as a calculator 
      2. Assigning data into R objects, using and reading them  
      3. Managing your session
      4. Managing objects in your R Session
      5. Saving your data, session, and history
      6. Classes and types of R objects
    II. Vectors
    III. Matrices
    IV. Dataframes
      1. creating a dataframe
      2. reading  a text file into a dataframe in R and vice versa
      3. extracting information from a dataframe
      4. adding variables to a dataframe
      5. subsetting a dataframe
      6. merging dataframes

---
## **Before going further**

<div class="alert alert-block alert-danger"><b>Caution:</b> 
Before starting the analysis, save a backup copy of this notebok : in the left-hand panel, right-click on this file and select "Duplicate"<br>
You can also make backups during the analysis. Don't forget to save your notebook regularly: <kbd>Ctrl</kbd> + <kbd>S</kbd> or click on the 💾 icon.
</div>

<div class="alert alert-block alert-warning"><b>Warning:</b>You are strongly advised to run the cells in the indicated order. If you want to rerun cells above, you can just restart the kernel to start at 1 again. </div>

 
___

__*=> About this jupyter notebook*__

This a jupyter notebook in **R**, meaning that the commands you will enter or run in `Code` cells are directly understood by the server in the R language. <br>You could run the same commands in a `Terminal` by starting R: type `R` and `enter`. 

>_If you want to see this by yourself, you can open a terminal on the IFB server:_
>- _in the `File` menu in the top bar, select `New Launcher` or click on the `+` sign below_
>- _open either a R `Console` or a `Terminal`_
>  - if you choose the `Terminal`, type `module load r` and enter. Then, you can start R by typing `R`. The bash `$` prompt will be >changed to a R `>` prompt   
> _you'll be able to copy and paste the commands from the `Code` cells of the notebook in the "bottom cell" (for the console) or after the `>` sign (for the terminal)_
>- to quit R from the terminal type `quit()` or `q()` and enter. You can either save or not the data and history of your session by answering "y" or "n" or directly by typing q("yes" or q("no"). 
>
>
>_This is for your information only, and not needed for our RNAseq training. All the commands are already included in this notebook (and the following ones)._
<br>    

<em>loaded JupyterLab</em> : Version 3.5.0

---
---
## **I. First steps into R**
---

### **I.0 What is R ?**
---

R is available on this website: https://www.r-project.org

The R language is:
- open-source
- available for Windows, Mac and Unix
- widely used in academia, finance, pharma, social sciences...

R is a statistical programming language. This project started in 1993. We are currently at version 4.4.1 (2024/06/14). There is a new release twice a year.

It includes a "core language" called `R base` with more than 3000 contributed packages. A package is a set of functions.

R can be used for:
1. data manipulation: import, format, edit, export
2. statistics
3. (very) avdanced graphics

***Some useful links***
- Quick R: https://www.statmethods.net/index.html
- Emmanuel Paradis tutorial: [in French](https://cran.r-project.org/doc/contrib/Paradis-rdebuts_fr.pdf) or [in English](https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf)
- R cheatsheet: https://rstudio.com/resources/cheatsheets/
- R style guide: https://google.github.io/styleguide/Rguide.html


___

### **I-1. - R as a calculator**

---

**Some very simple examples**

You can directly use R to perform mathematic operations with usual operators: `+`, `-`, `*` to multiply,`^` to raise to the power, `/` to divide, `%%` to get the modulo.

In [None]:
## Code cell n°1 ##

2+2

In [None]:
## Code cell n°2 ##

2-3

In [None]:
## Code cell n°3 ##

6/3

In [None]:
## Code cell n°4 ##

10/3

In [None]:
## Code cell n°5 ##

10%%3

You can use built-in functions like `round()`,`log()`, `mean`...


In [None]:
## Code cell n°6 ##

mean(c(1,2)) # we will see further down in this notebook that we need to put concatenate different values with a c() first
exp(-2)

You can nest functions within each other, in the following example, `exp()` is nested within `round()`. This means that the function `exp()` is going to be executed first and its result will be passed as an argument for the function `round()`.

In [None]:
## Code cell n°7 ##

round(exp(-2), 2)

For some functions, you need to enter several arguments. In the example below, we add the `base` argument for the `log()` function.

In [None]:
## Code cell n°8 ##

log(100, base = 10) # we want to get the log of 100 in base 10 

***Getting help on functions:***

To know which argument to use, it is recommanded to always look at the help page of the functions. To do so, enter the name of the function after `?`, or `help()` and the name of the function between the brackets. A help page will be displayed with different sections:

- description: what is the purpose of the function?
- usage: how is it used?
- arguments: which parameters are used by the function? Default values may be specified.
- details: technical description of the function
- value: type of the output returned by the function
- see also: similar functions in R
- source/references: not always present
- example: concrete examples -> the best way to learn how it works!

In [None]:
## Code cell n°9 ##

help(round)

In [None]:
## Code cell n°10 ##

?exp

<div class="alert alert-block alert-info"><b>Tip:</b> If you want to mask the help, go to the cell and click on its left. </div>

<div class="alert alert-block alert-info"><b>Remark:</b><br> In the Jupyter Lab interface, you can also access to the help page of a function using the <b>Contextual Help</b> panel. You can open this panel by selecting <b>Show Contextual Help</b> in the Help menu at the top of the Jupyter Lab page. Clicking on the name of a function will display its help page in this panel. This could be helpful to keep the help visible while navigating elsewhere in the notebook, or to avoid a lengthy output within the notebook.


---

### **I-2 - Assigning data into R objects, using and reading them**
---

We can store values in R objects/variables to reuse them later in another command.
To do so, use `<-` made with `<` and `-`. *An alternative is to use `=`, but for code clarity it is not recommended.*

Let's assign for example `2` to the variable `x`:

In [None]:
## Code cell n°11 ##

x <- 2

To know what is in `x` just enter `x`:

In [None]:
## Code cell n°12 ##

x

We can do operations on x:

In [None]:
## Code cell n°13 ##

x + x

You can then assign to the variable `y` the result of an operation with `x`.

In [None]:
## Code cell n°14 ##

y <- x + 3

To get the result of y, enter it in the next command:

In [None]:
## Code cell n°15 ##

y

<div class="alert alert-block alert-danger"><b>Caution:</b> 
If you assign a new value in x, y will not change because the result of the operation x+3 was stored in y, not the operation "x + 3" itself.
</div>

So you would have to rerun the command assigning `x+3` in y to change the value of y.

In [None]:
## Code cell n°16 ##

y <- x + 3
y

In addition to numeric values, we can store other kind of data in an R object. For example we will put a string of character in the variable "s". Strings of characters have to be put between "quotes".

In [None]:
## Code cell n°17 ##

s <- "this is a string of characters"
s

Of note, you can always check the type of a R object using `class()`.

In [None]:
## Code cell n°18 ##

class(x)
class(s)

It is important that numeric values are well encoded as numeric in R and not as strings of characters. You can not do operation with strings of characters:

In [None]:
## Code cell n°19 ##

"1"
class("1")
class(1)

If you try to add `"1"` and 3,  an error message is returned here since we are trying to make an impossible operation:

In [None]:
## Code cell n°20 ##

try("1" + 3)# I added the try function to avoid stopping the notebook if you want to run all the cells

If you are using numeric variables, the operation can be done:

In [None]:
## Code cell n°21 ##

1 + 3

___

### **I-3 - Managing your session**
---

When working with R, it is always a good practice to document the R version you are using and the packages that are loaded. The function is `sessionInfo()`.

In [None]:
## Code cell n°22 ##

sessionInfo()

As you can see, the version 4.2.3 is the one installed on the server. By default, some "base" packages such as `stats` are loaded. We will see in the next notebook that we can load other packages.

The command `getwd()` gives you the path of your working directory. It is similar to `pwd` in bash.

In [None]:
## Code cell n°23 ##

getwd()

<div class="alert alert-block alert-warning"><b>The result should be like this:</b>`/shared/ifbstor1/projects/2312_rnaseq_cea/mylogin` with your "login".</div>

If needed, you can change the working directory to another directory with `setwd()`.

Here is an example where we will save your current working directory in a variable as a string of characters, then we set up a new one, and we get back to the initial one.

In [None]:
## Code cell n°24 ##

initial_path <- getwd()
initial_path #current working directory

Then we change it:

In [None]:
## Code cell n°25 ##

setwd(paste0(initial_path,"/Results/", sep = "")) #change current working directory
getwd() #change is visible

And we put back our initial directory.

In [None]:
## Code cell n°26 ##

setwd(initial_path) # reset to initial working directory by using the content of the variable
rm(initial_path)
getwd()

___


### **I-4 - Managing objects in your R Session and working directory**
___

The objects `x`, `y`and `s`you have created above are only present in your R session, but they are not written in your working directory on the computer -> they are not present in the left-hand panel of Jupyter Lab.

So, to know which objects you have currently available in your R session, you can use the same `ls` function as in Unix/bash to list the files. The only difference is that in R you add brackets to use functions.

In [None]:
## Code cell n°27 ##

ls()

Similarly, you can get rid of an object with the function `rm()`.

In [None]:
## Code cell n°28 ##

rm(y) # the object y is removed
ls()

Conversely, you can also look at the data on your computer from R with the function `dir()` or `list.files()`. With the second function, you can add an argument to specify a pattern of interest.

In [None]:
## Code cell n°29 ##

dir()

In [None]:
## Code cell n°30 ##

list.files(pattern=".ipynb")

___

### **I.5 - Saving your data, session, and history**
___


Before quitting R, you will probably want to save objects and other session information on your computer to be able to find them again next time you work on your project.    
By default, all the data and files you save will be saved in your ***working directory***.

#### **a - Saving specific data *(or functions)***

The function `save()` is used to save a specific object in your computer or the server. You will have to give a name to the file on your computer or server.        
Usually, we save them with the extension `.RData`.

In [None]:
## Code cell n°31 ##

save(x,file = "x.RData")

With the above command, you should have created the file `x.Rdata` in your working directory. Check it is present on the left-hand panel of Jupyter Lab.<br>
Now, if you remove `x` from your R session, you can load it back again with the `load()` function.

In [None]:
## Code cell n°32 ##

rm(x)
ls()

In [None]:
## Code cell n°33 ##

load("x.RData")
ls()
x    # x is again accessible

You can also delete the file from the working directory with the function `file.remove()`.

In [None]:
## Code cell n°34 ##

file.remove("x.RData") #remove file: returns TRUE on successful removal

When saving a single R object, you can also use the function `saveRDS()`. It writes a single R object to a specified file (in `.rds` file format). The object can be restored back using the function `readRDS()`. See, for example: http://www.sthda.com/english/wiki/saving-data-into-r-data-format-rds-and-rdata



Instead of saving a single object, you can save several by listing them all as separate arguments in the `save()` function *(the function saveRDS doe not work for several objects)*.

In [None]:
## Code cell n°35 ##

save(x, s, file = "xands.RData")

In [None]:
## Code cell n°36 ##

file.remove("xands.RData")  # to clean the working directory

#### **b - Saving all variables *(and functions)* at once**

When you want to save all objects, it is even more efficient to use the function `save.image()`

In [None]:
## Code cell n°37 ##

ls()
save.image(file = "AllMyData.RData")

And similarly you can upload them all back after removing all objects in the session or starting a new one.

In [None]:
## Code cell n°38 ##

rm(list=ls()) # this command removes all the objects on the R session
ls() #all variables have been removed

In [None]:
## Code cell n°39 ##

load("AllMyData.RData")
ls() #all variables are accessible again
file.remove("AllMyData.RData")
ls()

#### **c- Save "history"** = all past commands

<div class="alert alert-block alert-warning"><b> Do not run as a Code cell</b>. It does not work in R notebooks where no history is saved because we are running independant cells!  <br>  
    The command below would be the one to run in R shell (Terminal > R) or in RStudio (on the IFB, you can access RStudio via a new launcher).</div>

___

### **I.6 - Classes and types of R objects**
___


#### **a - Classes of R objects**

The main types of variables are :

- numeric/integer
- character
- logical (FALSE/TRUE/NA)
- factors

In [None]:
## Code cell n°40 ##

x <- c(3, 7, 1, 2) # we define a variable x with 4 numeric values concatenated
x

To have a more classical R display than in a notebook, you can add `print()`.

In [None]:
## Code cell n°41 ##

print(x) 

The variable x contains 4 numeric values. We can check it is numeric with the function `is.numeric()`.

In [None]:
## Code cell n°42 ##

is.numeric(x)

It returns the logical value `TRUE`.

You can also perform tests that will return logical values. Below we test whether the values in x are below 2.

In [None]:
## Code cell n°43 ##

x < 2 # we test whether the 4 values are < 2

Only the third value of x is < 2. Similarly, we can test which values of x are equal to 2.

In [None]:
## Code cell n°44 ##

x == 2

In R, the function `class()` returns the class of the object. The functions `is.logical()`, `is.numeric()`, `is.character()`,...test whether the values are of this type.    
You may eventually do a type conversion with `as.numeric()`, `as.logical()`, ...

In [None]:
## Code cell n°45 ##

class(x)
class(s)
is.character(s)
is.numeric(s)
print(as.numeric(x < 2))
is.numeric("1")
is.numeric(as.numeric("1"))
is.numeric(c(1, "1"))   # c(1, "1") is a concatenation of a numeric value and a character value
class(c(1, "1"))

***Coercion rules:*** There are some coercion rules when doing conversions on concatenating elements of different types: `logical <integer < numeric < complex < character < list`

- if character strings are present, everything will be coerced to a character string.
- otherwise logical values are coerced to numbers: TRUE is converted to 1, FALSE to 0
- values are converted to the simplest type required to represent all information
- object attributes (sort of metadata of objects/variables like their names) are dropped

#### **b. Main data structures in R**

There are 4 main data structures in R. The heterogeneous ones accept several classes inside.

|   object  | Can it be heterogeneous? |
|:---------:|:------------------------:|
|   vector  |            no            |
|   matrix  |            no            |
| dataframe |            yes           |
|    list   |            yes           |


---
___
## **II. Vectors**
---

- They are **the most elementary R objects**. They have one dimension. Some functions to create them are `c()`, `seq()`, `:`, `rep()`, `append()`...


In [None]:
## Code cell n°46 ##

a <- c()
a

In [None]:
## Code cell n°47 ##

weight <- c(60, 72, 57, 90, 95, 72)
weight

<div class="alert alert-block alert-info"><b>Remark:</b><br> In a jupyter notebook, by default each item of a vector can be displayed on a different numbered row. Should you wish to display a vector in a more classical way, like in the R console, with the index of the first value of each line indicated between [], you should use the function <b>print()</b>. </div>

In [None]:
## Code cell n°48 ##

print(weight)

In [None]:
## Code cell n°49 ##

4:10
print(4:10)

In [None]:
## Code cell n°50 ##

print(seq(from = 4, to = 10))
print(seq(4, 10))

In the above cell 50, the two commands are equivalent. In the first one we name each argument. In the second one, we use them according to the by default usage as you can see in the help.

<div class="alert alert-block alert-info"><b>Tip:</b><br> In R, you may not write the names of the argument provided you use them in the correct order. </div>

In [None]:
## Code cell n°51 ##
? seq

In [None]:
## Code cell n°52 ##

print(seq(from = 2, to = 10, by = 2))

In [None]:
## Code cell n°53 ##

print(rep(x = 4, times = 2))

In [None]:
## Code cell n°54 ##

print(rep(seq(4, 10, 2)))
print(c(rep(1, 4),rep(2, 4)))
print(c(5, s))

You can check the class of a vector but also get some information on its length with `length()` and on its structure with `str()`.

In [None]:
## Code cell n°55 ##

class(c(5, s))
length(1:10)
length(weight)
str(weight)

- You can perform **operations** directly on vectors:

In [None]:
## Code cell n°56 ##

size <- c(1.75, 1.8, 1.65, 1.9, 1.74, 1.69)
print(size^2)
print(bmi <- weight/size^2 )
print(bmi)

- You can **order** them or **get dispersion values**:

In [None]:
## Code cell n°57 ##

print(sort(size))
mean(size)
sd(size)
median(size)
min(size)
max(size)
print(range(size))

summary(size)

- You can **extract some values** from a vector with the index of the values you want to extract inside using square brackets `[]`:

In [None]:
## Code cell n°58 ##

print(size)
size[1]
size[2]
size[6]
print(size[c(2,6)])
print(size[c(6,2)])
min(size[c(6,2)])

- Finally you can **add a name to the different values**. Names on vector values are attributes of the vector. Here the function `names()` returns a vector of the names of vector `size`. 

In [None]:
## Code cell n°59 ##

names(size)
names(size) <- c("Fabien", "Pierre", "Sandrine", "Claire", "Bruno", "Delphine")
size
str(size)

___
---

## **III - Matrices**
---

Matrices properties:
- **2-dimension objects** (rows x columns)
- contain only one type of variable (e.g numeric) = **homogeneous** 

The function to create a matrix is `matrix()`

In [None]:
## Code cell n°60 ##

myData <- matrix(c(1, 2, 3, 11, 12, 13), nrow = 2, ncol = 3)
myData
class(myData)

Thus by default, a matrix is filled by columns but you can change this behaviour and fill it by rows.

In [None]:
## Code cell n°61 ##

myData <- matrix(c(1, 2, 3, 11, 12, 13), nrow = 2, ncol = 3, byrow = TRUE)
myData

- you can check the dimensions with `dim()` or `str()`, `nrow()` or `ncol()`

In [None]:
## Code cell n°62 ##

print(dim(myData))
str(myData)
nrow(myData)
ncol(myData)

Printing the matrix shows you `[i,j]` coordinates, where *i* is the index of the row and *j* the index of the column.

In [None]:
## Code cell n°63 ##

print(myData)

- values can be extracted or sliced with the `[]`

In [None]:
## Code cell n°64 ##

myData[1, 2] # returns the value of the 1st row and 2nd column

In [None]:
## Code cell n°65 ##

myData[2, 1] # returns the value of the 2nd row and 1st column

In [None]:
## Code cell n°66 ##

print(myData[, 1]) # returns the values of the vector corresponding to the 1st column

In [None]:
## Code cell n°67 ##

print(myData[2,])  # returns the values of the vector corresponding to the 2nd row

In [None]:
## Code cell n°68 ##

myData[, 2:3] # subsets the initial matrix returning a sub-matrix
             # with all rows of the 2nd and 3rd columns from the initial matrix
             # the generated matrix has 2 rows and 2 columns

In [None]:
## Code cell n°69 ##

print(dim(myData[, 2:3])) # the generated matrix has 2 rows and 2 columns

In [None]:
## Code cell n°70 ##

myData[, 1]        # returns the values of the vector corresponding to the 1st column
class(myData[, 1]) # we extract a vector containg numbers -> thus the class is numeric and no more matrix
length(myData[1,])
length(myData[, 1])

- Vectors can be associated to generate a matrix with `rbind()` or `cbind()`

In [None]:
## Code cell n°71 ##

myData2 <- cbind(weight, size, bmi) # joining the vectors as columns
myData2
myData3 <- rbind(weight, size, bmi) # joining the vectors as rows
myData3

- of course, operations can be applied to the values in the matrix

In [None]:
## Code cell n°72 ##

myData2*2
summary(myData2)
mean(myData2)
mean(myData2[, 1])

---
---
## **IV. Dataframes**
---

Dataframes properties:
- **2-dimension objects** (rows x columns)
- can be **heterogeneous** : can contain several types of variables in different columns



<div class="alert alert-block alert-danger"><b> Caution: </b> Each column must remain with the same type (homogeneous).</div>


### **IV.1. - Creating a dataframe**
---

- They are generated with the function `data.frame()`:

This can be done **using existing vectors of same length** like the previously generated "weight", "size" and "bmi" .

In [None]:
## Code cell n°73 ##

myDataf <- data.frame(weight, size, bmi)
myDataf

The obtained dataframe looks pretty much like our previous matrix myData2.

In [None]:
## Code cell n°74 ##

class(myDataf)

In [None]:
## Code cell n°75 ##

str(myDataf)

In [None]:
## Code cell n°76 ##

print(dim(myDataf))

>*Note that if the vectors used to generate the dataframe are character strings, it is advised in versions of R < 4 to add the argument `stringsAsFactors=FALSE`*

If the vectors that will generate the dataframe do not exist yet in the session, but you would like to initiate a dataframe to fill it during your analysis, you could imagine creating an empty dataframe. But this method is useless as it is impossible to fill the generated dataframe having 0 columns and rows.

In [None]:
## Code cell n°77 ##

d <- data.frame()
d
dim(d)

- Dataframes can be generated by **converting a matrix into a dataframe** with `as.data.frame()`

Let's try with the object myData2 we previously created. It is a matrix:

In [None]:
## Code cell n°78 ##

class(myData2)
class(as.data.frame(myData2))
str(as.data.frame(myData2))

You may use `as.data.frame()` on a matrix generated by binding columns of vectors with  `cbind()`:

In [None]:
## Code cell n°79 ##

d2 <- as.data.frame(cbind(1:2, 10:11))
d2
str(d2)

Similarly, we can do such a conversion of an empy matrix (created with `matrix()` into a dataframe like in the example below with a matrix of two rows and three columns currently filled with missing values:

In [None]:
## Code cell n°80 ##

d <- as.data.frame(matrix(NA, 2, 3))
d
dim(d)
str(d)

### **IV.2. - Reading a text file into R and vice versa**
---

#### **a. Reading a text file into R**

The function `read.table()` reads a delimited text file (tabulated, scv or other column separator) into R.

<div class="alert alert-block alert-warning">
    <span style="color:red"><b>read.table() returns a dataframe</b></span></div>

Let's have a look at the file `Temperature.txt`. It is on `/shared/projects/2413_cea_rnaseq/alldata/Example_Data`. But you should also have a copy of this text file in your personal folder (done at the end of the previous notebook), visible in the panel on the left. Just double click on it to open it.


You will see it is a ***tab-delimited*** text file.

Now let's import it in R by specifying the correct separator with the `read.table()` function:

In [None]:
## Code cell n°81 ##

path_to_file <- "/shared/projects/2413_rnaseq_cea/alldata/Example_Data/Temperatures.txt" #in case you did not copy it on your personal directory in Pipe06
temperatures <- read.table(path_to_file,
                           sep = "\t",
                           header = TRUE,
                           stringsAsFactors = FALSE)
temperatures
str(temperatures)

In the above command, I used the argument `stringsAsFactors = FALSE`to avoid a factorisation of the columns with strings of character (here the "Month" column).
In R versions < 4, the default value for this argument is `TRUE`. Let's see what would have happened:

In [None]:
## Code cell n°82 ##

temperatures.2 <- read.table(path_to_file,
                             sep = "\t",
                             header = TRUE,
                             stringsAsFactors = TRUE)
str(temperatures.2)

Here the "Month" column has been factorised. How?

In [None]:
## Code cell n°83 ##

levels(temperatures.2$Month)

By alphabetic order, which is not what you want!
Thus always use `stringsAsFactors = FALSE`

<div class="alert alert-block alert-info"><b>Tutorial on factors:</b> to better understand the behaviour of factors, you may follow a tutorial available on the google drive.</div>

#### **b. Writing a dataframe on your computer**

Conversely, save a dataframe into your working directory with `write.table()`:

In [None]:
## Code cell n°84 ##

# save a dataframe as a text file in the working directory
write.table(myDataf,
            file = "bmi_data.txt",
            sep = "\t",
            quote = FALSE,
            col.names = TRUE)

Have a look at it by double clicking on it in your working directory.

and check you can import it back in R again:

In [None]:
## Code cell n°85 ##

rm(myDataf)
myDataf <- read.table("bmi_data.txt",
                      sep = "\t",
                      header = TRUE,
                      stringsAsFactors = FALSE)
head(myDataf) #myDataf is again accessible
file.remove("bmi_data.txt") #to clean the working directory

---
### **IV.3. - Extracting information from a dataframe**
---

To better follow, let's first display again myDataf:

In [None]:
## Code cell n°86 ##

print(myDataf)

- Getting **row and column names** of a dataframe:

Use the functions dedicated to dataframes which are `row.names()` and `names()`:

In [None]:
## Code cell n°87 ##

row.names(myDataf)
names(myDataf)

<div class="alert alert-block alert-danger"><b>Caution:</b>
    each row name must be unique in a dataframe!
</div>

- **Getting a variable from a dataframe:**



Variables are the columns of a dataframe. You can extract the vector corresponding to a column from a dataframe with its `index`, with the `name` of the column inside`""` or using the symbol `$`:

In [None]:
## Code cell n°88 ##

print(myDataf[, 2])
print(myDataf[, "size"])
print(myDataf$size)

- **Extracting rows from a dataframe:**

You have two options to do so:

**1.** either by specifying the index of the row

In [None]:
## Code cell n°89 ##

myDataf[2,]

**2.** or by giving its name within the `""` inside the squared brackets:

In [None]:
## Code cell n°90 ##

myDataf["Pierre",]

In [None]:
## Code cell n°91 ##

class(myDataf["Pierre",])

In both cases, you may notice that you obtain a dataframe and not a vector, even if you extract only one row.
If you wish to get the vector corresponing to a row, you have to convert it with the `unlist()` function.

>*Of note, dataframes are a special case of list variables of the same number of rows with unique row names.*

In [None]:
## Code cell n°92 ##

temp <- unlist(myDataf["Pierre",])
print(temp)
class(temp)

<div class="alert alert-block alert-warning"><b>Your turn:</b> which command could you use to extract the blue cells in the 3 dataframes below? Think of answers on your own -> we will discuss the solutions together. </div>

![image.png](attachment:3602e3b7-dc0e-4cf5-a783-1a5f82441946.png)

---
### **IV.4. - Adding variables/columns to a dataframe**
---

- **adding a column:** creating a new vector with characters and including it in the dataframe

1. either you add one vector at a time:

In [None]:
## Code cell n°93 ##

d2
d2$new <- 1:2
d2

Here is another example to add a colum "sex" to the dataframe myDataf using a vector called "sex". I changed the name of the vector but you could keep the same name!

In [None]:
## Code cell n°94 ##

gender <- c("Man", "Man", "Woman", "Woman", "Man", "Woman")
print(gender)
myDataf$sex <- gender
print(myDataf$sex)
myDataf
str(myDataf)

2. or add several vectors or several columns from another dataframe at once using `data.frame()`:

In [None]:
## Code cell n°95 ##

d3 <-  data.frame(d, d2)
d3

<div class="alert alert-block alert-danger"><b>Caution:</b> 
    You could also use <b>cbind()</b> but it is at risk as cbind() is rather a function for matrices. If you use it for dataframes, it will keep the data types only if you combine several variables of both dataframes. If you take only one variable from a dataframe, cbind() will convert it as a vector with a possible risk of coercion and of factorisation in versions of R < 4.
</div>

- **adding a row**: We do generally add columns that correspond to variables. In case you really want to add a row, this is achiveable by the function `rbind` which creates a matrix and you need to convert into a dataframe.

<div class="alert alert-block alert-danger"><b>Caution:</b> 
    But it is extremly dangereous as your new row might modify the type of the values within columns/variables..
</div>

In [None]:
## Code cell n°96 ##

d3 <- as.data.frame(rbind(d3, rep("toto", 6)))
d3
str(d3)

---
### **IV.5. - Subsetting a dataframe**
---

#### **a. The function `which()` returns the index of what is TRUE in a tested condition**

In [None]:
## Code cell n°97 ##

print(which ( myDataf$sex == "Woman") )

Here, we obtain a vector where 3, 4 and 6 corrrespond to the positions or indexes (1-based) of the occurence "Woman" in the vector/variable myDataf$sex. We can the use this vector as usual in a dataframe before the "," to select the corresponding rows.

In [None]:
## Code cell n°98 ##

myDataf [ which ( myDataf$sex == "Woman") , ] 

In [None]:
## Code cell n°99 ##

str(myDataf [ which ( myDataf$sex == "Woman") , ])

Instead of `==` one can use ̀`!=` for "is different" to detect what does not match.

In [None]:
## Code cell n°100 ##

print(which ( myDataf$sex != "Man"))

Abother method would be to add `!`  for "not" before the test, to get the complementary result:

In [None]:
## Code cell n°101 ##
print(which (! myDataf$sex == "Man"))

<div class="alert alert-block alert-danger"><b>Caution:</b>
    What happens if you do not use <code>which()</code>?
</div>

Lets' make a copy of our dataframe and replace the gender of Claire by a missing value:

In [None]:
## Code cell n°102 ##

myDataf2 <- myDataf
myDataf2["Claire", "sex"] <- NA
myDataf2

and rerun the same command as above without which() on the new myDataf2:

In [None]:
## Code cell n°103 ##

myDataf2[myDataf2$sex == "Woman",]

In [None]:
## Code cell n°104 ##

myDataf2[which(myDataf2$sex == "Woman"),]

<div class="alert alert-block alert-danger"><b>Caution:</b>
    If you have missing data and you forget to use which(), you will also return them.<b> =>  Always use which()</b>
</div>

#### **b. One can also search for a pattern with `grep()`**

It returns the index of what matches, even partially.

In [None]:
## Code cell n°105 ##

print(grep("Wom", myDataf$sex))

In [None]:
## Code cell n°106 ##

print(grep("Woman", myDataf$sex))

In [None]:
## Code cell n°107 ##

myDataf [grep("Woman", myDataf$sex), ] 

In [None]:
## Code cell n°108 ##

print(grep("a", row.names(myDataf)))

In [None]:
## Code cell n°109 ##

myDataf [grep("a", row.names(myDataf)),]

#### **c. The function `subset()` is even simpler than `which()`**

Just enter the dataframe as first argument, and the variable without "quotes" on which you do the filtering followed by the condition.

In [None]:
## Code cell n°110 ##

WomenDataf <- subset(myDataf, gender== "Woman")
WomenDataf

#### **d. You can even combine conditions**

- logical: `&` = AND, `|` = OR, `!` = not
- comparisons: `==` , `!=` for diffferent, `>`, `<`, `>=`, `>=`
- "is an element of" a vector using `%in%`

In [None]:
## Code cell n°111 ##

filteredData <- myDataf [ which ( myDataf$sex == "Woman" & myDataf$weight < 80 & myDataf$bmi > 20), ]
filteredData

In [None]:
## Code cell n°112 ##

subset( myDataf, sex == "Woman" & weight < 80 & bmi > 20)

---
### **IV.4. -Merging dataframes:** using a column as a "key"
---

In this example, I add one column with indexes that I will use as a key, but we can also use an existing variable as a key.

In [None]:
## Code cell n°113 ##

myDataf$index <- 1:6
myDataf

Then I generate another dataframe with handedness information on 6 samples, but one sample is new compared to the initial dataframe.

In [None]:
## Code cell n°114 ##

OtherData <- data.frame(c(1:5, 7),rep(c("right-handed","left-handed"),3))
names(OtherData) <- c("ID","handedness")
OtherData

We can now merge them together by specifying the "key" column with the argument `by`. The `all` argument is used to keep all the rows of a dataframe that are not present in the other. The `.x` refers to the first dataframe while `.y` refers to the second one.

<div class="alert alert-block alert-warning"><b>Warning:</b>If adding <b>sort=F</b> we will avoid the merged dataframe to be sorted by the "key" column. </div>


In [None]:
## Code cell n°115 ##

myMergedDataf <- merge(myDataf, OtherData,
                       by.x = "index", by.y = "ID",
                       all.x = TRUE, all.y = TRUE,
                       sort = FALSE)
myMergedDataf

In the merged dataframe, we start with all the rows present in both dataframes. The next row contains the data only present in the first dataframe with missing data for the columns in the second dataframe. The last rows are the ones with data only present in the second dataframe with missing data for the first dataframe.

Unless the merge is done on the row names (by="0"), the row names of the initial dataframe are lost. The new dataframe has its own row names. 

If two columns have the same name in both dataframes, by default R adds an ".x" to the one from the first dataframe and ".y" to the one of the second dataframe. The names can be changed with the argument `suffixes`.

<div class="alert alert-block alert-success"><b>Success:</b> Well done! You now know all the main functions to create and manipulate dataframes.

</div>
    

---
___

## Conclusion

---

### A - Saving our results

We can save some of the dataframe myDataf created in this session in a single R object in a dedicated folder.  
This will help us to reload our dataframes without having to run the same commands.   

In [None]:
## Code cell n°116 ##

myfolder <- getwd()
#myfolder <- setwd('/shared/ifbstor1/projects/2423_rnaseq_cea/mylogin') # devrait être inutile
#myfolder

# creation of the directory, recursive = TRUE is equivalent to the mkdir -p in Unix
dir.create(paste0(myfolder,"/Results/Rintro"), recursive = TRUE)

# storing the path to this output folder in a variable
rintrofolder <- paste0(myfolder,"/Results/Rintro/", sep = "")
rintrofolder

save(myDataf, file = paste0(rintrofolder,"myDataf.RData"))

---
---
### B - Next setp : Graphics and plots with R 

Now that you know the basics of the R language, we can start playing with some data.
  
**=> Step 07b: Graphics and statistics with R** 

The jupyter notebook used for the next session will be *Pipe_07b-R_plots_stats_withR.ipynb.ipynb*    
Let's retrieve it in our directory, in order to have a private copy to work on:   

In [None]:
## Code cell 117 ##   

file.copy("/shared/projects/2413_rnaseq_cea/pipeline/Pipe_07b-R_plots_stats_withR.ipynb", myfolder)



**Save executed notebook**

To end the session, save your executed notebook in your `run_notebooks` folder. **Adjust the name with yours** and reformat as code cell to run it.

---

<div class="alert alert-block alert-success"><b>Success:</b> Well done! You now know all the main functions to create and manipulate basic R objects.<br>
Don't forget to save you notebook and export a copy as an <b>html</b> file as well <br>
- Open "File" in the Menu<br>
- Select "Export Notebook As"<br>
- Export notebook as HTML<br>
- You can then open it in your browser even without being connected to the server! 
</div>

## Useful commands
<div class="alert alert-block alert-info"> 
    
- <kbd>CTRL</kbd>+<kbd>S</kbd> : save notebook<br>    
- <kbd>CTRL</kbd>+<kbd>ENTER</kbd> : Run Cell<br>  
- <kbd>SHIFT</kbd>+<kbd>ENTER</kbd> : Run Cell and Select Next<br>   
- <kbd>ALT</kbd>+<kbd>ENTER</kbd> : Run Cell and Insert Below<br>   
- <kbd>ESC</kbd>+<kbd>y</kbd> : Change to *Code* Cell Type<br>  
- <kbd>ESC</kbd>+<kbd>m</kbd> : Change to *Markdown* Cell Type<br> 
- <kbd>ESC</kbd>+<kbd>r</kbd> : Change to *Raw* Cell Type<br>    
- <kbd>ESC</kbd>+<kbd>a</kbd> : Create Cell Above<br> 
- <kbd>ESC</kbd>+<kbd>b</kbd> : Create Cell Below<br> 

<em>  
To make nice html reports with markdown: <a href="https://dillinger.io/" title="dillinger.io">html visualization tool 1</a> or <a href="https://stackedit.io/app#" title="stackedit.io">html visualization tool 2</a>, <a href="https://www.tablesgenerator.com/markdown_tables" title="tablesgenerator.com">to draw nice tables</a>, and the <a href="https://medium.com/analytics-vidhya/the-ultimate-markdown-guide-for-jupyter-notebook-d5e5abf728fd" title="Ultimate guide">Ultimate guide</a>. <br>
Further reading on JupyterLab notebooks: <a href="https://jupyterlab.readthedocs.io/en/latest/user/notebook.html" title="Jupyter Lab">Jupyter Lab documentation</a>.<br>   
</em>    
 
</div>

Claire Vandiedonck - 2021-2022   
Sandrine Caburet - 05/2023   
MAJ : 17/06/2024 by @CVandiedonck