# __R Session n°2 : tutorial__
M1 MEG UE5 - Claire Vandiedonck
***

## **introduction on dataframes**

*Content of this tutorial:*

1. Creating a dataframe
2. Reading a text file into RData
3. Saving a dataframe on my computer

In the next tutorial we will see how to subset dataframes on several criteria, and how to merge them.

---
## Avant d'aller plus loin

<div class="alert alert-block alert-danger"><b>Attention:</b> 
Ne travaillez pas directement sur ce notebook pour ne pas le perdre. Dupliquez-le et renommez-le par exemple en ajoutant vos initiales et travaillez sur cette nouvelle copie. Pour ce faire, dans le panneau de gauche, faites un clic droit sur le fichier et sélectionnez "Duplicate". Puis, toujours dans la colonne de gauche, faites un clic droit sur cette copie et sélectionnez "rename" pour changer le nom. Ouvrez ensuite cette nouvelle version en double cliquant dessus. Vous êtes prêt(e) à démarrer! <br>
<br>
<b>N'oubliez pas de sauvegarder régulièrement votre notebook</b>: <kbd>Ctrl</kbd> + <kbd>S</kbd>. ou en cliquant sur l'icone 💾 en haut à gauche de votre notebook ou dans le Menu du JupyterLab "File puis "Save Notebook"! Vous pouvez aussi le sauvegarder au format html: Menu "File" > Export Notebook As> Export notebook as HTML.
</div>
 

__*=> About this jupyter notebook*__

This a jupyter notebook in **R**, meaning that the commands you will enter or run in `Code` cells are directly understood by the server in the R language.
<br>You could run the same commands in a Terminal or in RStudio. 


> In this tutorial, you will run one cell at a time.    



<div class="alert alert-block alert-warning"><b>Warning:</b> you are strongly advised to run the cells in the indicated order. If you want to rerun cells above, you can just restart the kernel to start at 1 again. </div>

#### 1. Identifiez votre répertoire de travail

In [None]:
getwd()

Si ce répertoire ne vous convient pas, changez le pour un répertoire existant `my_directory`par exemple, avec la commande suivante en spécifiant le chemin relatif ou absolu de votre répertoire:
```setwd("path/my_directory")```

#### 2.  Identifiez la version R de votre environnement et packages installés.

In [None]:
sessionInfo()

---
---

## **I - Dataframes**


Dataframes are two-dimensional objects that can be heterogeneous between columns (but homogeneous within a column)

### **I.1. - Creating a dataframe:**
---

- They are generated with the function `data.frame()`:

This can be done **using existing vectors of same length** like the previoulsy generated "weight", "size" and "bmi" we previously saved in the object "anthropo.RData". If you didn't generate it, you can find it on moodle or in the shared data folder: `/srv/data/meg-m1-ue5/`. You only have to change the path in the next command.

In [None]:
load("../R1/anthropo.RData")
ls()

In [None]:
myDataf <- data.frame(weight, size, bmi)
myDataf

The obtained dataframe looks pretty much like the previous matrix myData2.

In [None]:
class(myDataf)

In [None]:
str(myDataf)

In [None]:
print(dim(myDataf))

>*Note that if the vectors used to generate the dataframe are character strings, it is advised in versions < 4 to add the argument `stringsAsFactors=FALSE`*

If the vectors that will generate the dataframe do not exist yet in the session, but you would like to initiate a dataframe to fill it during your analysis, you could imagine creating an empty dataframe. But this method is useless as it is impossible to fill the generated dataframe having 0 columns and rows.

In [None]:
d <- data.frame()
d
dim(d)

In that case, it is better to create an empty matrix and to convert it to a dataframe. See below.

- Dataframes can be generated by **converting a matrix into a dataframe** with `as.data.frame()`

Let's try with the object myData2 which is a matrix we create with `cbind()`.

In [None]:
myData2 <- cbind(weight, size, bmi)

In [None]:
class(myData2)
class(as.data.frame(myData2))
str(as.data.frame(myData2))

You may also use `as.data.frame()` matrix directly by binding rows :

In [None]:
d2 <- as.data.frame(rbind(1:2, 10:11))
str(d2)

So, similarly, we can do such a conversion of an empy matrix into a dataframe like in this example with a matrix of two rows and three columns currently filled with missing values:

In [None]:
d <- as.data.frame(matrix(NA,2,3))
d
dim(d)
str(d)

- Getting **row and column names** of a dataframe:

You may use the same fonctions as the ones used for matrices: `rownames()` and `colnames()`:

In [None]:
rownames(d)
colnames(d)

But it is better to use the functions dedicated to dataframes which are `row.names()` and `names()`:

In [None]:
row.names(d)
names(d)

<div class="alert alert-block alert-danger"><b>Caution:</b>
    each row name must be unique in a dataframe!
</div>

- **Getting a variable from a dataframe:**

To better follow, let's first diplay again myDataf

In [None]:
print(myDataf)

Variables are the columns of a dataframe. You can extract the vector corresponding to a column from a dataframe with its `index`, with the `name` of the column inside`""` or using the symbol `$`:

In [None]:
print(myDataf[,2])
print(myDataf[,"size"])
print(myDataf$size)

- **Extracting rows from a dataframe:**

You have two options to do so:

1. either by specifying the index of the row

In [None]:
myDataf[2,]

2. or by giving its name within the `""` insie the squared brackets:

In [None]:
myDataf["Pierre",]

In [None]:
class(myDataf["Pierre",])

In both cases, you may notice that you obtain a dataframe and not a vector, even if you extract only one row.
If you wish to get the vector corresponding to a row, you have to convert it with the `unlist()` function.

>*Of note, dataframes are a special case of list variables of the same number of rows with unique row names.*

In [None]:
temp <- unlist(myDataf["Pierre",])
print(temp)
class(temp)

<div class="alert alert-block alert-warning"><b>Your turn:</b> have a look at slide 23 and start thinking of answers on your own -> we will discuss the solutions together. </div>

- **adding a column:** creating a new vector with characters and including it in the dataframe

1. either you add one vector at a time:

In [None]:
d2$new <- 1:2
d2

Here is another example to add a colum "sex" to the dataframe myData using a vector called "sex". I changed the name of the vector but you could keep the same name!

In [None]:
gender <- c("Man","Man","Woman","Woman","Man","Woman")
print(gender)
myDataf$sex <- gender
print(myDataf$sex)
myDataf
str(myDataf)

2. or add several vectors or several columns from another dataframe at once using `data.frame()`:

In [None]:
d3 <-  data.frame(d, d2)
d3

<div class="alert alert-block alert-danger"><b>Caution:</b> 
    You could also use <b>cbind()</b> but it is at risk as cbind() is rather a function for matrices. If you use it for dataframes, it will keep the data types only if you combine several variables of both dataframes. If you take only one variable from a dataframe, cbind() will convert it as a vector with a possible risk of coercion and of factorisation in versions of R < 4.
</div>

### **I.2. - Reading a text file into R and vice versa**
---

#### **a. reading a text file into R**

The function `read.table()` reads a delimited text file (tabulated, scv or other column separator) into R and **generates a dataframe**. 

We will import into R the file `Temperatures.txt` which is located in the shared folder `srv/data/meg-m1-ue5`. It is also on moodle.
To import it in you `R2 folder` open a terminal and enter the following command:

Before importing it into R, let's see how it looks like. Just double click on it.

You will see it is a tab-delimited text file.

Now let's import it in R by specifying the correct separator with the `read.table()` function:

In [None]:
temperatures <- read.table("Temperatures.txt", sep = "\t", header = T, stringsAsFactors = F)
temperatures
str(temperatures)

In the above command, I used the argument `stringsAsFactors = FALSE`to avoid a factorisation of the columns with strings of character (here the "Month" column).
In R versions < 4, the default value for this argument is `TRUE`. Let's see what would have happened:

In [None]:
temperatures.2 <- read.table("Temperatures.txt", sep = "\t", header = T, stringsAsFactors = TRUE)
str(temperatures.2)

Here the "Month" column has been factorised. How?

In [None]:
levels(temperatures.2$Month)

By alphabetic order, which is not what you want!
Thus always use `stringsAsFactors=FALSE`

<div class="alert alert-block alert-info"><b>Personal work:</b>to better understand the behaviour of factors which is available on moodle.</div>

#### **b. writing a dataframe on your computer**

Conversely, save a dataframe into your working directory with `write.table()`:

In [None]:
# save a dataframe as a text file in the working directory

write.table(myDataf, file = "bmi_data.txt", sep = "\t", quote = F, col.names = T)

Have a look at it by double clicking on it in your working directory.

and check you can import it back in R again:

In [None]:
rm(myDataf)
myDataf <- read.table("bmi_data.txt", sep = "\t", header = T, stringsAsFactors = F)
head(myDataf) #myDataf is again accessible
file.remove("bmi_data.txt") #to clean the working directory

<div class="alert alert-block alert-success"><b>Success:</b> Well done! You now know all the main functions to create and save dataframes.

</div>
    

---
Lets' save all the main objects of this session into an R object:

In [None]:
ls()

We will keep `myDataf` and `temperatures`.

In [None]:
save(myDataf, temperatures, file="R2_tuto.RData")

In [None]:
sessionInfo()

***
***
## Useful commands
<div class="alert alert-block alert-info"> 
    
- <kbd>CTRL</kbd>+<kbd>S</kbd> : save notebook<br>    
- <kbd>CTRL</kbd>+<kbd>ENTER</kbd> : Run Cell<br>  
- <kbd>SHIFT</kbd>+<kbd>ENTER</kbd> : Run Cell and Select Next<br>   
- <kbd>ALT</kbd>+<kbd>ENTER</kbd> : Run Cell and Insert Below<br>   
- <kbd>ESC</kbd>+<kbd>y</kbd> : Change to *Code* Cell Type<br>  
- <kbd>ESC</kbd>+<kbd>m</kbd> : Change to *Markdown* Cell Type<br> 
- <kbd>ESC</kbd>+<kbd>r</kbd> : Change to *Raw* Cell Type<br>    
- <kbd>ESC</kbd>+<kbd>a</kbd> : Create Cell Above<br> 
- <kbd>ESC</kbd>+<kbd>b</kbd> : Create Cell Below<br> 

<em>  
To make nice html reports with markdown: <a href="https://dillinger.io/" title="dillinger.io">html visualization tool 1</a> or <a href="https://stackedit.io/app#" title="stackedit.io">html visualization tool 2</a>, <a href="https://www.tablesgenerator.com/markdown_tables" title="tablesgenerator.com">to draw nice tables</a>, and the <a href="https://medium.com/analytics-vidhya/the-ultimate-markdown-guide-for-jupyter-notebook-d5e5abf728fd" title="Ultimate guide">Ultimate guide</a>. <br>
Further reading on JupyterLab notebooks: <a href="https://jupyterlab.readthedocs.io/en/latest/user/notebook.html" title="Jupyter Lab">Jupyter Lab documentation</a>.<br>   
</em>    
 
</div>