# __R Session n°3 : tutorial__
M1 MEG UE5 - Claire Vandiedonck
***

## **Manipulating dataframes**

*Content of this tutorial:*

1. Subsetting a dataframe on several criteria
2. Merging dataframes

---
## Avant d'aller plus loin

<div class="alert alert-block alert-danger"><b>Attention:</b> 
Ne travaillez pas directement sur ce notebook pour ne pas le perdre. Dupliquez-le et renommez-le par exemple en ajoutant vos initiales et travaillez sur cette nouvelle copie. Pour ce faire, dans le panneau de gauche, faites un clic droit sur le fichier et sélectionnez "Duplicate". Puis, toujours dans la colonne de gauche, faites un clic droit sur cette copie et sélectionnez "rename" pour changer le nom. Ouvrez ensuite cette nouvelle version en double cliquant dessus. Vous êtes prêt(e) à démarrer! <br>
<br>
<b>N'oubliez pas de sauvegarder régulièrement votre notebook</b>: <kbd>Ctrl</kbd> + <kbd>S</kbd>. ou en cliquant sur l'icone 💾 en haut à gauche de votre notebook ou dans le Menu du JupyterLab "File puis "Save Notebook"! Vous pouvez aussi le sauvegarder au format html: Menu "File" > Export Notebook As> Export notebook as HTML.
</div>
 

__*=> About this jupyter notebook*__

This a jupyter notebook in **R**, meaning that the commands you will enter or run in `Code` cells are directly understood by the server in the R language.
<br>You could run the same commands in a Terminal or in RStudio. 


> In this tutorial, you will run one cell at a time.    



<div class="alert alert-block alert-warning"><b>Warning:</b> you are strongly advised to run the cells in the indicated order. If you want to rerun cells above, you can just restart the kernel to start at 1 again. </div>

#### 1. Identifiez votre répertoire de travail

In [None]:
#cell1
getwd()

Si ce répertoire ne vous convient pas, changez le pour un répertoire existant `my_directory`par exemple, avec la commande suivante en spécifiant le chemin relatif ou absolu de votre répertoire:
```setwd("path/my_directory")```

#### 2.  Identifiez la version R de votre environnement et packages installés.

In [None]:
#cell2
sessionInfo()

## **0. Load data**
---

To start with this tutorial, let's first load the data we saved at the end of the tutorial on the introduction on dataframes.

<div class="alert alert-block alert-warning"><b> If you do not wish to do the tutorial stepwise and directly start here:</b>. it would be necessary to run all above cells in order to have all required objects already loaded in the session. To do so, click on "Run" in the top menu and select "Run all above selected cell".</div>

In [None]:
#cell3
load("../R2/R2_tuto.RData")
ls()

## **I. - Subsetting a dataframe**
---

#### **a. The function `which()` returns the index of what is TRUE in a tested condition:**

In [None]:
#cell4
print(which ( myDataf$sex == "Woman") )

Here, we obtain a vector where 3, 4 and 6 corrrespond to the positions or indexes (1-based) of the occurence "Woman" in the vector/variable `myDataf$sex`. We can the use this vector as usual in a dataframe before the `,` to select the corresponding rows.

In [None]:
#cell 5
myDataf [ which ( myDataf$sex == "Woman") , ] 

In [None]:
#cell 6
str(myDataf [ which ( myDataf$sex == "Woman") , ])

Instead of `==` one can use ̀`!=` for "is different" to detect what does not match.

In [None]:
#cell 7
print(which ( myDataf$sex != "Man"))

Another method would be to add `!`  for "not" before the test, to get the complementary result:

In [None]:
#cell 8
print(which (! myDataf$sex == "Man"))

<div class="alert alert-block alert-danger"><b>Caution:</b>
    What happens if you do not use `which()`?
</div>

Lets' make a copy of our dataframe and replace the gender of "Claire" by a missing value:

In [None]:
#cell 9
myDataf2 <- myDataf
myDataf2["Claire", "sex"] <- NA
myDataf2

and rerun the same command as above without `which()` on the new `myDataf2`:

In [None]:
# cell 10
myDataf2[myDataf2$sex == "Woman",]

In [None]:
#cell 11
myDataf2[which(myDataf2$sex == "Woman"),]

<div class="alert alert-block alert-danger"><b>Caution:</b>
    If you have missing data and you forget to use which(), you will also return them.<b> =>  Always use which()</b>
</div>

#### **b. One can also search for a pattern with `grep()`:**

It returns the index of what matches, even partially.

In [None]:
#cell 12
print(grep("Wom", myDataf$sex))

In [None]:
#cell 13
print(grep("Woman", myDataf$sex))

In [None]:
#cell 14
myDataf [grep("Woman", myDataf$sex), ] 

In [None]:
#cell 15
print(grep("a", row.names(myDataf)))

In [None]:
#cell 16
myDataf [grep("a", row.names(myDataf)),]

#### **c. The function `subset()` is even simpler than `which()`:**

Just enter the dataframe as first argument, and the variable ***without "quotes"*** on which you do the filtering followed by the condition.

In [None]:
#cell 17
WomenDataf <- subset(myDataf, sex == "Woman")
WomenDataf

#### **d. You can even combine conditions:**

- logical: `&` = AND, `|` = OR, `!` = not
- comparisons: `==` , `!=` for diffferent, `>`, `<`, `>=`, `>=`
- "is an element of" a vector using `%in%`

In [None]:
#cell 18
filteredData <- myDataf [ which ( myDataf$sex == "Woman" & myDataf$weight < 80 & myDataf$bmi > 20), ]
filteredData

In [None]:
# cell 19
subset( myDataf, sex == "Woman" & weight < 80 & bmi > 20)

## **II. -Merging dataframes:** using a column as a "key"

In this example, I add one column with indexes that I will use as a key, but we can also use an existing variable as a key.

In [None]:
#cell 20
myDataf$index <- 1:6
myDataf

Then I generate another dataframe with handedness information on 6 samples, but one sample is new compared to the initial dataframe.

In [None]:
#cell 21
OtherData <- data.frame(c(1:5, 7),rep(c("right-handed","left-handed"),3))
names(OtherData) <- c("ID","handedness")
OtherData

We can now merge them together by specifying the "key" column with the argument `by`. The `all` argument is used to keep all the rows of a dataframe that are not present in the other. The `.x` refers to the first dataframe while `.y` refers to the second one.

<div class="alert alert-block alert-warning"><b>Warning:</b> If adding <b>sort=F</b> we will avoid the merged dataframe to be sorted by the "key" column. </div>


In [None]:
# cell 22
myMergedDataf <- merge(myDataf, OtherData, by.x="index", by.y="ID", all.x=T, all.y=T, sort=F)
myMergedDataf

In the merged dataframe, we start with all the rows present in both dataframes. The next rows contains the data only present in the first dataframe with missing data for the columns in the second dataframe. The last rows are the ones with data only present in the second dataframe with missing data for the first dataframe.

Unless the merge is done on the row names (`by="0"`), the row names of the initial dataframe are lost. **The new dataframe has its own row names.**

If two columns have the same name in both dataframes, by default R adds an `.x` to the one from the first dataframe and `.y` to the one of the second dataframe. The names can be changed with the argument `suffixes`.

<div class="alert alert-block alert-success"><b>Success:</b> Well done! You now know all the main functions to create and manipulate dataframes.

</div>
    

In [None]:
# cell 23
save.image(file="RSession3_tutorial.RData")

<div class="alert alert-block alert-danger"><b>Caution:</b><br> 
 Don't forget to save you notebook and export a copy as an <b>html</b> file as well <br>
- Open "File" in the Menu<br>
- Select "Export Notebook As"<br>
- Export notebook as HTML<br>
- You can then open it in your browser even without being connected to the IFB Jupyter hub! 
</div>