# Introducción a Tidyverse

Este Jupyter Notebook será la intro al paquete tidyverse. Usaremos un dataset sencillo y nos centraremos en las funciones más importantes. Lo primero es instalar el paquete:

In [2]:
install.packages(c("tidyverse","palmerpenguins"))


Installing packages into ‘/usr/local/lib/R/4.1/site-library’
(as ‘lib’ is unspecified)



La instalación se realiza solo una vez. Con la función `library` la llamamos:

In [3]:
library(tidyverse) # cargamos las librerias
library(palmerpenguins)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.5     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.6     [32m✔[39m [34mdplyr  [39m 1.0.8
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



La función `data` nos deja cargar datasets de los paquetes, en este caso el dataset `penguins`

In [20]:
data(package = 'palmerpenguins')


## 1) Echar un vistazo

Lo primero antes de trabajar con un dataset es darle un primer vistazo. Funciones útiles para esto son `glimpse`, `head`, y `tail`

In [21]:
glimpse(penguins)

Rows: 344
Columns: 8
$ species           [3m[90m<fct>[39m[23m Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            [3m[90m<fct>[39m[23m Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    [3m[90m<dbl>[39m[23m 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     [3m[90m<dbl>[39m[23m 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm [3m[90m<int>[39m[23m 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       [3m[90m<int>[39m[23m 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               [3m[90m<fct>[39m[23m male, female, female, NA, female, male, female, male…
$ year              [3m[90m<int>[39m[23m 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…


`glimpse` nos da información sobre el tipo de datos:

* logical vectors <lgl> contiene TRUE o FALSE
* integer vectors <int> contiene enteros
* double vectors <dbl> contiene números reales
* character vector <chr> contiene strings ("") 
* factors <fct>, representan variables categóricas determinadas (llamadas levels)

In [22]:
head(penguins)

species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>
Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
Adelie,Torgersen,,,,,,2007
Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,2007


`head` nos entrega las primeras filas y `tail` las últimas

`head`, y `tail` son funciones que tienen argumentos. En este caso `n = número de filas por mostrar`

In [23]:
tail(penguins, n = 3)

species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>
Chinstrap,Dream,49.6,18.2,193,3775,male,2009
Chinstrap,Dream,50.8,19.0,210,4100,male,2009
Chinstrap,Dream,50.2,18.7,198,3775,female,2009


## 2) Manipular datos

Acá revisaremos funciones que nos dejar manipular el dataset y resumir información. Funciones útiles para esto son `mutate`, `filter`, `group_by`, y `select`. Además del operador `%>%`.

### Operador `%>%`

A veces mentalmente se traduce como "y luego...". La idea es encadenar una secuencia de funciones de forma que sea fácil leerla.


In [24]:
yo_a_las_6am %>%
    me_levanto %>%
    me_visto %>%
    tomo_desayuno 

ERROR: Error in tomo_desayuno(.): could not find function "tomo_desayuno"


La idea es hacer el código fluido. Al igual que el ejemplo **el orden es muy importante**.
    

### Pregunta: Cuales son los pesos promedios de las hembras por isla?

Primero es útil la abstracción mental:

In [None]:
penguins %>%
    selecciono_los_pesos_de_las_hembras %>%
    agrupo_por_isla %>%
    promedio_los_pesos

Ahora con los pasos definidos, usamos `filter`, `group_by`, y `mean`. Además del operador `%>%`.

In [34]:
penguins %>%
    filter(sex == "female") %>% 
    group_by(island) %>%
    summarize(promedio_hembras = mean(body_mass_g))

island,promedio_hembras
<fct>,<dbl>
Biscoe,4319.375
Dream,3446.311
Torgersen,3395.833


## 3) Funciones `mutate()`,  `filter()`, `group_by()`, `select()`, `rename()`,`arrange()` y otras

### `mutate()` crea o modifica una nueva variable

In [8]:
penguins %>% 
    mutate(razon_bill_length_depth = bill_length_mm / bill_depth_mm)%>% 
    tail(n = 3)

species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year,razon_bill_length_depth
<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>,<dbl>
Chinstrap,Dream,49.6,18.2,193,3775,male,2009,2.725275
Chinstrap,Dream,50.8,19.0,210,4100,male,2009,2.673684
Chinstrap,Dream,50.2,18.7,198,3775,female,2009,2.684492


### `recode()`  modifica una variable de forma rápida

In [14]:
penguins %>% 
    mutate(sex = recode(sex, male = "macho", female = "hembra"))%>% 
    tail(n = 3)

species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>
Chinstrap,Dream,49.6,18.2,193,3775,macho,2009
Chinstrap,Dream,50.8,19.0,210,4100,macho,2009
Chinstrap,Dream,50.2,18.7,198,3775,hembra,2009


### `filter()`  retiene filas que cumplen con las condiciones que indica

In [57]:
penguins %>% 
    filter(sex == "female") %>% 
    tail(n = 3)

species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>
Chinstrap,Dream,45.7,17.0,195,3650,female,2009
Chinstrap,Dream,43.5,18.1,202,3400,female,2009
Chinstrap,Dream,50.2,18.7,198,3775,female,2009


### Para usar `filter()`  es útil conocer los operadores para indicar las condiciones:
* ==, >, >= etc.

* &, |, !

* is.na()

* between(), near()

Veamos unos ejemplos de operadores antes de seguir

In [45]:
penguins %>% # operador ! indica logical negation (NOT) y operador & indica logical AND
    filter(!sex == "female" &  between(body_mass_g,3000,3300))

species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>
Gentoo,Biscoe,46.5,14.8,217,5200,female,2008
Gentoo,Biscoe,45.2,14.8,212,5200,female,2009


In [56]:
penguins %>% # is.na() filtra las filas con NA y operador & indica logical OR
    filter(is.na(sex) | is.na(body_mass_g))

species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>
Adelie,Torgersen,,,,,,2007
Adelie,Torgersen,34.1,18.1,193.0,3475.0,,2007
Adelie,Torgersen,42.0,20.2,190.0,4250.0,,2007
Adelie,Torgersen,37.8,17.1,186.0,3300.0,,2007
Adelie,Torgersen,37.8,17.3,180.0,3700.0,,2007
Adelie,Dream,37.5,18.9,179.0,2975.0,,2007
Gentoo,Biscoe,44.5,14.3,216.0,4100.0,,2007
Gentoo,Biscoe,46.2,14.4,214.0,4650.0,,2008
Gentoo,Biscoe,47.3,13.8,216.0,4725.0,,2009
Gentoo,Biscoe,44.5,15.7,217.0,4875.0,,2009


### `group_by()`  agrupa variables para las siguientes operaciones. `ungroup()` termina la función

In [67]:
penguins %>% 
    group_by(sex) %>%
    summarize(promedio = mean(body_mass_g))%>%
    ungroup()

sex,promedio
<fct>,<dbl>
female,3862.273
male,4545.685
,


### Ahora podemos combinar  `filter()` y  `mutate()` 

In [66]:
penguins %>% 
    filter(!is.na(sex) & !is.na(body_mass_g))%>% 
    group_by(sex) %>%
    summarize(promedio = mean(body_mass_g)) %>%
    ungroup()

sex,promedio
<fct>,<dbl>
female,3862.273
male,4545.685


###  `ungroup()` es importante!!

In [84]:
penguins %>% 
  group_by(sex) %>% 
  mutate(promedio_body_mass_g = mean(body_mass_g)) %>%    # calcula el promedio de peso de males y females
  mutate(promedio_bill_depth = mean(bill_depth_mm)) %>%  # calcula el promedio de bill_depth de males y females
  ungroup() %>% 
  tail(n = 6) # cerrando con ungroup() 


species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year,promedio_body_mass_g,promedio_bill_depth
<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>,<dbl>,<dbl>
Chinstrap,Dream,45.7,17.0,195,3650,female,2009,3862.273,16.42545
Chinstrap,Dream,55.8,19.8,207,4000,male,2009,4545.685,17.89107
Chinstrap,Dream,43.5,18.1,202,3400,female,2009,3862.273,16.42545
Chinstrap,Dream,49.6,18.2,193,3775,male,2009,4545.685,17.89107
Chinstrap,Dream,50.8,19.0,210,4100,male,2009,4545.685,17.89107
Chinstrap,Dream,50.2,18.7,198,3775,female,2009,3862.273,16.42545


### Para contar grupos usamos `tally()` 

In [88]:
penguins %>% 
  group_by(sex, island) %>%
  tally(sort = TRUE) # sort = TRUE ordena de mayor a menor 

sex,island,n
<fct>,<fct>,<int>
male,Biscoe,83
female,Biscoe,80
male,Dream,62
female,Dream,61
female,Torgersen,24
male,Torgersen,23
,Biscoe,5
,Torgersen,5
,Dream,1


### Trabajar con `group_by()` es útil en conjunto a `summarise()`

In [91]:
penguins %>% 
  group_by(sex, island) %>%
  summarise(
    n = n(),
    promedio_peso = mean(body_mass_g, na.rm = TRUE)
  )

[1m[22m`summarise()` has grouped output by 'sex'. You can override using the `.groups`
argument.


sex,island,n,promedio_peso
<fct>,<fct>,<int>,<dbl>
female,Biscoe,80,4319.375
female,Dream,61,3446.311
female,Torgersen,24,3395.833
male,Biscoe,83,5104.518
male,Dream,62,3987.097
male,Torgersen,23,4034.783
,Biscoe,5,4587.5
,Dream,1,2975.0
,Torgersen,5,3681.25


### Para terminar, un operador extra `%in%`, un poco de estadistica con `t.test` y  un plot
*  `%in%`  se traduce como "me quiero quedar con estos atributos"
*  `ggplot`  es la función base de tidyverse para graficar

### Misión: evaluar y graficar que tan diferente es la masa de las hembras en dos islas: Biscoe y Dream

In [103]:
penguins %>% 
  filter(island %in% c("Biscoe", "Dream") & sex == "female") %>%
  mutate(id_pinguinas = row_number()) %>%
  pivot_wider(names_from = island, values_from = body_mass_g)


species,bill_length_mm,bill_depth_mm,flipper_length_mm,sex,year,id_pinguinas,Biscoe,Dream
<fct>,<dbl>,<dbl>,<int>,<fct>,<int>,<int>,<int>,<int>
Adelie,37.8,18.3,174,female,2007,1,3400,
Adelie,35.9,19.2,189,female,2007,2,3800,
Adelie,35.3,18.9,187,female,2007,3,3800,
Adelie,40.5,17.9,187,female,2007,4,3200,
Adelie,37.9,18.6,172,female,2007,5,3150,
Adelie,39.5,16.7,178,female,2007,6,,3250
Adelie,39.5,17.8,188,female,2007,7,,3300
Adelie,36.4,17.0,195,female,2007,8,,3325
Adelie,42.2,18.5,180,female,2007,9,,3550
Adelie,37.6,19.3,181,female,2007,10,,3300
