## Lecture 5 demo

There are 3 key themes to this lecture:

1. selectively changing values & weirdos

2. iterating over groups of rows

3. iterating over a columns in a data frame

Thanks for providing feedback to the block reps. Here are changes in reflection to comments;
- Lecture notes will be posted before Sunday morning (instead of before Monday morning)
- New look for the optional questions.
- Summary at the end of the lecture.

First, let's load the packages we need:

In [43]:
library(palmerpenguins)
library(tidyverse)
options(repr.matrix.max.rows = 10)

## Theme 1: Selectively changing values

Let's say we want to change the penguins species names from the common names to their latin species names:

- Adelie: *Pygoscelis adeliae*
- Gentoo: *Pygoscelis papua*
- Chinstrap: *Pygoscelis antarcticus*

Let's use {dplyr}'s `case_when` to do this!

In [44]:
penguins

species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>
Adelie,Torgersen,39.1,18.7,181,3750,male,2007
Adelie,Torgersen,39.5,17.4,186,3800,female,2007
Adelie,Torgersen,40.3,18.0,195,3250,female,2007
Adelie,Torgersen,,,,,,2007
Adelie,Torgersen,36.7,19.3,193,3450,female,2007
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
Chinstrap,Dream,55.8,19.8,207,4000,male,2009
Chinstrap,Dream,43.5,18.1,202,3400,female,2009
Chinstrap,Dream,49.6,18.2,193,3775,male,2009
Chinstrap,Dream,50.8,19.0,210,4100,male,2009


In [45]:
latin_penguins <- penguins |> 
  mutate(species = case_when(species == "Adelie" ~ "Pygoscelis adeliae",
                             species == "Gentoo" ~ "Pygoscelis papua",
                             species == "Chinstrap" ~ "Pygoscelis antarcticus"))

latin_penguins

species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
<chr>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>
Pygoscelis adeliae,Torgersen,39.1,18.7,181,3750,male,2007
Pygoscelis adeliae,Torgersen,39.5,17.4,186,3800,female,2007
Pygoscelis adeliae,Torgersen,40.3,18.0,195,3250,female,2007
Pygoscelis adeliae,Torgersen,,,,,,2007
Pygoscelis adeliae,Torgersen,36.7,19.3,193,3450,female,2007
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
Pygoscelis antarcticus,Dream,55.8,19.8,207,4000,male,2009
Pygoscelis antarcticus,Dream,43.5,18.1,202,3400,female,2009
Pygoscelis antarcticus,Dream,49.6,18.2,193,3775,male,2009
Pygoscelis antarcticus,Dream,50.8,19.0,210,4100,male,2009


### Let's visit some weirdness of `case_when`
We need to ensure that we understand the behavior to do it right. So let's create a dummy dataframe for demonstration.

In [46]:
a_df <- tibble(x = c(3, 1, 2, NA,5), 
               y = c("a", "b", NA, "c","d"))
a_df

x,y
<dbl>,<chr>
3.0,a
1.0,b
2.0,
,c
5.0,d


Let's replace `a` with `apple`, `b` with `ball`, `c` with `cat` and let's leave `d` as it is. 

In [47]:
a_df |> mutate(y = case_when(y == "a" ~ "apple",
                             y == "b" ~ "ball",
                             y == "c" ~ "cat"))

x,y
<dbl>,<chr>
3.0,apple
1.0,ball
2.0,
,cat
5.0,


Ohh NO... What happened with `d`? Did it replace it with `NA`? We want to understand a bit more about how it is working.
Let's give it another try by specifying that if there are cases that are not in my condition, then I want it to be what is already there.

In [48]:
a_df |> mutate(y = case_when(y == "a" ~ "apple",
                             y == "b" ~ "ball",
                             y == "c" ~ "cat",
                            TRUE ~ y))

x,y
<dbl>,<chr>
3.0,apple
1.0,ball
2.0,
,cat
5.0,d


Wonderful !! 
### Let's visit some weirdness of `NA`
What about selectively changing NA values? Let's work through a simple example where we want to use `case_when` to change the `NA` in the `x` column to a `0`:

Let's try what we did above:

In [49]:
a_df

x,y
<dbl>,<chr>
3.0,a
1.0,b
2.0,
,c
5.0,d


In [51]:
a_df |> 
  mutate(x = case_when(x == NA ~ 0,
                        TRUE ~ x))

x,y
<dbl>,<chr>
3.0,a
1.0,b
2.0,
,c
5.0,d


Well that didn't work! It turns out that NA's are special, and instead we have to use `is.na` instead:

In [52]:
a_df |>
  mutate(x = case_when(is.na(x) ~ 0,
                           TRUE ~ x))

x,y
<dbl>,<chr>
3,a
1,b
2,
0,c
5,d


Okay, I get what is `is.na()`, but is there anything so I can completely remove everything that is `NA`? YES, you can use `drop_na()`. `NA` is always special, so you can't just filter it using `filter().`

In [53]:
a_df |> drop_na()

x,y
<dbl>,<chr>
3,a
1,b
5,d


Great! Everything that is `NA` is gone. Or in other words, we get complete cases of this dataframe. But what if I want to remove based on columns?

In [54]:
## You can also specify multiple columns
a_df |> drop_na(x)

x,y
<dbl>,<chr>
3,a
1,b
2,
5,d


### Let's visit some weirdness of `factors`

Let's see if there is any change in behaviour when it is a factor

In [55]:
f_df <- tibble(x = c(3, 1, 2, NA,5), 
               y = as.factor(c("a", "b", NA, "c","d")))
f_df

x,y
<dbl>,<fct>
3.0,a
1.0,b
2.0,
,c
5.0,d


In [None]:
# Why we are getting an error here ?
# it runs okay if you replace all the cases with strings )in this example replace `d` with `dog`
f_df |> mutate(y = case_when(y == "a" ~ "apple",
                             y == "b" ~ "ball",
                             y == "c" ~ "cat",
                            TRUE ~ y))

## Theme 2: Iterating over groups of rows

Let's say we want to calculate the mean weight of chicks fed on each different diet. We can do this with {dplyr}'s `group_by` + `summarise`

In [57]:
chickwts

weight,feed
<dbl>,<fct>
179,horsebean
160,horsebean
136,horsebean
227,horsebean
217,horsebean
⋮,⋮
359,casein
216,casein
222,casein
283,casein


In [58]:
# Let's experiment other functions median, max, min, n, 
mean_weight <- chickwts |>
  group_by(feed) |> 
  summarise(mean_weight = mean(weight, na.rm = TRUE))
mean_weight

feed,mean_weight
<fct>,<dbl>
casein,323.5833
horsebean,160.2
linseed,218.75
meatmeal,276.9091
soybean,246.4286
sunflower,328.9167


Again watch for `NA` here. Let's see an example

In [59]:
library(palmerpenguins)
penguins

species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>
Adelie,Torgersen,39.1,18.7,181,3750,male,2007
Adelie,Torgersen,39.5,17.4,186,3800,female,2007
Adelie,Torgersen,40.3,18.0,195,3250,female,2007
Adelie,Torgersen,,,,,,2007
Adelie,Torgersen,36.7,19.3,193,3450,female,2007
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
Chinstrap,Dream,55.8,19.8,207,4000,male,2009
Chinstrap,Dream,43.5,18.1,202,3400,female,2009
Chinstrap,Dream,49.6,18.2,193,3775,male,2009
Chinstrap,Dream,50.8,19.0,210,4100,male,2009


In [60]:
penguins %>%
    group_by(species) %>%
    summarise(max_bill_length = max(bill_length_mm))

species,max_bill_length
<fct>,<dbl>
Adelie,
Chinstrap,58.0
Gentoo,


In [61]:
## Don't forget to handle NAs properly
penguins %>%
    group_by(species) %>%
    summarise(max_bill_length = max(bill_length_mm,na.rm = TRUE))

species,max_bill_length
<fct>,<dbl>
Adelie,46.0
Chinstrap,58.0
Gentoo,59.6


## Iterating over columns in a data frame

(which is equivalent to iterating over columns in a data frame)

Let's looks that the built-in `USJudgeRatings` data set, which has lawyers' ratings of state judges in the US Superior Court on their various courtroom attributes. 

What if we were interested in the median ratings of each of these attributes, to see, for example, if lawyers routinely rated certain attributes higher than others across all judges. We can use {purrr}'s `map_df` function to do this and get back the results as a tibble. 

In [63]:
USJudgeRatings

Unnamed: 0_level_0,CONT,INTG,DMNR,DILG,CFMG,DECI,PREP,FAMI,ORAL,WRIT,PHYS,RTEN
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
"AARONSON,L.H.",5.7,7.9,7.7,7.3,7.1,7.4,7.1,7.1,7.1,7.0,8.3,7.8
"ALEXANDER,J.M.",6.8,8.9,8.8,8.5,7.8,8.1,8.0,8.0,7.8,7.9,8.5,8.7
"ARMENTANO,A.J.",7.2,8.1,7.8,7.8,7.5,7.6,7.5,7.5,7.3,7.4,7.9,7.8
"BERDON,R.I.",6.8,8.8,8.5,8.8,8.3,8.5,8.7,8.7,8.4,8.5,8.8,8.7
"BRACKEN,J.J.",7.3,6.4,4.3,6.5,6.0,6.2,5.7,5.7,5.1,5.3,5.5,4.8
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
"TESTO,R.J.",8.3,7.3,7.0,6.8,7.0,7.1,6.7,6.7,6.7,6.7,8.0,7.0
"TIERNEY,W.L.JR.",8.3,8.2,7.8,8.3,8.4,8.3,7.7,7.6,7.5,7.7,8.1,7.9
"WALL,R.A.",9.0,7.0,5.9,7.0,7.0,7.2,6.9,6.9,6.5,6.6,7.6,6.6
"WRIGHT,D.B.",7.1,8.4,8.4,7.7,7.5,7.7,7.8,8.2,8.0,8.1,8.3,8.1


In [66]:
## Let's play with map_* and some functions
median_ratings <- map_df(USJudgeRatings, median,na.rm = TRUE)
median_ratings

CONT,INTG,DMNR,DILG,CFMG,DECI,PREP,FAMI,ORAL,WRIT,PHYS,RTEN
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
7.3,8.1,7.7,7.8,7.6,7.7,7.7,7.6,7.5,7.6,8.1,7.8


### What did we learn today?

- How to use `case_when` to selectively change values in a data frame (similar to base R `if` statements)

- How to use `group_by` to iterate over groups of rows (similar to `for` loops in base R)

- How to use {purrr} `map_*` functions to iterate over columns (similar to `apply` in base R)