## Lecture 7 theme

## Announcements
- Quiz - Well done, class average 86%.
- Make sure you read this https://ubc-mds.github.io/policies/#re-grading before approaching TAs with regrade requests.
- Early feedback on deliverables is there so that you have some time to reflect upon and make necessary changes to your study pattern.
- Thursday after our last lab - New programmers session, let's work things out !!! From 4 PM
- Do you have a strategy to prepare for quiz? Cheat sheets, important links, etc.

There are 3 key themes to this lecture:

1. Using anonymous function with {purrr} `map_*` functions

2. Nested data frames

3. Mapping with nested data frames

In [247]:
## loading necessary packages 
library(gapminder)
library(repurrrsive)
library(tidyverse)
library(infer)
options(repr.matrix.max.rows = 10)

## Theme 0: Knowledge check
### Clicker 1: Functions and lexical scoping

In [248]:
# x <- 20
# y <- 10
# z <- 5
# w <- 1
# sum <- function(x, y = 0) {
#   ( x - y ) - z 
#   }

# try({sum(30)})
# try({sum(x,y)})
# try({sum()})
# try({base::sum(x,y)})

### Clicker 2: Group_by & mutate

In [249]:
# # calculate the average life expectancy for each continent
# students <- data.frame(name = c("James","Diana","James","Diana","George","Diana"),
#            grade = c(10, 20, 30, 40, 50, 60))
# students
# students %>% 
#     group_by(____) %>% 
#     _____(total_grade = ____(grade))

### Clicker 3: map

In [250]:
# ## For students to experiment 
# students <- data.frame(name = c("James","Diana","James","Diana","George","Diana"),
#            grade = c(10, 20, 30, 40, 50, 60))

# What is the output of the following map operation?

# map_dbl(students, is.numeric)

# print("Option A :---")
# map_dbl(students, is.numeric)
# print("Option B :---")
# map_lgl(students, is.numeric)
# print("Option C :---")
# map(students, is.numeric)
# print("Option D :---")
# print("Error bad type")

## Theme 1: Using anonymous function with {purrr} `map_*` functions

Below function takes in x as an argument and adds one to it. The function definition is surrounded by round brackets, as is the value being passed to the anonymous function.

In [257]:
(function(x) 1 + x)(1)

What if it is not an anonymous function

In [258]:
add_one <- function(x){
    x + 1
    }
add_one(1)

Let's now get into using anonymous function calls within {purrr} `map_*`. The example what we are going to work on is Map `str_replace` to all the columns of a very wide data frame to replace all instances of `"Cdn"` with `"Canadian"` to fix a data entry error that occurs in several columns in a data set. Let's first make some dummy tables;

In [259]:
data_entry <- tibble(id = c("25323", "45234", "23471"),
                    birth_citizenship = c("Canadian", "American", "Cdn"),
                    current_citizenship = c("Canadian", "Vietnamese", "Cdn"))
data_entry

id,birth_citizenship,current_citizenship
<chr>,<chr>,<chr>
25323,Canadian,Canadian
45234,American,Vietnamese
23471,Cdn,Cdn


Before even thinking about writing an anonymous function, do we need to write an anonymous function for this problem? 

NO, you do not need to use an anonymous function, you can take advantage of `...` (map functions pass them to the function you are using), but we will see more complicated cases later in the lecture and the lab. Complicated cases arise when you are working with nested data frames (which is, infact our next theme)

In [260]:
# map_df(df,sum,na.rm=TRUE)

map_df(data_entry, str_replace, pattern = "Cdn", replacement = "Canadian")

id,birth_citizenship,current_citizenship
<chr>,<chr>,<chr>
25323,Canadian,Canadian
45234,American,Vietnamese
23471,Canadian,Canadian


But to illustrate the use of the anonymous function, let us work out this with the use of the anonymous function. 

Using verbose anonymous function syntax:

In [261]:
map_df(data_entry, function(vect) str_replace(vect, pattern = "Cdn", replacement = "Canadian"))

id,birth_citizenship,current_citizenship
<chr>,<chr>,<chr>
25323,Canadian,Canadian
45234,American,Vietnamese
23471,Canadian,Canadian


Using shorthand anonymous function syntax:

In [262]:
map_df(data_entry, ~ str_replace(.x, pattern = "Cdn", replacement = "Canadian"))
# map_df(data_entry, ~str_replace(., pattern = "Cdn", replacement = "Canadian"))

id,birth_citizenship,current_citizenship
<chr>,<chr>,<chr>
25323,Canadian,Canadian
45234,American,Vietnamese
23471,Canadian,Canadian


## Nested data frames

In [263]:
# create a nested data frame DSCI 552
gap_lifeExp_ci <- function(df, statistic) {
  df %>% 
        specify(response = lifeExp) %>% 
        generate(reps = 1000, type = "bootstrap")  %>% 
        calculate(stat = statistic)  %>% 
        get_ci()
}

by_country <- gapminder %>%
    group_by(continent, country) %>%
    nest() %>% 
    mutate(mean_life_exp = map_dbl(data, ~mean(.$lifeExp)), 
    life_exp_ci = map(data, ~gap_lifeExp_ci(., "mean")))
print(by_country)

[90m# A tibble: 142 × 5[39m
[90m# Groups:   continent, country [142][39m
   country     continent data              mean_life_exp life_exp_ci     
   [3m[90m<fct>[39m[23m       [3m[90m<fct>[39m[23m     [3m[90m<list>[39m[23m                    [3m[90m<dbl>[39m[23m [3m[90m<list>[39m[23m          
[90m 1[39m Afghanistan Asia      [90m<tibble [12 × 4]>[39m          37.5 [90m<tibble [1 × 2]>[39m
[90m 2[39m Albania     Europe    [90m<tibble [12 × 4]>[39m          68.4 [90m<tibble [1 × 2]>[39m
[90m 3[39m Algeria     Africa    [90m<tibble [12 × 4]>[39m          59.0 [90m<tibble [1 × 2]>[39m
[90m 4[39m Angola      Africa    [90m<tibble [12 × 4]>[39m          37.9 [90m<tibble [1 × 2]>[39m
[90m 5[39m Argentina   Americas  [90m<tibble [12 × 4]>[39m          69.1 [90m<tibble [1 × 2]>[39m
[90m 6[39m Australia   Oceania   [90m<tibble [12 × 4]>[39m          74.7 [90m<tibble [1 × 2]>[39m
[90m 7[39m Austria     Europe    [90m<tibble [12 × 4]

### List column workflow:

1. Create a list column using function `nest`

2. Create other intermediate list-columns by transforming existing list columns with `map`

3. Simplify the list-column back down to a data frame or atomic vector, often by `unnest`, `mutate` + `map_*` functions that return atomic vectors as opposed to lists. 

#### 1. List-columns

To create a nested data frame we start with a **grouped** data frame, and “nest” it:

NB: Please use `print()` to print a nested dataframe. Otherwise, jupyter doesn't know to print it pretty (but R studio can)

In [264]:
# create a nested data frame
by_country <- gapminder %>% 
    group_by(continent, country) %>% 
    nest()
print(by_country)

[90m# A tibble: 142 × 3[39m
[90m# Groups:   continent, country [142][39m
   country     continent data             
   [3m[90m<fct>[39m[23m       [3m[90m<fct>[39m[23m     [3m[90m<list>[39m[23m           
[90m 1[39m Afghanistan Asia      [90m<tibble [12 × 4]>[39m
[90m 2[39m Albania     Europe    [90m<tibble [12 × 4]>[39m
[90m 3[39m Algeria     Africa    [90m<tibble [12 × 4]>[39m
[90m 4[39m Angola      Africa    [90m<tibble [12 × 4]>[39m
[90m 5[39m Argentina   Americas  [90m<tibble [12 × 4]>[39m
[90m 6[39m Australia   Oceania   [90m<tibble [12 × 4]>[39m
[90m 7[39m Austria     Europe    [90m<tibble [12 × 4]>[39m
[90m 8[39m Bahrain     Asia      [90m<tibble [12 × 4]>[39m
[90m 9[39m Bangladesh  Asia      [90m<tibble [12 × 4]>[39m
[90m10[39m Belgium     Europe    [90m<tibble [12 × 4]>[39m
[90m# … with 132 more rows[39m


What is the `data` column here? 

In [265]:
by_country$data[[1]]
by_country$data[[2]]

year,lifeExp,pop,gdpPercap
<int>,<dbl>,<int>,<dbl>
1952,28.801,8425333,779.4453
1957,30.332,9240934,820.8530
1962,31.997,10267083,853.1007
1967,34.020,11537966,836.1971
1972,36.088,13079460,739.9811
⋮,⋮,⋮,⋮
1987,40.822,13867957,852.3959
1992,41.674,16317921,649.3414
1997,41.763,22227415,635.3414
2002,42.129,25268405,726.7341


year,lifeExp,pop,gdpPercap
<int>,<dbl>,<int>,<dbl>
1952,55.23,1282697,1601.056
1957,59.28,1476505,1942.284
1962,64.82,1728137,2312.889
1967,66.22,1984060,2760.197
1972,67.69,2263554,3313.422
⋮,⋮,⋮,⋮
1987,72.000,3075321,3738.933
1992,71.581,3326498,2497.438
1997,72.950,3428038,3193.055
2002,75.651,3508512,4604.212


Now let's explore how we can create other intermediate list-columns by transforming existing columns with `map`.

### 2. Create other intermediate list-columns with `map`

We'd like to apply the mean function to get the mean life expectancy in a column:

In [266]:
by_country <- gapminder %>%
    group_by(continent, country) %>%
    nest() %>%
    mutate(mean_life_exp = map_dbl(data, ~ mean(.$lifeExp))) %>% arrange(country)
print(by_country)

[90m# A tibble: 142 × 4[39m
[90m# Groups:   continent, country [142][39m
   country     continent data              mean_life_exp
   [3m[90m<fct>[39m[23m       [3m[90m<fct>[39m[23m     [3m[90m<list>[39m[23m                    [3m[90m<dbl>[39m[23m
[90m 1[39m Afghanistan Asia      [90m<tibble [12 × 4]>[39m          37.5
[90m 2[39m Albania     Europe    [90m<tibble [12 × 4]>[39m          68.4
[90m 3[39m Algeria     Africa    [90m<tibble [12 × 4]>[39m          59.0
[90m 4[39m Angola      Africa    [90m<tibble [12 × 4]>[39m          37.9
[90m 5[39m Argentina   Americas  [90m<tibble [12 × 4]>[39m          69.1
[90m 6[39m Australia   Oceania   [90m<tibble [12 × 4]>[39m          74.7
[90m 7[39m Austria     Europe    [90m<tibble [12 × 4]>[39m          73.1
[90m 8[39m Bahrain     Asia      [90m<tibble [12 × 4]>[39m          65.6
[90m 9[39m Bangladesh  Asia      [90m<tibble [12 × 4]>[39m          49.8
[90m10[39m Belgium     Europe    [90m

Now we'd like to apply the `gap_lifeExp_ci` function to each tibble in the `data` list column to obtain another list column containing the confidence interval tibbles. We can use `mutate` + `map` to do this:

In [267]:
gap_lifeExp_ci <- function(df, statistic) {
  df %>% 
        specify(response = lifeExp) %>% 
        generate(reps = 1000, type = "bootstrap")  %>% 
        calculate(stat = statistic)  %>% 
        get_ci()
}

by_country <- gapminder %>%
    group_by(continent, country) %>%
    nest() %>% 
    mutate(mean_life_exp = map_dbl(data, ~mean(.$lifeExp)), 
    life_exp_ci = map(data, ~gap_lifeExp_ci(., "mean")))
print(by_country)

[90m# A tibble: 142 × 5[39m
[90m# Groups:   continent, country [142][39m
   country     continent data              mean_life_exp life_exp_ci     
   [3m[90m<fct>[39m[23m       [3m[90m<fct>[39m[23m     [3m[90m<list>[39m[23m                    [3m[90m<dbl>[39m[23m [3m[90m<list>[39m[23m          
[90m 1[39m Afghanistan Asia      [90m<tibble [12 × 4]>[39m          37.5 [90m<tibble [1 × 2]>[39m
[90m 2[39m Albania     Europe    [90m<tibble [12 × 4]>[39m          68.4 [90m<tibble [1 × 2]>[39m
[90m 3[39m Algeria     Africa    [90m<tibble [12 × 4]>[39m          59.0 [90m<tibble [1 × 2]>[39m
[90m 4[39m Angola      Africa    [90m<tibble [12 × 4]>[39m          37.9 [90m<tibble [1 × 2]>[39m
[90m 5[39m Argentina   Americas  [90m<tibble [12 × 4]>[39m          69.1 [90m<tibble [1 × 2]>[39m
[90m 6[39m Australia   Oceania   [90m<tibble [12 × 4]>[39m          74.7 [90m<tibble [1 × 2]>[39m
[90m 7[39m Austria     Europe    [90m<tibble [12 × 4]

### Simplifying the list-column back down to a data frame or atomic vector

After we create some other intermediate list-columns with `map`, we usually want to get some values back as regular atomic vector columns in our data frame, for visualization, further analysis, or reporting. 

We will first demonstrate how to do this using `unnest` in our example to covert the `life_exp_ci` list column to two columns, one for the lower bound of the confidence interval, and one for the upper bound of the confidence interval:

In [268]:
# unnest the ci column
by_country %>% 
    unnest(life_exp_ci) %>% 
    print()

[90m# A tibble: 142 × 6[39m
[90m# Groups:   continent, country [142][39m
   country     continent data              mean_life_exp lower_ci upper_ci
   [3m[90m<fct>[39m[23m       [3m[90m<fct>[39m[23m     [3m[90m<list>[39m[23m                    [3m[90m<dbl>[39m[23m    [3m[90m<dbl>[39m[23m    [3m[90m<dbl>[39m[23m
[90m 1[39m Afghanistan Asia      [90m<tibble [12 × 4]>[39m          37.5     34.7     40.2
[90m 2[39m Albania     Europe    [90m<tibble [12 × 4]>[39m          68.4     64.9     71.6
[90m 3[39m Algeria     Africa    [90m<tibble [12 × 4]>[39m          59.0     53.0     64.5
[90m 4[39m Angola      Africa    [90m<tibble [12 × 4]>[39m          37.9     35.7     39.9
[90m 5[39m Argentina   Americas  [90m<tibble [12 × 4]>[39m          69.1     66.8     71.2
[90m 6[39m Australia   Oceania   [90m<tibble [12 × 4]>[39m          74.7     72.5     76.9
[90m 7[39m Austria     Europe    [90m<tibble [12 × 4]>[39m          73.1     70.8    

## What did we learn:

- how to write anonymous functions
- how to use {purrr} `map_*` with anonymous functions to add extra arguments
- what are nested data frames
- how to use {tidyr}'s `nest` & `unnest` and {purrr} `map_*` functions to work with data frames to nest, modify and unnest data frames