# Worksheet A-4: Tidy Data & The Model-Fitting Paradigm in R

By the end of this worksheet, you will be able to:

(From tidyr)

- convert a dataset between the 'long' and 'wide' format, using `tidyr::pivot_longer()` and `tidyr::pivot_wider()`
- assess which format is best suited for each type of analysis
- deal with missing data in a tibble

(From modelling)

- make a model object in R, using `lm()` as an example.
- write a formula in R.
- predict on a model object with the `broom::augment()` and `predict()` functions.
- extract information from a model object using `broom::tidy()`, `broom::glance()`, and traditional means.

To get full marks for this worksheet, you must successfully answer at least 10 of the autograded questions. There are 15 questions in total.

## Getting Started

Load the requirements for this worksheet:

In [1]:
library(testthat)
library(digest)
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(broom))
suppressPackageStartupMessages(library(gapminder))
lotr  <- suppressMessages(read_csv("https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/lotr_tidy.csv"))
guest <- suppressMessages(read_csv("https://raw.githubusercontent.com/STAT545-UBC/Classroom/master/data/wedding/attend.csv"))

The following code chunk has been unlocked, to give you the flexibility to start this document with some of your own code. Remember, it's bad manners to keep a call to `install.packages()` in your source code, so don't forget to delete these lines if you ever need to run them.

In [4]:
# An unlocked code cell.

Film,Race,Gender,Words
<chr>,<chr>,<chr>,<dbl>
The Fellowship Of The Ring,Elf,Female,1229
The Fellowship Of The Ring,Hobbit,Female,14
The Fellowship Of The Ring,Man,Female,0
The Two Towers,Elf,Female,331
The Two Towers,Hobbit,Female,0
The Two Towers,Man,Female,401
The Return Of The King,Elf,Female,183
The Return Of The King,Hobbit,Female,2
The Return Of The King,Man,Female,268
The Fellowship Of The Ring,Elf,Male,971


# Part 1: Tidy Data with Univariate Pivoting

Consider the Lord of the Rings data. Run the code cell below to see the first few lines of the tibble.

In [7]:
print(lotr, n = 5)

[90m# A tibble: 18 x 4[39m
  Film                       Race   Gender Words
  [3m[90m<chr>[39m[23m                      [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m  [3m[90m<dbl>[39m[23m
[90m1[39m The Fellowship Of The Ring Elf    Female  [4m1[24m229
[90m2[39m The Fellowship Of The Ring Hobbit Female    14
[90m3[39m The Fellowship Of The Ring Man    Female     0
[90m4[39m The Two Towers             Elf    Female   331
[90m5[39m The Two Towers             Hobbit Female     0
[90m# ... with 13 more rows[39m


## Question 1.1
Widen the data so that we see the words spoken by each race, by puttting race as its own column. Store the answer in `answer1.1`.

```
(answer1.1 <- lotr %>%
    FILL_THIS_IN(id_cols = c(-FILL_THIS_IN, -FILL_THIS_IN), 
                names_from = FILL_THIS_IN,
                values_from = FILL_THIS_IN))
```

Your `answer1.1` should look something like this (full tibble not always shown):

![answer1.1](https://github.com/UBC-STAT/stat545.stat.ubc.ca/raw/master/content/data/worksheet_04a/answer1.1.png)

_Sidenote:_ Putting a variable assignment in parenthesis will not only assign the value to the variable, but also print to console. Normally when you assign a variable, you do not get to see the value of the variable. This is a helpful tip so you can see what you are storing!

In [6]:
(answer1.1 <- lotr %>%
    pivot_wider(id_cols = c(-Race, -Words), 
                names_from = Race,
                values_from = Words))
print(answer1.1)
arrange_all(answer1.1)

Film,Gender,Elf,Hobbit,Man
<chr>,<chr>,<dbl>,<dbl>,<dbl>
The Fellowship Of The Ring,Female,1229,14,0
The Two Towers,Female,331,0,401
The Return Of The King,Female,183,2,268
The Fellowship Of The Ring,Male,971,3644,1995
The Two Towers,Male,513,2463,3589
The Return Of The King,Male,510,2673,2459


[90m# A tibble: 6 x 5[39m
  Film                       Gender   Elf Hobbit   Man
  [3m[90m<chr>[39m[23m                      [3m[90m<chr>[39m[23m  [3m[90m<dbl>[39m[23m  [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
[90m1[39m The Fellowship Of The Ring Female  [4m1[24m229     14     0
[90m2[39m The Two Towers             Female   331      0   401
[90m3[39m The Return Of The King     Female   183      2   268
[90m4[39m The Fellowship Of The Ring Male     971   [4m3[24m644  [4m1[24m995
[90m5[39m The Two Towers             Male     513   [4m2[24m463  [4m3[24m589
[90m6[39m The Return Of The King     Male     510   [4m2[24m673  [4m2[24m459


Film,Gender,Elf,Hobbit,Man
<chr>,<chr>,<dbl>,<dbl>,<dbl>
The Fellowship Of The Ring,Female,1229,14,0
The Fellowship Of The Ring,Male,971,3644,1995
The Return Of The King,Female,183,2,268
The Return Of The King,Male,510,2673,2459
The Two Towers,Female,331,0,401
The Two Towers,Male,513,2463,3589


In [8]:
test_that("Question 1.1", {
    expect_true(all(c("Film", "Gender", "Elf", "Hobbit", "Man") %in% names(answer1.1)))
    expect_equal(nrow(answer1.1), 6L)
})

"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a/worksheets": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a/worksheets": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ": The system cannot find the file specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ": The system cannot find the file specified"


[32mTest passed[39m 


## Question 1.2
Re-lengthen the wide `lotr` data (i.e. `answer1.1`) from Question 1.1 above. Store your answer in `answer1.2`.

**Hint:** the resulting data frame should appear to be the _almost the same_ as the original! (No need to reorder the columns)

```
(answer1.2 <- answer1.1 %>% 
  FILL_THIS_IN(cols = c(-FILL_THIS_IN, -FILL_THIS_IN), 
               names_to  = FILL_THIS_IN, 
               values_to = FILL_THIS_IN))
```

Your `answer1.2` should look something like this (full tibble not always shown):

![answer1.2](https://github.com/UBC-STAT/stat545.stat.ubc.ca/raw/master/content/data/worksheet_04a/answer1.2.png)

In [13]:
(answer1.2 <- answer1.1 %>% 
  pivot_longer(cols = c(-Film, -Gender), 
               names_to  = "Race", 
               values_to = "Words"))
print(answer1.2)

Film,Gender,Race,Words
<chr>,<chr>,<chr>,<dbl>
The Fellowship Of The Ring,Female,Elf,1229
The Fellowship Of The Ring,Female,Hobbit,14
The Fellowship Of The Ring,Female,Man,0
The Two Towers,Female,Elf,331
The Two Towers,Female,Hobbit,0
The Two Towers,Female,Man,401
The Return Of The King,Female,Elf,183
The Return Of The King,Female,Hobbit,2
The Return Of The King,Female,Man,268
The Fellowship Of The Ring,Male,Elf,971


[90m# A tibble: 18 x 4[39m
   Film                       Gender Race   Words
   [3m[90m<chr>[39m[23m                      [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m  [3m[90m<dbl>[39m[23m
[90m 1[39m The Fellowship Of The Ring Female Elf     [4m1[24m229
[90m 2[39m The Fellowship Of The Ring Female Hobbit    14
[90m 3[39m The Fellowship Of The Ring Female Man        0
[90m 4[39m The Two Towers             Female Elf      331
[90m 5[39m The Two Towers             Female Hobbit     0
[90m 6[39m The Two Towers             Female Man      401
[90m 7[39m The Return Of The King     Female Elf      183
[90m 8[39m The Return Of The King     Female Hobbit     2
[90m 9[39m The Return Of The King     Female Man      268
[90m10[39m The Fellowship Of The Ring Male   Elf      971
[90m11[39m The Fellowship Of The Ring Male   Hobbit  [4m3[24m644
[90m12[39m The Fellowship Of The Ring Male   Man     [4m1[24m995
[90m13[39m The Two Towers             Male   E

In [14]:
test_that("Question 1.2", {
    expect_true(all(c("Film", "Gender", "Race", "Words") %in% names(answer1.2)))
    expect_equal(nrow(answer1.2), 18L)
})

"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a/worksheets": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a/worksheets": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ": The system cannot find the file specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ": The system cannot find the file specified"


[32mTest passed[39m 


## QUESTION 1.3

Using the `gapminder` dataset: what's the relationship between Canada's GDP per capita and the United Kingdom's? First, produce a tidy tibble from the `gapminder` tibble to address this question. Store your tibble in a variable named `question1.3`. Do not rename any columns.

_Food for thought_: After tidying the data for this problem, we should be able to make a scatterplot of Canada's GDP per capita against the UK's. But, doing so for the original `gapminder` dataset would be difficult. 

```
answer1.3 <- gapminder %>% 
  filter(FILL_TIHS_IN) %>% 
  pivot_FILL_THIS_IN(FILL_THIS_IN)
```

In [27]:
answer1.3 <- gapminder %>% 
  filter(country %in% c("United Kingdom", "Canada")) %>% 
  pivot_wider(id_cols = year, names_from = country, values_from = gdpPercap)
print(answer1.3)

[90m# A tibble: 12 x 3[39m
    year Canada `United Kingdom`
   [3m[90m<int>[39m[23m  [3m[90m<dbl>[39m[23m            [3m[90m<dbl>[39m[23m
[90m 1[39m  [4m1[24m952 [4m1[24m[4m1[24m367.            [4m9[24m980.
[90m 2[39m  [4m1[24m957 [4m1[24m[4m2[24m490.           [4m1[24m[4m1[24m283.
[90m 3[39m  [4m1[24m962 [4m1[24m[4m3[24m462.           [4m1[24m[4m2[24m477.
[90m 4[39m  [4m1[24m967 [4m1[24m[4m6[24m077.           [4m1[24m[4m4[24m143.
[90m 5[39m  [4m1[24m972 [4m1[24m[4m8[24m971.           [4m1[24m[4m5[24m895.
[90m 6[39m  [4m1[24m977 [4m2[24m[4m2[24m091.           [4m1[24m[4m7[24m429.
[90m 7[39m  [4m1[24m982 [4m2[24m[4m2[24m899.           [4m1[24m[4m8[24m232.
[90m 8[39m  [4m1[24m987 [4m2[24m[4m6[24m627.           [4m2[24m[4m1[24m665.
[90m 9[39m  [4m1[24m992 [4m2[24m[4m6[24m343.           [4m2[24m[4m2[24m705.
[90m10[39m  [4m1[24m997 [4m2[24m[4m8[24m955.           [

In [105]:
test_that("Question 1.3", {
    expect_true("Canada" %in% names(answer1.3))
    expect_true("United Kingdom" %in% names(answer1.3))
    expect_equal(nrow(answer1.3), 12L)
})

"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a/worksheets": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a/worksheets": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ": The system cannot find the file specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ": The system cannot find the file specified"


[32mTest passed[39m 


# Part 2: Tidy Data with Multivariate Pivoting

Congratulations, you’re getting married! In addition to the wedding, you’ve decided to hold two other events: a day-of brunch and a day-before round of golf. You’ve made a guestlist of attendance so far, along with food preference for the food events (wedding and brunch).

Run the code cell below to see the first few rows of the `guest` data frame.

In [29]:
head(guest)

party,name,meal_wedding,meal_brunch,attendance_wedding,attendance_brunch,attendance_golf
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,Sommer Medrano,PENDING,PENDING,PENDING,PENDING,PENDING
1,Phillip Medrano,vegetarian,Menu C,CONFIRMED,CONFIRMED,CONFIRMED
1,Blanka Medrano,chicken,Menu A,CONFIRMED,CONFIRMED,CONFIRMED
1,Emaan Medrano,PENDING,PENDING,PENDING,PENDING,PENDING
2,Blair Park,chicken,Menu C,CONFIRMED,CONFIRMED,CONFIRMED
2,Nigel Webb,,,CANCELLED,CANCELLED,CANCELLED


## Question 2.1
Put `meal` and `attendance` as their own columns, with the events living in a new column. Store your answer in `answer2.1`.

```
(answer2.1 <- guest %>% 
  FILL_THIS_IN(cols      = c(-FILL_THIS_IN, -FILL_THIS_IN), 
               names_to  = c(FILL_THIS_IN, FILL_THIS_IN),
               names_sep = FILL_THIS_IN))
```               

**Hint**: Read the possible values for `names_to` in the corresponding documentation of the function you choose!

Your `answer2.1` should look something like this (full tibble not always shown):

![answer2.1](https://github.com/UBC-STAT/stat545.stat.ubc.ca/raw/master/content/data/worksheet_04a/answer2.1.png)

In [103]:
(answer2.1 <- guest %>% 
  pivot_longer(cols      = c(-party, -name), 
               names_to  = c(".value", "event"),
               names_sep = "_"))
print(answer2.1)

party,name,event,meal,attendance
<dbl>,<chr>,<chr>,<chr>,<chr>
1,Sommer Medrano,wedding,PENDING,PENDING
1,Sommer Medrano,brunch,PENDING,PENDING
1,Sommer Medrano,golf,,PENDING
1,Phillip Medrano,wedding,vegetarian,CONFIRMED
1,Phillip Medrano,brunch,Menu C,CONFIRMED
1,Phillip Medrano,golf,,CONFIRMED
1,Blanka Medrano,wedding,chicken,CONFIRMED
1,Blanka Medrano,brunch,Menu A,CONFIRMED
1,Blanka Medrano,golf,,CONFIRMED
1,Emaan Medrano,wedding,PENDING,PENDING


[90m# A tibble: 90 x 5[39m
   party name            event   meal       attendance
   [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m           [3m[90m<chr>[39m[23m   [3m[90m<chr>[39m[23m      [3m[90m<chr>[39m[23m     
[90m 1[39m     1 Sommer Medrano  wedding PENDING    PENDING   
[90m 2[39m     1 Sommer Medrano  brunch  PENDING    PENDING   
[90m 3[39m     1 Sommer Medrano  golf    [31mNA[39m         PENDING   
[90m 4[39m     1 Phillip Medrano wedding vegetarian CONFIRMED 
[90m 5[39m     1 Phillip Medrano brunch  Menu C     CONFIRMED 
[90m 6[39m     1 Phillip Medrano golf    [31mNA[39m         CONFIRMED 
[90m 7[39m     1 Blanka Medrano  wedding chicken    CONFIRMED 
[90m 8[39m     1 Blanka Medrano  brunch  Menu A     CONFIRMED 
[90m 9[39m     1 Blanka Medrano  golf    [31mNA[39m         CONFIRMED 
[90m10[39m     1 Emaan Medrano   wedding PENDING    PENDING   
[90m# ... with 80 more rows[39m


In [104]:
test_that("Question 2.1", {
    expect_true(all(c("party", "name", "event", "meal", "attendance") %in% names(answer2.1)))
    expect_equal(nrow(answer2.1), 90L)
})

"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a/worksheets": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a/worksheets": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ": The system cannot find the file specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ": The system cannot find the file specified"


[32mTest passed[39m 


## Question 2.2
Use `tidyr::separate()` to split the `name` in `answer2.1` into two columns: `first_name` and `last_name`. Store your answer in `answer2.2`.

```
(answer2.2 <- answer2.1 %>% 
  FILL_THIS_IN(FILL_THIS_IN, into = c(FILL_THIS_IN, FILL_THIS_IN), sep=FILL_THIS_IN))
```  

Your `answer2.2` should look something like this (full tibble not always shown):

![answer2.2](https://github.com/UBC-STAT/stat545.stat.ubc.ca/raw/master/content/data/worksheet_04a/answer2.2.png)

In [101]:
(answer2.2 <- answer2.1 %>% 
  separate(name, into = c("first_name", "last_name"), sep=" "))
print(answer2.2)

party,first_name,last_name,event,meal,attendance
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>
1,Sommer,Medrano,wedding,PENDING,PENDING
1,Sommer,Medrano,brunch,PENDING,PENDING
1,Sommer,Medrano,golf,,PENDING
1,Phillip,Medrano,wedding,vegetarian,CONFIRMED
1,Phillip,Medrano,brunch,Menu C,CONFIRMED
1,Phillip,Medrano,golf,,CONFIRMED
1,Blanka,Medrano,wedding,chicken,CONFIRMED
1,Blanka,Medrano,brunch,Menu A,CONFIRMED
1,Blanka,Medrano,golf,,CONFIRMED
1,Emaan,Medrano,wedding,PENDING,PENDING


[90m# A tibble: 90 x 6[39m
   party first_name last_name event   meal       attendance
   [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m      [3m[90m<chr>[39m[23m     [3m[90m<chr>[39m[23m   [3m[90m<chr>[39m[23m      [3m[90m<chr>[39m[23m     
[90m 1[39m     1 Sommer     Medrano   wedding PENDING    PENDING   
[90m 2[39m     1 Sommer     Medrano   brunch  PENDING    PENDING   
[90m 3[39m     1 Sommer     Medrano   golf    [31mNA[39m         PENDING   
[90m 4[39m     1 Phillip    Medrano   wedding vegetarian CONFIRMED 
[90m 5[39m     1 Phillip    Medrano   brunch  Menu C     CONFIRMED 
[90m 6[39m     1 Phillip    Medrano   golf    [31mNA[39m         CONFIRMED 
[90m 7[39m     1 Blanka     Medrano   wedding chicken    CONFIRMED 
[90m 8[39m     1 Blanka     Medrano   brunch  Menu A     CONFIRMED 
[90m 9[39m     1 Blanka     Medrano   golf    [31mNA[39m         CONFIRMED 
[90m10[39m     1 Emaan      Medrano   wedding PENDING    PENDING   
[90m# 

In [102]:
test_that("Question 2.2", {
    expect_true(all(c("party", "first_name", "last_name", "event", "meal", "attendance") %in% names(answer2.2)))
    expect_equal(nrow(answer2.2), 90L)
})

"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a/worksheets": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a/worksheets": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ": The system cannot find the file specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ": The system cannot find the file specified"


[32mTest passed[39m 


### Question 2.3
Re-unite `first_name` and `last_name` in `answer2.2` back into `name` using `tidyr::unite()`. Store your answer in `answer2.3`.

```
(answer2.3 <- answer2.2 %>%
    FILL_THIS_IN(col = FILL_THIS_IN, c(FILL_THIS_IN, FILL_THIS_IN), sep = FILL_THIS_IN))
```    

Your `answer2.3` should look something like this (full tibble not always shown):

![answer2.3](https://github.com/UBC-STAT/stat545.stat.ubc.ca/raw/master/content/data/worksheet_04a/answer2.3.png)

In [98]:
(answer2.3 <- answer2.2 %>%
    unite(col = "name", c(first_name, last_name), sep = "_" ))
print(answer2.3)

party,name,event,meal,attendance
<dbl>,<chr>,<chr>,<chr>,<chr>
1,Sommer_Medrano,wedding,PENDING,PENDING
1,Sommer_Medrano,brunch,PENDING,PENDING
1,Sommer_Medrano,golf,,PENDING
1,Phillip_Medrano,wedding,vegetarian,CONFIRMED
1,Phillip_Medrano,brunch,Menu C,CONFIRMED
1,Phillip_Medrano,golf,,CONFIRMED
1,Blanka_Medrano,wedding,chicken,CONFIRMED
1,Blanka_Medrano,brunch,Menu A,CONFIRMED
1,Blanka_Medrano,golf,,CONFIRMED
1,Emaan_Medrano,wedding,PENDING,PENDING


[90m# A tibble: 90 x 5[39m
   party name            event   meal       attendance
   [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m           [3m[90m<chr>[39m[23m   [3m[90m<chr>[39m[23m      [3m[90m<chr>[39m[23m     
[90m 1[39m     1 Sommer_Medrano  wedding PENDING    PENDING   
[90m 2[39m     1 Sommer_Medrano  brunch  PENDING    PENDING   
[90m 3[39m     1 Sommer_Medrano  golf    [31mNA[39m         PENDING   
[90m 4[39m     1 Phillip_Medrano wedding vegetarian CONFIRMED 
[90m 5[39m     1 Phillip_Medrano brunch  Menu C     CONFIRMED 
[90m 6[39m     1 Phillip_Medrano golf    [31mNA[39m         CONFIRMED 
[90m 7[39m     1 Blanka_Medrano  wedding chicken    CONFIRMED 
[90m 8[39m     1 Blanka_Medrano  brunch  Menu A     CONFIRMED 
[90m 9[39m     1 Blanka_Medrano  golf    [31mNA[39m         CONFIRMED 
[90m10[39m     1 Emaan_Medrano   wedding PENDING    PENDING   
[90m# ... with 80 more rows[39m


In [99]:
test_that("Question 2.3", {
    expect_true(all(c("party", "name", "event", "meal", "attendance") %in% names(answer2.3)))
    expect_equal(nrow(answer2.3), 90L)
})

"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a/worksheets": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a/worksheets": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ": The system cannot find the file specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ": The system cannot find the file specified"


[32mTest passed[39m 


## Question 2.4

Which parties still have a "PENDING" attendance status for all of its members and all of the events? Your answer should be a vector of party ID's (not a tibble). Store your answer in `answer2.4`.

**Hint**: use `answer2.1` as a starting point. Use `pull()` to access a column as a vector.

```
answer2.4 <- answer2.1 %>% 
    group_by(FILL_THIS_IN) %>% 
    summarize(all_pending = all(FILL_THIS_IN == "PENDING")) %>%
    filter(all_pending) %>%
    FILL_THIS_IN(party)
```    

In [68]:
answer2.4 <- answer2.1 %>% 
    group_by(party) %>% 
    summarize(all_pending = all(attendance == "PENDING")) %>%
    filter(all_pending) %>%
    pull(party)
print(answer2.4)

[1]  3  4  8 10


In [69]:
test_that("Question 2.4", {
    expect_equal(digest(unclass(answer2.4)),"f13a65bc5c8793a2cad1415aad7dff93")
})

"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a/worksheets": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a/worksheets": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ": The system cannot find the file specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ": The system cannot find the file specified"


[32mTest passed[39m 


## Question 2.5
Which parties still have a "PENDING" attendance status for all of its members, for the wedding event only? Your answer should be a vector of party ID's (not a tibble). Store your answer in `answer2.5`.

**Hint**: Use `pull()` to access a column as a vector.

```
answer2.5 <- guest %>% 
    group_by(FILL_THIS_IN) %>% 
    summarize(pending_wedding = all(FILL_THIS_IN == "PENDING")) %>%
    filter(FILL_THIS_IN) %>%
    FILL_THIS_IN(party)
```    

In [73]:
answer2.5 <- guest %>% 
    group_by(party) %>% 
    summarize(pending_wedding = all(attendance_wedding == "PENDING")) %>%
    filter(pending_wedding) %>%
    pull(party)
print(answer2.5)

[1]  3  4  8 10


In [74]:
test_that("Question 2.5", {expect_equal(digest(unclass(answer2.5)), "f13a65bc5c8793a2cad1415aad7dff93")})

"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a/worksheets": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a/worksheets": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ": The system cannot find the file specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ": The system cannot find the file specified"


[32mTest passed[39m 


# Part 3: The Model-Fitting Paradigm in R

**Overview**

1. Fit a linear regression model to life expectancy ("Y") from year ("X") by filling in the formula. Notice what appears as the output.
2. Use the `unclass` function to uncover the object's true nature: a list.

## Question 3.1
First, create a subset of the `gapminder` dataset containing only the country of `France`. Store your answer in `answer3.1`.

```
(answer3.1 <- gapminder %>%
   FILL_THIS_IN(FILL_THIS_IN == FILL_THIS_IN))
```   

In [75]:
(answer3.1 <- gapminder %>%
   filter(country == 'France'))
print(answer3.1)

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
France,Europe,1952,67.41,42459667,7029.809
France,Europe,1957,68.93,44310863,8662.835
France,Europe,1962,70.51,47124000,10560.486
France,Europe,1967,71.55,49569000,12999.918
France,Europe,1972,72.38,51732000,16107.192
France,Europe,1977,73.83,53165019,18292.635
France,Europe,1982,74.89,54433565,20293.897
France,Europe,1987,76.34,55630100,22066.442
France,Europe,1992,77.46,57374179,24703.796
France,Europe,1997,78.64,58623428,25889.785


[90m# A tibble: 12 x 6[39m
   country continent  year lifeExp      pop gdpPercap
   [3m[90m<fct>[39m[23m   [3m[90m<fct>[39m[23m     [3m[90m<int>[39m[23m   [3m[90m<dbl>[39m[23m    [3m[90m<int>[39m[23m     [3m[90m<dbl>[39m[23m
[90m 1[39m France  Europe     [4m1[24m952    67.4 42[4m4[24m[4m5[24m[4m9[24m667     [4m7[24m030.
[90m 2[39m France  Europe     [4m1[24m957    68.9 44[4m3[24m[4m1[24m[4m0[24m863     [4m8[24m663.
[90m 3[39m France  Europe     [4m1[24m962    70.5 47[4m1[24m[4m2[24m[4m4[24m000    [4m1[24m[4m0[24m560.
[90m 4[39m France  Europe     [4m1[24m967    71.6 49[4m5[24m[4m6[24m[4m9[24m000    [4m1[24m[4m3[24m000.
[90m 5[39m France  Europe     [4m1[24m972    72.4 51[4m7[24m[4m3[24m[4m2[24m000    [4m1[24m[4m6[24m107.
[90m 6[39m France  Europe     [4m1[24m977    73.8 53[4m1[24m[4m6[24m[4m5[24m019    [4m1[24m[4m8[24m293.
[90m 7[39m France  Europe     [4m1[24m982    74.9 54[4

In [76]:
test_that("Question 3.1", {expect_equal(digest(unclass(answer3.1)), "a6125bcbb25047b7a8c932acbb1f2250")})

"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a/worksheets": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a/worksheets": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ": The system cannot find the file specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ": The system cannot find the file specified"


[32mTest passed[39m 


### Question 3.2

> Fit a linear regression model to life expectancy ("Y") from year ("X") by filling in the formula

Now, using the `lm()` function we will create the linear model. Store your answer in `answer3.2`.

```
(answer3.2 <- FILL_THIS_IN(FILL_THIS_IN ~ FILL_THIS_IN, answer3.1)
```

In [81]:
(answer3.2 <- lm(lifeExp ~ year, answer3.1))
print(answer3.2)


Call:
lm(formula = lifeExp ~ year, data = answer3.1)

Coefficients:
(Intercept)         year  
  -397.7646       0.2385  



Call:
lm(formula = lifeExp ~ year, data = answer3.1)

Coefficients:
(Intercept)         year  
  -397.7646       0.2385  



In [82]:
test_that("Question 3.2", {
    expect_known_hash(round(coef(answer3.2), 4), "e7375b4c2683882feb9d7215f6929f69")
    expect_known_hash(answer3.2$terms, "9c71cfd9974bfbc8e160b1a31936d137")
})
print("Success!")

"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a/worksheets": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a/worksheets": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ": The system cannot find the file specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ": The system cannot find the file specified"


[32mTest passed[39m 
[1] "Success!"


## Question 3.3

We are interested in the modeling results around the modeling period which starts at year 1952. To get a meaniningful "interpretable" intercept we can use the `I()` function.

Use `I()` to make the intercept so that the "beginning" of our dataset (1952) corresponds to '0' in the model. This makes all the years in the data set relative to the first year, 1952.

Store your answer in `answer3.3`.

```
answer3.3 <- FILL_THIS_IN(FILL_THIS_IN ~ I(FILL_THIS_IN-1952), answer3.1)
```

In [96]:
answer3.3 <- lm(lifeExp ~ I(year-1952), answer3.1)
print(answer3.3)


Call:
lm(formula = lifeExp ~ I(year - 1952), data = answer3.1)

Coefficients:
   (Intercept)  I(year - 1952)  
       67.7901          0.2385  



In [86]:
test_that("Question 3.3", {
    expect_known_hash(round(coef(answer3.3), 4), "6a83f591b39b440f2a699dbee2c23468")
    expect_known_hash(answer3.3$terms, "f7d8f19ef010f5932ba9b321f3f88282")
})

"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a/worksheets": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a/worksheets": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ": The system cannot find the file specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ": The system cannot find the file specified"


[32mTest passed[39m 


## Question 3.4
Use the `unclass()` function to take a look at how the `lm()` object actually looks like. Store your answer in `answer3.4`.

```
answer3.4 <- FILL_THIS_IN(answer3.3)
```

In [87]:
answer3.4 <- unclass(answer3.3)
print(answer3.4)

$coefficients
   (Intercept) I(year - 1952) 
    67.7901282      0.2385014 

$residuals
          1           2           3           4           5           6 
-0.38012821 -0.05263520  0.33485781  0.18235082 -0.18015618  0.07733683 
          7           8           9          10          11          12 
-0.05517016  0.20232284  0.12981585  0.11730886 -0.12519814 -0.25070513 

$effects
   (Intercept) I(year - 1952)                                              
 -257.55220231    14.26030956     0.41516662     0.26479522    -0.09557618 
                                                                           
    0.16405242     0.03368103     0.29330963     0.22293823     0.21256684 
                              
   -0.02780456    -0.15117596 

$rank
[1] 2

$fitted.values
       1        2        3        4        5        6        7        8 
67.79013 68.98264 70.17514 71.36765 72.56016 73.75266 74.94517 76.13768 
       9       10       11       12 
77.33018 78.52269 79.71520 80.90

In [88]:
test_that("Question 3.4", {
    expect_known_hash(round(answer3.4$coefficients, 4), "6a83f591b39b440f2a699dbee2c23468")
    expect_known_hash(class(answer3.4), "086ebc4c59c08c43e75bae74f1e16897")
    expect_known_hash(answer3.4$terms, "f7d8f19ef010f5932ba9b321f3f88282")
    
})

"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a/worksheets": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a/worksheets": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ": The system cannot find the file specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ": The system cannot find the file specified"


[32mTest passed[39m 


# Part 4: Producing Tidy Tibbles with broom

## Question 4.1

Apply `broom::tidy()` to `answer3.3`. Store your answer in `answer4.1`.

```
answer4.1 <- FILL_THIS_IN(answer3.3)
```

In [89]:
answer4.1 <- broom::tidy(answer3.3)
print(answer4.1)

[90m# A tibble: 2 x 5[39m
  term           estimate std.error statistic  p.value
  [3m[90m<chr>[39m[23m             [3m[90m<dbl>[39m[23m     [3m[90m<dbl>[39m[23m     [3m[90m<dbl>[39m[23m    [3m[90m<dbl>[39m[23m
[90m1[39m (Intercept)      67.8     0.119       567.  7.12[90me[39m[31m-24[39m
[90m2[39m I(year - 1952)    0.239   0.003[4m6[24m[4m8[24m      64.8 1.86[90me[39m[31m-14[39m


In [90]:
test_that("Question 4.1", {
    expect_known_hash(dimnames(answer4.1), "b7e5db66048ee1c33cb090078dc59103")
    expect_known_hash(answer4.1[[1]], "6736fedebcaa557ef7a78f8db206000f")
    expect_known_hash(round(answer4.1[[3]], 4), "004727aa166a650b3f55c2c11f6be257")
})

"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a/worksheets": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a/worksheets": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ": The system cannot find the file specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ": The system cannot find the file specified"


[32mTest passed[39m 


## Question 4.2
Apply `broom::augment()` to `answer3.3`. Store your answer in `answer4.2`.

```
answer4.2 <- FILL_THIS_IN(answer3.3)
```

In [91]:
answer4.2 <- broom::augment(answer3.3)
print(answer4.2)

[90m# A tibble: 12 x 7[39m
   lifeExp `I(year - 1952)` .fitted   .hat .sigma .cooksd .std.resid
     [3m[90m<dbl>[39m[23m         [3m[90m<I<dbl>>[39m[23m   [3m[90m<dbl>[39m[23m  [3m[90m<dbl>[39m[23m  [3m[90m<dbl>[39m[23m   [3m[90m<dbl>[39m[23m      [3m[90m<dbl>[39m[23m
[90m 1[39m    67.4                0    67.8 0.295   0.176 0.885       -[31m2[39m[31m.[39m[31m0[39m[31m6[39m 
[90m 2[39m    68.9                5    69.0 0.225   0.231 0.010[4m7[24m      -[31m0[39m[31m.[39m[31m272[39m
[90m 3[39m    70.5               10    70.2 0.169   0.197 0.283        1.67 
[90m 4[39m    71.6               15    71.4 0.127   0.223 0.057[4m2[24m       0.887
[90m 5[39m    72.4               20    72.6 0.099[4m1[24m  0.223 0.040[4m9[24m      -[31m0[39m[31m.[39m[31m863[39m
[90m 6[39m    73.8               25    73.8 0.085[4m1[24m  0.230 0.006[4m2[24m[4m8[24m      0.367
[90m 7[39m    74.9               30    74.9 0.085[4m1[24m  

In [92]:
test_that("Question 4.2", {
    expect_known_hash(round(answer4.2$.fitted, 4), "3a2e6323312173544d30c4dc75d5b604")
    expect_known_hash(round(answer4.2$.std.resid, 4), "b2796c0f920703cb5066fcda716be2e7")
})

"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a/worksheets": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a/worksheets": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ": The system cannot find the file specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ": The system cannot find the file specified"


[32mTest passed[39m 


## Question 4.3
Apply `broom::glance()` to `answer3.3`. Store your answer in `answer4.3`.

```
answer4.3 <- FILL_THIS_IN(answer3.3)
```

In [94]:
answer4.3 <- broom::glance(answer3.3)
print(answer4.3)

[90m# A tibble: 1 x 12[39m
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      [3m[90m<dbl>[39m[23m         [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m     [3m[90m<dbl>[39m[23m    [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m  [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
[90m1[39m     0.998         0.997 0.220     [4m4[24m200. 1.86[90me[39m[31m-14[39m     1   2.23  1.53  2.99
[90m# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>[39m


In [95]:
test_that("Question 4.3", {
    expect_known_hash(dimnames(answer4.3), "78044903eb403fb9220d796ac127297c")
    expect_known_hash(round(answer4.3[[2]], 4), "a3ec6ee89f16b0783571e8f9e26c9ef5")
    expect_known_hash(round(answer4.3[[4]], 4), "f95d29661fc511c6a038ae4e06b9ea02")
})

"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a/worksheets": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a/worksheets": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ/stat545a": The system cannot find the path specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ": The system cannot find the file specified"
"path[1]="C:/Users/41615/OneDrive/×ÀÃæ": The system cannot find the file specified"


[32mTest passed[39m 


### Attribution

Thanks to Diana Lin, Icíar Fernández Boyano, David Kepplinger, and Vincenzo Coia for putting this worksheet together.