# Worksheet 05a: Working With Factors & Tibble Joins
_**Leader**: Almas Khan **Reviewer:** Diana Lin **ASDA Assist**: David Kepplinger_

**Version 1.1** - Delivered Wednesday, Octobebr 14, 2020

Fixes:
- _Question 2.5_: change **decreasing** to **increasing**; add a hint

This is the corresponding worksheet for Class 10 (Oct 13, 2020) & Class 11 (Oct 15, 2020).

For marking purposes, we will need the packages below.
Remember to pay attention to the variable name to store your answer in, or else it will not be autograded correctly.
To ensure everything works properly, remember to run all code cells, not just the ones with your answer.

If you want to use packages which are not yet installed, you can use the code cell below to install them.

In [None]:
# Install additional packages, e.g.
# install.packages("forcats")
# install.packages('tsibble')

Use the following code cell to load any additional packages you want to use for this worksheet.

In [None]:
# Load packages, e.g.
# library(devtools)
# library(tsibble)

Run the code cell below to load the packages.

In [None]:
library(testthat)
library(digest)

## TOPIC 1: Working With Factors in R

For the best experience working with factors in R, we will use the forcats package.

In [None]:
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(forcats))
suppressPackageStartupMessages(library(gapminder))

### Question 1: Creating Factors

Let's look again into `gapminder` dataset and create a new column, `life_level`, that contains five categories ("very high", "high","moderate", "low" and "very low") based on life expectancy in 1997.
Assign categories according to the table below:

| Criteria          |`life_level`   | 
|-------------------|-------------|
| less than 23      | very low    |
| between 23 and 48 | low         |
| between 48 and 59 | moderate    |
| between 59 and 70 | high        |
| more than 70      | very high   |

#### Question 1.1

Create a new data set for the year 1997 by first filtering by the year and the adding a new column `life_level` according to the table above.

Store this new data frame in variable `gapminder_1997`.

**Hint**: We are using `case_when()`, a tidier way to vectorise multiple `if_else()` statements.
You can read more about this function [in the tidyverse reference](https://dplyr.tidyverse.org/reference/case_when.html).

```
(gapminder_1997 <- gapminder %>% 
   FILL_THIS_IN(year == FILL_THIS_IN) %>% 
   FILL_THIS_IN(life_level = case_when(FILL_THIS_IN < FILL_THIS_IN ~ "very low",
                                 FILL_THIS_IN < FILL_THIS_IN ~ "low",
                                 FILL_THIS_IN < FILL_THIS_IN ~ "moderate",
                                 FILL_THIS_IN < FILL_THIS_IN ~ "high",
                                 TRUE ~ "very high")))
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that("Question 1.1", expect_known_hash(table(gapminder_1997$life_level), "3d2e691667d4706e66ce5784bb1d7042"))
print("Success!")

We can now plot boxplots for the GDP per capita per level of life expectancy.
Run the following code to see the boxplots.

In [None]:
ggplot(gapminder_1997) + geom_boxplot(aes(x = life_level, y = gdpPercap)) +
  labs(y = "GDP per capita ($)", x= "Life expectancy level (years)") +
  ggtitle("GDP per capita per Level of Life Expectancy") +
  theme_bw() 

We notice a few oddities here:

- It seems that none of the countries had a "very low" life-expectancy in 1997. 
- However, since it was an option in our analysis it should be included in our plot. Right?
- Notice also how levels on x-axis are placed in the "wrong" order. (in alphabetical order)

#### Question 1.2

You can correct these issues by explicitly making `life_level` a factor and setting the levels parameter.
Create a new data frame as in Question 1.1, but make the column `life_level` a factor with levels ordered from *very low* to *very high*.
Store this new data frame in variable `gapminder_1997_fct`.

```
(gapminder_1997_fct <- gapminder %>% 
   FILL_THIS_IN(year == 1997) %>% 
   FILL_THIS_IN(life_level = FILL_THIS_IN(case_when(FILL_THIS_IN < FILL_THIS_IN ~ "very low",
                                        FILL_THIS_IN < FILL_THIS_IN ~ "low",
                                        FILL_THIS_IN < FILL_THIS_IN ~ "moderate",
                                        FILL_THIS_IN < FILL_THIS_IN ~ "high",
                                        TRUE ~ "very high"),
                              levels = c('FILL_THIS_IN', 'FILL_THIS_IN', 'FILL_THIS_IN', 'FILL_THIS_IN', 'FILL_THIS_IN'))))
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that("Question 1.2", expect_known_hash(table(gapminder_1997_fct$life_level), "8e62f09fbd0756d7e69d1bc95715d333"))
print("Success!")

Run the following code to see the boxplots from the new data frame with life expectancy level as factor.

In [None]:
ggplot(gapminder_1997_fct) + geom_boxplot(aes(x = life_level, y = gdpPercap)) +
  labs(y = "GDP per capita ($)", x= "Life expectancy level (years)") +
  scale_x_discrete(drop = FALSE) + # Don't drop the very low factor
  ggtitle("GDP per capita per level of Life Expectancy") +
  theme_bw() 

### Question 2: Inspecting Factors

In Question 1, you created our own factors, so now let's explore what categorical variables are in the `gapminder` dataset.

#### Question 2.1

What levels does the column `continent` have?
Assign the levels to variable `continent_levels`.

```
(continent_levels <- FILL_THIS_IN(gapminder$FILL_THIS_IN))
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that("Question 2.1", expect_known_hash(continent_levels, "6926255b7f073fb8e7d89773802102a6"))
print("Success!")

#### Question 2.2

How many levels does the column `country` have?
Assign the number of levels to variable `gap_nr_countries`.

```
(gap_nr_countries <- FILL_THIS_IN(gapminder$FILL_THIS_IN))
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that("Question 2.2", expect_known_hash(as.integer(gap_nr_countries), "3b6d002135d8d45a3c5f4a9fb857c323"))
print("Success!")

#### Question 2.3

Consider we are only interested in the following 5 countries: Egypt, Haiti, Romania, Thailand, and Venezuela.
Create a new data frame with only these 5 countries and store it in variable `gap_5`.

```
(gap_5 <- gapminder %>%
   FILL_THIS_IN(FILL_THIS_IN %in% c("FILL_THIS_IN", "FILL_THIS_IN", "FILL_THIS_IN", "FILL_THIS_IN", "FILL_THIS_IN")))
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that("Question 2.3", {
  expect_known_hash(dim(gap_5), "6c0f8c2a8d488051f33fc89b2c327dcd")
  expect_known_hash(table(gap_5$country), "05b8ca3033e94f96b9ec5422a69c1207")
})
print("Success!")

#### Question 2.4

However, sub-setting the data set does not affect the levels of the factors.
The column `country` in data frame `gap_5` still has the same number of levels as in the original data frame.
Create a new data frame from `gap_5`, but drop all unused levels from column `country`.
Store new new data frame in variable `gap_5_dropped`.

```
(gap_5_dropped <- gap_5 %>% FILL_THIS_IN())
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that("Question 2.4", expect_known_hash(sort(levels(gap_5_dropped$country)), "ac97b9af845a59395697b028c5121503"))
print("Success!")

#### Question 2.5

The factor levels of column `continent` in data frame `gapminder` are ordered alphabetically.
Create a new data frame, with the levels of column `continent` in ~*decreasing*~ *increasing* order according to their frequency (i.e., the number of rows for each continent).
Store the new data frame in variable `gap_continent_freq`.

```
(gap_continent_freq <- gapminder %>%
   mutate(continent = FILL_THIS_IN(FILL_THIS_IN(continent))))
```

**Hint**: The first `FILL_THIS_IN` corresponds to a `fct_*` function that reverses the levels of the factors. The second `FILL_THIS_IN` correspond to a `fct_*` function that orders the levels by *decreasing* frequency.

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that("Question 2.5", expect_known_hash(table(gap_continent_freq$continent), "0bb23ea87ce71deb5452eaae8cdbf7cf"))
print("Success!")

#### Question 2.6

Again based on the `gapminder` data set, create another data frame, with the levels of column `continent` in *increasing* order of their average life expectancy (from column `lifeExp`).
Store the new data frame in variable `gap_continent_life`.

```
(gap_continent_life <- gapminder %>%
   mutate(continent = FILL_THIS_IN(FILL_THIS_IN, FILL_THIS_IN, FILL_THIS_IN)))
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that("Question 2.6", expect_known_hash(table(gap_continent_life$continent), "7688676a0807063f1bfa5b4cc721c2d9"))
print("Success!")

#### Question 2.7

Consider now you want to make comparisons between countries, relative to Canada.
Create a new data frame, with the levels of column `country` rearranged to have Canada as the first one.
Store the new data frame in variable `gap_canada_base`.

```
(gap_canada_base <- gapminder %>%
   mutate(country = FILL_THIS_IN(FILL_THIS_IN, "FILL_THIS_IN")))
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that("Question 2.7", expect_known_hash(table(gap_canada_base$country), "72d75ce05a16d8965f7bd0ae3fb986d3"))
print("Success!")

#### Question 2.8

Sometimes you want to manually change a few factor levels, e.g., if the level is too wide for plotting.
Based on the `gapminder` data set, create a new data frame with the Central African Republic renamed to *Central African Rep.* and Bosnia and Herzegovina renamed to *Bosnia & Herzegovina*.
Store the new data frame in variable `gap_car`.

```
(gap_car <- gapminder %>%
   mutate(country = FILL_THIS_IN(FILL_THIS_IN, "Central African Rep." = "FILL_THIS_IN",
                               "Bosnia & Herzegovina" = "FILL_THIS_IN")))
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that("Question 2.8", expect_known_hash(table(gap_car$country), "9cc15f09cb70b5596bbf3feaa73ee471"))
print("Success!")

## TOPIC 2: Tibble Joins

### Question 3

Run the following R code to load the data (extracted from the R package [singer](https://github.com/JoeyBernhardt/singer)).

In [None]:
suppressMessages({
  time <- read_csv("https://raw.githubusercontent.com/STAT545-UBC/Classroom/master/data/singer/songs.csv") %>% rename(song = title)
  album <- read_csv("https://raw.githubusercontent.com/STAT545-UBC/Classroom/master/data/singer/loc.csv") %>% select(title, everything()) %>% rename(song = title, album = release)
})

These two data sets contain information about a few popular songs and albums.
Run the following R codes to look at the two data sets:

In [None]:
time

In [None]:
album

#### Question 3.1
We really care about the songs in `time`.
Bot for which of the songs do we know the corresponding album?
Create a new data frame with all songs from `time` and the information on the corresponding album.
This new data frame should contain only the songs with a corresponding album.
Store the joined data set in variable `songs_with_album`.

```
(songs_with_album <- time %>% 
  FILL_THIS_IN(FILL_THIS_IN, by = c("FILL_THIS_IN", "FILL_THIS_IN")))
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that("Question 3.1", {
  expect_known_hash(sort(songs_with_album$song), "146ff293a74ccc1ad24505a6bc0b6682")
  expect_known_hash(table(songs_with_album$artist_name), "51f7daeec65e839e5ae6c84ac5a1cb70")
})
print("Success!")

#### Question 3.2
Go ahead and add the corresponding albums to the `time` tibble, being sure to preserve rows even if album info is not readily available.
Store the joined data set in variable `all_songs`.

```
(all_songs <- time %>% 
  FILL_THIS_IN(FILL_THIS_IN, by = c("FILL_THIS_IN", "FILL_THIS_IN")))
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that("Question 3.2", {
  expect_known_hash(sort(all_songs$song), "dd1c0b2e14a879cb1a6f07077ed38e97")
  expect_known_hash(all_songs$album[order(all_songs$song)], "2baea3c1a23797fdac5a9e0dc119073e")
})
print("Success!")

#### Question 3.3: Joining Rows by Columns
Create a new data frame with songs from `time` for which there is no album info.
Store the new data set in variable `songs_without_album`.

```
(songs_without_album <- time %>% 
  FILL_THIS_IN(FILL_THIS_IN, by = c("FILL_THIS_IN", "FILL_THIS_IN")))
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that("Question 3.3", expect_known_hash(sort(songs_without_album$song), "146ff293a74ccc1ad24505a6bc0b6682"))
print("Success!")

#### Question 3.4
Create a new data frame with *all* songs from artists for which there is no album information.
Store the new data set in variable `songs_artists_no_album`.

```
(songs_artists_no_album <- time %>% 
  FILL_THIS_IN(FILL_THIS_IN, by = "FILL_THIS_IN"))
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that("Question 3.4", expect_known_hash(table(songs_artists_no_album$artist_name), "244510c51477c31e6e795cbc0ca0b0d7"))
print("Success!")

#### Question 3.5
Create a new data frame with all the information from both tibbles, regardless of no corresponding information being present in the other tibble.
Store the new data set in variable `all_songs_and_albums`.

```
(all_songs_and_albums <- time %>% 
  FILL_THIS_IN(FILL_THIS_IN, by = c("FILL_THIS_IN", "FILL_THIS_IN")))
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that("Question 3.5", {
  expect_known_hash(sort(all_songs_and_albums$song), "ba2ba3507e50c56d21028893404259a5")
  expect_known_hash(with(all_songs_and_albums, album[order(song)]), "dbc70af8d3078ea830be9cfb0dee6b9d")
  expect_known_hash(with(all_songs_and_albums, year[order(song)]), "10669b0750ab4d53b54f0e509430e2d1")
})
print("Success!")

### Question 4: Concatenating Rows

Run the following R code to load the three Lord of the Rings tibbles that we saw a few times already.

In [None]:
suppressMessages({
  fell <- read_csv("https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Fellowship_Of_The_Ring.csv")
  ttow <- read_csv("https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Two_Towers.csv")
  retk <- read_csv("https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Return_Of_The_King.csv")
})

Run the following R codes to take a look at the 3 tibbles:

In [None]:
fell

In [None]:
ttow

In [None]:
retk

#### Question 4.1

Combine the three data sets into a single tibble, storing the new tibble in variable `lotr`.

```
(lotr <- FILL_THIS_IN(fell, ttow, retk))
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that("Question 4.1", expect_known_hash(table(lotr$Film), "41c29122f6c217d447e85a9069f5a92f"))
print("Success!")

#### Question 4.2

Create a new data set with all races that are present in "The Fellowship of the Ring" (`fell`), but not in any of the other ones.
Store the new data frame in variable `only_fell`.

```
(only_fell <- fell %>% 
  FILL_THIS_IN(ttow, by = "Race") %>% 
  FILL_THIS_IN(retk, by = "Race"))
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that("Question 4.2", expect_known_hash(dim(only_fell), "d1e54b618e5808b540dbdfbb7f75026f"))
print("Success!")

### Question 5: Set Operations

Let's use three set functions: `intersect()`, `union()` and `setdiff()`.
They work for data frames with the same column names.

We'll work with two toy tibbles named `y` and `z`, similar to the Data Wrangling Cheatsheet.

Run the following R codes to create the data.

In [None]:
(y <-  tibble(x1 = LETTERS[1:3], x2 = 1:3))

In [None]:
(z <- tibble(x1 = c("B", "C", "D"), x2 = 2:4))

#### Question 5.1

Use one of the three methods mentioned above to create a new data set which contains all rows that appear in both `y` and `z`.
Store the new data frame in variable `in_both`

```
(in_both <- FILL_THIS_IN(y, z))
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that("Question 5.1", expect_known_hash(in_both$x1, "745ec49ab3231655a04484be44a15f98"))
print("Success!")

#### Question 5.2
Assume that rows in `y` are from *Day 1* and rows in `z` are from *Day 2*.
Create a new data set with all rows from `y` and `z`, as well as an additional column `day` which is *Day 1* for rows from `y` and *Day 2* for rows from `z`.
Store the new data set in variable `both_days`.

```
(both_days <- FILL_THIS_IN(
  mutate(y, day = "Day 1"),
  mutate(z, day = "Day 2")
))
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that("Question 5.2", expect_known_hash(with(both_days, x1[order(x2, day)]), "66b9eefd39c2f0b5d130453c139a2051"))
print("Success!")

#### Question 5.3

The rows contained in `z` are bad.
Use one of the three methods mentioned above to create a new data set which contains only the rows from `y` which are not in `z`.
Store the new data frame in variable `only_y`

```
(only_y <- FILL_THIS_IN(y, z))
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that("Question 5.3", expect_known_hash(only_y$x1, "75f1160e72554f4270c809f041c7a776"))
print("Success!")

### Question 6- Dates and Tsibble 

We're going to take a look at the Tsibble package and how it works with dates. Let's first load this package. 

#### Question 6.1

In [None]:
#install.packages("tsibble")
suppressPackageStartupMessages(library(tsibble))

Next let's take a look at the built in presidential dataset that looks at the start and ending terms of US presidents. 

In [None]:
presidential

Using `tsibble` to convert the existing start and end column dates into only year and month. Name this tibble
`president_ym`.

```
(president_ym <- presidential %>%
 mutate(start=FILL_THIS_IN(start), end=FILL_THIS_IN(end)))
```


In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that("Question 6.1", expect_known_hash(president_ym[1,], "8b9ac24bc52a692ab7d1bd83f9e0a19c"))
print("Success!")