# DSCI 523 Lab Assignment 3

## Tidy control flow in R, as well as functions & testing in R

## Lab Mechanics
rubric={5}

- All files necessary to run your work must be pushed to your GitHub.ubc.ca repository for this lab.
- You need to have a minimum of 3 commit messages associated with your GitHub.ubc.ca repository for this lab.
- You must also submit this `.ipynb` notebook of this homework to Gradescope, and it must be executed so the TA's can see the results of your work.
- **There is autograding in this lab, so please do not move or rename this file. Also, do not copy and paste cells, if you need to add new cells, create new cells via the "Insert a cell below" button instead.**
- Follow the [MDS general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions/).

## Code Quality
rubric={quality:5}

The code that you write for this assignment will be given one overall grade for code quality, see our code quality rubric as a guide to what we are looking for. Also, for this course (and other MDS courses that use R), we are trying to follow the tidyverse code style. There is a guide you can refer too: http://style.tidyverse.org/

Each code question will also be assessed for code accuracy (i.e., does it do what it is supposed to do?).

## Writing 
rubric={writing:5}

To get the marks for this writing component, you should:

- Use proper English, spelling, and grammar throughout your submission (the non-coding parts).
- Be succinct. This means being specific about what you want to communicate, without being superfluous.

## Table of contents

1. [Exercise 1: control flow with {dplyr}](#Exercise-1:-control-flow-with-{dplyr})

2. [Exercise 2: mapping with {purrr}](#Exercise-2:-mapping-with-{purrr})

3. [Exercise 3: functions](#Exercise-3:-functions)

4. [Exercise 4: testing](#Exercise-4:-testing)

5. [Exercise 5: (Optional)](#Exercise-5:-(Optional))

8. [Submission instructions](#Submission)

Run the cell below to load the libraries needed for this lab, as well as the test file so you can check your answers as you go!

In [2]:
library(nycflights13)
library(testthat)
library(tidyverse)
options(repr.matrix.max.rows = 10)

── [1mAttaching packages[22m ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.8     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.1
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1
── [1mConflicts[22m ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mreadr[39m::[32medition_get()[39m   masks [34mtestthat[39m::edition_get()
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m        masks [34mstats[39m::filter()
[31m✖[39m [34mpurrr[39m::[32mis_null()[39m       masks [34mtestthat[39m::is_null()
[31m✖[39m [34mdplyr[39m::[32mlag()[39

## Exercise 1: control flow with {dplyr}
rubric={autograde=15}

Use the {tidyverse} control flow functions we learned about this week to take the {nycflights13} `flights` data set and obtain the average speed (in km/hr) and average distance (in km) for the carriers AA, AS, UA and US.
Name these new columns `avg_speed` and `avg_distance_km`, and round the values so that the answer is a whole number (i.e., no decimal points). Convert the carrier acronyms to their full names (American Airlines, Alaska Airlines, 
United Airlines and US Airways). Sort the results according to the average speed. Bind the name `avg_flights` to the data frame.

Some hints:
- The distance is in miles and air time is in minutes in the `flights` data. 
- You will have to create a column that holds the average speed for each flight before you can do this for each carrier.
- You may also need to handle `NA` entries in the data.

In [3]:
avg_flights <- NULL
# BEGIN SOLUTION NO PROMPT
avg_flights <- flights |>
    filter(carrier == "AA" | carrier == "AS" | carrier == "UA" | carrier == "US") |>
    mutate(carrier = case_when(carrier == "AA" ~ "American Airlines",
                              carrier == "AS" ~ "Alaska Airlines",
                              carrier == "UA" ~ "United Airlines",
                              carrier == "US" ~ "US Airways")) |> 
    mutate(avg_speed = (distance * 1.6093) / (air_time / 60)) |>
    group_by(carrier) |>
    summarise(avg_speed = round(mean(avg_speed, na.rm = TRUE)), 
              avg_distance_km = round(mean(1.6093 * distance, na.rm = TRUE))) |>
    arrange(desc(avg_speed))
# END SOLUTION
avg_flights

carrier,avg_speed,avg_distance_km
<chr>,<dbl>,<dbl>
Alaska Airlines,714,3866
United Airlines,677,2461
American Airlines,672,2157
US Airways,550,891


The tests below only check that the object has the correct names. The other tests are intentionally hidden.

In [4]:
# visible tests to check object name
# the remaining tests are hidden
expect_true(exists("avg_flights"))
expect_named(avg_flights, c("carrier", "avg_speed", "avg_distance_km"), ignore.order = TRUE)

In [5]:
# HIDDEN
expect_s3_class(avg_flights, "tbl_df")
expect_type(avg_flights$carrier, "character")
expect_type(avg_flights$avg_speed, "double")
expect_type(avg_flights$avg_distance_km, "double")
expect_equal(nrow(avg_flights), 4)
expect_equal(ncol(avg_flights), 3)
#expect_equal(avg_flights[[1]][1], 'Alaska Airlines')
#expect_equal(avg_flights[[1]][3], 'American Airlines')
expect_true(avg_flights[[1]][1] %in% c("Alaska Airlines", "US Airways"))
expect_true(avg_flights[[1]][3] %in% c("American Airlines", "United Airlines"))
expect_equal(round(sum(as.numeric(avg_flights$avg_speed)), -2), 2600)

## Exercise 2: mapping with {purrr}
rubric={accuracy:20}

We want to know if the list mixed_bag given below contains all numeric elements, if it does, we want to output `TRUE`. If not, we want to output `FALSE`.

To do this use a {purrr} `map*` function to iterate over the list given below to generate a logical vector that holds `TRUE` if the list element is numeric and `FALSE` if it is not. Then use the fact that R can sum logical vectors (`TRUE` takes on the value of 1 and `FALSE` takes on the value of 0) and check whether the sum of the logical vector generated by map equals the length of the mixed_bag list.

In [5]:
mixed_bag <- list(c(11232, 21231, 32123),
                 "https://github.com/UBC-DSCI/introduction-to-datascience",
                 c(TRUE, FALSE, FALSE, TRUE, TRUE),
                 c("CRC Press"),
                 list(1, 2, 3))
# BEGIN SOLUTION NO PROMPT
sum(map_lgl(mixed_bag, is.numeric)) == length(mixed_bag)
# END SOLUTION

## Exercise 3: functions
rubric={accuracy:16}

We provide you below code that performs a random walk (follows the same logic as the code you wrote in DSCI 511 lab 1), for 10 steps. Turn this code into a function in R that takes an argument `n` for the number of steps the random walk function should take. 

Additionally, although the code below that works, it does not adhere to the [tidyverse style guide](https://style.tidyverse.org/) nor uses roxygen2-style comments. Identify where it deviates from the tidyverse style guide and correct it. The {[styler](https://styler.r-lib.org/)} package will get you part way, but you will still need a human in the loop to adhere to all of the tidyverse style guide recommendations.

In [7]:
X=0
    y=0
    
    
    for (i in 1:10){
        dirGo = runif(1)
        if(dirGo<0.25)
        {
            # go right
            X = X+1
        } else if(dirGo<0.5){
            # go left
            X = X-1
        } else if(dirGo<0.75){
            # go up
            y = y+1
        } else
        {
            # go down
            y = y-1
        }
        
        print(c(X,y))
    }
    
    
    return(X ^ 2+y ^ 2)

# BEGIN SOLUTION NO PROMPT
#' Simulates n steps of a 2D random walk. Prints the result of each step
#' and calculates the squared distance from the origin.
#'    
#' @param n the number of steps to take
#' 
#' @return the squared distance from the origin
#' 
#' @examples
#' randomWalker(20)
random_walk <- function(n) {
    x <- 0
    y <- 0
    for (i in 1:n) {
        dir_go <- runif(1)
        if (dir_go < 0.25) {
            # go right
            x <- x + 1
        } else if (dir_go < 0.5) {
            # go left
            x <- x - 1
        } else if (dir_go < 0.75) {
            # go up
            y <- y + 1
        } else {
            # go down
            y <- y - 1
        }
        print(c(x, y))
    }
    x^2 + y^2
}

random_walk(20)
# END SOLUTION

[1] 1 0
[1] 0 0
[1] 0 1
[1] 1 1
[1] 1 2
[1] 1 1
[1] 1 0
[1]  1 -1
[1]  2 -1
[1]  3 -1


[1] 1 0
[1] 0 0
[1]  0 -1
[1]  1 -1
[1] 1 0
[1] 2 0
[1]  2 -1
[1] 2 0
[1] 1 0
[1] 1 1
[1] 1 0
[1]  1 -1
[1]  0 -1
[1]  1 -1
[1] 1 0
[1] 1 1
[1] 2 1
[1] 2 0
[1] 3 0
[1] 2 0


## Exercise 4: testing
rubric={accuracy:18} 

Sample variance of data generated from a normal/Gaussian distribution is defined as:

$variance = \frac{\Sigma{(x-mean)^2}}{n-1}$

where $mean$ is the mean of our observations, $x$ is each individual observation, and $n$ is the number of observations.

Your task is to use test driven development to write a function that calculates the variance from scratch (*i.e.*, do not use the `var` function in R). Your function should take in a vector, and return a vector of length 1. Make sure you use defensive programming so that your function will fail early (and provides useful error messages) if the user provides incorrect inputs (e.g., lists, data frames, etc). Use {testthat} statements to check the correctness of your function on tractable edge cases, as well as to check that your function handles exceptions as expected. 

*Hint - you likely need to avoid using {tidyverse} functions in your solution as we will not learn how to write functions with them until next week (they are a little trickier to program with due to their unquoted column names).*

In [8]:
# BEGIN SOLUTION NO PROMPT
#' Calculates the variance of a vector of numbers.
#'
#' Calculates the sample variance of data generated from a normal/Gaussian distribution, 
#' omitting NA's in the data.
#'
#' @param data numeric vector of numbers whose length is > 1.
#'
#' @return numeric vector of length one, the variance.
#'
#' @examples
#' variance(c(1, 2, 3))
variance <- function(data) {
    if (is.list(data)) {
        stop("input should be a vector")
    }
    if (!is.numeric(data)) {
        stop("input should be a numeric vector")
    }
    
    data_total <- sum(data, na.rm = TRUE)
    length_data <- length(na.omit(data))
    data_mean  <- data_total / length_data
    sum_diffsq <- sum((data - data_mean)^2)
    sum_diffsq / (length(na.omit(data)) - 1)
}

test_that('variance is calculated incorrectly', {
    expect_equal(variance(c(1, 1)), 0)
    expect_equal(variance(c(2, 4)), 2)
    expect_true(is.na(variance(c(1))))
})
test_that('variance expects a numeric vector', {
    expect_error(variance(list(1, 2, 3)))
    expect_error(variance(data.frame(1, 2, 3)))
    expect_error(variance(c("one", "two", "three")))
})
# END SOLUTION

[32mTest passed[39m 😸
[32mTest passed[39m 🌈


## Exercise 5: (Challenging Question)
rubric={accuracy:5}

We're going to be working with a data set from Kaggle to further explore the {purrr} `map*` functions. This data was collected under the instructions from Madrid's City Council and is publicly available on their website. It is named `madrid_pollution.tsv` and is available here https://github.com/UBC-DSCI/dsci-100-assets/blob/master/2019-fall/materials/worksheet_03/data/madrid_pollution.csv?raw=true. This data includes daily and hourly measurements of air quality from 2001 to 2006. Pollutants are categorized based on their chemical properties. More information about this data set can be found [here](https://www.kaggle.com/decide-soluciones/air-quality-madrid). 

In this exercise we want you to use create a subset of this data frame called that contains only the records for the year 2006, and only the columns with the pollutant values. Then we want you to use a {purrr} `map*` function and a standard error function (that you write yourself) to obtain the standard errors for each pollutant in 2006 stored as a tibble. 

The standard error of a normal distribution is defined as the standard deviation divided by the square root of the number of observations:

$$se = \frac{sd}{\sqrt{n}}$$

There is no function for this in R, so for this question you need to write this yourself. Be sure to also write tests for your function to prove that it works as expected.

In [9]:
# BEGIN SOLUTION NO PROMPT
#' Calculates the standard deviation of a vector of numbers.
#'
#' Calculates the standard deviation of data generated from a normal/Gaussian distribution, 
#' omitting NA's in the data.
#'
#' @param data numeric vector of numbers whose length is > 1.
#'
#' @return numeric vector of length one, the standard error.
#'
#' @examples
#' se(c(1, 2, 3))
se <- function(data) {
    if (is.list(data)) {
        stop("input should be a vector")
    }
    if (!is.numeric(data)) {
        stop("input should be a numeric vector")
    }
    
    sd(data, na.rm = TRUE) / sqrt(length(na.omit(data)))
}

test_that('standard error is calculated incorrectly', {
    expect_equal(se(c(2, 2, 2, 2)), 0)
    expect_equal(se(c(2, 4, 2, 4)), 0.58, tolerance = 0.01)
    expect_true(is.na(variance(c(1))))
})
test_that('variance expects a numeric vector', {
    expect_error(variance(list(1, 2, 3)))
    expect_error(variance(data.frame(1, 2, 3)))
    expect_error(variance(c("one", "two", "three")))
})

madrid_se <- read_tsv("https://raw.githubusercontent.com/UBC-DSCI/dsci-100-assets/master/2019-fall/materials/worksheet_03/data/madrid_pollution.csv") %>% 
    filter(year == 2006) |> 
    select(-date, -year, -month) |> 
    map_df(se)
madrid_se
# END SOLUTION

[32mTest passed[39m 🥇
[32mTest passed[39m 🌈


[1mRows: [22m[34m51864[39m [1mColumns: [22m[34m17[39m
[36m──[39m [1mColumn specification[22m [36m───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m "\t"
[31mchr[39m   (1): month
[32mdbl[39m  (15): BEN, CO, EBE, MXY, NMHC, NO_2, NOx, OXY, O_3, PM10, PXY, SO_2, TC...
[34mdttm[39m  (1): date

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


BEN,CO,EBE,MXY,NMHC,NO_2,NOx,OXY,O_3,PM10,PXY,SO_2,TCH,TOL
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0.0105806,0.003330898,0.01547974,0.03847104,0.001128442,0.3594865,1.169954,0.01616178,0.2584397,0.33335,0.0135156,0.04546101,0.002264205,0.06696747


Note - there is a new {tidyverse} function, `across`, that is also useful for applying a function across columns (docs: https://dplyr.tidyverse.org/reference/across.html), however we focus on teaching `map_*` in MDS as it is more general. Feel free to use either in future if the use of `map_*` is not specified.

Congratulations! You are done the lab!!! Pat yourself on the back, make sure you pushed 3 commits to GitHub and submit your worksheet to Gradescope!