diff --git a/01-Visualize.Rmd b/01-Visualize.Rmd index 64cfd64..0d1c474 100755 --- a/01-Visualize.Rmd +++ b/01-Visualize.Rmd @@ -94,7 +94,7 @@ Make a density plot of `budget` colored by `clean_test`. ## Your Turn 7 -Make a density plot of `clean_test` colored by `clean_test`. +Make a barchart of `clean_test` colored by `clean_test`. ```{r} diff --git a/03-Tidy.Rmd b/03-Tidy.Rmd index 0b189b2..23f1503 100755 --- a/03-Tidy.Rmd +++ b/03-Tidy.Rmd @@ -5,7 +5,7 @@ editor_options: chunk_output_type: inline --- - + ```{r setup} library(tidyverse) diff --git a/cheatsheets/dplyr-data-transformation.pdf b/cheatsheets/dplyr-data-transformation.pdf old mode 100755 new mode 100644 index 27646dc..22c6003 Binary files a/cheatsheets/dplyr-data-transformation.pdf and b/cheatsheets/dplyr-data-transformation.pdf differ diff --git a/cheatsheets/ggplot2-data-visualization.pdf b/cheatsheets/ggplot2-data-visualization.pdf old mode 100755 new mode 100644 index 51998d8..0eab6ca Binary files a/cheatsheets/ggplot2-data-visualization.pdf and b/cheatsheets/ggplot2-data-visualization.pdf differ diff --git a/cheatsheets/lubridate-dates-times.pdf b/cheatsheets/lubridate-dates-times.pdf old mode 100755 new mode 100644 index 4a71263..f183fa0 Binary files a/cheatsheets/lubridate-dates-times.pdf and b/cheatsheets/lubridate-dates-times.pdf differ diff --git a/cheatsheets/purrr-iterate.pdf b/cheatsheets/purrr-iterate.pdf old mode 100755 new mode 100644 index d90a621..db4d751 Binary files a/cheatsheets/purrr-iterate.pdf and b/cheatsheets/purrr-iterate.pdf differ diff --git a/cheatsheets/rstudio-ide.pdf b/cheatsheets/rstudio-ide.pdf old mode 100755 new mode 100644 index dd2c750..486f7fd Binary files a/cheatsheets/rstudio-ide.pdf and b/cheatsheets/rstudio-ide.pdf differ diff --git a/cheatsheets/stringr-strings.pdf b/cheatsheets/stringr-strings.pdf old mode 100755 new mode 100644 index d7157bd..5286cc1 Binary files a/cheatsheets/stringr-strings.pdf and b/cheatsheets/stringr-strings.pdf differ diff --git a/slides/05-Data-Types.pdf b/slides/05-Data-Types.pdf old mode 100755 new mode 100644 index 020fca5..041a4b4 Binary files a/slides/05-Data-Types.pdf and b/slides/05-Data-Types.pdf differ diff --git a/slides/06-Iteration.pdf b/slides/06-Iteration.pdf old mode 100755 new mode 100644 index 7d152ee..25be1aa Binary files a/slides/06-Iteration.pdf and b/slides/06-Iteration.pdf differ diff --git a/solutions/01-Visualize-solutions.Rmd b/solutions/01-Visualize-solutions.Rmd new file mode 100644 index 0000000..c87d9bc --- /dev/null +++ b/solutions/01-Visualize-solutions.Rmd @@ -0,0 +1,172 @@ +--- +title: "Visualization - solutions" +output: html_notebook +editor_options: + chunk_output_type: inline +--- + + + +## Setup + +The first chunk in an R Notebook is usually titled "setup," and by convention includes the R packages you want to load. Remember, in order to use an R package you have to run some `library()` code every session. Execute these lines of code to load the packages. + +```{r setup} +library(ggplot2) +library(fivethirtyeight) +``` + +## Bechdel test data + +We're going to start by playing with data collected by the website FiveThirtyEight on movies and [the Bechdel test](https://en.wikipedia.org/wiki/Bechdel_test). + +To begin, let's just preview our data. There are a couple ways to do that. One is just to type the name of the data and execute it like a piece of code. + +```{r} +bechdel +``` + +Notice that you can page through to see more of the dataset. + +Sometimes, people prefer to see their data in a more spreadsheet-like format, and RStudio provides a way to do that. Go to the Console and type `View(bechdel)` to see the data preview. + +(An aside-- `View` is a special function. Since it makes something happen in the RStudio interface, it doesn't work properly in R Notebooks. Most R functions have names that start with lowercase letters, so the uppercase "V" is there to remind you of its special status.) + + + +## Consider +What relationship do you expect to see between movie budget (budget) and domestic gross(domgross)? + +## Your Turn 1 + +Run the code on the slide to make a graph. Pay strict attention to spelling, capitalization, and parentheses! + +```{r} +ggplot(data = bechdel) + + geom_point(mapping = aes(x = budget, y = domgross)) +``` + +## Your Turn 2 + +Add `color`, `size`, `alpha`, and `shape` aesthetics to your graph. Experiment. + +```{r} +ggplot(data = bechdel) + + geom_point(mapping = aes(x = budget, y = domgross, color=clean_test)) + +ggplot(bechdel) + + geom_point(mapping = aes(x = budget, y = domgross, size=clean_test)) +ggplot(bechdel) + + geom_point(mapping = aes(x = budget, y = domgross, shape=clean_test)) +ggplot(bechdel) + + geom_point(mapping = aes(x = budget, y = domgross, alpha=clean_test)) + +``` + +## Set vs map + +```{r} +ggplot(bechdel) + + geom_point(mapping = aes(x = budget, y = domgross), color="blue") +``` + +## Your Turn 3 + +Replace this scatterplot with one that draws boxplots. Use the cheatsheet. Try your best guess. + +```{r} +ggplot(data = bechdel) + geom_point(aes(x = clean_test, y = budget)) + +ggplot(data = bechdel) + geom_boxplot(aes(x = clean_test, y = budget)) +``` + +## Your Turn 4 + +Make a histogram of the `budget` variable from `bechdel`. + +```{r} +ggplot(bechdel) + + geom_histogram(aes(x=budget)) +``` + +## Your Turn 5 +Try to find a better binwidth for `budget`. + +```{r} +ggplot(data = bechdel) + + geom_histogram(mapping = aes(x = budget), binwidth=10000000) +``` + +## Your Turn 6 + +Make a density plot of `budget` colored by `clean_test`. + +```{r} +ggplot(data = bechdel) + + geom_density(mapping = aes(x = budget)) + +ggplot(data = bechdel) + + geom_density(mapping = aes(x = budget, color=clean_test)) +``` + + +## Your Turn 7 + +Make a barchart of `clean_test` colored by `clean_test`. + +```{r} +ggplot(data=bechdel) + + geom_bar(mapping = aes(x = clean_test, fill = clean_test)) +``` + + +## Your Turn 8 + +Predict what this code will do. Then run it. + +```{r} +ggplot(bechdel) + + geom_point(aes(budget, domgross)) + + geom_smooth(aes(budget, domgross)) +``` + +## global vs local + +```{r} +ggplot(data = bechdel, mapping = aes(x = budget, y = domgross)) + + geom_point(mapping = aes(color = clean_test)) + + geom_smooth() +``` + +```{r} +ggplot(data = bechdel, mapping = aes(x = budget, y = domgross)) + + geom_point(mapping = aes(color = clean_test)) + + geom_smooth(data = filter(bechdel, clean_test == "ok")) +``` + +## Your Turn + +What does `getwd()` return? + +```{r} +getwd() +``` + +## Your Turn 9 + +Save the last plot and then locate it in the files pane. If you run your `ggsave()` code inside this notebook, the image will be saved in the same directory as your .Rmd file (likely, project -> code), but if you run `ggsave()` in the Console it will be in your working directory. + +```{r} +ggsave("my-plot.png") +``` + +*** + +# Take aways + +You can use this code template to make thousands of graphs with **ggplot2**. + +```{r eval = FALSE} +ggplot(data = ) + + (mapping = aes()) +``` \ No newline at end of file diff --git a/solutions/02-Transform-Solutions.Rmd b/solutions/02-Transform-Solutions.Rmd new file mode 100644 index 0000000..2645232 --- /dev/null +++ b/solutions/02-Transform-Solutions.Rmd @@ -0,0 +1,357 @@ +--- +title: "Transform Data - solutions" +output: html_notebook +editor_options: + chunk_output_type: inline +--- + + + +```{r setup} +library(dplyr) +library(babynames) +library(nycflights13) +library(skimr) +``` + +## Babynames + +```{r} +babynames +skim(babynames) +skim_with(integer = list(p25 = NULL, p75=NULL)) +``` + + +## Your Turn 1 +Run the skim_with() command, and then try skimming babynames again to see how the output is different +```{r} +skim(babynames) +``` + +## Select + +```{r} +select(babynames, name, prop) +``` + +## Your Turn 2 + +Alter the code to select just the `n` column: + +```{r} +select(babynames, n) +``` + +## Consider + +Which of these is NOT a way to select the `name` and `n` columns together? + +```{r} +select(babynames, -c(year, sex, prop)) +select(babynames, name:n) +select(babynames, starts_with("n")) +select(babynames, ends_with("n")) +``` + +## Your Turn 3 + +Show: + +* All of the names where prop is greater than or equal to 0.08 +* All of the children named "Sea" +* All of the names that have a missing value for `n` + +```{r} +filter(babynames, prop >= 0.08) +filter(babynames, name == "Sea") +filter(babynames, is.na(n)) +``` + +## Your Turn 4 + +Use Boolean operators to alter the code below to return only the rows that contain: + +* Girls named Sea +* Names that were used by exactly 5 or 6 children in 1880 +* Names that are one of Acura, Lexus, or Yugo + +```{r} +filter(babynames, name == "Sea", sex == "F") +filter(babynames, n == 5 | n == 6, year == 1880) +filter(babynames, name %in% c("Acura", "Lexus", "Yugo")) +``` + +## Arrange + +```{r} +arrange(babynames, n) +``` + +## Your Turn 5 + +Arrange babynames by `n`. Add `prop` as a second (tie breaking) variable to arrange on. Can you tell what the smallest value of `n` is? + +```{r} +arrange(babynames, n, prop) +``` + +## desc + +```{r} +arrange(babynames, desc(n)) +``` + +## Your Turn 6 + +Use `desc()` to find the names with the highest prop. +Then, use `desc()` to find the names with the highest n. + +```{r} +arrange(babynames, desc(prop)) +arrange(babynames, desc(n)) +``` + +## Steps and the pipe + +```{r} +babynames %>% + filter(year == 2015, sex == "M") %>% + select(name, n) %>% + arrange(desc(n)) +``` + +## Your Turn 7 + +Use `%>%` to write a sequence of functions that: + +1. Filter babynames to just the girls that were born in 2015 +2. Select the `name` and `n` columns +3. Arrange the results so that the most popular names are near the top. + +```{r} +babynames %>% + filter(year == 2015, sex == "F") %>% + select(name, n) %>% + arrange(desc(n)) +``` + +## Your Turn 8 + +1. Trim `babynames` to just the rows that contain your `name` and your `sex` +2. Trim the result to just the columns that will appear in your graph (not strictly necessary, but useful practice) +3. Plot the results as a line graph with `year` on the x axis and `prop` on the y axis + +```{r} +babynames %>% + filter(name == "Amelia", sex == "F") %>% + select(year, prop) %>% + ggplot() + + geom_line(mapping = aes(year, prop)) +``` + +## Your Turn 9 + +Use summarise() to compute three statistics about the data: + +1. The first (minimum) year in the dataset +2. The last (maximum) year in the dataset +3. The total number of children represented in the data + +```{r} +babynames %>% + summarise(first = min(year), + last = max(year), + total = sum(n)) +``` + +## Your Turn 10 + +Extract the rows where `name == "Khaleesi"`. Then use `summarise()` and a summary functions to find: + +1. The total number of children named Khaleesi +2. The first year Khaleesi appeared in the data + +```{r} +babynames %>% + filter(name == "Khaleesi") %>% + summarise(total = sum(n), first = min(year)) +``` + +## Toy data for transforming + +```{r} +# Toy dataset to use +pollution <- tribble( + ~city, ~size, ~amount, + "New York", "large", 23, + "New York", "small", 14, + "London", "large", 22, + "London", "small", 16, + "Beijing", "large", 121, + "Beijing", "small", 56 +) +``` + +## Summarize + +```{r} +pollution %>% + summarise(mean = mean(amount), sum = sum(amount), n = n()) +``` + +```{r} +pollution %>% + group_by(city) %>% + summarise(mean = mean(amount), sum = sum(amount), n = n()) +``` + + +## Your Turn 11 + +Use `group_by()`, `summarise()`, and `arrange()` to display the ten most popular names. Compute popularity as the total number of children of a single gender given a name. + +```{r} +babynames %>% + group_by(name, sex) %>% + summarise(total = sum(n)) %>% + arrange(desc(total)) +``` + +## Your Turn 12 + +Use grouping to calculate and then plot the number of children born each year over time. + +```{r} +babynames %>% + group_by(year) %>% + summarise(n_children = sum(n)) %>% + ggplot() + + geom_line(mapping = aes(x = year, y = n_children)) +``` + +## Ungroup + +```{r} +babynames %>% + group_by(name, sex) %>% + summarise(total = sum(n)) %>% + arrange(desc(total)) +``` + +## Mutate + +```{r} +babynames %>% + mutate(percent = round(prop*100, 2)) +``` + +## Your Turn 13 + +Use `min_rank()` and `mutate()` to rank each row in `babynames` from largest `n` to lowest `n`. + +```{r} +babynames %>% + mutate(rank = min_rank(desc(prop))) +``` + +## Your Turn 14 + +Compute each name's rank _within its year and sex_. +Then compute the median rank _for each combination of name and sex_, and arrange the results from highest median rank to lowest. + +```{r} +babynames %>% + group_by(year, sex) %>% + mutate(rank = min_rank(desc(prop))) %>% + group_by(name, sex) %>% + summarise(score = median(rank)) %>% + arrange(score) +``` + +## Flights data +```{r} +flights +skim(flights) +``` + +## Toy data + +```{r} +band <- tribble( + ~name, ~band, + "Mick", "Stones", + "John", "Beatles", + "Paul", "Beatles" +) + +instrument <- tribble( + ~name, ~plays, + "John", "guitar", + "Paul", "bass", + "Keith", "guitar" +) + +instrument2 <- tribble( + ~artist, ~plays, + "John", "guitar", + "Paul", "bass", + "Keith", "guitar" +) +``` + +## Mutating joins + +```{r} +band %>% left_join(instrument, by = "name") +``` + +## Your Turn 15 + +Which airlines had the largest arrival delays? Complete the code below. + +1. Join `airlines` to `flights` +2. Compute and order the average arrival delays by airline. Display full names, no codes. + +```{r} +flights %>% + drop_na(arr_delay) %>% + left_join(airlines, by = "carrier") %>% + group_by(name) %>% + summarise(delay = mean(arr_delay)) %>% + arrange(delay) +``` + +## Different names + +```{r} +band %>% left_join(instrument2, by = c("name" = "artist")) +``` + +## Your Turn 16 + +How many airports in `airports` are serviced by flights originating in New York (i.e. flights in our dataset?) Notice that the column to join on is named `faa` in the **airports** data set and `dest` in the **flights** data set. + + +```{r} +airports %>% + semi_join(flights, by = c("faa" = "dest")) %>% + distinct(faa) +``` + +*** + +# Take aways + +* Extract variables with `select()` +* Extract cases with `filter()` +* Arrange cases, with `arrange()` + +* Make tables of summaries with `summarise()` +* Make new variables, with `mutate()` +* Do groupwise operations with `group_by()` + +* Connect operations with `%>%` + +* Use `left_join()`, `right_join()`, `full_join()`, or `inner_join()` to join datasets +* Use `semi_join()` or `anti_join()` to filter datasets against each other diff --git a/solutions/03-Tidy-Solutions.Rmd b/solutions/03-Tidy-Solutions.Rmd new file mode 100644 index 0000000..6cbdd2b --- /dev/null +++ b/solutions/03-Tidy-Solutions.Rmd @@ -0,0 +1,158 @@ +--- +title: "Tidy -- Solutions" +output: + github_document: + df_print: tibble + html_document: + df_print: paged +--- + + + + +```{r setup} +library(tidyverse) +library(babynames) + +# Toy data +cases <- tribble( + ~Country, ~"2011", ~"2012", ~"2013", + "FR", 7000, 6900, 7000, + "DE", 5800, 6000, 6200, + "US", 15000, 14000, 13000 +) +pollution <- tribble( + ~city, ~size, ~amount, + "New York", "large", 23, + "New York", "small", 14, + "London", "large", 22, + "London", "small", 16, + "Beijing", "large", 121, + "Beijing", "small", 121 +) +bp_systolic <- tribble( + ~ subject_id, ~ time_1, ~ time_2, ~ time_3, + 1, 120, 118, 121, + 2, 125, 131, NA, + 3, 141, NA, NA +) +bp_systolic2 <- tribble( + ~ subject_id, ~ time, ~ systolic, + 1, 1, 120, + 1, 2, 118, + 1, 3, 121, + 2, 1, 125, + 2, 2, 131, + 3, 1, 141 +) +``` + +## Tidy and untidy data + +`table1` is tidy: +```{r} +table1 +``` + +For example, it's easy to add a rate column with `mutate()`: +```{r} +table1 %>% + mutate(rate = cases/population) +``` + +`table2` isn't tidy, the count column really contains two variables: +```{r} +table2 +``` + +It makes it very hard to manipulate. + +## Your Turn 1 + +Is `bp_systolic` tidy? + +```{r} +bp_systolic2 +``` + +## Your Turn 2 + +Using `bp_systolic2` with `group_by()`, and `summarise()`: + +* Find the average systolic blood pressure for each subject +* Find the last time each subject was measured + +```{r} +bp_systolic2 %>% + group_by(subject_id) %>% + summarise(avg_bp = mean(systolic), + last_time = max(time)) +``` + +## Your Turn 3 + +On a sheet of paper, draw how the cases data set would look if it had the same values grouped into three columns: **country**, **year**, **n** + +----------------------------- + country year cases +------------- ------ -------- + Afghanistan 1999 745 + + Afghanistan 2000 2666 + + Brazil 1999 37737 + + Brazil 2000 80488 + + China 1999 212258 + + China 2000 213766 +----------------------------- + +## Your Turn 4 + +Use `gather()` to reorganize `table4a` into three columns: **country**, **year**, and **cases**. + +```{r} +table4a %>% + gather(key = "year", + value = "cases", -country) %>% + arrange(country) +``` + +## Your Turn 5 + +On a sheet of paper, draw how `pollution` would look if it had the same values grouped into three columns: **city**, **large**, **small** + +-------------------------- + city large small +---------- ------- ------- + Beijing 121 121 + + London 22 16 + + New York 23 14 +-------------------------- + +## Your Turn 6 + +Use `spread()` to reorganize `table2` into four columns: **country**, **year**, **cases**, and **population**. + +```{r} +table2 %>% + spread(key = type, value = count) +``` + +*** + +# Take Aways + +Data comes in many formats but R prefers just one: _tidy data_. + +A data set is tidy if and only if: + +1. Every variable is in its own column +2. Every observation is in its own row +3. Every value is in its own cell (which follows from the above) + +What is a variable and an observation may depend on your immediate goal. \ No newline at end of file diff --git a/solutions/04-Case-Study-Solutions.Rmd b/solutions/04-Case-Study-Solutions.Rmd new file mode 100644 index 0000000..842ed7d --- /dev/null +++ b/solutions/04-Case-Study-Solutions.Rmd @@ -0,0 +1,211 @@ +--- +title: 'Case Study: Friday the 13th Effect (Solution)' +output: + html_document: + df_print: paged + github_document: + df_print: tibble +--- + + + +```{r setup} +library(fivethirtyeight) +library(tidyverse) +``` + +## Task + +Reproduce this figure from fivethirtyeight's article [*Some People Are Too Superstitious To Have A Baby On Friday The 13th*](https://fivethirtyeight.com/features/some-people-are-too-superstitious-to-have-a-baby-on-friday-the-13th/): + +![](resources/bialik-fridaythe13th-2.png) + +## Data + +In the `fivethiryeight` package there are two datasets containing birth data, but for now let's just work with one `US_births_1994_2003`. Note that since we have data from 1994-2003, our results may differ somewhat from the figure based on 1994-2014. + +## Your Turn 1 + +With your neighbour, brainstorm the steps needed to get the data in a form ready to make the plot. + +```{r} +US_births_1994_2003 +``` + +## Some overviews of the data + +Whole time series: +```{r} +ggplot(US_births_1994_2003, aes(x = date, y = births)) + + geom_line() +``` +There is so much fluctuation it's really hard to see what is going on. + +Let's try just looking at one year: +```{r} +US_births_1994_2003 %>% + filter(year == 1994) %>% + ggplot(mapping = aes(x = date, y = births)) + + geom_line() +``` +Strong weekly pattern accounts for most variation. + +## Strategy + +Use the figure as a guide for what the data should like to make the final plot. We want to end up with something like: + +--------------------------- + day_of_week avg_diff_13 +------------- ------------- + Mon -2.686 + + Tues -1.378 + + Wed -3.274 + + ... ... + +--------------------------- + +There is more than one way to get there, but we +ll roughly follow this strategy: + +* Get just the data for the 6th, 13th, and 20th +* Calculate variable of interest: + * (For each month/year): + * Find average births on 6th and 20th + * Find percentage difference between births on 13th and average births on 6th and 20th + + * Average percent difference by day of the week +* Create plot + +## Your Turn 2 + +Extract just the 6th, 13th and 20th of each month: + +```{r} +US_births_1994_2003 %>% + select(-date) %>% + filter(date_of_month %in% c(6, 13, 20)) +``` + +## Your Turn 3 + +Which arrangement is tidy? + +**Option 1:** + +----------------------------------------------------- + year month date_of_month day_of_week births +------ ------- --------------- ------------- -------- + 1994 1 6 Thurs 11406 + + 1994 1 13 Thurs 11212 + + 1994 1 20 Thurs 11682 +----------------------------------------------------- + +**Option 2:** + +---------------------------------------------------- + year month day_of_week 6 13 20 +------ ------- ------------- ------- ------- ------- + 1994 1 Thurs 11406 11212 11682 +---------------------------------------------------- + +(**Hint:** think about our next step *"Find the percent difference between the 13th and the average of the 6th and 12th"*. In which layout will this be easier using our tidy tools?) + +**Solution**: Option 2, since then we can easily use `mutate()`. + +## Your Turn 4 + +Tidy the filtered data to have the days in columns. + +```{r} +US_births_1994_2003 %>% + select(-date) %>% + filter(date_of_month %in% c(6, 13, 20)) %>% + spread(date_of_month, births) +``` + +## Your Turn 5 + +Now use `mutate()` to add columns for: + +* The average of the births on the 6th and 20th +* The percentage difference between the number of births on the 13th and the average of the 6th and 20th + +```{r} +US_births_1994_2003 %>% + select(-date) %>% + filter(date_of_month %in% c(6, 13, 20)) %>% + spread(date_of_month, births) %>% + mutate( + avg_6_20 = (`6` + `20`)/2, + diff_13 = (`13` - avg_6_20) / avg_6_20 * 100 + ) +``` + +## A little additional exploring + +Now we have a percent difference between the 13th and the 6th and 20th of each month, it's probably worth exploring a little (at the very least to check our calculations seem reasonable). + +To make it a little easier let's assign our current data to a variable +```{r} +births_diff_13 <- US_births_1994_2003 %>% + select(-date) %>% + filter(date_of_month %in% c(6, 13, 20)) %>% + spread(date_of_month, births) %>% + mutate( + avg_6_20 = (`6` + `20`)/2, + diff_13 = (`13` - avg_6_20) / avg_6_20 * 100 + ) +``` + +Then take a look +```{r} +births_diff_13 %>% + ggplot(mapping = aes(day_of_week, diff_13)) + + geom_point() +``` + +Looks like we are on the right path. There's a big outlier one Monday +```{r} +births_diff_13 %>% + filter(day_of_week == "Mon", diff_13 > 10) +``` + +Seem's to be driven but a particularly low number of births on the 6th of Sep 1999. Maybe a holiday effect? Labour Day was of the 6th of Sep that year. + +## Your Turn 6 + +Summarize each day of the week to have mean of diff_13. + +Then, recreate the fivethirtyeight plot. + +```{r} +US_births_1994_2003 %>% + select(-date) %>% + filter(date_of_month %in% c(6, 13, 20)) %>% + spread(date_of_month, births) %>% + mutate( + avg_6_20 = (`6` + `20`)/2, + diff_13 = (`13` - avg_6_20) / avg_6_20 * 100 + ) %>% + group_by(day_of_week) %>% + summarise(avg_diff_13 = mean(diff_13)) %>% + ggplot(aes(x = day_of_week, y = avg_diff_13)) + + geom_bar(stat = "identity") +``` + +## Extra Challenges + +* If you wanted to use the `US_births_2000_2014` data instead, what would you need to change in the pipeline? How about using both `US_births_1994_2003` and `US_births_2000_2014`? + +* Try not removing the `date` column. At what point in the pipeline does it cause problems? Why? + +* Can you come up with an alternative way to investigate the Friday the 13th effect? Try it out! + +## Takeaways + +The power of the tidyverse comes from being able to easily combine functions that do simple things well. \ No newline at end of file diff --git a/solutions/06-Iterate-solutions.Rmd b/solutions/06-Iterate-solutions.Rmd new file mode 100644 index 0000000..8bbb371 --- /dev/null +++ b/solutions/06-Iterate-solutions.Rmd @@ -0,0 +1,164 @@ +--- +title: "Iteration (solutions)" +output: + html_document: + df_print: paged + github_document: + df_print: tibble +--- + + + +```{r setup} +library(tidyverse) + +# Toy data +set.seed(1000) +exams <- list( + student1 = round(runif(10, 50, 100)), + student2 = round(runif(10, 50, 100)), + student3 = round(runif(10, 50, 100)), + student4 = round(runif(10, 50, 100)), + student5 = round(runif(10, 50, 100)) +) + +extra_credit <- list(0, 0, 10, 10, 15) +``` + +## Your Turn 1 + +What kind of object is `mod`? Why are models stored as this kind of object? + +```{r} +mod <- lm(price ~ carat + cut + color + clarity, data = diamonds) +# View(mod) +``` + +`mod` is a list. A list is used because we need to store lots of heterogeneous information. + +## Quiz + +What's the difference between a list and an **atomic** vector? + +Atomic vectors are: "logical", "integer", "numeric" (synonym "double"), "complex", "character" and "raw" vectors. + +Lists can hold data of different types and different lengths, we can even put lists inside other lists. + +## Your Turn 2 + +Here is a list: + +```{r} +a_list <- list(num = c(8, 9), + log = TRUE, + cha = c("a", "b", "c")) +``` + +Here are two subsetting commands. Do they return the same values? Run the code chunk above, _and then_ run the code chunks below to confirm + +```{r} +a_list["num"] +``` + +```{r} +a_list$num +``` + +## Your Turn 3 + +What will each of these return? Run the code chunks to confirm. + +```{r} +vec <- c(-2, -1, 0, 1, 2) +abs(vec) +``` + +`abs()` returns the absolute value of each element. + +```{r, error = TRUE} +lst <- list(-2, -1, 0, 1, 2) +abs(lst) +``` + +Out intent might be to take the absolute value of each element, but we get an error, because `abs()` doens't know how to handle a list. + +## Your Turn 4 + +Run the code in the chunks. What does it return? + +```{r} +list(student1 = mean(exams$student1), + student2 = mean(exams$student2), + student3 = mean(exams$student3), + student4 = mean(exams$student4), + student5 = mean(exams$student5)) +``` + +This chunk manually iterates over the elements of `exams` taking the mean of each element, and returning the results in a list. + +```{r} +library(purrr) +map(exams, mean) +``` + +This does the exact same thing, but automatically. + + +## Your Turn 5 + +Calculate the variance (`var()`) of each student’s exam grades. + +```{r} +exams %>% map(var) +``` + +## Your Turn 6 + +Calculate the max grade (max())for each student. Return the result as a vector. + +```{r} +exams %>% map_dbl(max) +``` + +## Your Turn 7 + +Write a function that counts the best exam twice and then takes the average. Use it to grade all of the students. + +1. Write code that solves the problem for a real object +2. Wrap the code in `function(){}` to save it +3. Add the name of the real object as the function argument + +```{r} +double_best <- function(x) { + (sum(x) + max(x)) / (length(x) + 1) +} + +exams %>% + map_dbl(double_best) +``` + +### Your Turn 8 + +Compute a final grade for each student, where the final grade is the average test score plus any `extra_credit` assigned to the student. Return the results as a double (i.e. numeric) vector. + +```{r} +exams %>% + map2_dbl(extra_credit, function(x, y) mean(x) + y) +``` + + +*** + +# Take Aways + +Lists are a useful way to organize data, but you need to arrange manually for functions to iterate over the elements of a list. + +You can do this with the `map()` family of functions in the purrr package. + +To write a function, + +1. Write code that solves the problem for a real object +2. Wrap the code in `function(){}` to save it +3. Add the name of the real object as the function argument + +This sequence will help prevent bugs in your code (and reduce the time you spend correcting bugs). diff --git a/solutions/07-Model-Solutions.Rmd b/solutions/07-Model-Solutions.Rmd new file mode 100644 index 0000000..67320a7 --- /dev/null +++ b/solutions/07-Model-Solutions.Rmd @@ -0,0 +1,125 @@ +--- +title: "Model (solutions)" +output: + html_document: + df_print: paged + github_document: + df_print: tibble +--- + +```{r setup, message=FALSE} +library(tidyverse) +library(modelr) +library(broom) + +wages <- heights %>% filter(income > 0) +``` + +## Your Turn 1 + +Fit the model on the slide and then examine the output. What does it look like? + +```{r} +mod_e <- lm(log(income) ~ education, data = wages) +mod_e +``` + +## Your Turn 2 + +Use a pipe to model `log(income)` against `height`. Then use broom and dplyr functions to extract: + +1. The **coefficient estimates** and their related statistics +2. The **adj.r.squared** and **p.value** for the overall model + +```{r} +mod_h <- wages %>% lm(log(income) ~ height, data = .) +mod_h %>% + tidy() + +mod_h %>% + glance() %>% + select(adj.r.squared, p.value) +``` + +## Your Turn 3 + +Model `log(income)` against `education` _and_ `height`. Do the coefficients change? + +```{r} +mod_eh <- wages %>% + lm(log(income) ~ education + height, data = .) + +mod_eh %>% + tidy() +``` + +## Your Turn 4 + +Model `log(income)` against `education` and `height` and `sex`. Can you interpret the coefficients? + +```{r} +mod_ehs <- wages %>% + lm(log(income) ~ education + height + sex, data = .) + +mod_ehs %>% + tidy() +``` + +## Your Turn 5 + +Use a broom function and ggplot2 to make a line graph of `height` vs `.fitted` for our heights model, `mod_h`. + +_Bonus: Overlay the plot on the original data points._ + +```{r} +mod_h <- wages %>% lm(log(income) ~ height, data = .) + +mod_h %>% + augment(data = wages) %>% + ggplot(mapping = aes(x = height, y = .fitted)) + + geom_point(mapping = aes(y = log(income)), alpha = 0.1) + + geom_line(color = "blue") +``` + +## Your Turn 6 + +Repeat the process to make a line graph of `height` vs `.fitted` colored by `sex` for model mod_ehs. Are the results interpretable? Add `+ facet_wrap(~education)` to the end of your code. What happens? + +```{r} +mod_ehs <- wages %>% lm(log(income) ~ education + height + sex, data = .) + +mod_ehs %>% + augment(data = wages) %>% + ggplot(mapping = aes(x = height, y = .fitted, color = sex)) + + geom_line() + + facet_wrap(~ education) +``` + +## Your Turn 7 + +Use one of `spread_predictions()` or `gather_predictions()` to make a line graph of `height` vs `pred` colored by `model` for each of mod_h, mod_eh, and mod_ehs. Are the results interpretable? + +Add `+ facet_grid(sex ~ education)` to the end of your code. What happens? + +```{r warning = FALSE, message = FALSE} +mod_h <- wages %>% lm(log(income) ~ height, data = .) +mod_eh <- wages %>% lm(log(income) ~ education + height, data = .) +mod_ehs <- wages %>% lm(log(income) ~ education + height + sex, data = .) + +wages %>% + gather_predictions(mod_h, mod_eh, mod_ehs) %>% + ggplot(mapping = aes(x = height, y = pred, color = model)) + + geom_line() + + facet_grid(sex ~ education) +``` + +*** + +# Take Aways + +* Use `glance()`, `tidy()`, and `augment()` from the **broom** package to return model values in a data frame. + +* Use `add_predictions()` or `gather_predictions()` or `spread_predictions()` from the **modelr** package to visualize predictions. + +* Use `add_residuals()` or `gather_residuals()` or `spread_residuals()` from the **modelr** package to visualize residuals. + diff --git a/solutions/08-Organize-Solutions.Rmd b/solutions/08-Organize-Solutions.Rmd new file mode 100644 index 0000000..ff70215 --- /dev/null +++ b/solutions/08-Organize-Solutions.Rmd @@ -0,0 +1,143 @@ +--- +title: "Organize with List Columns" +output: + html_document: + df_print: paged + github_document: + df_print: tibble +--- + + + +```{r setup} +library(tidyverse) +library(gapminder) +library(broom) + +nz <- gapminder %>% + filter(country == "New Zealand") +us <- gapminder %>% + filter(country == "United States") +``` + +## Your turn 1 + +How has life expectancy changed in other countries? +Make a line plot of lifeExp vs. year grouped by country. +Set alpha to 0.2, to see the results better. + +```{r} +gapminder %>% + ggplot(mapping = aes(x = year, y = lifeExp, group = country)) + + geom_line(alpha = 0.2) +``` + +## Quiz + +How is a data frame/tibble similar to a list? + +```{r} +gapminder_sm <- gapminder[1:5, ] +``` + +It is a list! Columns are like elements of a list + +You can extract them with `$` of `[[` +```{r} +gapminder_sm$country +gapminder_sm[["country"]] +``` + +Or get a new smaller list with `[`: +```{r} +gapminder_sm["country"] +``` + +## Quiz + +If one of the elements of a list can be another list, +can one of the columns of a data frame be another list? + +**Yes!**. + +```{r} +tibble( + num = c(1, 2, 3), + cha = c("one", "two", "three"), + listcol = list(1, c("1", "two", "FALSE"), FALSE) +) +``` + +And we call it a **list column**. + +## Your turn 2 + +Run this chunk: +```{r} +gapminder_nested <- gapminder %>% + group_by(country) %>% + nest() + +fit_model <- function(df) lm(lifeExp ~ year, data = df) + +gapminder_nested <- gapminder_nested %>% + mutate(model = map(data, fit_model)) + +get_rsq <- function(mod) glance(mod)$r.squared + +gapminder_nested <- gapminder_nested %>% + mutate(r.squared = map_dbl(model, get_rsq)) +``` + +Then filter `gapminder_nested` to find the countries with r.squared less than 0.5. + +```{r} +gapminder_nested %>% + filter(r.squared < 0.5) +``` + +## Your Turn 3 + +Edit the code in the chunk provided to instead find and plot countries with a slope above 0.6 years/year. + +```{r} +get_slope <- function(mod) { + tidy(mod) %>% filter(term == "year") %>% pull(estimate) +} + +# Add new column with r-sqaured +gapminder_nested <- gapminder_nested %>% + mutate(slope = map_dbl(model, get_slope)) + +# filter out low r-squared countries +big_slope <- gapminder_nested %>% + filter(slope > 0.6) + +# unnest and plot result +unnest(big_slope, data) %>% + ggplot(aes(x = year, y = lifeExp)) + + geom_line(aes(color = country)) +``` + +## Your Turn 4 + +**Challenge:** + +1. Create your own copy of `gapminder_nested` and then add one more list column: `output` which contains the output of `augment()` for each model. + +2. Plot the residuals against time for the countries with small r-squared. + +```{r} +charlotte_gapminder <- gapminder_nested + +charlotte_gapminder %>% + mutate(output = model %>% map(augment)) %>% + unnest(output) %>% + filter(r.squared < 0.5) %>% + ggplot() + + geom_line(aes(year, .resid, color = country)) + +``` + +# Take away +