# Data manipulation with `dplyr` and visualization with `ggplot2`

In [0]:
# Some initial setup
options(digits = 3, repr.matrix.max.rows = 6)

# load tidyverse, which includes dplyr and ggplot
library("tidyverse")

The [tidyverse](https://www.tidyverse.org) is a collection of actively-developed `R` packages that follow a certain principle of writing code.
Here, we will primarily focus on `dplyr`, which deals with data manipulation, and `ggplot2` which deals with data visualization.

First, we will start by reading some data. 
<!-- There are many functions in `tidyverse` (and `R` in general) for reading various data formats into the `R` environment as a table of data (or, `data.table`).
For example, either `read_csv` (from `tidyverse`) or `read.csv` (base `R`) can be used to read a `csv` file (e.g., a file where each line represents a row in the table, and the columns are separated by a `,`). -->

For this tutorial, we will be using the [titanic](https://www.kaggle.com/c/titanic/data) passenger data from Kaggle.
The (partial) schema of the data are being described in the following table.

**Variable**|**Definition**|**Key**
:-----:|:-----:|:-----:
Survived|Survival|0 = No, 1 = Yes
Pclass|Ticket class|1 = 1st, 2 = 2nd, 3 = 3rd
Sex|Sex| 
Age|Age in years| 
SibSp|# of siblings / spouses aboard the Titanic| 
Parch|# of parents / children aboard the Titanic| 
Ticket|Ticket number| 
Fare|Passenger fare| 
Cabin|Cabin number| 
Embarked|Port of Embarkation|C = Cherbourg, Q = Queenstown, S = Southampton

To get started, let's first load and take a look at the data:

In [0]:
# load the csv data
# column types can be inferred automatically, but can also be specified with col_types argument
# run ?read_csv to learn more about reading in data!
passengers <- read_csv(
    "titanic.csv",
    col_types = cols(.default = "d", Name='c', Sex='f', Ticket='c', Embarked='c', Cabin='f')
    )

# cleaning some data
passengers <- passengers %>% drop_na(Embarked, Age)

# Summary of each column
summary(passengers)

## Introduction to `dplyr` verbs

In the world of `dplyr`, a _verb_ is a function that

* takes a data frame as its first argument, and
* returns another data frame as a result

Any function that meets this criteria, even if it's not necessarily a function in the `dplyr` package, can be considered a _verb_.

For example, the `head()` function in `R`, if applied to a data frame, will return a first `n` rows of a data frame.

In [0]:
# Return a data frame that consists of the first three rows of titanic
head(passengers, n = 3)

In this sense, the base `R` function `head()` is a verb.

The **core idea** of `dplyr` is that a vast majority of data manipulation needs can be satisfied through a combination of five verbs.

verb                 | action
-------------------- | ---------
`filter(df, ...)`    | select a subset of _rows_ by some specified condition
`select(df, ...)`    | select a subset of _columns_
`mutate(df, ...)`    | create a _new column_ (often as a function of existing columns)
`arrange(df, ...)`   | reorder (sort) _rows_ according to values of specific _columns_
`summarize(df, ...)` | aggregate and reduce a vector (column) to a single value

We will explore each of these verbs (and some additional variations within each category) below.

### Selecting rows (1/5)

The `filter(df, ...)` verb is used to select a subset of _rows_ that satisfy the conditions specified in `...`. 
The conditions must be written in a form that would evaluate to either `TRUE` or `FALSE`.

For example, if we want a data frame of passengers in the 3rd clas:

In [0]:
filter(passengers, Pclass == 3)

Multiple conditions can also be specified.

For example, if we want a data frame of all passengers in the 3rd class that are older than 50 years old:

In [0]:
filter(passengers, Pclass == 3, Age > 50)

By default, `filter()` will combine multiple conditions as `AND` operations.
In the example above, we are returned passengers where `Pclass == 3` _**AND**_ `Age > 50`.

We can specify an OR condition by using the `|` operator.

We can explicitly specify an AND condition using the `&` operator.

For example, if we want passengers in the 3rd class who are either younger than 18 years old or older than 50 years old:

In [0]:
filter(passengers, Pclass == 3, Age > 50 | Age < 18)

Use the `%in%` operator to filter to values that match a collection of values. 

For example, suppose we want to look at passengers embarked on Queenstown (`Q`) and Southampton (`S`)

In [0]:
filter(passengers, 
       Embarked %in% c("Q", "S"))

Finally, use `!` to negate any condition. 

For example, if we wanted to find all passengers who did NOT embark on Queenstown (`Q`) and Southampton (`S`)

In [0]:
filter(passengers, 
       ! (Embarked %in% c("Q", "S")))

### Exercise: `filter()`
Find all pasengers with 2 siblings/spouses and 2 parents/children aboard.

In [0]:
# YOUR CODE HERE


### Selecting columns (2/5)

Use `select(df, ...)` to either specify which columns to select,

In [0]:
select(passengers, Survived, Pclass, Age)

or to specify which columns to exclude, using `-`.

In [0]:
select(passengers, -Name, -SibSp, -Parch)

# Equivalently:
# select(passengers, -c(Name, SibSp, Parch))

`tidyverse` also provides some useful helper functions to `select()` columns that match specific criteria.

* `starts_with(x)`: match column names that start with `x`
* `ends_with(x)`: match column names that end with `x`
* `contains(x)`: match column names that contain `x`
* `matches(x)`: match column names that match (the regular expression) `x`

where `x` is a string (in either single- or double-quotes).

For example, if we want all the columns that start with letter `S`:

In [0]:
select(passengers, starts_with('S'))

You can see the documentation for `select()` for details.
In general, for any `R` function, you can pull-up the documentation (if one exists) by running `?` followed by the function name.
For example, to see the documentation for `select()` as provided in the `dplyr` package, uncomment and run the following (warning! the output is long):

In [0]:
# ?dplyr::select

### Create new columns (3/5)

Use `mutate(df, ...)` to create new columns, usually as a function of existing columns.

Suppose we wish to create a column called `no_sib`, indicating whether the passenger has no sibling or spouse aboard.
We would write,

In [0]:
mutate(passengers, no_sib = (SibSp == 0))

We note that within a single `mutate()` function, you can refer to the new columns you've created in the previous argument. For example:

In [0]:
mutate(passengers, 
       no_sib = (SibSp == 0),
       first_class = (Pclass == 1),
       no_sib_and_first_class = no_sib & first_class
      )

### Exercise: `mutate()`
Create a column `num_family` representing the number of family members on board for each passenger (the sum of sibling/spouse number and parent/children number)

In [0]:
# YOUR CODE HERE


### Sorting (4/5)

Use `arrange(df, ...)` to reorder the rows of a data frame by the value of specified columns.
Multiple conditions are arranged from left to right.

Note that `arrange` orders from lowest to highest by default.

In [0]:
arrange(passengers, Pclass, Name)

Use `desc()` around columns that you want to sort in `desc`ending order.

In [0]:
arrange(passengers, Pclass, desc(Name))

### Aggregating (5/5)

Use `summarize(df, ...)` to aggregate multiple rows into a single row. Unlike `mutate()`, function that are used in `summarize()` must return a single value (i.e., "aggregate" the provided vector)

For example, to find the min, mean, and max age of all passengers,

In [0]:
summarize(passengers,
          min_height = min(Age),
          avg_height = mean(Age),
          max_height = max(Age)
         )

`dplyr` also provides a special function `n()` which will evaluate to the number of rows within a `dplyr` verb.

For example, to count how many passengers (rows) there are in our dataset in total, 

In [0]:
summarize(passengers, N = n())

### Exercise: `summarize()`
Calculate the mean, median and standard deviation of `Fare`.
(hint: using `median` and `sd` function for median and standard deviation respectively) 

In [0]:
# YOUR CODE HERE


### Grouping (Split-apply-combine)

Now that we've covered the five core verbs, we should be able to manipulate data to our heart's desire!

Then, how about:

* The number of passengers for each `Pclass`?
* Survival rates in each `Pclass`?
* Number of each `Pclass` for passengers embarked on each port?

As an example, let's just consider the number of passengers for each `Pclass`

A natural, but _**tedious**_ way to compute this would look something like this:

In [0]:
#########
# NOTE: Code in this cell is intended to be an example of a BAD way to compute this.
# This is purely for illustrative purposes, and should NEVER EVER be re-used, in any context.
#########

# 1. Split into separate datasets by Pclass
class1_passengers <- filter(passengers, Pclass == 1)
class2_passengers <- filter(passengers, Pclass == 2)
class3_passengers <- filter(passengers, Pclass == 3)

# 2. Count the number of rows in each dataset.
N_1 <- summarize(class1_passengers, N = n())
N_2 <- summarize(class2_passengers, N = n())
N_3 <- summarize(class3_passengers, N = n())

# 3. Aggregate the counts into a single vector.
c(class1_total = N_1$N, 
  class2_total = N_2$N,
  class3_total = N_3$N)

# REMINDER: This is a TERRIBLE way to do this and should not be repeated!

This style of code can easily get out of hand, and would be a nightmare to maintain! 
(e.g., what happens if we get a new dataset where we want to count the number of rows for 50 different categories rather than just 3 categories)

As horrible as the above code is, it is useful in highlighting a common pattern that emerges when manipulating data:

1. **Split**: The data are split into smaller pieces of data, according to one (or more) column. 
   In this case, we've split the data by the `Pclass` column.
1. **Apply**: Some operation is applied to each of the smaller pieces.
   In this case, we've simply counted the number of rows of each piece using `summarize()` and `n()`.
1. **Combine**: The results of the previous **apply** are combined to some final data structure.
   In the above case, for simplicity, we've combinded the result as a vector; but in practice we usually want to keep everything in the form of a data frame.

This pattern in data manipulation is so common, that there is a `dplyr` verb for it. 
This is the `group_by` verb.

On its own, `group_by` makes no visible changes to a data frame, other than marking the data frame as being "grouped".
The difference is only made apparent when we apply some other verb to a grouped data frame.

Note that none of the `dplyr` verbs make any changes to the original data frame! This is very intentional. 
So, for now, we need to save the "grouped" data as a new variable for our changes to have effect (but we'll see a more convenient approach to this later).

In [0]:
passengers_by_class <- group_by(passengers, Pclass)

# Note that the two data frames, on the surface, seem identical.
passengers
passengers_by_class

In [0]:
# But we can see a difference when applying, for example, a summarize
summarize(passengers, N = n())

In [0]:
summarize(passengers_by_class, N = n())

Note: For now, don't be too concerned about the warning `summarise() ungrouping output (override with .groups argument)`. To silence the warning, include `.groups="drop"` as an argument to `summarize`. Or, simply ignore the warning for now.

As shown in the simple example above, when a `dplyr` verb is applied to a "grouped" data frame,
`dplyr` internally **splits**-**applies**-and **combines** the data, finally returning results for
_each of the unique values that are found in the columns by which the data frame is grouped_.

This can be a lot to process, if it's the first time you've seen this. 
But once you get used to it (via trying a bunch of manipulation tasks and seeing some more examples), you'll find it extremely convenient and powerful.

Let's try answering the other questions we started this section with.

* Survival rates in each `Pclass`?

In [0]:
passengers_by_class <- group_by(passengers, Pclass)

summarize(passengers_by_class, 
          survival_rate = mean(Survived),
          
          .groups = "drop")

* Number of each `Pclass` for passengers embarked on each port?

In [0]:
passengers_by_port <- group_by(passengers, Embarked, Pclass)

summarize(passengers_by_port, 
          N = n(),
          
          .groups = "drop")

### Exercise: `group_by()`

Calculate the mean fare and survival rate for passengers in each Pclass, respectively.

In [0]:
# YOUR CODE HERE


### Multiple (chained) operations

As we've briefly seen above, we would often like to apply multiple operations (verbs) to a data frame.
However, by design, verbs do not save intermediate changes to the original data frame, so for each operation we would have to assign the result to a new data frame.

Even for a reasonable number of operations, this can get quite messy (i.e., we'd end up with so many names and data frames that we only use as intermediate steps).

Consider the following query:

* For each port, what is the proportion of each class?

We can think of finding the answer in multiple steps:

1. group by `Embarked` and `Pclass`
1. find the number of passengers for each of the groups in the previously grouped data frame
1. with the computed number of passengers for each class-port pair, re-group by only `Embarked`
1. create a new column which computes the proportion of counts for each row, over the sum of rows

Using the current method of saving all intermediate results, the implementation would look something like this:

In [0]:
###########
# NOTE: Code in this cell is intended to be an example of a BAD implementation.
#       While this implementation is acceptable, it is intended for illustrative purposes, 
#       and is best avoided.  
###########

passengers_by_port_and_class <- group_by(passengers, Embarked, Pclass)
counts_by_port_and_class <- summarize(passengers_by_port_and_class, N = n())
regroup_by_port <- group_by(counts_by_port_and_class, Embarked)
mutate(regroup_by_port, prop = N / sum(N))

# REMINDER: This is NOT ideal code. Use the %>% instead (see below).

The above code is bad for a multiple reasons. Among others, it's

* creating a lot of unnecessary intermediate results that will not be used again
* difficult to read, if you don't already know what the end goal is (e.g., you eyes have to wander left-to-right-to-left a few times to see what's going on)

A sophisticated, yet quite simple, solution to this problem is the introduction of `%>%`, also called the "pipe operator".

`%>%` is a _binary operator_ (much like `+` or `-`) which, in words, takes the result of the left-hand side, and uses it as the first argument on the right hand side. 
This may be confusing at first, but might make more sense in the context of `dplyr` _verbs_. Recall, a _verb_ in `dplyr` is any function that _returns a data frame_ (LHS) and _takes a data frame as its first argument_ (RHS).

If further notation is helpful, one could also write the `%>%` as
```
x %>% f(y) = f(x, y)
```

What this means from a practical standpoint, however, is that we no longer need to _save_ intermediate results just to use them in the next verb. 
Instead, we can use `%>%` to send results from a verb down a "pipe" to the next verb.
Consider our previous example, which involved four verbs, with three intermediate steps.
Using `%>%`, the same result can now be achieved in a (conceptually) single line:

In [0]:
passengers %>% 
    group_by(Embarked, Pclass) %>%
    summarize(N = n(),
              .groups = "drop") %>%
    group_by(Embarked) %>%
    mutate(prop = N / sum(N))

Note the intentional style of (1) starting from the data frame (instead of a verb that explicitly includes the data frame) and (2) keeping each verb on its own line. 
This not only makes it easier to read, but also easier to maintain and modify.

### Exercise: Putting it all together
1. Create a column `age_group` that groups each passenger's age down to the 10s (i.e., 0-9, 10-19, 20-29, etc.)
1. Calculate the number of passengers and survival rate for each (`Pclass`, `age_group`) pair

What patterns do you observe?

(hint1: function `floor` takes in x and returns the largest integers not greater than the corresponding elements of x, e.g., `floor(1.9)` will return `1`)

(hint2: you can run `options(repr.matrix.max.rows = 20)` to display more rows at a time)


In [0]:
options(repr.matrix.max.rows = 20)
# YOUR CODE HERE


## `dplyr` ending notes

There are many, MANY more verbs that we simply did not have the time to cover here, but are immensely useful. 
Some examples are:

* `rename(df, ...)`: rename columns
* `slice(df, ...)`: select rows of a data frame by index, instead of some condition
* `top_n(df, N, col)`: retrieve the top N rows for values in some specified column

You are highly recommended to explore more. One great resource for learning about `tidyverse` and using it to work with data is Hadley Wickham's online book: [R for Data Science](https://r4ds.had.co.nz/). 

Hadley Wickham is also the original author for many of the packages in `tidyverse`. In fact, in the "early days" (circa 2016), before the word `tidyverse` was created, the collection of Hadley Wickham's `R` packages were unofficially referred to as the `hadleyverse`, until [Hadley announced tidyverse and explicitly asked people to stop calling it the hadleyverse](https://twitter.com/hadleywickham/status/774008060549312512?lang=en).

# `ggplot2` mini-tutorial

First off, here are three great ways to become a `ggplot2` pro:
- Chapter 3 of R4DS (**I cannot stress this enough!**): https://r4ds.had.co.nz/data-visualisation.html
- Look at the sample code for the plots that Sharad showed in lecture
- Visualization slides in `extra-materials` folder for this section
- Google "how to do [X] with ggplot" !

Now, consider our final result from the `dplyr` introduction:

In [0]:
survival_by_age_group <- 
    passengers  %>% 
    mutate(age_group = floor(Age / 10) * 10) %>%
    group_by(Pclass, age_group) %>%
    summarize(
        N = n(),
        survival_rate = mean(Survived),
        
        .groups = "drop"
    )

survival_by_age_group

With a some effort, we can glean a story from this table. But, the table imposes a high cognitive load. 
    
Reducing cognitive load is a key goal of data visualization. Let's see if we can tell a richer story with a scatterplot.

The following formula is a helpful starting point for building a scatterplot with `ggplot2`: 

``` 
ggplot(<DATAFRAME>) +
    geom_point(aes(x = <X_VARIABLE_NAME>, y = <Y_VARIABLE_NAME>))
```

Note: Both `dplyr` and `ggplot2` are automatically imported after calling `library(tidyverse)`

Here's the formula in action:

In [0]:
ggplot(survival_by_age_group) +
    geom_point(aes(x = age_group, y = survival_rate))

Oof, this is messy. So far, I think I'd rather read the table!

To make it easier to distinguish the groups, let's color the points by group.

In [0]:
ggplot(survival_by_age_group) +
    geom_point(aes(x = age_group, y = survival_rate, color = factor(Pclass)))

A little better? Notice how ``ggplot`` automatically gave us a legend, that sure is nice.

Since age groups are sequential and ordered, it would be helpful to connect our points. Connecting the points will make it easier to see a trend.


Note: ``ggplot2`` uses `+` to chain calls, not `%>%`

In [0]:
ggplot(survival_by_age_group) +
    geom_point(aes(x = age_group, y = survival_rate, color = factor(Pclass))) +
    geom_line(aes(x = age_group, y = survival_rate, color = factor(Pclass)))

Better! It looks like something funny is going on at the high end of the first class group.

After inspecting the table, it looks like the highest age group has few individuals. Let's change the size of our points to reflect this lack of data.

Notice how point size is only relevant to the points, not the lines, so we only refer to size inside `geom_point`.

Also, `geom_point` and `geom_line` currently share the same call to `aes`, so we can conveniently abstract that code to the main ``ggplot`` call.

In [0]:
ggplot(survival_by_age_group, aes(x = age_group, y = survival_rate, color = factor(Pclass))) +
    geom_point(aes(size = N)) +
    geom_line()

This plot is looking pretty good. Below are some potential improvements; can you think of any others?
- Add better labels and units to the x and y axes
- Change the y-axis to percentages
- Rename the legends to something more descriptive
- Give the plot a title that briefly summarizes the takeaway.
- Remove the gridlines (an example of "chart junk")
- Remove the gray background

The code below contains each of these improvements.

In [0]:
theme_set(theme_bw())

ggplot(survival_by_age_group, aes(x = age_group, y = survival_rate, color = factor(Pclass))) +
    geom_point(aes(size = N)) +
    geom_line() +
    scale_x_continuous(
        name = "Age group (years)"
    ) +
    scale_y_continuous(
        name = "Survival rate (%)",
        labels = scales::percent_format(accuracy=1)
    ) +
    scale_size_continuous(
        name = "Number of passengers"
    ) +
    scale_color_discrete(
        name = "Passenger class"
    ) +
    ggtitle("Survival rates differ across age and class") +
    theme(
        panel.grid = element_blank()
    )
    

To save a plot, use ``ggsave``. 

``ggsave`` will automatically save the most recently created plot, unless you specify otherwise. See ``?ggsave`` for more details. 

In [0]:
ggsave("survival_rates.pdf", width=6, height=4)

Finally, I want to again stress three great ways to familiarize yourself with `ggplot2`: 
- Chapter 3 of R4DS (**I cannot stress this enough!**): https://r4ds.had.co.nz/data-visualisation.html
- Look at the sample code for the plots that Sharad showed in lecture
- Google "how to do [X] with ggplot" !