# Lecture 4.1: More about Data Transformation

<div style="border: 1px double black; padding: 10px; margin: 10px">

**Goals for today's lecture:**
* A review on R basics
* Learn more about `group_by`
* Summary of what we have learnt and common mistakes
    
This lecture note corresponds to Chapter 5.6--5.7 of your book.
</div>


In [1]:
library(tidyverse)
library(nycflights13)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.1     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



## Useful Functions in R for Data Transformation

R provides you with several in-built vectorized functions that can be used to create more complicated function. These include:

* **Arithmetic operators** `+, -, *, /, ^`
* **Modular arithmetic operators** `%/%` and `%%` 
* **Logarithms** `log()`, `log10()`, `log2()`
* **Offsets** `lag()` and `lead()`

To do a regular division, we use `/`.  To do an integer division, we use the code `%/%`. Integer division is a division in which the fractional part (remainder) is discarded.

You may also find the function `lag` and `lead` useful.   For instance, `lag` computes a lagged version of a time series, shifting the time base back by a given number of observations.


We also have:

* **Logical comparisons** `==, !=, <, <=, >, >=`
* **Cumulative aggregates** `cumsum(), cumprod(), cummin(), cummax()` (`dplyr` also provides `cummean()`)

## Ranking functions
Sometimes, we want to *rank* our data by assigning integers for 1st place, 2nd place, etc. The functions `dense_rank()`, `min_rank()`, and `row_number()` can be used for this purpose:

Note the differences in behavior: 
- The rankings from `dense_rank()` never have gaps.
- The rankings from `min_rank()` skips over 3rd place (because we have two entries tied for 2nd.)
- The rankings from `row_number()` break ties arbitrarily, so the first 4.0 GPA gets ranked 5th, and the second 4.0 GPA gets ranked 6th.

By default, the ranking functions rank lowest first.  We want to actually reverse that, and assign rank 1 to the highest gpa.  To do this, we can actually use the `desc`.

## More about Summary Function

`summarize()` can be used to summarize entire data frames by collapsing them into single number summaries.


Many summarization functions are available:

* Center: `mean(), median()`
* Spread: `sd(), IQR(), mad()`
* Range: `min(), max(), quantile()`
* Position: `first(), last(), nth()`
* Count: `n(), n_distinct()`
* Logical: `any(), all()`

Now, let us try to use some of the summarize functions to create a new table with the variables dest, total flights, mean distance, and standard deviation of the distance.  We want to sort the mean distance in descending order.   Let's try to guess which airport has the largest mean distance before we even proceed! 

## Using Pipe with ggplot

You can even plot the data by adding a `ggplot` command at the end after manipulating your data.

Let's try to create a table for each month with the mean delay time.  Then plot a barchart for each month.  

How about a bar chart of mean arrival delay by destination airport for the top 10 airports that have the highest traffic volume?  We will use `group_by`, `summarize`, `arrange`, `slice`, and `ggplot`.

Now, let us try to get a scatter plot of average distance vs average arrival delay after grouping by destination airport? We will also superimpose the scatter plot with a smoothed plot

## Group Mutate/Filter

In [46]:
flights_sml <- select(flights,year:day,ends_with("delay"),distance, air_time)

In [15]:
print(flights_sml)

[38;5;246m# A tibble: 336,776 x 7[39m
    year month   day dep_delay arr_delay distance air_time
   [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m     [3m[38;5;246m<dbl>[39m[23m     [3m[38;5;246m<dbl>[39m[23m    [3m[38;5;246m<dbl>[39m[23m    [3m[38;5;246m<dbl>[39m[23m
[38;5;250m 1[39m  [4m2[24m013     1     1         2        11     [4m1[24m400      227
[38;5;250m 2[39m  [4m2[24m013     1     1         4        20     [4m1[24m416      227
[38;5;250m 3[39m  [4m2[24m013     1     1         2        33     [4m1[24m089      160
[38;5;250m 4[39m  [4m2[24m013     1     1        -[31m1[39m       -[31m18[39m     [4m1[24m576      183
[38;5;250m 5[39m  [4m2[24m013     1     1        -[31m6[39m       -[31m25[39m      762      116
[38;5;250m 6[39m  [4m2[24m013     1     1        -[31m4[39m        12      719      150
[38;5;250m 7[39m  [4m2[24m013     1     1        -[31m5[39m        19  

## Find the worst flight with the worst delay for each day

# Summary of Chapter 5
Before we move on to the next part of the book, I want to spend some time summarizing and tying together the main ideas from the past few lectures. In chapter 5 we learned about five types of operations for altering data tibbles:
* `filter()`: drop rows from a data table based on certain logical conditions.
* `select()`: keep *columns* in a data table by name, range, or logical conditions.
* `arrange()`: sort / reorder the rows of a data table.
* `mutate()`: generate new columns in a data table by applying functions to the existing ones.
* `group_by()` / `summarize()`: group rows together based on one or more variables, and compute summary statistics within each group.

#### `filter()` vs `select()`
Some students were mixing up the use of `filter()` and `select()`.

`filter()` selects the rows based on some specific criterion

`select()` selects the columns of your data set

#### Common Error `` and ' ' and "  "

#### `=` versus `==`
Remember that `=` and `==` mean different things. The former is used for assignment and to pass keyword parameters to functions. The latter is used to test for equality and returns either `TRUE` or `FALSE`.

#### Vector versus column versus data table
There is particular confusion about when it is appropriate to use vectors, columns and data tables. We will be discussing these concepts at greater length in the coming weeks, but here are some essentials that you should know:

**Vectors** in R contain multiple values. You create vectors using the `c()` function. If you do neglect to do this, R will produce an error and/or do the wrong thing. Some examples of this I saw include:
```{r}
a = factor(b, levels=1, 2, 3, 4, 5) ## wrong
a = factor(b, levels=(1, 2, 3, 4, 5)) ## wrong
a = factor(b, levels=c(1, 2, 3, 4, 5)) ## correct
```

Vectors have a particular type, and all the entries of the vector must be of that same type; if they are not R will convert them to be.

You can think of a data table as a list of vectors. Each column has its own vector. To access a vector of values stored in a column in R, we traditionally use the `$` operator:

If working inside one of the `dplyr` functions like `mutate()`, `filter()`, etc., the dataset is specified by the first parameter. So you don't need to use the `$` operator, just specify the column name:
```{r}
filter(flights, flights$arr_delay < 10)  # wrong (although it will work)
filter(flights, arr_delay < 10)  # correct
```

Even though they contain the same information, a column vector is *not the same* as a table containing only that column: