# Appendix C Data 


## C.4 Basic operations on data

`R` offers several built-in vectorized functions that can be used to create more complicated function. These include:

* **Arithmetic operators** `+, -, *, /, ^`
* **Modular arithmetic operators** `%/%` and `%%` 
* **Logarithms** `log()`, `log10()`, `log2()`
* **Offsets** `lag()` and `lead()`

We use `/` for a regular division.  For an integer division, we use the code `%/%`, where the fractional part (remainder) is discarded.

Sometimes we may find the modular operation `%%` useful.  This is outputting the fractional part of a division.  

There are also handy functions such as `lag` and `lead`. The function`lag` computes a lagged version of a time series, shifting the time base back by a given number of observations.

We also have:

* **Logical comparisons** `==, !=, <, <=, >, >=`
* **Cumulative aggregates** `cumsum(), cumprod(), cummin(), cummax()` (`dplyr` also provides `cummean()`)

Sometimes, we want to *rank* our data by assigning integers for 1st place, 2nd place, etc. The functions `dense_rank()`, `min_rank()`, and `row_number()` can be used for this purpose:

In [1]:
gpas = c(3.9, 3.8, 2.7, 3.8, 4.0, 4.0)


Note the differences in behavior: 
- The rankings from `dense_rank()` never have gaps.
- The rankings from `min_rank()` skips over 3rd place (because we have two entries tied for 2nd.)
- The rankings from `row_number()` break ties arbitrarily, so the first 4.0 GPA gets ranked 5th, and the second 4.0 GPA gets ranked 6th.

By default, the ranking functions rank lowest first. If we want to reverse that, and assign rank 1 to the highest entry, we can use the `desc()` function:

In [2]:
(x <- sample(c(11, 12, 12, 14, 14, 14, 17, 21, 26, NA))) # returns a random permutation of the input


`summarize()` can be used to summarize entire data frames by collapsing them into single number summaries.


In [3]:
# departure delay

Many summarization functions are available:

* Center: `mean(), median()`
* Spread: `sd(), IQR(), mad()`
* Range: `min(), max(), quantile()`
* Position: `first(), last(), nth()`
* Count: `n(), n_distinct()`
* Logical: `any(), all()`

Now, let us try to use some of the summarize functions to create a new table with the variables airports, total flights, mean distance, and standard deviation of the distance.  We want to sort the mean distance in descending order.   Let's try to guess which airport has the largest mean distance before we even proceed! 