[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/CU-Denver-MathStats-OER/Data-Wrangling-and-Visualization/blob/main/07-Data-Manipulation-with-dplyr.ipynb)



# <a name="07-title"><font size="6">Module 07: Data Manipulation with `dplyr`</font></a>

---

# <a name="tidyverse">Welcome to the `tidyverse`!</a>

---

![The tidyverse logos](https://raw.githubusercontent.com/CU-Denver-MathStats-OER/Data-Wrangling-and-Visualization/main/Images/tidyverse.png)


The [`tidyverse`](https://www.tidyverse.org/) is a [collection of packages](https://www.tidyverse.org/packages/) by [Hadley Wickham](https://blog.revolutionanalytics.com/2016/09/tidyverse.html) that "share an underlying design philosophy, grammar, and data structures" of tidy data.

> The core tidyverse includes the packages that you're likely to use in everyday data analyses:


- [`ggplot2`](https://ggplot2.tidyverse.org/): data visualization.
- [`dplyr`](https://dplyr.tidyverse.org/): data wrangling and transformation.
- [`tidyr`](https://tidyr.tidyverse.org/): arrange into tidy data.
- [`readr`](https://readr.tidyverse.org/): importing tables and files from other aplications into R.
- And others ...


The `dplyr` package is a powerful package for restructuring and manipulating data frames. In this notebook, we will learn how to perform basic data frame manipulation using `dplyr`.




## <a name="load-tidyverse">Loading `tidyverse` Packages</a>

---

We can load individual packages within the `tidyverse` one by one, or we can load all packages at once with the command `library(tidyverse)`.


In [1]:
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.4     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


## <a name="quest1">Question 1</a>

---

What other packages are part of the `tidyverse`?

<br>



### <a name="sol1">Solution to Question 1</a>

---



<br>  
<br>  


# <a name="tidy-data">Tidy data</a>

---

<font color="dodgerblue">**Tidy data**</font> is structured so that:

1. Each variable is in a single column and no variables share a column.
2. Each observation is in a single row.
3. Each data value has its own cell.




## <a name="quest2">Question 2</a>

---

The `tidyverse` package provides several data sets to illustrate the concept of tidy data.

The data sets `table1`, `table2`, `table3`, `table4a`, `table4b`, and `table5` provide data about tuberculosis cases for various countries in different years. The variables provided include:

- `country`: the name of the country
- `year`: the year an observation was observed
- `cases`: the number of tuberculosis cases observed in the specified year
- `population`: the population of the country in the specified year
- `rate`: the rate of disease observed in a country in a specified year (`cases/population`)
- `1999`, `2000`: values of a variable observed for a country in that year

<br>  



### <a name="quest2a">Question 2a</a>

---

Run the code cell below and determine if the data in `table2` is tidy. If not, explain why not.

<br>  


In [None]:
table2

country,year,type,count
<chr>,<dbl>,<chr>,<dbl>
Afghanistan,1999,cases,745
Afghanistan,1999,population,19987071
Afghanistan,2000,cases,2666
Afghanistan,2000,population,20595360
Brazil,1999,cases,37737
Brazil,1999,population,172006362
Brazil,2000,cases,80488
Brazil,2000,population,174504898
China,1999,cases,212258
China,1999,population,1272915272


#### <a name="sol2a">Solution to Question 2a</a>

---



<br>  
<br>  


### <a name="quest2b">Question 2b</a>

---

Run the code cell below and determine if the data in `table3` is tidy. If not, explain why not.

<br>  


In [None]:
table3

country,year,rate
<chr>,<dbl>,<chr>
Afghanistan,1999,745/19987071
Afghanistan,2000,2666/20595360
Brazil,1999,37737/172006362
Brazil,2000,80488/174504898
China,1999,212258/1272915272
China,2000,213766/1280428583


#### <a name="sol2b">Solution to Question 2b</a>

---



<br>  
<br>  


## <a name="time-series">Time Series Data</a>

---

Times series data are often stored in a non-tidy way. In that case, the values of a variable are stored in a time-related column (like day or year). Different variables may then be stored across multiple data frames!

`table4a` and `table4b` in the `tidyr` package exhibit this type of non-tidyness. `table4a` provides the `cases` values for each `country` for each `year`. `table4b` provides the same thing for the `population` values.

In these two tables, the values for `cases` and `population` are split across multiple columns. Specifically, the `cases` are split across columns by `year` (and similarly for `population`). Additionally, observations are not in a single row since the values of `cases` and `population` for each `country` for each `year` are not in a single row.

In [None]:
table4a

country,1999,2000
<chr>,<dbl>,<dbl>
Afghanistan,745,2666
Brazil,37737,80488
China,212258,213766


In [None]:
table4b

country,1999,2000
<chr>,<dbl>,<dbl>
Afghanistan,19987071,20595360
Brazil,172006362,174504898
China,1272915272,1280428583


## <a name="a-good-one">A Tidy Example</a>

----



In [None]:
table1

country,year,cases,population
<chr>,<dbl>,<dbl>,<dbl>
Afghanistan,1999,745,19987071
Afghanistan,2000,2666,20595360
Brazil,1999,37737,172006362
Brazil,2000,80488,174504898
China,1999,212258,1272915272
China,2000,213766,1280428583


# <a name="dplyr">The `dplyr` Package</a>

---


The [`dplyr`](https://dplyr.tidyverse.org/) package provides functions for the "verbs" (i.e., actions) related to common data frame manipulation tasks. Each verb performs a different task, but they all share the following structure:


1. The first argument is always a data frame.

2. The next arguments typically describe which columns to operate on using variable names **without quotes**.

3. The output is a new data frame.

<br>  

The table below summarizes some of the most important verbs (functions) by the aspect of a data frame it manipulates.

<br>  

aspect | function | purpose
---|---|---
row | `filter` | selects rows based on a logical statement
row | `slice` | select rows based on position or other property
row | `arrange` | reorders the rows based on some property
row | `group_by` | groups a collection of rows hierarchically based on one or more variables
group of rows | `summarize` | produces a summarizing value or values for a group of rows
column | `select` | selects columns based on a logical statement, column names, etc.
column | `rename` | changes one or more column names
column | `mutate` | changes an existing column or adds a new column
column | `relocate` | reorders the columns

<br>  

Link to [cheatsheet](https://github.com/rstudio/cheatsheets/blob/main/data-transformation.pdf) for `dplyr`.

<br>  


# <a name="pipe">The pipe</a>

---

The <font color="dodgerblue">**pipe**</font>, `|>`, is a popular approach for "piping" an object into the first argument (usually) of another function.

- The pipe was introduced by the `magrittr` package with `%>%`.
  - `magrittr` is in the `tidyverse`, so you can use `%>%` or `|>` whenever you load the `tidyverse`.
  - `|>` was added to R 4.1.0 in 2021 and works in base R and in the `tidyverse`.
  - Note `%>%` does not work in base R which may affect reproducibility of code and ability to collaborate.
- Because each verb does a single action, solving complex problems typically requires composing a sequence of actions together in a particular order.  
- It is extremely popular in the `tidyverse` for stringing sequences of operations together in a readable way.

We will learn about the pipe in the examples that follow (we need to learn actions before we can pipe them), but Wickham, Çetinkaya-Rundel, and Grolemund R briefly summarize the concept in for [R Data Science.](https://r4ds.hadley.nz/data-transform.html#dplyr-basics) as follows:

> the pipe takes the thing on its left and passes it along to the function on its right so that `x |> f(y)` is equivalent to $f(x, y)$, and `x |> f(y) |> g(z)` is equivalent to $g(f(x, y), z)$. The easiest way to pronounce the pipe is “then”.

<br>  




In [4]:
data(package = "dplyr")

In [28]:
?storms

In [16]:
# glimpse function in dplyr shows all columns a data frame
glimpse(storms)

Rows: 19,537
Columns: 13
$ name                         [3m[90m<chr>[39m[23m "Amy", "Amy", "Amy", "Amy", "Amy", "Amy",…
$ year                         [3m[90m<dbl>[39m[23m 1975, 1975, 1975, 1975, 1975, 1975, 1975,…
$ month                        [3m[90m<dbl>[39m[23m 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
$ day                          [3m[90m<int>[39m[23m 27, 27, 27, 27, 28, 28, 28, 28, 29, 29, 2…
$ hour                         [3m[90m<dbl>[39m[23m 0, 6, 12, 18, 0, 6, 12, 18, 0, 6, 12, 18,…
$ lat                          [3m[90m<dbl>[39m[23m 27.5, 28.5, 29.5, 30.5, 31.5, 32.4, 33.3,…
$ long                         [3m[90m<dbl>[39m[23m -79.0, -79.0, -79.0, -79.0, -78.8, -78.7,…
$ status                       [3m[90m<fct>[39m[23m tropical depression, tropical depression,…
$ category                     [3m[90m<dbl>[39m[23m [31mNA[39m, [31mNA[39m, [31mNA[39m, [31mNA[39m, [31mNA[39m, [31mNA[39m, [31mNA[39m, [31mNA[39m, [31mNA[39m, 

# <a name="tibbles">Tibbles</a>

---

When the `dplyr` package interacts with a data frame, it usually produces a <font color="dodgerblue">**tibble**</font>.

- Tibbles comes from the [`tibble`](https://tibble.tidyverse.org/) package, which is automatically imported by `dplyr`.
- A tibble is supposed to be a "modern re-imagining" of R's standard data frame (referred to as `data.frame` for clarity).
- A tibble has class `tbl_df`.
- A tibble is also a `data.frame`, but has different default behaviors.
- Functions that take a `data.frame` as an input should also take a tibble as an input.

<br>  


# <a name="rows">Row Actions</a>

---


## <a name="filter">`filter()`</a>

---

The `filter()` function filters out the rows of a data frame that do not meet the designated criteria. For example, we can `filter()` the rows of the `storms` data frame based on whether the `hurricane_force_diameter` variable is more than 240 nautical miles.


- First, we try accomplishing this goal using indexing with logical statements.
- Then, we use the `filter()` function to perform the same task.


In [None]:
# slice all rows with hurrican force diameter greater than 240 miles
storms[storms$hurricane_force_diameter > 240, ]

In [25]:
# and remove rows with hurrican force diameter not equal to NA
storms[storms$hurricane_force_diameter > 240 &
       !is.na(storms$hurricane_force_diameter), ]

name,year,month,day,hour,lat,long,status,category,wind,pressure,tropicalstorm_force_diameter,hurricane_force_diameter
<chr>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<fct>,<dbl>,<int>,<int>,<int>,<int>
Maria,2005,9,11,12,52.0,-32.9,extratropical,,65,971,600,250
Maria,2005,9,11,18,54.0,-32.0,extratropical,,65,968,600,250
Maria,2005,9,12,0,55.5,-31.0,extratropical,,65,962,600,250
Maria,2005,9,12,6,57.0,-29.0,extratropical,,65,967,600,250
Maria,2005,9,12,12,58.5,-26.0,extratropical,,65,970,600,250
Sandy,2012,10,29,18,38.3,-73.2,hurricane,1.0,80,940,840,300
Sandy,2012,10,29,21,38.8,-74.0,extratropical,,75,943,840,300
Alex,2016,1,17,0,57.0,-42.0,extratropical,,70,978,600,300


In [27]:
# filtering rows with dplyr automatically ignores missing values
filter(storms, hurricane_force_diameter > 240)

name,year,month,day,hour,lat,long,status,category,wind,pressure,tropicalstorm_force_diameter,hurricane_force_diameter
<chr>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<fct>,<dbl>,<int>,<int>,<int>,<int>
Maria,2005,9,11,12,52.0,-32.9,extratropical,,65,971,600,250
Maria,2005,9,11,18,54.0,-32.0,extratropical,,65,968,600,250
Maria,2005,9,12,0,55.5,-31.0,extratropical,,65,962,600,250
Maria,2005,9,12,6,57.0,-29.0,extratropical,,65,967,600,250
Maria,2005,9,12,12,58.5,-26.0,extratropical,,65,970,600,250
Sandy,2012,10,29,18,38.3,-73.2,hurricane,1.0,80,940,840,300
Sandy,2012,10,29,21,38.8,-74.0,extratropical,,75,943,840,300
Alex,2016,1,17,0,57.0,-42.0,extratropical,,70,978,600,300


## <a name="quest3">Question 3</a>

---

Modify the code in the previous cell to filter out all rows in `storms` that have a hurricane force diameter greater than 240 nautical miles using the `filter()` function and the pipe, `|>`.


### <a name="sol3">Solution to Question 3</a>

---


<br>  


## <a name="slice">`slice()`</a>

---


The `slice()` function subsets rows of a data frame by index or with respect to other properties. To slice by row index, we can use the syntax `slice(data_name, a:b)`.


## <a name="quest4">Question 4</a>

---


Subset rows 10-12 of `storms` using the following methods:

- Indexing without using the `slice()` function at all.
- Using the `slice()` function without the pipe.
- Using the `slice()` function with the pipe.

<br>  

### <a name="sol4">Solution to Question 4</a>

---


<br>  


In [None]:
# subset rows 10-12 by indexing


In [None]:
# subset rows 10-12 using slice() without the pipe


In [None]:
# subset rows 10-12 using slice() with the pipe


### <a name="other-slice">Other Slice Functions</a>

---


- `slice_head(df, n = k)` subsets the first `k` rows of a data frame `df`.
- `slice_tail(df, n = k)` subsets the last `k` rows of a data frame `df`.

These is similar to the `head` and `tail` functions in base R.






In [101]:
slice_head(storms, n = 3)  # subset first three rows of storms

name,year,month,day,hour,lat,long,status,category,wind,pressure,tropicalstorm_force_diameter,hurricane_force_diameter
<chr>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<fct>,<dbl>,<int>,<int>,<int>,<int>
Amy,1975,6,27,0,27.5,-79,tropical depression,,25,1013,,
Amy,1975,6,27,6,28.5,-79,tropical depression,,25,1013,,
Amy,1975,6,27,12,29.5,-79,tropical depression,,25,1013,,


In [102]:
storms |> slice_head(n = 3)  # subset first three rows with pipe

name,year,month,day,hour,lat,long,status,category,wind,pressure,tropicalstorm_force_diameter,hurricane_force_diameter
<chr>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<fct>,<dbl>,<int>,<int>,<int>,<int>
Amy,1975,6,27,0,27.5,-79,tropical depression,,25,1013,,
Amy,1975,6,27,6,28.5,-79,tropical depression,,25,1013,,
Amy,1975,6,27,12,29.5,-79,tropical depression,,25,1013,,


In [103]:
slice_tail(storms, n = 2)  # subset last two rows of storms

name,year,month,day,hour,lat,long,status,category,wind,pressure,tropicalstorm_force_diameter,hurricane_force_diameter
<chr>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<fct>,<dbl>,<int>,<int>,<int>,<int>
Nicole,2022,11,11,12,33.2,-84.6,tropical depression,,25,999,0,0
Nicole,2022,11,11,18,35.4,-83.8,other low,,25,1000,0,0


In [104]:
storms |> slice_tail(n = 2)  # subset last two rows with pipe

name,year,month,day,hour,lat,long,status,category,wind,pressure,tropicalstorm_force_diameter,hurricane_force_diameter
<chr>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<fct>,<dbl>,<int>,<int>,<int>,<int>
Nicole,2022,11,11,12,33.2,-84.6,tropical depression,,25,999,0,0
Nicole,2022,11,11,18,35.4,-83.8,other low,,25,1000,0,0


- `slice_max(df, var, n = k)` subsets the `k` rows of a data frame with the `k` largest values with respect to a specific variable named `var` of data frame `df`.
- `slice_min(df, var, n = k)` subsets the `k` rows of a data frame with the `k` smallest values with respect to a specific variable named `var` of data frame `df`.
- `slice_sample(df, n = k)` randomly selects `k` rows from data frame `df`. We can make results reproducible by fixing the randomization with the `set.seed()` function (which is a base R function).




## <a name="quest5">Question 5</a>

---


Without running the code, type a comment in the code to explain what the code does. Then run the code cells to check your answer.

<br>  



### <a name="sol5">Solution to Question 5</a>

---


<br>  


In [None]:
slice_max(storms, wind, n = 5)

In [None]:
storms |> slice_min(pressure, n = 2)

In [None]:
storms |>  slice_sample(n = 7)

## <a name="arrange">`arrange()`</a>

---

The `arrange()` function arranges or sorts the rows of a data frame with respect to a certain variable.

- By default, the rows are ordered in ascending order.
- The ordering can be done within groups if specified.

In the code below, we arrange the rows of `storms` in ascending and then descending order, respectively, with respect to `wind`.



In [None]:
# order rows of storms in ascending order by wind speed
arrange(storms, wind)

### <a name="print-colab">Viewing or Printing Tibbles and Data Frames</a>

---


When viewing a data frame or tibble as output from a code cell in Colab, by default the entire data frame is displayed if possible. If there are a lot of rows, then the first 30 rows and last 30 columns are displayed in a scrolling window which may not be very convenient since that is a lot of raw data to look at!

If instead we use the `print()` function to display the data frame, then we get a condensed view that takes up less of the screen when displayed and is easier to read.

In [29]:
# print more concise output to screen
print(arrange(storms, wind))

[90m# A tibble: 19,537 × 13[39m
   name      year month   day  hour   lat  long status   category  wind pressure
   [3m[90m<chr>[39m[23m    [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<int>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<fct>[39m[23m       [3m[90m<dbl>[39m[23m [3m[90m<int>[39m[23m    [3m[90m<int>[39m[23m
[90m 1[39m Bonnie    [4m1[24m986     6    28     6  36.5 -[31m91[39m[31m.[39m[31m3[39m tropica…       [31mNA[39m    10     [4m1[24m013
[90m 2[39m Bonnie    [4m1[24m986     6    28    12  37.2 -[31m90[39m   tropica…       [31mNA[39m    10     [4m1[24m012
[90m 3[39m Charley   [4m1[24m986     8    13    12  30.1 -[31m84[39m   subtrop…       [31mNA[39m    10     [4m1[24m009
[90m 4[39m Charley   [4m1[24m986     8    13    18  30.8 -[31m84[39m   subtrop…       [31mNA[39m    10     [4m1[24m012
[90m 5[39m Charley   [4m1[24m986     8    14     0  31.

In [80]:
# using pipes
storms |> arrange(wind) |> print()

[90m# A tibble: 19,537 × 13[39m
   name      year month   day  hour   lat  long status   category  wind pressure
   [3m[90m<chr>[39m[23m    [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<int>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<fct>[39m[23m       [3m[90m<dbl>[39m[23m [3m[90m<int>[39m[23m    [3m[90m<int>[39m[23m
[90m 1[39m Bonnie    [4m1[24m986     6    28     6  36.5 -[31m91[39m[31m.[39m[31m3[39m tropica…       [31mNA[39m    10     [4m1[24m013
[90m 2[39m Bonnie    [4m1[24m986     6    28    12  37.2 -[31m90[39m   tropica…       [31mNA[39m    10     [4m1[24m012
[90m 3[39m Charley   [4m1[24m986     8    13    12  30.1 -[31m84[39m   subtrop…       [31mNA[39m    10     [4m1[24m009
[90m 4[39m Charley   [4m1[24m986     8    13    18  30.8 -[31m84[39m   subtrop…       [31mNA[39m    10     [4m1[24m012
[90m 5[39m Charley   [4m1[24m986     8    14     0  31.

In [81]:
storms |>
  arrange(wind) |>
  print(n = 15, width = Inf)  # print first 15 rows and all columns

[90m# A tibble: 19,537 × 13[39m
   name      year month   day  hour   lat  long status                 category
   [3m[90m<chr>[39m[23m    [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<int>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<fct>[39m[23m                     [3m[90m<dbl>[39m[23m
[90m 1[39m Bonnie    [4m1[24m986     6    28     6  36.5 -[31m91[39m[31m.[39m[31m3[39m tropical depression          [31mNA[39m
[90m 2[39m Bonnie    [4m1[24m986     6    28    12  37.2 -[31m90[39m   tropical depression          [31mNA[39m
[90m 3[39m Charley   [4m1[24m986     8    13    12  30.1 -[31m84[39m   subtropical depression       [31mNA[39m
[90m 4[39m Charley   [4m1[24m986     8    13    18  30.8 -[31m84[39m   subtropical depression       [31mNA[39m
[90m 5[39m Charley   [4m1[24m986     8    14     0  31.4 -[31m83[39m[31m.[39m[31m6[39m subtropical depression       [31mNA[39m


### <a name="desc">`desc()`</a>


---

The `desc()` function can be used to arrange the rows in descending order.

<br>  


In [82]:
arrange(storms, desc(wind)) |>  # order rows of storms in descending order by wind speed
  print()

[90m# A tibble: 19,537 × 13[39m
   name     year month   day  hour   lat  long status    category  wind pressure
   [3m[90m<chr>[39m[23m   [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<int>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<fct>[39m[23m        [3m[90m<dbl>[39m[23m [3m[90m<int>[39m[23m    [3m[90m<int>[39m[23m
[90m 1[39m Allen    [4m1[24m980     8     7    18  21.8 -[31m86[39m[31m.[39m[31m4[39m hurricane        5   165      899
[90m 2[39m Gilbert  [4m1[24m988     9    14     0  19.7 -[31m83[39m[31m.[39m[31m8[39m hurricane        5   160      888
[90m 3[39m Wilma    [4m2[24m005    10    19    12  17.3 -[31m82[39m[31m.[39m[31m8[39m hurricane        5   160      882
[90m 4[39m Dorian   [4m2[24m019     9     1    16  26.5 -[31m77[39m   hurricane        5   160      910
[90m 5[39m Dorian   [4m2[24m019     9     1    18  26.5 -[31m77[39m[31m.[39m[31m1[39m

In [83]:
storms |>
  arrange(desc(wind)) |>  # order rows of storms in descending order by wind speed
  print()

[90m# A tibble: 19,537 × 13[39m
   name     year month   day  hour   lat  long status    category  wind pressure
   [3m[90m<chr>[39m[23m   [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<int>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<fct>[39m[23m        [3m[90m<dbl>[39m[23m [3m[90m<int>[39m[23m    [3m[90m<int>[39m[23m
[90m 1[39m Allen    [4m1[24m980     8     7    18  21.8 -[31m86[39m[31m.[39m[31m4[39m hurricane        5   165      899
[90m 2[39m Gilbert  [4m1[24m988     9    14     0  19.7 -[31m83[39m[31m.[39m[31m8[39m hurricane        5   160      888
[90m 3[39m Wilma    [4m2[24m005    10    19    12  17.3 -[31m82[39m[31m.[39m[31m8[39m hurricane        5   160      882
[90m 4[39m Dorian   [4m2[24m019     9     1    16  26.5 -[31m77[39m   hurricane        5   160      910
[90m 5[39m Dorian   [4m2[24m019     9     1    18  26.5 -[31m77[39m[31m.[39m[31m1[39m

In [84]:
storms |>
  arrange(desc(wind)) |>
  slice_head(n = 10)

name,year,month,day,hour,lat,long,status,category,wind,pressure,tropicalstorm_force_diameter,hurricane_force_diameter
<chr>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<fct>,<dbl>,<int>,<int>,<int>,<int>
Allen,1980,8,7,18,21.8,-86.4,hurricane,5,165,899,,
Gilbert,1988,9,14,0,19.7,-83.8,hurricane,5,160,888,,
Wilma,2005,10,19,12,17.3,-82.8,hurricane,5,160,882,265.0,65.0
Dorian,2019,9,1,16,26.5,-77.0,hurricane,5,160,910,200.0,60.0
Dorian,2019,9,1,18,26.5,-77.1,hurricane,5,160,910,210.0,60.0
Allen,1980,8,5,12,15.9,-70.5,hurricane,5,155,932,,
Allen,1980,8,7,12,21.0,-84.8,hurricane,5,155,910,,
Allen,1980,8,8,0,22.2,-87.9,hurricane,5,155,920,,
Allen,1980,8,9,6,25.0,-94.2,hurricane,5,155,909,,
Gilbert,1988,9,14,6,19.9,-85.3,hurricane,5,155,889,,


### <a name="slice-group">Arranging `.by_group`</a>

---


The data frame can also be arranged within groups. In the code below, we perform the following actions:

1. Go into the `storms` data frame.
2. Group the data together by storm `status`,
3. Slice rows 1 thru 3 from each `status` group.
4. Store the result to `storms_slice`.

We then sort the rows of `storms_slice` by `wind` within each group by setting `.by_group` to `TRUE`.

<br>  





In [72]:
storms_slice <- storms |>  # go into storms data frame
                  group_by(status) |>  # group data together by status
                  slice(1:3)  # slice rows 1-3 from each status group

In [73]:
# order storms_slice in ascending order by wind speed within each group
arrange(storms_slice, wind, .by_group = TRUE)

name,year,month,day,hour,lat,long,status,category,wind,pressure,tropicalstorm_force_diameter,hurricane_force_diameter
<chr>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<fct>,<dbl>,<int>,<int>,<int>,<int>
Georges,1980,9,4,6,22.1,-62.3,disturbance,,20,1012,,
Georges,1980,9,4,12,23.4,-63.6,disturbance,,20,1013,,
Georges,1980,9,4,18,24.8,-64.8,disturbance,,20,1013,,
Amy,1975,7,4,12,47.0,-48.0,extratropical,,45,995,,
Blanche,1975,7,28,12,44.0,-65.2,extratropical,,60,988,,
Blanche,1975,7,28,18,47.2,-62.4,extratropical,,60,992,,
Blanche,1975,7,27,6,35.9,-70.0,hurricane,1.0,65,987,,
Blanche,1975,7,27,12,36.9,-69.0,hurricane,1.0,70,984,,
Blanche,1975,7,27,18,37.9,-68.0,hurricane,1.0,75,981,,
Arlene,1987,8,8,0,34.3,-77.5,other low,,10,1016,,


# <a name="column">Column Actions</a>

---


## <a name="group-by">`group_by()`</a>

---


The `group_by` function groups rows of a data frame with respect to one or more variables, which can then be manipulated in various ways. `group_by()` does not change the data itself but, we can see from the output that the data frame now has a different structure, namely it is now a grouped data frame. `group_by()` adds a grouped structure to the data frame, which affects the subsequent actions of verbs applied to the data.



In [None]:
# this will print too much to screen
group_by(storms, month)

In [86]:
storms |>
  group_by(month) |>  # group storms data together by month
  print(n = 20)  # print first 20 rows to screen

[90m# A tibble: 19,537 × 13[39m
[90m# Groups:   month [10][39m
   name   year month   day  hour   lat  long status      category  wind pressure
   [3m[90m<chr>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<int>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<fct>[39m[23m          [3m[90m<dbl>[39m[23m [3m[90m<int>[39m[23m    [3m[90m<int>[39m[23m
[90m 1[39m Amy    [4m1[24m975     6    27     0  27.5 -[31m79[39m   tropical d…       [31mNA[39m    25     [4m1[24m013
[90m 2[39m Amy    [4m1[24m975     6    27     6  28.5 -[31m79[39m   tropical d…       [31mNA[39m    25     [4m1[24m013
[90m 3[39m Amy    [4m1[24m975     6    27    12  29.5 -[31m79[39m   tropical d…       [31mNA[39m    25     [4m1[24m013
[90m 4[39m Amy    [4m1[24m975     6    27    18  30.5 -[31m79[39m   tropical d…       [31mNA[39m    25     [4m1[24m013
[90m 5[39m Amy    [4m1[24m975     6    28 

## <a name="summarize">`summarize()`</a>

---

`group_by()` is often used in close connection with the `summarize()` function, which computes statistical summaries for (grouped) rows into a single value.

- Since we now want to string several operation in a row, **the pipe is useful for stringing the commands together**.

In the code below, we group the `storms` data with respect to `status` and `month` and summarize the various aspects of the grouped data (compute the mean, the median, and the number of observations in each group for `wind`.)

- We create variables with designated names inside `summarize()`.
- `n()` counts the number of observations in each grouping.
- `na.rm` may be needed when computing statistics (not in this situation).





In [None]:
##############################################################
# running this code cell prints too much output to the screen
##############################################################
storms |>  # read in storms data
  group_by(status, month) |>  # group together data by status and month
  summarize(mean_wind = mean(wind),  # compute mean wind speed of grouped data
            median_wind =  median(wind),  # compute median wind speed of grouped data
            count_wind = n())  # count number of observations in group

In [88]:
storms |>
  group_by(status, month) |>
  summarize(mean_wind = mean(wind),
            median_wind =  median(wind),
            count_wind = n()) |>
  print(n=24)  # print 24 rows of condensed output

[1m[22m`summarise()` has grouped output by 'status'. You can override using the
`.groups` argument.


[90m# A tibble: 75 × 5[39m
[90m# Groups:   status [9][39m
   status        month mean_wind median_wind count_wind
   [3m[90m<fct>[39m[23m         [3m[90m<dbl>[39m[23m     [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m      [3m[90m<int>[39m[23m
[90m 1[39m disturbance       6      34.6        35           35
[90m 2[39m disturbance       7      30.2        27.5         46
[90m 3[39m disturbance       8      31          30           25
[90m 4[39m disturbance       9      26.1        25           41
[90m 5[39m disturbance      10      30.9        30           16
[90m 6[39m disturbance      11      24.4        25            8
[90m 7[39m extratropical     1      54.0        55           29
[90m 8[39m extratropical     4      37.6        35           40
[90m 9[39m extratropical     5      42.2        45           18
[90m10[39m extratropical     6      35.7        35          130
[90m11[39m extratropical     7      36.5        35          135
[90m

## <a name="quest6">Question 6</a>

---

Create a data frame called `yearly_wind` that gives the maximum, minimum, and mean storm wind speed for the storms in each year of the `storms` data frame. Then subset the 5 rows of `yearly_wind` with the largest maximum wind speeds.

<br>  


### <a name="sol6">Solution to Question 6</a>

---


<br>  


## <a name="select">`select()`</a>

---


The `select` function can be used to select columns of a data frame. You can get pretty creative in how you select the columns (e.g., using string matching), but we only consider some simple examples.

- We can select specific columns of a data frame by providing the column names.
- We can select the columns of `storms` that end in `meter`.
- Run `?select` to see help documentation and more complex examples.



In [None]:
# select just the lat and long columns from storms
# prints the full data frame to screen which is not desirable
# we should print summarized or condensed output
select(storms, lat, long)

In [89]:
storms |>  # read in storms data
  select(lat, long) |>  # select lat and long columns
  print(n = 10)  # print first 10 rows

[90m# A tibble: 19,537 × 2[39m
     lat  long
   [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
[90m 1[39m  27.5 -[31m79[39m  
[90m 2[39m  28.5 -[31m79[39m  
[90m 3[39m  29.5 -[31m79[39m  
[90m 4[39m  30.5 -[31m79[39m  
[90m 5[39m  31.5 -[31m78[39m[31m.[39m[31m8[39m
[90m 6[39m  32.4 -[31m78[39m[31m.[39m[31m7[39m
[90m 7[39m  33.3 -[31m78[39m  
[90m 8[39m  34   -[31m77[39m  
[90m 9[39m  34.4 -[31m75[39m[31m.[39m[31m8[39m
[90m10[39m  34   -[31m74[39m[31m.[39m[31m8[39m
[90m# ℹ 19,527 more rows[39m


In [76]:
storms |>  # read in storms data
  select(ends_with("meter")) |>  # select rows that end with string "meter"
  slice_head(n = 10)  # subset the first 10 rows

tropicalstorm_force_diameter,hurricane_force_diameter
<int>,<int>
,
,
,
,
,
,
,
,
,
,


## <a name="rename">`rename()`</a>

---

The `rename()` function renames the columns of a data frame.

- The syntax is `new_name = old_name`.

In the example below, we rename the `tropicalstorm_force_diameter` column to `trop_diameter` and `hurricane_force_diameter` to `hurr_diameter`.



In [61]:
# renaming columns using base R
names(storms)
# names(storms)[12:13] <- c("trop_diameter", "hurr_diameter")

In [None]:
# renaming columns using dplyr
# this prints too much output to screen (full data set)
rename(storms, trop_diameter = tropicalstorm_force_diameter, hurr_diameter = hurricane_force_diameter)

In [96]:
# renaming columns using dplyr
storms |>
  rename(trop_diameter = tropicalstorm_force_diameter,
         hurr_diameter = hurricane_force_diameter) |>
  glimpse()  # display glimpse of data frame to screen

Rows: 19,537
Columns: 13
$ name          [3m[90m<chr>[39m[23m "Amy", "Amy", "Amy", "Amy", "Amy", "Amy", "Amy", "Amy", …
$ year          [3m[90m<dbl>[39m[23m 1975, 1975, 1975, 1975, 1975, 1975, 1975, 1975, 1975, 19…
$ month         [3m[90m<dbl>[39m[23m 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7,…
$ day           [3m[90m<int>[39m[23m 27, 27, 27, 27, 28, 28, 28, 28, 29, 29, 29, 29, 30, 30, …
$ hour          [3m[90m<dbl>[39m[23m 0, 6, 12, 18, 0, 6, 12, 18, 0, 6, 12, 18, 0, 6, 12, 18, …
$ lat           [3m[90m<dbl>[39m[23m 27.5, 28.5, 29.5, 30.5, 31.5, 32.4, 33.3, 34.0, 34.4, 34…
$ long          [3m[90m<dbl>[39m[23m -79.0, -79.0, -79.0, -79.0, -78.8, -78.7, -78.0, -77.0, …
$ status        [3m[90m<fct>[39m[23m tropical depression, tropical depression, tropical depre…
$ category      [3m[90m<dbl>[39m[23m [31mNA[39m, [31mNA[39m, [31mNA[39m, [31mNA[39m, [31mNA[39m, [31mNA[39m, [31mNA[39m, [31mNA[39m, [31mNA[39m, [31mNA[39m, [

## <a name="mutate">`mutate()`</a>

---


The `mutate` function modifies an existing column or can create a new column from the existing columns.

- To create a new column, the syntax is `new_column = function(existing_columns)`.
- To modify an existing column, the syntax is `existing_column = function(existing_columns)`.
  


## <a name="quest7">Question 7</a>

---


The variable `wind` in `storms` gives the storm's maximum sustained wind speed in knots. 1 knot is equal to $1.15078$ miles per hour. In the code cell below, tranform `wind` into a new variable named `wind_mph` that gives the storm's maximum sustained wind speed in miles per hour instead of knots. Then select the two wind columns (`wind` and `wind_mph`) and print the first 10 rows of those two columns.

<br>  


### <a name="sol7">Solution to Question 7</a>

---


<br>  


## <a name="relocate">`relocate()`</a>

---


The `relocate` function moves one or more columns.

- By default the column is moved to the front of the data frame.
- The `.before` and `.after` arguments can be used to specify where the columns should be placed.


In [98]:
# move pressure to first column
storms |>
  relocate(pressure) |>
  slice_head(n = 5)

pressure,name,year,month,day,hour,lat,long,status,category,wind,tropicalstorm_force_diameter,hurricane_force_diameter
<int>,<chr>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<fct>,<dbl>,<int>,<int>,<int>
1013,Amy,1975,6,27,0,27.5,-79.0,tropical depression,,25,,
1013,Amy,1975,6,27,6,28.5,-79.0,tropical depression,,25,,
1013,Amy,1975,6,27,12,29.5,-79.0,tropical depression,,25,,
1013,Amy,1975,6,27,18,30.5,-79.0,tropical depression,,25,,
1012,Amy,1975,6,28,0,31.5,-78.8,tropical depression,,25,,


In [99]:
# move pressure and status to after wind column
storms |>
  relocate(pressure, status, .after = wind) |>
  slice_head(n = 5)

name,year,month,day,hour,lat,long,category,wind,pressure,status,tropicalstorm_force_diameter,hurricane_force_diameter
<chr>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>,<int>
Amy,1975,6,27,0,27.5,-79.0,,25,1013,tropical depression,,
Amy,1975,6,27,6,28.5,-79.0,,25,1013,tropical depression,,
Amy,1975,6,27,12,29.5,-79.0,,25,1013,tropical depression,,
Amy,1975,6,27,18,30.5,-79.0,,25,1013,tropical depression,,
Amy,1975,6,28,0,31.5,-78.8,,25,1012,tropical depression,,


In [100]:
# move all columns ending in `meter` to before the `status` column.
storms |>
  relocate(ends_with("meter"), .before = status) |>
  slice_head(n = 5)

name,year,month,day,hour,lat,long,tropicalstorm_force_diameter,hurricane_force_diameter,status,category,wind,pressure
<chr>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<int>,<int>,<fct>,<dbl>,<int>,<int>
Amy,1975,6,27,0,27.5,-79.0,,,tropical depression,,25,1013
Amy,1975,6,27,6,28.5,-79.0,,,tropical depression,,25,1013
Amy,1975,6,27,12,29.5,-79.0,,,tropical depression,,25,1013
Amy,1975,6,27,18,30.5,-79.0,,,tropical depression,,25,1013
Amy,1975,6,28,0,31.5,-78.8,,,tropical depression,,25,1012


## <a name="CC License">Creative Commons License Information</a>
---

![Creative Commons
License](https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png)

Materials created by the [Department of Mathematical and Statistical Sciences at the University of Colorado Denver](https://github.com/CU-Denver-MathStats-OER/)
and is licensed under a [Creative Commons
Attribution-NonCommercial-ShareAlike 4.0 International
License](http://creativecommons.org/licenses/by-nc-sa/4.0/).