# Data Wrangling 2

Welcome to part 2! In this session, we will recap everything we covered in [Part 1](https://jackedtaylor.github.io/expra-wise23/introduction/data_wrangling_1.html), and we will cover:

* [Pipes](#pipes): `|>`
* [Reformatting into Wide / Long format](#wide-and-long-data-formatting): `pivot_wider()` and `pivot_longer()`

We will use two main packages today: `dplyr` and `tidyr`:

In [14]:
library(dplyr)
library(tidyr)

<br>

---

## Pipes

The pipe operator looks like this: `|>`. It takes the output of one function, and "pipes" it into the first argument of the next function.

But why would such a thing be useful? Well, here's some example code to hopefully demonstrate...

In [1]:
options(repr.plot.width=3.5, repr.plot.height=3, repr.matrix.max.rows=10)

In [2]:
data_1 <- starwars
data_2 <- filter(data_1, homeworld != "Alderaan")
data_3 <- mutate(data_2, height_inches = height/2.54)
data_4 <- select(data_3, name, height, height_inches)
data_5 <- arrange(data_4, height)

data_5


Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union




name,height,height_inches
<chr>,<int>,<dbl>
Ratts Tyerell,79,31.10236
Wicket Systri Warrick,88,34.64567
Dud Bolt,94,37.00787
R2-D2,96,37.79528
R5-D4,97,38.18898
...,...,...
Roos Tarpals,224,88.18898
Chewbacca,228,89.76378
Lama Su,229,90.15748
Tarfful,234,92.12598


In [3]:
print(data_5)

[90m# A tibble: 74 x 3[39m
   name                  height height_inches
   [3m[90m<chr>[39m[23m                  [3m[90m<int>[39m[23m         [3m[90m<dbl>[39m[23m
[90m 1[39m Ratts Tyerell             79          31.1
[90m 2[39m Wicket Systri Warrick     88          34.6
[90m 3[39m Dud Bolt                  94          37.0
[90m 4[39m R2-D2                     96          37.8
[90m 5[39m R5-D4                     97          38.2
[90m 6[39m Sebulba                  112          44.1
[90m 7[39m Gasgano                  122          48.0
[90m 8[39m Watto                    137          53.9
[90m 9[39m Mon Mothma               150          59.1
[90m10[39m Cordé                    157          61.8
[90m# i 64 more rows[39m


What you can hopefully see is that we start with one dataframe, `starwars`. We then apply the `filter()`, `mutate()`, `select()`, and `arrange()` functions. Each time, we take the result of the last output, apply the function, and store the result in a new variable.

Rather than storing several variables that we are not interested in, another approach would be to nest the functions within each other's parentheses. For example:

In [4]:
# an example of nested data wrangling - difficult to read isn't it?
final_data <- arrange(
    select(
        mutate(
            filter(starwars, homeworld != "Alderaan"),
            height_inches = height/2.54
        ),
        name, height, height_inches
    ),
    height
)

This is really difficult to read, isn't it!?

What if we want code as readable as the first example, but without the unnecessary variables in between. Pipes are a perfect solution!

In [5]:
# a clear, readable example using pipes
final_data <- starwars |>
    filter(homeworld != "Alderaan") |>
    mutate(height_inches = height/2.54) |>
    select(name, height, height_inches) |>
    arrange(height)

Each line tells R what to do with the output of the last line. The output of the last line is always provided to the first input of the function on the next line.

<br>

### Check your Knowledge!

Rewrite the following snippets of code to use pipes (`|>`). You should check that the output matches the value in the last variable to be assigned in the non-piped example.

##### 1A)

In [6]:
filtered_naboo <- filter(starwars, homeworld=="Naboo")
naboo_characters <- pull(filtered_naboo, name)

##### 1B)

In [7]:
height_summ <- summarise(group_by(starwars, homeworld), mean_height=mean(height))

##### 1C)

In [8]:
sw_filt <- filter(starwars, birth_year>50)
sw_grp <- group_by(sw_filt, species)
mass_summ <- summarise(sw_grp, M = mean(mass), SD = sd(mass))

##### 1D)

In [27]:
hws <- pull(starwars, homeworld)
unique_worlds <- sort(unique(hws))

<br>

---

## Wide and Long Data Formatting

Imagine we have a database of names, heights, and ages. There are two sensible ways we can organise these data in a table.

##### Wide Format Example

* Each variable is in a different column
* Each row refers to one ID (name)

In [26]:
data.frame(
    name = c("Julia", "Hans", "Paul", "Laura"),
    height = c(180, 193, 174, 168),
    age = c(25, 32, 28, 30)
)

name,height,age
<chr>,<dbl>,<dbl>
Julia,180,25
Hans,193,32
Paul,174,28
Laura,168,30


##### Long Format Example

* One column contains values from multiple variables
* Another column tells us which variable each value comes from
* Each row refers to one combination of variable and ID (name)

In [24]:
data.frame(
    name = rep(c("Julia", "Hans", "Paul", "Laura"), each=2),
    variable = rep(c("height", "age"), 4),
    value = c(180, 25, 193, 32, 174, 28, 168, 30)
)

name,variable,value
<chr>,<chr>,<dbl>
Julia,height,180
Julia,age,25
Hans,height,193
Hans,age,32
Paul,height,174
Paul,age,28
Laura,height,168
Laura,age,30


Wide and Long data formats are simply different ways of representing the data.