# Data Wrangling

<h5>

**Wrangling** /ˈræŋ.ɡəl.ɪŋ/

the activity of taking care of, controlling, or moving animals, especially large animals such as cows or horses

([Cambridge Dictionary](https://dictionary.cambridge.org/dictionary/english/wrangling))

</h5>

![Cattle Wrangler - image from https://commons.wikimedia.org/wiki/File:Pioneer_Day_Wrangler.jpg](https://upload.wikimedia.org/wikipedia/commons/thumb/8/83/Pioneer_Day_Wrangler.jpg/320px-Pioneer_Day_Wrangler.jpg)

**[Data wrangling](https://en.wikipedia.org/wiki/Data_wrangling)** commonly refers to the transformation of data from one "input" format (e.g., `.csv` files from an experiment), to a different format (e.g., a tidy dataframe) that is more appropriate to the needs of an analysis. In the context of the ExPra experiments, you will use data wrangling techniques to implement the transformations and data cleaning steps specified in your preregistrations.

## Setup

### Setup Part 1: Install Packages

We will use [`tidyverse`](https://www.tidyverse.org/) packages to implement our data wrangling. The "tidyverse" is a series of packages which share a philosphy based around code and data structures that are (a) tidy, and (b) readable. You can install the tidyverse packages like so:

```
install.packages("tidyverse")
```

This includes many packages that we won't be using today, but which will be useful in other parts of the course (e.g., on Data Visualisation).

Remember, you should install packages in the console - never in a script that your share with others. This is because otherwise, your script will go to the effort of reinstalling a package *every time* it is run!

### Setup Part 2: Check the Packages Load

Now, we can test that the packages we will be using today actually load. You should be able to run this code without any errors:

In [1]:
options(repr.plot.width=3.5, repr.plot.height=3)

In [2]:
library(dplyr)
library(tidyr)


Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union




### Setup Part 3

Finally, check that you can access the dataset we'll be using in this session. The `starwars` dataset is a dataset built into R that contains details of characters from the Star Wars films:

In [4]:
print(starwars)

[90m# A tibble: 87 x 14[39m
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   [3m[90m<chr>[39m[23m     [3m[90m<int>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m      [3m[90m<chr>[39m[23m      [3m[90m<chr>[39m[23m          [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m 
[90m 1[39m Luke Sk~    172    77 blond      fair       blue            19   male  mascu~
[90m 2[39m C-3PO       167    75 [31mNA[39m         gold       yellow         112   none  mascu~
[90m 3[39m R2-D2        96    32 [31mNA[39m         white, bl~ red             33   none  mascu~
[90m 4[39m Darth V~    202   136 none       white      yellow          41.9 male  mascu~
[90m 5[39m Leia Or~    150    49 brown      light      brown           19   fema~ femin~
[90m 6[39m Owen La~    178   120 brown, gr~ light      blue            52   male  mascu~
[90m 7[39m Beru Wh~    165    75 brown      light      blue          

This is snapshot shows an example of tidy data - a philosophy of organising data such that each observation (*character*) has a single row, with all variables tied to that character as a single column.

Now that we're all set up, let's dig into the data...

## Arranging

We can use the `arrange()` function to sort by variables in the dataframe. For example, we can arrange all characters in order of height (shortest to tallest) like so:

In [None]:
arrange(starwars, height)

In [None]:
arrange(starwars, height) |> print()