# Cleaning and Wrangling Data

## Important Packages

- ```dplyr```
    - Part of ```tidyverse```. If you load ```tidyverse```, this does not need to be loaded.
    - Provides: ```select```, ```filter```, ```mutate```, ```arrange```, ```summarize```, and ```group_by```
- ```purrr```
    - Part of ```tidyverse``` metapackage.
    - Provides: ```map()``` and ```map_*``` functions.

## Tidy Data

<div style="display: flex; flex-direction: row; align-items:flex-start;">
    <div style="width: 500px;">
        <ul>
            <li>Each row is a single observation</li>
            <li>Each column is a single variable</li>
            <li>Each value is a single cell (eg, its entry in the data frame is not shared with another value).</li>
        </ul>
    </div>
    <img src="media/tidy_data.png" style="width:200px; margin-left: 50px;"> 
</div>



Scenario 1:

| year1 | year2 | year3 | year 4|, you are violating the ROW rule as you have 4 observations in the same row.

**USE PIVOT_LONGER TO FIX ROW RULE**

Scenario 2:

| key | <br>
| case | <br>
| population |,

you are violating the COLUMN rule as you have 2 variables in the same column.

**USE PIVOT_WIDER TO FIX COLUMN RULE**

## Tidying Up

### Wide to long using ```pivot_longer```

<div style="display: flex; flex-direction: row; align-items:flex-start;">
    <div style="width: 500px;">
        In the case where observations are used as variables, we need to widen the table. To make a wide table into a longer one, we can apply the pivot_longer function from the 
        tidyverse package. The pivot_longer function combines columns, and tidies data, making it longer.
    </div>
    <img src="media/pivot_longer.png" style="width:400px; margin-left: 50px;"> 
</div>

```r
pivot_longer(lang_wide,                   # Data we want to reshape
             cols = Toronto:Edmonton,     # Columns to combine
             names_to = 'region',         # Name of new col (comes from lang_wide vector names)
             values_to = 'mother_tongue'  # Name of new col (comes from lang_wide values)
)
```

### Wide to long using ```pivot_wider```

<div style="display: flex; flex-direction: row; align-items:flex-start;">
    <div style="width: 500px;">
        When observations spread across multiple rows rather than gather in a single row, we need to shif the data into a wider format. We can do this by applying the pivot_wider
        function.
    </div>
    <img src="media/pivot_wider.png" style="width:500px; margin-left: 50px;"> 
</div>

```r
pivot_wider(lang_long,           # Data set we want to reshape
            names_from = type,   # Name of the column for the variable names
            values_from = count  # Name of column to take values
           )
```

### Using ```separate``` to deal with multiple delimeters

Data is also not tidy when multiple values are stored in the same cell. For instance, in the data below, the number of Canadians reporting their primary language at work and at home is combined in one column, separted by the delimeter (```/```).

<img src="media/separate.png" width="400px">

```r
separate(lang_messy_longer,                        # Data set
         col = value,                              # Name of a column we need to split
         into = c("most_at_home", "most_at_work"), # New column names we would split data into
         sep = "/",                                # Separator on which to split
         convert = TRUE                            # Converts variables to their type (e.g. from char to int)
        )
```

### Using ```select``` to extract a range of columns.

Ex. 1
```r
selected_columns <- select(tidy_lang, language, region, most_at_home, most_at_work)
```

Ex. 2
```r
column_range <- select(tidy_lang, language:most_at_work)
```

Ex. 3
```r
select(tidy_lang, starts_with("most"))
```

Ex. 4
```r
select(tidy_lang, contains("_"))
```

### Using ```filter``` to extract rows.

Ex. 1
```r
official_langs <- filter(tidy_lang, category == "Official languages")
```

Ex. 2
```r
filter(tidy_lang, category != "Official languages")
```

Ex. 3: "AND"
```r
filter(official_langs, region == "Montréal", language == "French")
filter(official_langs, region == "Montréal" & language == "French")
```

Ex. 4: "OR"
```r
filter(official_langs, region == "Calgary" | region == "Edmonton")
```

Ex. 5 "IN"
```r
five_cities <- filter(region_data, 
                      region %in% c("Toronto", "Montreal"))
```

NOTE:

```r c("Vancouver", "Toronto") == c("Toronto", "Vancouver") ```
> FALSE, FALSE

```r c("Vancouver", "Toronto") %in% c("Toronto", "Vancouver")```
> TRUE, TRUE


Ex. 6
```r
filter(official_langs, most_at_home > 2669195)
```

### Using ```mutate``` to modify columns

Ex. 1 Change column data types
```r
official_langs_numeric <- mutate(official_langs_chr,
  most_at_home = as.numeric(most_at_home),                 # Changed the column from chars to int
  most_at_work = as.numeric(most_at_work)
)
```

Ex. 2 Create a new column based on previous columns
```r
english_langs <- mutate(english_langs, most_at_home_proportion = most_at_home / city_pops)
```

## Aggregating Data with ```Summarize``` and ```Map```

### Basic uses of Summary

Part of data analysis is to calculate a summary value for the data (a summary statistic).

NOTE: The ```na.rm = TRUE``` is very important if there are NAs in the data.

<img src="media/summarize_1.png" width="400px">

```r
summarize(region_lang,
          min_most_at_home = min(most_at_home, na.rm = TRUE),
          max_most_at_home = max(most_at_home, na.rm = TRUE))
```

<img src="media/summarize_2.png" width="400px">


### Summarize & group_by

Using ```summarize``` and ```group_by``` together can allow us to summarize values for subgroups within a data set.

<img src="media/summarize&group_by_1.png" width="400px">

In the graphic above, ```summarize``` and ```group_by``` creates a new data frame, one row for each group, containing the summary statistic for each column being summarized. It also creates a column listing the value fo the grouping variable.

```r
group_by(region_lang, region) |>
summarize(min_most_at_home = min(most_at_home),
          max_most_at_home = max(most_at_home)
         )
```

Notably, ```group_by``` doesn't change how the data looks, instead, it changes how other functions work with the data.

A very useful function when working with ```summarize``` and dealing with categorical variables is: ```n()```
```r
top_restaurants <- fast_food %>%
                   group_by(name) %>%
                   summarize(n=n())
```

Here, we group up the fast food joints by their name, and count how large each group is.

***Notably***: To stop ```summarize``` from dropping all the previous columns, we can use ```.drop = FALSE```:

```r
group_by(b, .drop=FALSE)
```

### Calculating summary statistics on many columns

<img src="media/summary-stats_on_many_cols.png">

#### ```summarize``` and ```across``` to calculate summary stats on many columns.

```r
region_lang %>% 
summarize(across(mother_tongue:lang_known, max, na.rm=TRUE))
```

#### ```map``` for calculating summary statistics on many columns.

- ```map(dataframe, function_name, na.rm = TRUE)```
    - Takes two args, an object (vector, df, or list) and the function you want to apply.
    - If ```na.rm = FALSE```, then NA values will not be ignored.
    - Returns an object of type list.
- ```map_df(dataframe, function_name, na.rm = TRUE)```

Notably, ```map``` does not have an argument to specify which columns to apply the function to. Therefore, we will use ```select``` before calling ```map``` to choose the columns for which we want the maximum.

```r
region_lang %>%
    select(mother_tongue:lang_known) %>%
    map_df(max, na.rm = TRUE)
```

Different ```function_name```s:
- max
- min
- mean
- sum
- diff

We need to use the appropriate map_* function to get the output we want.

<img src="media/map1.png" width="600px">

## ```mutate``` and ```across```

### Apply functions across many columns with ```mutate``` and ```across```

<img src="media/mutate_across.png" width="400px;">

For example, if we wanted to convert all the numeric columns in the ```region_lang``` data frame from double type to integer type using the ```as.integer``` function.

<img src="media/mutate_across1.png" width="300px;">

We can use ```mutate``` paired with ```across```:

```r
region_lang %>%
    mutate(across(mother_tongue:lang_known, as.integer()))
```

<img src="media/mutate_across2.png" width="300px">


### Apply functions across many columns with one row with ```rowwise``` and ```mutate```

<img src="media/rowwise-mutate1.png" width="300px;">

First, we need to select the columns we are going to be using.

<img src="media/rowwise-mutate2.png" width="400px">

Now we apply ```rowwise``` before ```mutate```, to tell R that we would like the mutate function to be applied across, and within, a row, instead of applying it to a column.

```r
region_lang %>%
    select(mother_tongue:lang_known) %>%
    rowwise() %>%
    mutate(maximum = max(c(mother_tongue,
                           most_at_home,
                           most_at_work,
                           lang_known)))
```

<img src="media/rowwise-mutate3.png" width="400px">

Notably, ```rowwise``` doesn't appear to do anything when its called by itself. however, we can apply ```rowwise``` in combination with other functions to change how these other functions operate on the data.

## Simple functions

- ```top_n(dataframe, n, col_name)```
    - choose the top ```n``` values of the selected column.
- ```semi_join(df1, df2)```
    - Gives the intersection of 2 dataframes.
    - All columns of ```df1``` are kept while only the columns of ```df2```which match with that of ```df1``` are kept.
    - The intersection of the same columns is done filtering out the respective rows from ```df1```.
    
<img src ="media/joins.png" width="400px">

- ```ifelse(condition, true_value, false_value)```
    - E.g. ```mutate(islands_top12, is_continent = ifelse(landmass %in% continents, "Continent", "Other"))```
- ```pull(dataframe, column_name)```
    - Pulls out the column of a dataframe in the form of a vector/array list.
- ```factor(col_name, levels = c(..., ..., ...))```
    - This is used to encode a vector as a factor; it allows you to specify the values, and whether they are ordered or not.
- ```as.factor()``` (```as_factor```)
    - Simply coerces an existing vector to a factor, if possible.
- ```slice_max(data, order_by = ..., n = ...)```
    - ```data```: what data frame we are looking at.
    - ```order_by```: which parameter value we select to order (default is largest first)
    - ```n```: Number of rows selected to be left.
    - Function is used to select only the top ```n``` data rows ordered by some column from a data frame to generate a new data frame.
   

## Summary

<img src="media/wrangling_func_summary.png" width="500px">