<a href="https://colab.research.google.com/github/CamrynRhude/Linear-Algebra-Coding-projects/blob/main/lecture_6_1_introduction_to_dplyr_and_tidyr.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part 1 - Introduction to the `tidyverse` in `R`

## Why use R?


- Save and rerun code
- Several data science/statistics packages available
- Great graphics
- Built for data
- Free and open-source
- Large user community

### Market-share

![](https://github.com/WSU-DataScience/DSCI_210_R_notebooks/blob/main/img/Fig-1a-IndeedJobs-2017.png?raw=1)

## What is the `tidyverse`?

The tidyverse is a collection of `R` packages designed for data science. They all share an underlying design philosophy, grammar, and data structures. We will focus on a few packages for managing data, using the data verb syntax.
*   `dplyr` (`select`, `filter`, `mutate`, `group_by`, `summarize`)
* `tidyr` (stack and unstack with `gather` and `spread`)


In future data science courses, you will likely use `ggplot`  to create nice graphics.   
    

# Introduction to the `dplyr` package in `R`

## Loading a Library

In [1]:
# This loads all of the dplyr functions
# You must do every time you start new R session

library("dplyr")


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




## Reading in data

In [2]:
surveys <- read.csv('https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/data/portal_data_joined.csv')

# Good habit: Always inspect the result with head
head(surveys,n=10)

Unnamed: 0_level_0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>
1,1,7,16,1977,2,NL,M,32.0,,Neotoma,albigula,Rodent,Control
2,72,8,19,1977,2,NL,M,31.0,,Neotoma,albigula,Rodent,Control
3,224,9,13,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
4,266,10,16,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
5,349,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
6,363,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
7,435,12,10,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
8,506,1,8,1978,2,NL,,,,Neotoma,albigula,Rodent,Control
9,588,2,18,1978,2,NL,M,,218.0,Neotoma,albigula,Rodent,Control
10,661,3,11,1978,2,NL,,,,Neotoma,albigula,Rodent,Control


## Selecting columns with `select`

In [None]:
# Syntax: select(df, col1, col2, ...)

new_df <- select(surveys, plot_id, species_id, weight)
head(new_df)

## Filtering rows with `filter`

In [None]:
new_df2 <- filter(surveys, year == 1995)
head(new_df2)

*Question: Why are the columns not selected up above still appearing here?*

## Creating a new column with `mutate`

In [None]:
new_df <- select(surveys, plot_id, species_id, weight, year)
new_df2 <- filter(new_df, year == 1995)
new_df3 <- mutate(new_df2, weight_kg = weight / 1000)
head(new_df3)

In [None]:
# To drop the old weight column:

new_df4 <- select(new_df3, -weight)
head(new_df4)

## Motivating pipes

The pipe, `%>%`, is a powerful tool for clearly expressing a sequence of multiple operations. Before we explore using the pipe with `dplyr` functions, let's look at some alternatives.

### Alternative #1: Imperative coding pattern - save, save, save!


<img width="450" src="https://github.com/WSU-DataScience/DSCI_210_R_notebooks/blob/main/img/imperative_pattern.png?raw=1">

This works, but it's not the best approach.
- **Problem 1:** Creates lots of temporary variables 
- **Problem 2:** Messy and lots of overhead

All the extra *stuff* clouds the meaning/intent of the code!

### Alternative #2 - Rewrite to the same data frame

Instead of creating new objects at each step, we could just overwrite the original:

```{R}
surveys <- select(surveys, plot_id, species_id, weight, year)
surveys <- filter(surveys, year == 1995)
surveys <- mutate(surveys, weight_kg = weight / 1000)
```

**Problem:** This approach obscures what's changing on each line.



### Alternative #3 - Functional coding approach

This approach just strings the function calls together:

In [None]:
surveys2 <-
select(
  filter(
    mutate(surveys,
      weight_kg = weight / 1000), 
    year == 1995), 
  plot_id, species_id, weight, year)
  head(surveys)

**Problem:** We have to read from inside-out and from right to left. This is difficult to understand!

### The fix: use a pipe for cleaner code

The pipe helps us write code in a way that is easier to read and understand. The pipe pushes the data frame through the first position:

<img width="350" src="https://github.com/WSU-DataScience/DSCI_210_R_notebooks/blob/main/img/pipe1.png?raw=1">

Imagine an invisible data frame in the first spot... but don't write it!

<img width="350" src="https://github.com/WSU-DataScience/DSCI_210_R_notebooks/blob/main/img/pipe2.png?raw=1">

Note this important point - each data frame is NEW when you use the pipe.

\\

### The code with pipes - much cleaner!
The code shown below uses the pipe with `dplyr` functions. The advantage is that we are now focusing on the data verbs!

In [None]:
surveys  %>% 
  select(plot_id, species_id, weight, year) %>%
  filter(year == 1995) %>%
  mutate(weight_kg = weight / 1000) %>%
  head()

### My preferred code format

In [None]:
(surveys  
 %>% select(plot_id, species_id, weight, year) 
 %>% filter(year == 1995) 
 %>% mutate(weight_kg = weight / 1000) 
 %>% head()
)

## <font color="red"> Exercise 1 </font>

Write code using `dplyr` with pipes to perform the following tasks.

1. Compute the weight of all species in lbs.
2. Filter out the rows containing only weights (in lbs) greater than 0.2 lbs.

In [4]:
# Your code for task 1 here
surveys %>% mutate(weight_lbs =weight*.002205)%>% head


Unnamed: 0_level_0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type,weight_lbs
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<dbl>
1,1,7,16,1977,2,NL,M,32.0,,Neotoma,albigula,Rodent,Control,
2,72,8,19,1977,2,NL,M,31.0,,Neotoma,albigula,Rodent,Control,
3,224,9,13,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,
4,266,10,16,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,
5,349,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,
6,363,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,


In [5]:
# Your code for task 2 here

surveys %>% mutate(weight_lbs =weight*.002205)%>% filter(weight_lbs >.2) %>% head(n=10)


Unnamed: 0_level_0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type,weight_lbs
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<dbl>
1,588,2,18,1978,2,NL,M,,218,Neotoma,albigula,Rodent,Control,0.48069
2,845,5,6,1978,2,NL,M,32.0,204,Neotoma,albigula,Rodent,Control,0.44982
3,990,6,9,1978,2,NL,M,,200,Neotoma,albigula,Rodent,Control,0.441
4,1164,8,5,1978,2,NL,M,34.0,199,Neotoma,albigula,Rodent,Control,0.438795
5,1261,9,4,1978,2,NL,M,32.0,197,Neotoma,albigula,Rodent,Control,0.434385
6,1453,11,5,1978,2,NL,M,,218,Neotoma,albigula,Rodent,Control,0.48069
7,1756,4,29,1979,2,NL,M,33.0,166,Neotoma,albigula,Rodent,Control,0.36603
8,1818,5,30,1979,2,NL,M,32.0,184,Neotoma,albigula,Rodent,Control,0.40572
9,1882,7,4,1979,2,NL,M,32.0,206,Neotoma,albigula,Rodent,Control,0.45423
10,2133,10,25,1979,2,NL,F,33.0,274,Neotoma,albigula,Rodent,Control,0.60417


\\
# Part 2 - Converting code and types of errors

## You've seen piping before...
 
<img width="850" src="https://github.com/thooks630/DSCI_210_R_notebooks/raw/main/img/openrefine_piping.PNG">

## Saving the result of a piped operation

In [None]:
surveys_small <- 
(surveys 
  %>% filter(weight < 5) 
  %>% select(species_id, sex, weight)
)

head(surveys_small)

## A recap - the advantages of piping

* Reads left-to-right
* Reads top-to-bottom
* Focuses on verbs
* Removes pointless nouns

## Comparing three different coding approaches

* Imperative
* Functional
* Piping

### Imperative:

In [None]:
x <- pi
r_x <- round(x, 2)
c_x <- as.character(r_x)
c_x

### Functional:

In [None]:
as.character(round(pi,2))

### Piping:

In [None]:
pi %>%
  round(2) %>%
  as.character

## Example 1 - converting to pipes

In [None]:
surveys_small <- filter(surveys, weight < 5) 
survey_small_id_sex_wgt <- select(surveys_small, species_id, sex, weight)
head(survey_small_id_sex_wgt)

In [None]:
# Convert to piped code
surveys %>%
filter(weight < 5) %>%
select(species_id, sex, weight) %>%
head

## Example 2 - converting to imperative approach

In [None]:
surveys_small <- surveys %>%
  filter(species_id == 'NL') %>%
  select(species_id, sex, weight)

head(surveys_small)

In [None]:
# Convert to imperative
surveys_small <- filter(surveys, species_id == "NL")
surveys_small2 <- select(surveys_small, species_id, sex, weight)
head(surveys_small2)



## Example 3 - converting to functional approach

In [None]:
surveys_small <- surveys %>%
  filter(weight < 5) %>%
  select(species_id, sex, weight)

head(surveys_small)

In [None]:
# Convert to functional
head(select(filter(surveys, weight < 5), species_id, sex, weight))

## <font color="red"> Exercise 2 </font>

Perform each of the following code conversions.

In [6]:
sales <- read.csv('https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/data/auto_sales.csv')
head(sales)

Unnamed: 0_level_0,Salesperson,Compact,Sedan,SUV,Truck
Unnamed: 0_level_1,<chr>,<int>,<int>,<int>,<int>
1,Ann,22,18,15,12
2,Bob,19,12,17,20
3,Yolanda,19,8,32,15
4,Xerxes,12,23,18,9


#### <font color="red">TASK 1</font>. Convert the following *piped code* to the *imperative style*

In [None]:
sales %>%
    select(Salesperson, Compact, Sedan) %>%
    mutate(Car = Compact + Sedan) 

In [8]:
# Your code here (using imperative approach)
sales2 <- select(sales, Salesperson,Compact,Sedan)
sales3 <- mutate(sales2, Car=Compact+Sedan)
head(sales3)

Unnamed: 0_level_0,Salesperson,Compact,Sedan,Car
Unnamed: 0_level_1,<chr>,<int>,<int>,<int>
1,Ann,22,18,40
2,Bob,19,12,31
3,Yolanda,19,8,27
4,Xerxes,12,23,35


#### <font color="red">TASK 2</font>. Convert the following *imperative code* to the *piped style*

In [13]:
df2 <- mutate(sales, Car = Compact + Sedan)
df3 <- mutate(df2, Utility = SUV + Truck)
df4 <- select(df3, Salesperson, Car, Utility)
head(df4)

Unnamed: 0_level_0,Salesperson,Car,Utility
Unnamed: 0_level_1,<chr>,<int>,<int>
1,Ann,40,27
2,Bob,31,37
3,Yolanda,27,47
4,Xerxes,35,27


In [16]:
# Your code here
sales %>%
mutate(Car = Compact + Sedan) %>% 
mutate(Utility = SUV + Truck) %>% 
select(Salesperson, Car, Utility)


Salesperson,Car,Utility
<chr>,<int>,<int>
Ann,40,27
Bob,31,37
Yolanda,27,47
Xerxes,35,27


## Types of programming errors

* Name errors
* Syntax errors
* Semantic errors (hardest/worst)

### Name Errors - Using the wrong name

In [15]:
sales <- read.csv('https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/data/auto_sales.csv')
head(sales)

Unnamed: 0_level_0,Salesperson,Compact,Sedan,SUV,Truck
Unnamed: 0_level_1,<chr>,<int>,<int>,<int>,<int>
1,Ann,22,18,15,12
2,Bob,19,12,17,20
3,Yolanda,19,8,32,15
4,Xerxes,12,23,18,9


In [None]:
# Find the name errors
sales %>%
  select(Salesperson, Sedan)

### Syntax errors - Incorrect syntax

In [None]:
head(sales)

In [None]:
# Find the syntax errors
sales %>%
  mutate(monthly_sedan = Sedan/3,
         monthly_suv = SUV/3,
         monthly_truck = Truck/3)

### Semantic Errors - Correct code, wrong meaning

In [None]:
# Find the semantic errors
sales %>%
  group_by(Salesperson) %>%
  mutate(avg_sedan = median(Truck))

## <font color="red"> Exercise 3 </font>

Identify all of the errors in the following code and classify each as either a name, syntax, or semantic error.

In [None]:
sales %>%
    mutate(Car = Compact + Sedan) %>%
    mutate(Utility = SUV * Truck) 

> Your answer here