In [7]:
library(dplyr)

In [5]:
# installing dplyr
# install.packages("dplyr")

Installing package into 'C:/Users/Keila/AppData/Local/R/win-library/4.3'
(as 'lib' is unspecified)

also installing the dependencies 'pkgconfig', 'withr', 'generics', 'tibble', 'tidyselect'




package 'pkgconfig' successfully unpacked and MD5 sums checked
package 'withr' successfully unpacked and MD5 sums checked
package 'generics' successfully unpacked and MD5 sums checked
package 'tibble' successfully unpacked and MD5 sums checked
package 'tidyselect' successfully unpacked and MD5 sums checked
package 'dplyr' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\Keila\AppData\Local\Temp\RtmpyA58QT\downloaded_packages


## Tibbles
The main data structure in the tidyverse is what’s called a tibble. Tibbles are like data frames: a collection of rows and named columns. Tibbles and data frames are relatively interchangeable

* A data frame is converted into a tibble by using the as_tibble()
* A tibble can be converted into a data frame using the as.data.frame() function

## The dplyr package
* We can think of dplyr as a grammar of data manipulation. The dplyr package is part of the tidyverse and allows us to handle data using five main functions. 

Tibble called tib to explore how these functions work:

### Mutate
The mutate() function creates a new column in the tibble. 
* A helpful function to use with mutate() is case_when(), which acts like an if-else statement.

In [11]:
# Create the tibble
tib <- tibble(
  names = c("Michael", "Jae-jin", "Molly", "Harvey", "Mia", "Kylee", "Dhee", "Peter", "Cate", "Lauren"),
  test1 = c(87, 92, 73, 85, 95, 86, 89, 80, 75, 91),
  test2 = c(85, 87, 82, 90, 92, 90, 82, 84, 86, 88),
  quiz1 = c(10, 9, 10, 8, 10, 7, 9, 8, 9, 8),
  quiz2 = c(8, 9, 10, 8, 8, 8, 9, 10, 9, 9)
)

tib

names,test1,test2,quiz1,quiz2
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
Michael,87,85,10,8
Jae-jin,92,87,9,9
Molly,73,82,10,10
Harvey,85,90,8,8
Mia,95,92,10,8
Kylee,86,90,7,8
Dhee,89,82,9,9
Peter,80,84,8,10
Cate,75,86,9,9
Lauren,91,88,8,9


In [12]:
mutate(tib, 
    # create age column
    age = c(20, 21, 22, 19, 20, 22, 19, 18, 21, 20),
    # create test1_letter column
    test1_letter = case_when(
        # assign letter based on test1 score
        test1 < 80 ~ "C",
        test1 >= 80 & test1 < 90 ~ "B",
        test1 >= 90 ~ "A"))

names,test1,test2,quiz1,quiz2,age,test1_letter
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
Michael,87,85,10,8,20,B
Jae-jin,92,87,9,9,21,A
Molly,73,82,10,10,22,C
Harvey,85,90,8,8,19,B
Mia,95,92,10,8,20,A
Kylee,86,90,7,8,22,B
Dhee,89,82,9,9,19,B
Peter,80,84,8,10,18,B
Cate,75,86,9,9,21,C
Lauren,91,88,8,9,20,A


## Select
The select() function subsets to specific columns by name. It also allows you to easily remove columns from a tibble.

In [13]:
# select only quiz1 and quiz2 columns from tib
select(tib, quiz1, quiz2)

quiz1,quiz2
<dbl>,<dbl>
10,8
9,9
10,10
8,8
10,8
7,8
9,9
8,10
9,9
8,9


In [14]:
# select everything except names and quiz1 from tib
select(tib, -names, -quiz1)

test1,test2,quiz2
<dbl>,<dbl>,<dbl>
87,85,8
92,87,9
73,82,10
85,90,8
95,92,8
86,90,8
89,82,9
80,84,10
75,86,9
91,88,9


# Filter
The filter() function subsets to specific rows by the values in a column. For example, to filter to only students who scored over 80 on test 1:

In [16]:
# filter tib to only rows where test1 > 80
filter(tib, test1 > 80)

names,test1,test2,quiz1,quiz2
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
Michael,87,85,10,8
Jae-jin,92,87,9,9
Harvey,85,90,8,8
Mia,95,92,10,8
Kylee,86,90,7,8
Dhee,89,82,9,9
Lauren,91,88,8,9


## Summarise
The summarise() function reduces one or more variables to a summary value. Common summary values are mean, median, standard deviation, minimum, maximum, etc.

In [20]:
# create a new tibble that has a column for the average of all of 
# quiz1 and the standard deviation of all of quiz1
summarise(tib, quiz1_avg = mean(quiz1), quiz1_sd = sd(quiz1))

quiz1_avg,quiz1_sd
<dbl>,<dbl>
8.8,1.032796


## Arrange
The arrange() function changes the order of the rows in a tibble based on the row’s value in a selected column. Default is in ascending order (smallest to largest value), but this can be changed using desc(column_name).

In [24]:
# arrange tib from highest to lowest scores of test1
arrange(tib, desc(test1))

names,test1,test2,quiz1,quiz2
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
Mia,95,92,10,8
Jae-jin,92,87,9,9
Lauren,91,88,8,9
Dhee,89,82,9,9
Michael,87,85,10,8
Kylee,86,90,7,8
Harvey,85,90,8,8
Peter,80,84,8,10
Cate,75,86,9,9
Molly,73,82,10,10


## Piping
The tidyverse utilizes the pipe operator %>% to string multiple functions together. The pipe operator takes the tibble output from one function and inputs it as the first argument in the next function. This allows us to create more intricate operations without having to save intermittent steps.

For example, if we wanted to get the average scores for tests 1 and 2 and then get the letter value for both test averages, we could pipe a summarise() operation and a mutate() operation. Then if we wanted to put the columns together in a specific order—test 1 average, test 1 letter, test 2 average, test 2 letter—we could pipe an additional select() operation:

In [25]:
tib %>% 
  # make a new tibble that has a column for the 
  # averages of test1 and test2
  summarize(avg_test1 = mean(test1), avg_test2 = mean(test2)) %>% 
  # add new columns test1_letter and test2_letter
  mutate(
    # assign test1_letter based on test1 average score
    test1_letter = case_when(
      avg_test1 < 80 ~ "C",
      avg_test1 >= 80 & avg_test1 < 90 ~ "B",
      avg_test1 >= 90 ~ "A"),
    # assign test2_letter based on test2 average score
    test2_letter = case_when(
      avg_test2 < 80 ~ "C",
      avg_test2 >= 80 & avg_test2 < 90 ~ "B",
      avg_test2 >= 90 ~ "A")) %>% 
  # reorder the columns to be by test1 then test2
  select(avg_test1, test1_letter, avg_test2, test2_letter)

avg_test1,test1_letter,avg_test2,test2_letter
<dbl>,<chr>,<dbl>,<chr>
85.3,B,86.6,B
