## **What's in this tutorial?**

In this short tutorial, we'll get to learn and practice some basic functions to help you work easily and intuitively with data. You'll learn how to:

-   Load packages in R using 

-   Drop missing values using 

-   Query whether there are missing values 

-   Subset unique rows using 

-   Keep rows that satisfy your conditions using 

<p >
<img src = "../../images/r_learners_sm.jpeg", width= 500>



Let's get started!

## 1. **Meet the Tidyverse and data**

Before we can get started doing some Data Science, we typically begin by loading packages into our current R environment and importing data.


In [None]:
# Load the Tidyverse
library(tidyverse)

# Load the library containing the data set
library(palmerpenguins)

# Print the first few rows of the data
head(penguins)


`NA` values represent `Not Available` / `Missing Values`. We'll cover better ways to deal with them later in the course but for now, let's drop them using `drop_na()`



In [None]:
# Drop rows containing missing values
penguins <- penguins %>%
  drop_na()


In [None]:

# Do we still have any missing values?
penguins %>% 
  anyNA()


No missing values! Excellent!



In [None]:
penguins %>% 
  head()


We only see penguins of species `Adelie`. Are there any other species of penguins? To answer this, we use a function/verb in dplyr called `distinct()`. distinct only returns the unique rows in our data:



In [None]:
# What are the different species of penguins in our data?
penguins %>% 
  distinct(species)


In [None]:
# What are the different islands in our data?
penguins %>% 
  distinct(island)


Now let's go forth and filter!

## 2. filter: keep rows that satisfy your conditions
<p >
    <img src = "../../images/dplyr_filter_sm.png", width = 500>



### **Example 1**

Make a subset with only chinstrap penguins.

In the code below, we **filter** the **penguins** data to only keep rows where the entry for **species** exactly matches "Adelie" (case sensitive).


In [None]:
penguins %>% 
  filter(species == "Adelie")


Easy, right? We may want to save this to a variable name "adelie_penguins". This is how we would go about it.



In [None]:
# Subset data to only obtain Adelie penguins
adelie_penguins <- penguins %>% 
  filter(species == "Adelie")

# Print the data
adelie_penguins %>% 
  head()


Good job!! Give it a try too.

**Question 1: filter** the **penguins** data to only keep rows where the entry for **species** exactly matches "Chinstrap". Save this in a variable name `chinstrap_penguins`


In [None]:
chinstrap_penguins <- penguins %>% 
  filter(species == "___")

chinstrap_penguins %>% 
  head()


In [None]:
. = ottr::check("tests/Question 1.R")



### **Example 2**

That went well. What if now we wanted to keep rows where species matches "Chinstrap" **OR** "Gentoo"?

We use the "or" operator, `|` (the vertical line) when we want to filter rows based on multiple observations in a specific column.


In [None]:
# Make a subset only containing Chinstrap and Gentoo penguins
penguins %>% 
  filter(species == "Chinstrap" | species == "Gentoo")


Now, let's take this a little bit further. What if we wanted to make subsets based on conditions that span different columns? Say we only want to keep observations (rows) where the species is "Gentoo" **AND** the island is "Dream" - a row should only be retained if both of those conditions are met.

There are a number of ways you can write an **and** statement within `filter()`, including:

-   A comma between conditions indicates both must be met (`filter(x == "a", y == "b")`)

-   An ampersand between conditions indicates both must be met (`filter(x == "a" & y == "b")`)

We can create a subset starting from penguins that only contains observations for Gentoo penguins on Dream Island as follows:


In [None]:
penguins %>%
  filter(species == "Gentoo", island == "Dream") %>% 
  head()


Give it a try yourself.

**Question 2:** Create a subset from `penguins` containing observations for **female** **Adelie** penguins on **Dream** or **Torgersen** Islands.


In [None]:
penguins_subset <- penguins %>%
  filter(sex == "____",
         "____" == "Adelie") %>%
  filter("____" | "____")

  

penguins_subset %>% 
  head()


In [None]:
. = ottr::check("tests/Question 2.R")



## Wrap up and Resources

Fantastic! You just did some data wrangling in R. You learnt how to:

-   Load packages in R using `library()`

-   Drop missing values using `drop_na()`

-   Query whether there are missing values `anyNA()`

-   Subset unique rows using `distinct()`

-   Keep rows that satisfy your conditions using `filter()`

The fun doesn't end here, `dplyr` has a whole bunch of useful verbs:

-   `select()`: keep or exclude some columns

-   `rename()`: rename columns

-   `relocate()`: move columns around

-   `mutate()`: add a new column

-   `group_by()` + `summarize()`: get summary statistics by group

Here are some great places you can learn all about them:

-   [dplyr.tidyverse.org](https://dplyr.tidyverse.org/)

-   [R for Data Science](https://r4ds.had.co.nz/) by Hadley Wickham and Garrett Grolemund

Happy learning!
