## Lecture 4

## Logistics:
- All important links are bookmarked on top of the canvas;
- I expect you to spend 30 - 40 min before the lecture going over lecture notes.
- [Here you can find issues](https://github.ubc.ca/MDS-2022-23/DSCI_523_r-prog_students/issues) where you can ask questions and mention topics that you would like me to go over. If none is there, I will go with the default theme.
- You can find theme lecture notes [here.](https://github.ubc.ca/MDS-2022-23/DSCI_523_r-prog_students/tree/master/lecture_theme)
- Time when things get released.
    - Lab Monday morning
    - Lecture notes for that week before Monday morning.
    - Worksheets before the lecture
    - Lecture theme notes before 10 PM on the lecture day.
    - Lecture recording 5 PM on the lecture day.
    - Solutions after the grade release.
    - Practice Quiz

## Half way refresher
<img src="data/halfway.png" width=290>

- Checkout Vizualization using https://tidydatatutor.com

```
library(dplyr)
library(palmerpenguins)

set.seed(2021-12-03)

sample_penguins <- penguins %>%
  group_by(species) %>% 
  sample_n(3) %>% 
  select(species, bill_length_mm) %>% 
  ungroup()

sample_penguins %>% 
  filter(bill_length_mm > 45) %>% 
  mutate(bill_length_cm = bill_length_mm/10) %>% 
  select(species, bill_length_cm) %>% 
  arrange(desc(bill_length_cm)) %>%
  slice(1)
```
***CLICKER 1***

In [42]:
library(tidyverse)
extracurricular <-  tibble(name = c("Tiff","James","Mike","Matt","Yan"),
       category = as.factor(c("sports","sports","singing","singing","dance")),
       age = c(24,25,31,31,13))
# print(extracurricular)
extracurricular |> filter(age == 31) |> pull(category) |> fct_drop() |> levels()
# extracurricular$category |> fct_drop() |> levels()

## Today's theme 
Helpful links
- https://www.garrickadenbuie.com/project/tidyexplain/#inner-join

There are 3 key themes to this lecture:

1. joins

2. for loops

3. if, if else and else statements

First, let's load the packages we need:

In [20]:
library(tidyverse)

*Note: if you have to install an R package that exists on CRAN, the command is: `install.packages("PACKAGE_NAME")`.*

And then let's limit the output of data frames in Jupyter to 6 lines:

In [21]:
options(repr.matrix.max.rows = 6)

## 1. Joins

You can smash things together row-wise (“row binding”) or column-wise (“column binding”) using binding. You might sometimes get into situations where you want to perform row bind, but it is best to avoid column bind whenever you can. But it is best to avoid those if you can. Joins let us combine multiple data frames in useful ways, and were inspired by the database query language SQL. Let's practice a few of the most common joins you might practice in your data science work with the `band_members` and `band_instruments` data frames from the `dplyr` package.

In [3]:
band_members

name,band
<chr>,<chr>
Mick,Stones
John,Beatles
Paul,Beatles


In [4]:
band_instruments

name,plays
<chr>,<chr>
John,guitar
Paul,bass
Keith,guitar


**Question - what column can we join these two dataframes on?**

What would we do if we want to combine all rows of the dataframes, so we get all records back, with all columns? We would do a `full_join`:

In [5]:
full_join(band_members, band_instruments)

[1m[22mJoining, by = "name"


name,band,plays
<chr>,<chr>,<chr>
Mick,Stones,
John,Beatles,guitar
Paul,Beatles,bass
Keith,,guitar


What if we just wanted the intersection of these two data frames? Only the rows where the same people exist in both dataframes? We would use an `inner_join`:

In [6]:
inner_join(band_members, band_instruments)

[1m[22mJoining, by = "name"


name,band,plays
<chr>,<chr>,<chr>
John,Beatles,guitar
Paul,Beatles,bass


What if we wanted to add an instruments column to the `band_members` dataframe, but just for the members that exist in the `band_members` dataframe and not drop any records from the `band_members` dataframe? We would use `left_join` with the `band_members` dataframe being the first argument:

In [7]:
left_join(band_members, band_instruments)

[1m[22mJoining, by = "name"


name,band,plays
<chr>,<chr>,<chr>
Mick,Stones,
John,Beatles,guitar
Paul,Beatles,bass


What if your column names don't match? You can specify which columns to join by! Let's rename columns...

In [44]:
band_members <- band_members |> rename(band_name = name)

In [46]:
band_members
band_instruments

band_name,band
<chr>,<chr>
Mick,Stones
John,Beatles
Paul,Beatles


name,plays
<chr>,<chr>
John,guitar
Paul,bass
Keith,guitar


In [47]:
left_join(band_members, band_instruments)

ERROR: [1m[33mError[39m in [1m[1m`left_join()`:[22m
[1m[22m[33m![39m `by` must be supplied when `x` and `y` have no common variables.
[36mℹ[39m use by = character()` to perform a cross-join.


In [10]:
## We will get error as there is no common variable
left_join(band_members, band_instruments,by = c("band_name" = "name"))

band_name,band,plays
<chr>,<chr>,<chr>
Mick,Stones,
John,Beatles,guitar
Paul,Beatles,bass


I don't care to learn that i can rename it back so that there is common column.

In [11]:
band_members <- band_members |> rename(name = band_name)

In [12]:
left_join(band_members, band_instruments)

[1m[22mJoining, by = "name"


name,band,plays
<chr>,<chr>,<chr>
Mick,Stones,
John,Beatles,guitar
Paul,Beatles,bass


## 2. `for` loops



Let's be a little silly here, why? Because life is a little too serious right now.

let's loop over the character vector below to print out the joke:

In [48]:
the_joke <- c("Helvetica and Times New Roman", "walk into a bar", "Get out of here!", "Shouts the bartender", "We don't serve your type!")

In [49]:
for (lines in the_joke) {
    print(lines)
}

[1] "Helvetica and Times New Roman"
[1] "walk into a bar"
[1] "Get out of here!"
[1] "Shouts the bartender"
[1] "We don't serve your type!"


In contrast to Python, R uses `{` and `}` to define what code is part of the `for` loop. You also see indentation with code in R when writing a `for` loop, but it is not strictly required but is used to make code more readable.

We can also use indices in R when we iterate with a `for` loop:

In [51]:
for (i in seq_along(the_joke)) {
    # print(i)
    print(the_joke[i])
}

[1] "Helvetica and Times New Roman"
[1] "walk into a bar"
[1] "Get out of here!"
[1] "Shouts the bartender"
[1] "We don't serve your type!"


Beware of using length instead:

In [52]:
the_joke_empty <- c()

In [53]:
for (i in 1:length(the_joke_empty)) {
    print(the_joke_empty[i])
}

NULL
NULL


Let's see how it going to behave if it is seq_along

In [54]:
for (i in seq_along(the_joke_empty)) {
    print(the_joke_empty[i])
}

## 3. `if`, `if else` and `else` statements

Let's do a dice rolling exercise, where we simulate rolling two fair dice and R will comment whether we got a pair or not (or snake eyes!):

In [19]:
dice1 <- sample(1:6, 1)
dice2 <- sample(1:6, 1)
paste(dice1, "&", dice2)

if (dice1 == dice2) {
print("You rolled a pair!")
    if (dice1 == 1) {
        print("And they are snake eyes! What a lucky day!")
    }
} else {
    print("Try again!")
}

[1] "Try again!"
