## Lecture 3 demo

There are 3 key themes to this lecture:

1. dates & times

2. strings

3. factors

> Note: Whatever mentioned optional won't be tested in the assignment or exam.




First, let's load the packages we need:

In [13]:
library(tidyverse)
library(lubridate)
library(palmerpenguins)

*Note: if you have to install an R package that exists on CRAN, the command is: `install.packages("PACKAGE_NAME")`.*

And then let's limit the output of data frames in Jupyter to 6 lines:

In [14]:
options(repr.matrix.max.rows = 6)

> Show them cheat sheet. https://www.rstudio.com/resources/cheatsheets/

> optional in lecture notes will not be tested in assignment or exam.

## 1. Dates & times

We learned about how to make date and time objects, but why are we doing this? What kinds of things can we do with these that are more difficult if we left these as characters or numeric vectors?

- We can use date operations like `month`,`year` etc. 
- We can ask whether a date falls within a certain interval. Use the `interval` function to create an interval.
- It impacts modeling and data visualization.(you will see in DSCI 531)

Show them the datatypes image

# Clicker 1: 
For the following dataframe (`time_df`), give the code to list the rows where the week day is in Monday.

| date_col    | other_col     |
|-------------|---------------|
| 15-1-2020   | Alice Wed |
| 15-1-2019   | Bob Tue    |
| 20-1-2019   | Charlie Sun     |
| 30-1-2019   | Jab Wed     |
| 13-1-2020   | Bobby Mon     |

A) `time_df |> filter(wday(dmy(date_col), label = TRUE) == 'Mon')`

B) `time_df |> filter(wday(dmy(date_col), label = TRUE) == 'Monday')`

C) `time_df |> wday(dmy(date_col), label = TRUE) == 'Monday'`

D) `time_df |> mutate(date_col = dmy(date_col)) |> filter(wday(date_col) == 1)`

Answer:

A) `time_df |> filter(wday(dmy(date_col), label = TRUE) == 'Mon')`

D) `time_df |> mutate(date_col = dmy(date_col)) |> filter(wday(date_col) == 1)`

In [None]:
time_df |> filter(wday(dmy(date_col),) == 'Monday')

# clicker 2:

What will be the output of `print(time_df)` after running the following code?

```r
time_df |> mutate(date_col = dmy(date_col)) |> filter(wday(date_col) == 1)
```

A. 

| date_col    | other_col     |
|-------------|---------------|
| 15-1-2020   | Alice |
| 15-1-2019   | Bob      |
| 20-1-2019   | Charlie      |
| 30-1-2019   | Jab      |
| 13-1-2020   | Bobby     |

B. 

| date_col    | other_col     |
|-------------|---------------|
| 13-1-2020   | Bobby     |


In [44]:
time_df <- time_df |> mutate(date_col = dmy(date_col)) |> filter(wday(date_col) == 1)
time_df_new <- time_df |> mutate(date_col = dmy(date_col)) |> filter(wday(date_col) == 1)

# answer:

ANSWER: A

Assign back to the original dataframe to keep the changes.

In [15]:
# time_df <- tibble(date_col = c("15-1-2020","15-1-2019","20-1-2019","30-1-2019","13-1-2020"),
#                    other_col = c("Alice", "Bob", "Charlie", "Jab", "Bobby"))
# time_df |> mutate(week_day = wday(dmy(date_col), abbr = FALSE, label = TRUE))

In [50]:
datev <- "01-01-2011"
typeof(datev)
# Won't run
# month(dmy(datev))
# month(datev)

In [17]:
datev <- dmy("31-Jan-2011")
datev <- mdy("Jan-31-2011")
class(datev)
month(datev)

In [18]:
# setting up an interval 
my_interval <- interval(start = dmy("1-1-2019"), end = dmy("31-1-2019"))

- `%within%` is useful for pairing with `filter` to get rows in an interval

In [19]:
## see if 15-1-2019 is in this interval
dmy("15-1-2019") %within% my_interval

In [20]:
## How you apply this to a data frame
time_df <- tibble(date_col = c("15-1-2020","15-1-2019","20-1-2019","30-1-2019","13-1-2020"),
                   other_col = c("Alice", "Bob", "Charlie", "Jab", "Bobby"))
time_df |> mutate(date_col = dmy(date_col)) |> filter(date_col %within% my_interval) 

date_col,other_col
<date>,<chr>
2019-01-15,Bob
2019-01-20,Charlie
2019-01-30,Jab


You will do this in the worksheet using `lubridate`'s `interval` function to create the interval, and then `filter` + `%within%` to subset rows from that interval.

## 2. Strings

My most common use for strings manipulation in data wrangling is filtering for rows where I have partial matches, and then replacing strings. The former can be done by pairing `filter` with`stringr`'s `str_detect` function, while the latter can be done by pairing `mutate` with `stringr`'s `str_replace` function.

Other functions that can be quite useful are if you want to split a column in a dataframe into 2 separate columns or vice versa. The former can be done using `separate`, while the latter can be done using `unite`.

Below I will demonstrate both on the `lakers` data frame below (which is the Los Angeles Lakers 2008-2009 basketball data set from the `lubridate` package):

In [51]:
str_length("gittu")

In [52]:
nchar("gittu")

In [53]:
lakers

date,opponent,game_type,time,period,etype,team,player,result,points,type,x,y
<int>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<int>,<int>
20081028,POR,home,12:00,1,jump ball,OFF,,,0,,,
20081028,POR,home,11:39,1,shot,LAL,Pau Gasol,missed,0,hook,23,13
20081028,POR,home,11:37,1,rebound,LAL,Vladimir Radmanovic,,0,off,,
...,...,...,...,...,...,...,...,...,...,...,...,...
20090414,UTA,home,00:27,4,turnover,LAL,Andrew Bynum,,0,,,
20090414,UTA,home,00:21,4,shot,UTA,Kyle Korver,missed,0,3pt,41,25
20090414,UTA,home,00:20,4,rebound,LAL,Luke Walton,,0,def,,


Let's find all the rows where the player has the name Kyle:

In [22]:
lakers |>
  filter(str_detect(player, "Kyle"))

date,opponent,game_type,time,period,etype,team,player,result,points,type,x,y
<int>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<int>,<int>
20081222,MEM,away,09:33,1,shot,MEM,Kyle Lowry,missed,0,3pt,13,28
20081222,MEM,away,04:59,1,rebound,MEM,Kyle Lowry,,0,def,,
20081222,MEM,away,04:39,1,shot,MEM,Kyle Lowry,missed,0,jump bank,30,12
...,...,...,...,...,...,...,...,...,...,...,...,...
20090414,UTA,home,00:42,4,free throw,UTA,Kyle Korver,made,1,,,
20090414,UTA,home,00:42,4,free throw,UTA,Kyle Korver,made,1,,,
20090414,UTA,home,00:21,4,shot,UTA,Kyle Korver,missed,0,3pt,41,25


Wait !!! I am interested in knowing about Trevor as well.

In [54]:
# Filter that contain pattern "Kyle" or "Trevor" in column player. 
# We can do this by creating a pattern which would match them "Kyle|Trevor"
lakers |> 
    filter(str_detect(player,"Kyle|Trevor|James"))

date,opponent,game_type,time,period,etype,team,player,result,points,type,x,y
<int>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<int>,<int>
20081028,POR,home,02:31,1,rebound,LAL,Trevor Ariza,,0,def,,
20081028,POR,home,02:11,1,shot,LAL,Trevor Ariza,made,2,dunk,25,6
20081028,POR,home,01:31,1,shot,LAL,Trevor Ariza,made,3,3pt,46,17
...,...,...,...,...,...,...,...,...,...,...,...,...
20090414,UTA,home,00:42,4,free throw,UTA,Kyle Korver,made,1,,,
20090414,UTA,home,00:42,4,free throw,UTA,Kyle Korver,made,1,,,
20090414,UTA,home,00:21,4,shot,UTA,Kyle Korver,missed,0,3pt,41,25


What if I am interested in knowing all players that starts with "Ky" ?

# iclicker2: 
What code will you use to find all the players that starts with Ky?

A) `lakers |> filter(str_detect(player, "^Ky"))`

B) `lakers |> filter(str_detect(player, "Ky$"))`

C) `lakers |> filter(str_detect(player, "Ky_"))`

D) `lakers |> filter(str_detect(player, "Ky^"))`

Answer: A

In [55]:
## That starts with "Ky".
lakers |> 
    filter(str_detect(player,"^Ky"))

date,opponent,game_type,time,period,etype,team,player,result,points,type,x,y
<int>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<int>,<int>
20081222,MEM,away,09:33,1,shot,MEM,Kyle Lowry,missed,0,3pt,13,28
20081222,MEM,away,04:59,1,rebound,MEM,Kyle Lowry,,0,def,,
20081222,MEM,away,04:39,1,shot,MEM,Kyle Lowry,missed,0,jump bank,30,12
...,...,...,...,...,...,...,...,...,...,...,...,...
20090414,UTA,home,00:42,4,free throw,UTA,Kyle Korver,made,1,,,
20090414,UTA,home,00:42,4,free throw,UTA,Kyle Korver,made,1,,,
20090414,UTA,home,00:21,4,shot,UTA,Kyle Korver,missed,0,3pt,41,25


Sorry, but now I am interested in knowing all the player that ends with wry

In [56]:
## That ends with "wry"
lakers |> 
    filter(str_detect(player,"wry$"))

date,opponent,game_type,time,period,etype,team,player,result,points,type,x,y
<int>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<int>,<int>
20081222,MEM,away,09:33,1,shot,MEM,Kyle Lowry,missed,0,3pt,13,28
20081222,MEM,away,04:59,1,rebound,MEM,Kyle Lowry,,0,def,,
20081222,MEM,away,04:39,1,shot,MEM,Kyle Lowry,missed,0,jump bank,30,12
...,...,...,...,...,...,...,...,...,...,...,...,...
20090403,HOU,home,11:38,4,foul,HOU,Kyle Lowry,,0,personal,,
20090403,HOU,home,11:25,4,rebound,HOU,Kyle Lowry,,0,def,,
20090403,HOU,home,10:45,4,shot,HOU,Kyle Lowry,missed,0,layup,25,6


Do we have any rows with no player names ? Now, let's filter for all the rows that don't have a player name 

In [58]:
## That contain empty value in column player $^
lakers |> 
    filter(str_detect(player," "))

date,opponent,game_type,time,period,etype,team,player,result,points,type,x,y
<int>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<int>,<int>
20081028,POR,home,11:39,1,shot,LAL,Pau Gasol,missed,0,hook,23,13
20081028,POR,home,11:37,1,rebound,LAL,Vladimir Radmanovic,,0,off,,
20081028,POR,home,11:25,1,shot,LAL,Derek Fisher,missed,0,layup,25,6
...,...,...,...,...,...,...,...,...,...,...,...,...
20090414,UTA,home,00:27,4,turnover,LAL,Andrew Bynum,,0,,,
20090414,UTA,home,00:21,4,shot,UTA,Kyle Korver,missed,0,3pt,41,25
20090414,UTA,home,00:20,4,rebound,LAL,Luke Walton,,0,def,,


Next, let's change some of the text so it is more readable - so "3pt" to "3 point":

In [27]:
lakers |>
  mutate(type = str_replace(type, "3pt", "3 point"))

date,opponent,game_type,time,period,etype,team,player,result,points,type,x,y
<int>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<int>,<int>
20081028,POR,home,12:00,1,jump ball,OFF,,,0,,,
20081028,POR,home,11:39,1,shot,LAL,Pau Gasol,missed,0,hook,23,13
20081028,POR,home,11:37,1,rebound,LAL,Vladimir Radmanovic,,0,off,,
...,...,...,...,...,...,...,...,...,...,...,...,...
20090414,UTA,home,00:27,4,turnover,LAL,Andrew Bynum,,0,,,
20090414,UTA,home,00:21,4,shot,UTA,Kyle Korver,missed,0,3 point,41,25
20090414,UTA,home,00:20,4,rebound,LAL,Luke Walton,,0,def,,


Now let's try to separate out column `player` into 2 columns `player_f_name` and `player_l_name`. 

In [59]:
new_lakers <- lakers |> separate(player, into = c("player_f_name","player_l_name"), sep = " ")
head(new_lakers)

"[1m[22mExpected 2 pieces. Additional pieces discarded in 59 rows [2417, 2419, 5601,
5615, 5617, 5655, 5666, 5668, 5681, 5684, 5704, 5714, 5718, 5777, 5788, 7745,
7748, 7770, 7785, 7816, ...]."
"[1m[22mExpected 2 pieces. Missing pieces filled with `NA` in 5398 rows [1, 35, 40, 44,
50, 51, 52, 69, 70, 81, 82, 83, 84, 96, 97, 105, 110, 114, 123, 127, ...]."


Unnamed: 0_level_0,date,opponent,game_type,time,period,etype,team,player_f_name,player_l_name,result,points,type,x,y
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<int>,<int>
1,20081028,POR,home,12:00,1,jump ball,OFF,,,,0,,,
2,20081028,POR,home,11:39,1,shot,LAL,Pau,Gasol,missed,0,hook,23.0,13.0
3,20081028,POR,home,11:37,1,rebound,LAL,Vladimir,Radmanovic,,0,off,,
4,20081028,POR,home,11:25,1,shot,LAL,Derek,Fisher,missed,0,layup,25.0,6.0
5,20081028,POR,home,11:23,1,rebound,LAL,Pau,Gasol,,0,off,,
6,20081028,POR,home,11:22,1,shot,LAL,Pau,Gasol,made,2,hook,25.0,10.0


Now let's those columns back to a single column `player`.

In [29]:
new_lakers |> unite(col = "player", c(player_f_name,player_l_name), sep= " ")

date,opponent,game_type,time,period,etype,team,player,result,points,type,x,y
<int>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<int>,<int>
20081028,POR,home,12:00,1,jump ball,OFF,,,0,,,
20081028,POR,home,11:39,1,shot,LAL,Pau Gasol,missed,0,hook,23,13
20081028,POR,home,11:37,1,rebound,LAL,Vladimir Radmanovic,,0,off,,
...,...,...,...,...,...,...,...,...,...,...,...,...
20090414,UTA,home,00:27,4,turnover,LAL,Andrew Bynum,,0,,,
20090414,UTA,home,00:21,4,shot,UTA,Kyle Korver,missed,0,3pt,41,25
20090414,UTA,home,00:20,4,rebound,LAL,Luke Walton,,0,def,,


## 3. Factors

Show them datatype image....

Factors are a kind of vector in R. And remember - data frame columns are usually made up of vectors - so you most commonly encounter them as columns in a data frame. They are very useful for modeling and data visualization.

Let's see what happens when we convert the `etype` column in the `lakers` data frame to a factor, as well as how we can do that:

First, let's look at what the column originally is:

In [62]:
class(lakers$etype)
str(lakers$etype)
# str(lakers)
as.factor(lakers$etype)

 chr [1:34624] "jump ball" "shot" "rebound" "shot" "rebound" "shot" "foul" ...


In [63]:
myname <- "gittu"
str(myname)

 chr "gittu"


Then we can use `as.factor` to convert it to a factor column:
> Note: By default, factor levels are ordered alphabetically.

In [32]:
lakers_factor <- lakers |>
  mutate(etype = as.factor(etype))

str(lakers_factor$etype)
levels(lakers_factor$etype)

 Factor w/ 10 levels "ejection","foul",..: 4 6 5 6 5 6 2 3 2 6 ...


You can use `fct_reorder` to order one variable by another. The factor is the grouping variable and the default summarizing function is median but you can specify something else.

In [39]:
lakers_factor <- lakers_factor |> mutate(etype = fct_reorder(etype,period,min))

In [37]:
lakers_factor$etype |> levels()

In [67]:
nchar(as.factor(c("gittu","gittu")))

ERROR: Error in nchar(as.factor(c("gittu", "gittu"))): 'nchar()' requires a character vector


In [68]:
str_length(as.factor(c("gittu","gittu")))

In [35]:
lakers_factor

date,opponent,game_type,time,period,etype,team,player,result,points,type,x,y
<int>,<chr>,<chr>,<chr>,<int>,<fct>,<chr>,<chr>,<chr>,<int>,<chr>,<int>,<int>
20081028,POR,home,12:00,1,jump ball,OFF,,,0,,,
20081028,POR,home,11:39,1,shot,LAL,Pau Gasol,missed,0,hook,23,13
20081028,POR,home,11:37,1,rebound,LAL,Vladimir Radmanovic,,0,off,,
...,...,...,...,...,...,...,...,...,...,...,...,...
20090414,UTA,home,00:27,4,turnover,LAL,Andrew Bynum,,0,,,
20090414,UTA,home,00:21,4,shot,UTA,Kyle Korver,missed,0,3pt,41,25
20090414,UTA,home,00:20,4,rebound,LAL,Luke Walton,,0,def,,


In the lab, you will explore how factor levels impact data visualization. And next block, in DSCI 552 you will start to use them for statistical analysis.