Refresh character data lesson

STAT545-UBC · Oct 25, 2017 · ea814b7 · ea814b7
1 parent afaad83
commit ea814b7
Show file tree

Hide file tree

Showing 3 changed files with 149 additions and 130 deletions.
diff --git a/block028_character-data.Rmd b/block028_character-data.Rmd
@@ -29,13 +29,14 @@ I start with this because we cannot possibly do this topic justice in a short am
 #### Manipulating character vectors
 
   * [stringr package](https://cran.r-project.org/web/packages/stringr/index.html)
-    - A non-core package in the tidyverse. It is installed via `install.packages("tidyverse")`, but not loaded via `library(tidyverse)`. Load it as needed via `library(stringr)`.
+    - A non-core package in the tidyverse. It is installed via `install.packages("tidyverse")`, but not loaded via `library(tidyverse)`. Load it as needed via `library(stringr)`. *2017-10 note: this is changing and stringr will be a core package at the next CRAN update.*
     - Main functions start with `str_`. Auto-complete is your friend.
     - Replacements for base functions re: string manipulation and regular expressions (see below).
     - Main advantagse over base functions: greater consistency about inputs and outputs. Outputs are more ready for your next analytical task. 
   * [tidyr package](https://cran.r-project.org/web/packages/tidyr/index.html)
     - Especially useful for functions that split 1 character vector into many and *vice versa*: `separate()`, `unite()`, `extract()`.
   * Base functions: `nchar()`, `strsplit()`, `substr()`, `paste()`, `paste0()`.
+  * The [glue package](http://glue.tidyverse.org) is fantastic for string interpolation. If `stringr::str_interp()` doesn't get your job done, check out glue.
 
 #### Regular expressions
 
@@ -100,29 +101,29 @@ Determine presence/absence of a literal string with `str_detect()`. Spoiler: lat
 Which fruits actually use the word "fruit"?
 
 ```{r}
-str_detect(fruit, "fruit")
+str_detect(fruit, pattern = "fruit")
 ```
 
 What's the easiest way to get the actual fruits that match? Use `str_subset()` to keep only the matching elements. Note we are storing this new vector `my_fruit` to use in later examples!
 
 ```{r}
-(my_fruit <- str_subset(fruit, "fruit"))
+(my_fruit <- str_subset(fruit, pattern = "fruit"))
 ```
 
 ### String splitting by delimiter
 
 Use `stringr::str_split()` to split strings on a delimiter. Some of our fruits are compound words, like "grapefruit", but some have two words, like "ugli fruit". Here we split on a single space `" "`, but show use of a regular expression later. 
 
 ```{r}
-str_split(my_fruit, " ")
+str_split(my_fruit, pattern = " ")
 ```
 
 It's bummer that we get a *list* back. But it must be so! In full generality, split strings must return list, because who knows how many pieces there will be?
 
 If you are willing to commit to the number of pieces, you can use `str_split_fixed()` and get a character matrix. You're welcome!
 
 ```{r}
-str_split_fixed(my_fruit, " ", n = 2)
+str_split_fixed(my_fruit, pattern = " ", n = 2)
 ```
 
 If the to-be-split variable lives in a data frame, `tidyr::separate()` will split it into 2 or more variables.
@@ -160,7 +161,7 @@ tibble(fruit) %>%
 Finally, `str_sub()` also works for assignment, i.e. on the left hand side of `<-`.
 
 ```{r}
-x <- head(fruit, 3)
+(x <- head(fruit, 3))
 str_sub(x, 1, 3) <- "AAA"
 x
 ```
@@ -191,7 +192,10 @@ str_c(fruit[1:4], fruit[5:8], sep = " & ", collapse = ", ")
 If the to-be-combined vectors are variables in a data frame, you can use `tidyr::unite()` to make a single new variable from them
 
 ```{r}
-fruit_df <- tibble(fruit1 = fruit[1:4], fruit2 = fruit[5:8])
+fruit_df <- tibble(
+  fruit1 = fruit[1:4],
+  fruit2 = fruit[5:8]
+)
 fruit_df %>% 
   unite("flavor_combo", fruit1, fruit2, sep = " & ")
 ```
@@ -201,13 +205,13 @@ fruit_df %>%
 You can replace a pattern with `str_replace()`. Here we use an explicit string-to-replace, but later we revisit with a regular expression.
 
 ```{r}
-str_replace(my_fruit, "fruit", "THINGY")
+str_replace(my_fruit, pattern = "fruit", replacement = "THINGY")
 ```
 
 A special case that comes up alot is replacing `NA`, for which there is `str_replace_na()`.
 
 ```{r}
-melons <- str_subset(fruit, "melon")
+melons <- str_subset(fruit, pattern = "melon")
 melons[2] <- NA
 melons
 str_replace_na(melons, "UNKNOWN MELON")
@@ -242,7 +246,7 @@ Frequently your string tasks cannot be expressed in terms of a fixed string, but
 The first metacharacter is the period `.`, which stands for any single character, except a newline (which by the way, is represented by `\n`). The regex `a.b` will match all countries that have an `a`, followed by any single character, followed by `b`. Yes, regexes are case sensitive, i.e. "Italy" does not match.
 
 ```{r}
-str_subset(countries, "i.a")
+str_subset(countries, pattern = "i.a")
 ```
 
 Notice that `i.a` matches "ina", "ica", "ita", and more.
@@ -252,17 +256,17 @@ Notice that `i.a` matches "ina", "ica", "ita", and more.
 Note how the regex `i.a$` matches many fewer countries than `i.a` alone. Likewise, more elements of `my_fruit` match `d` than `^d`, which requires "d" at string start.
 
 ```{r}
-str_subset(countries, "i.a$")
-str_subset(my_fruit, "d")
-str_subset(my_fruit, "^d")
+str_subset(countries, pattern = "i.a$")
+str_subset(my_fruit, pattern = "d")
+str_subset(my_fruit, pattern = "^d")
 ```
 
 The metacharacter `\b` indicates a **word boundary** and `\B` indicates NOT a word boundary. This is our first encounter with something called "escaping" and right now I just want you at accept that we need to prepend a second backslash to use these sequences in regexes in R. We'll come back to this tedious point later.
 
 ```{r}
-str_subset(fruit, "melon")
-str_subset(fruit, "\\bmelon")
-str_subset(fruit, "\\Bmelon")
+str_subset(fruit, pattern = "melon")
+str_subset(fruit, pattern = "\\bmelon")
+str_subset(fruit, pattern = "\\Bmelon")
 ```
 
 ### Character classes
@@ -273,19 +277,19 @@ Here we match `ia` at the end of the country name, preceded by one of the charac
 
 ```{r}
 ## make a class "by hand"
-str_subset(countries, "[nls]ia$")
+str_subset(countries, pattern = "[nls]ia$")
 ## use ^ to negate the class
-str_subset(countries, "[^nls]ia$")
+str_subset(countries, pattern = "[^nls]ia$")
 ```
 
 Here we revisit splitting `my_fruit` with two more general ways to match whitespace: the `\s` metacharacter and the POSIX class `[:space:]`. Notice that we must prepend an extra backslash `\` to escape `\s` and the POSIX class has to be surrounded by two sets of square brackets.
 
 ```{r}
 ## remember this?
-# str_split_fixed(fruit, " ", 2)
+# str_split_fixed(fruit, pattern = " ", n = 2)
 ## alternatives
-str_split_fixed(my_fruit, "\\s", 2)
-str_split_fixed(my_fruit, "[[:space:]]", 2)
+str_split_fixed(my_fruit, pattern = "\\s", n = 2)
+str_split_fixed(my_fruit, pattern = "[[:space:]]", n = 2)
 ```
 
 Let's see the country names that contain punctuation.
@@ -310,28 +314,28 @@ Explore these by inspecting matches for `l` followed by `e`, allowing for variou
 `l.*e` will match strings with 0 or more characters in between, i.e. any string with an `l` eventually followed by an `e`. This is the most inclusive regex for this example, so we store the result as `matches` to use as a baseline for comparison.
 
 ```{r}
-(matches <- str_subset(fruit, "l.*e"))
+(matches <- str_subset(fruit, pattern = "l.*e"))
 ```
 
 Change the quantifier from `*` to `+` to require at least one intervening character. The strings that no longer match: all have a literal `le` with no preceding `l` and no following `e`.
 
 ```{r}
-list(match = intersect(matches, str_subset(fruit, "l.+e")),
-     no_match = setdiff(matches, str_subset(fruit, "l.+e")))
+list(match = intersect(matches, str_subset(fruit, pattern = "l.+e")),
+     no_match = setdiff(matches, str_subset(fruit, pattern = "l.+e")))
 ```
 
 Change the quantifier from `*` to `?` to require at most one intervening character. In the strings that no longer match, the shortest gap between `l` and following `e` is at least two characters.
 
 ```{r}
-list(match = intersect(matches, str_subset(fruit, "l.?e")),
-     no_match = setdiff(matches, str_subset(fruit, "l.?e")))
+list(match = intersect(matches, str_subset(fruit, pattern = "l.?e")),
+     no_match = setdiff(matches, str_subset(fruit, pattern = "l.?e")))
 ```
 
 Finally, we remove the quantifier and allow for no intervening characters. The strings that no longer match lack a literal `le`.
 
 ```{r}
-list(match = intersect(matches, str_subset(fruit, "le")),
-     no_match = setdiff(matches, str_subset(fruit, "le")))
+list(match = intersect(matches, str_subset(fruit, pattern = "le")),
+     no_match = setdiff(matches, str_subset(fruit, pattern = "le")))
 ```
 
 ### Escaping
@@ -361,20 +365,20 @@ Here is routine, non-regex use of backslash `\` escapes in plain vanilla R strin
 
 Examples of using escapes in regexes to match characters that would otherwise have a special interpretation.
 
-We know several Gapminder country names contain a period. How do we isolate them? Although it's tempting, this command `str_subset(countries, ".")` won't work!
+We know several Gapminder country names contain a period. How do we isolate them? Although it's tempting, this command `str_subset(countries, pattern = ".")` won't work!
 
 ```{r}
 ## cheating using a POSIX class ;)
-str_subset(countries, "[[:punct:]]")
+str_subset(countries, pattern = "[[:punct:]]")
 ## using two backslashes to escape the period
-str_subset(countries, "\\.")
+str_subset(countries, pattern = "\\.")
 ```
 
 A last example that matches an actual square bracket.
 
 ```{r}
 (x <- c("whatever", "X is distributed U[0,1]"))
-str_subset(x, "\\[")
+str_subset(x, pattern = "\\[")
 ```
 
 ### Groups and backreferences