Skip to content
This repository has been archived by the owner on Sep 18, 2019. It is now read-only.

Commit

Permalink
Refresh character data lesson
Browse files Browse the repository at this point in the history
  • Loading branch information
jennybc committed Oct 25, 2017
1 parent afaad83 commit ea814b7
Show file tree
Hide file tree
Showing 3 changed files with 149 additions and 130 deletions.
68 changes: 36 additions & 32 deletions block028_character-data.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -29,13 +29,14 @@ I start with this because we cannot possibly do this topic justice in a short am
#### Manipulating character vectors

* [stringr package](https://cran.r-project.org/web/packages/stringr/index.html)
- A non-core package in the tidyverse. It is installed via `install.packages("tidyverse")`, but not loaded via `library(tidyverse)`. Load it as needed via `library(stringr)`.
- A non-core package in the tidyverse. It is installed via `install.packages("tidyverse")`, but not loaded via `library(tidyverse)`. Load it as needed via `library(stringr)`. *2017-10 note: this is changing and stringr will be a core package at the next CRAN update.*
- Main functions start with `str_`. Auto-complete is your friend.
- Replacements for base functions re: string manipulation and regular expressions (see below).
- Main advantagse over base functions: greater consistency about inputs and outputs. Outputs are more ready for your next analytical task.
* [tidyr package](https://cran.r-project.org/web/packages/tidyr/index.html)
- Especially useful for functions that split 1 character vector into many and *vice versa*: `separate()`, `unite()`, `extract()`.
* Base functions: `nchar()`, `strsplit()`, `substr()`, `paste()`, `paste0()`.
* The [glue package](http://glue.tidyverse.org) is fantastic for string interpolation. If `stringr::str_interp()` doesn't get your job done, check out glue.

#### Regular expressions

Expand Down Expand Up @@ -100,29 +101,29 @@ Determine presence/absence of a literal string with `str_detect()`. Spoiler: lat
Which fruits actually use the word "fruit"?

```{r}
str_detect(fruit, "fruit")
str_detect(fruit, pattern = "fruit")
```

What's the easiest way to get the actual fruits that match? Use `str_subset()` to keep only the matching elements. Note we are storing this new vector `my_fruit` to use in later examples!

```{r}
(my_fruit <- str_subset(fruit, "fruit"))
(my_fruit <- str_subset(fruit, pattern = "fruit"))
```

### String splitting by delimiter

Use `stringr::str_split()` to split strings on a delimiter. Some of our fruits are compound words, like "grapefruit", but some have two words, like "ugli fruit". Here we split on a single space `" "`, but show use of a regular expression later.

```{r}
str_split(my_fruit, " ")
str_split(my_fruit, pattern = " ")
```

It's bummer that we get a *list* back. But it must be so! In full generality, split strings must return list, because who knows how many pieces there will be?

If you are willing to commit to the number of pieces, you can use `str_split_fixed()` and get a character matrix. You're welcome!

```{r}
str_split_fixed(my_fruit, " ", n = 2)
str_split_fixed(my_fruit, pattern = " ", n = 2)
```

If the to-be-split variable lives in a data frame, `tidyr::separate()` will split it into 2 or more variables.
Expand Down Expand Up @@ -160,7 +161,7 @@ tibble(fruit) %>%
Finally, `str_sub()` also works for assignment, i.e. on the left hand side of `<-`.

```{r}
x <- head(fruit, 3)
(x <- head(fruit, 3))
str_sub(x, 1, 3) <- "AAA"
x
```
Expand Down Expand Up @@ -191,7 +192,10 @@ str_c(fruit[1:4], fruit[5:8], sep = " & ", collapse = ", ")
If the to-be-combined vectors are variables in a data frame, you can use `tidyr::unite()` to make a single new variable from them

```{r}
fruit_df <- tibble(fruit1 = fruit[1:4], fruit2 = fruit[5:8])
fruit_df <- tibble(
fruit1 = fruit[1:4],
fruit2 = fruit[5:8]
)
fruit_df %>%
unite("flavor_combo", fruit1, fruit2, sep = " & ")
```
Expand All @@ -201,13 +205,13 @@ fruit_df %>%
You can replace a pattern with `str_replace()`. Here we use an explicit string-to-replace, but later we revisit with a regular expression.

```{r}
str_replace(my_fruit, "fruit", "THINGY")
str_replace(my_fruit, pattern = "fruit", replacement = "THINGY")
```

A special case that comes up alot is replacing `NA`, for which there is `str_replace_na()`.

```{r}
melons <- str_subset(fruit, "melon")
melons <- str_subset(fruit, pattern = "melon")
melons[2] <- NA
melons
str_replace_na(melons, "UNKNOWN MELON")
Expand Down Expand Up @@ -242,7 +246,7 @@ Frequently your string tasks cannot be expressed in terms of a fixed string, but
The first metacharacter is the period `.`, which stands for any single character, except a newline (which by the way, is represented by `\n`). The regex `a.b` will match all countries that have an `a`, followed by any single character, followed by `b`. Yes, regexes are case sensitive, i.e. "Italy" does not match.

```{r}
str_subset(countries, "i.a")
str_subset(countries, pattern = "i.a")
```

Notice that `i.a` matches "ina", "ica", "ita", and more.
Expand All @@ -252,17 +256,17 @@ Notice that `i.a` matches "ina", "ica", "ita", and more.
Note how the regex `i.a$` matches many fewer countries than `i.a` alone. Likewise, more elements of `my_fruit` match `d` than `^d`, which requires "d" at string start.

```{r}
str_subset(countries, "i.a$")
str_subset(my_fruit, "d")
str_subset(my_fruit, "^d")
str_subset(countries, pattern = "i.a$")
str_subset(my_fruit, pattern = "d")
str_subset(my_fruit, pattern = "^d")
```

The metacharacter `\b` indicates a **word boundary** and `\B` indicates NOT a word boundary. This is our first encounter with something called "escaping" and right now I just want you at accept that we need to prepend a second backslash to use these sequences in regexes in R. We'll come back to this tedious point later.

```{r}
str_subset(fruit, "melon")
str_subset(fruit, "\\bmelon")
str_subset(fruit, "\\Bmelon")
str_subset(fruit, pattern = "melon")
str_subset(fruit, pattern = "\\bmelon")
str_subset(fruit, pattern = "\\Bmelon")
```

### Character classes
Expand All @@ -273,19 +277,19 @@ Here we match `ia` at the end of the country name, preceded by one of the charac

```{r}
## make a class "by hand"
str_subset(countries, "[nls]ia$")
str_subset(countries, pattern = "[nls]ia$")
## use ^ to negate the class
str_subset(countries, "[^nls]ia$")
str_subset(countries, pattern = "[^nls]ia$")
```

Here we revisit splitting `my_fruit` with two more general ways to match whitespace: the `\s` metacharacter and the POSIX class `[:space:]`. Notice that we must prepend an extra backslash `\` to escape `\s` and the POSIX class has to be surrounded by two sets of square brackets.

```{r}
## remember this?
# str_split_fixed(fruit, " ", 2)
# str_split_fixed(fruit, pattern = " ", n = 2)
## alternatives
str_split_fixed(my_fruit, "\\s", 2)
str_split_fixed(my_fruit, "[[:space:]]", 2)
str_split_fixed(my_fruit, pattern = "\\s", n = 2)
str_split_fixed(my_fruit, pattern = "[[:space:]]", n = 2)
```

Let's see the country names that contain punctuation.
Expand All @@ -310,28 +314,28 @@ Explore these by inspecting matches for `l` followed by `e`, allowing for variou
`l.*e` will match strings with 0 or more characters in between, i.e. any string with an `l` eventually followed by an `e`. This is the most inclusive regex for this example, so we store the result as `matches` to use as a baseline for comparison.

```{r}
(matches <- str_subset(fruit, "l.*e"))
(matches <- str_subset(fruit, pattern = "l.*e"))
```

Change the quantifier from `*` to `+` to require at least one intervening character. The strings that no longer match: all have a literal `le` with no preceding `l` and no following `e`.

```{r}
list(match = intersect(matches, str_subset(fruit, "l.+e")),
no_match = setdiff(matches, str_subset(fruit, "l.+e")))
list(match = intersect(matches, str_subset(fruit, pattern = "l.+e")),
no_match = setdiff(matches, str_subset(fruit, pattern = "l.+e")))
```

Change the quantifier from `*` to `?` to require at most one intervening character. In the strings that no longer match, the shortest gap between `l` and following `e` is at least two characters.

```{r}
list(match = intersect(matches, str_subset(fruit, "l.?e")),
no_match = setdiff(matches, str_subset(fruit, "l.?e")))
list(match = intersect(matches, str_subset(fruit, pattern = "l.?e")),
no_match = setdiff(matches, str_subset(fruit, pattern = "l.?e")))
```

Finally, we remove the quantifier and allow for no intervening characters. The strings that no longer match lack a literal `le`.

```{r}
list(match = intersect(matches, str_subset(fruit, "le")),
no_match = setdiff(matches, str_subset(fruit, "le")))
list(match = intersect(matches, str_subset(fruit, pattern = "le")),
no_match = setdiff(matches, str_subset(fruit, pattern = "le")))
```

### Escaping
Expand Down Expand Up @@ -361,20 +365,20 @@ Here is routine, non-regex use of backslash `\` escapes in plain vanilla R strin

Examples of using escapes in regexes to match characters that would otherwise have a special interpretation.

We know several Gapminder country names contain a period. How do we isolate them? Although it's tempting, this command `str_subset(countries, ".")` won't work!
We know several Gapminder country names contain a period. How do we isolate them? Although it's tempting, this command `str_subset(countries, pattern = ".")` won't work!

```{r}
## cheating using a POSIX class ;)
str_subset(countries, "[[:punct:]]")
str_subset(countries, pattern = "[[:punct:]]")
## using two backslashes to escape the period
str_subset(countries, "\\.")
str_subset(countries, pattern = "\\.")
```

A last example that matches an actual square bracket.

```{r}
(x <- c("whatever", "X is distributed U[0,1]"))
str_subset(x, "\\[")
str_subset(x, pattern = "\\[")
```

### Groups and backreferences
Expand Down
Loading

0 comments on commit ea814b7

Please sign in to comment.