# Worksheet 3-B: Strings and Regular Expressions

**Version 1.0.0**

In this tutorial, you'll practice how to:

- Manipulate a character vector in R using the stringr package.
- Write simple regular expressions (regex).
- Apply regular expressions to data manipulation.

In [59]:
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(gapminder))
suppressPackageStartupMessages(library(testthat))
suppressPackageStartupMessages(library(digest))

# Part 1: Warming up to the stringr functions


### Question 1

There's that famous sentence about the "quick fox" that contains all letters of the alphabet, although we don't quite remember the sentence. Obtain a vector of all sentences from the `stringr::sentences` dataset containing the word `"fox"`. Store the resulting vector in a variable named `answer1`.

In [17]:
 answer1 <- str_subset(sentences, "fox")

print(answer1)

[1] "The quick fox jumped on the sleeping cat."


In [18]:
expect_identical(digest(answer1), "b54efc522343ff2628fee7e71bd17747")
cat("success!")

success!

### Question 2

Make an (atomic) vector of the individual words in the sentence. Store the result in a variable named `answer2`.

Hint: Use `str_split(string, pattern)`, and carefully note what the output of this function is.

In [19]:
answer2<-str_split(answer1, " ")[[1]]
print(answer2)

[1] "The"      "quick"    "fox"      "jumped"   "on"       "the"      "sleeping"
[8] "cat."    


In [20]:
expect_identical(digest(answer2), "e9776d44cb7da14ddffdfa985e1d8908")
cat("success!")

success!

### Question 3

With stringr, we can substitute parts of a string, too. Replace the word "fox" from `answer1` with "giraffe" using `str_replace()`, and store the result in a variable named `answer3`.

In [22]:
 answer3 <- str_replace(answer1, pattern = "fox", replacement = "giraffe")

# your code here
print(answer3)

[1] "The quick giraffe jumped on the sleeping cat."


In [23]:
expect_identical(digest(answer3), "8659b349bfc1e359cdbb08cd38f5537d")
cat("success!")

success!

### Question 4: pig latin

(Solutions are indicated below.)

Convert `words` to a simplistic version of pig latin:

1. Move the first letter to the end of the word.
2. Add "ay" to the end of the word.

Hint: subset by position using `str_sub(string, start, end)`.

Store the result in a variable named `answer4`.

In [34]:
firstletter<-str_sub(words, 1,1)
ords<-str_sub(words, 2)
answer4<- str_c(ords,firstletter,"ay")
head(answer4)

In [35]:
expect_identical(digest(answer4), "66f9cc0b279607492b6d015f979210d2")
cat("success!")

success!

## Part 2: Manipulating character columns in a tibble

Consider the wedding dataset on the UBC-STAT/stat545.stat.ubc.ca GitHub repository:

In [24]:
wedding <- suppressMessages(read_csv("https://raw.githubusercontent.com/UBC-STAT/stat545.stat.ubc.ca/master/content/data/wedding/attend.csv"))
head(wedding)

party,name,meal_wedding,meal_brunch,attendance_wedding,attendance_brunch,attendance_golf
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,Sommer Medrano,PENDING,PENDING,PENDING,PENDING,PENDING
1,Phillip Medrano,vegetarian,Menu C,CONFIRMED,CONFIRMED,CONFIRMED
1,Blanka Medrano,chicken,Menu A,CONFIRMED,CONFIRMED,CONFIRMED
1,Emaan Medrano,PENDING,PENDING,PENDING,PENDING,PENDING
2,Blair Park,chicken,Menu C,CONFIRMED,CONFIRMED,CONFIRMED
2,Nigel Webb,,,CANCELLED,CANCELLED,CANCELLED


### Question 5

Split the `name` column into two columns, named `first` and `last` containing the first and last names (which are currently separated by a space). Store the resulting tibble in a variable called `answer5`.

**Hint**:

- Use `tidyr::separate()`.
- Want a challenge? Try the same exercise, using `str_split()`. 

In [27]:
 answer5 <- wedding %>% 
   separate(name, into = c("first","last"), sep = " ")

# your code here
print(answer5)

[90m# A tibble: 30 x 8[39m
   party first last  meal_wedding meal_brunch attendance_wedd… attendance_brun…
   [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m       [3m[90m<chr>[39m[23m            [3m[90m<chr>[39m[23m           
[90m 1[39m     1 Somm… Medr… PENDING      PENDING     PENDING          PENDING         
[90m 2[39m     1 Phil… Medr… vegetarian   Menu C      CONFIRMED        CONFIRMED       
[90m 3[39m     1 Blan… Medr… chicken      Menu A      CONFIRMED        CONFIRMED       
[90m 4[39m     1 Emaan Medr… PENDING      PENDING     PENDING          PENDING         
[90m 5[39m     2 Blair Park  chicken      Menu C      CONFIRMED        CONFIRMED       
[90m 6[39m     2 Nigel Webb  [31mNA[39m           [31mNA[39m          CANCELLED        CANCELLED       
[90m 7[39m     3 Sine… Engl… PENDING      PENDING     PENDING          PENDING         
[90m 8[39m     4 Ayra  Mar

In [28]:
expect_identical(
    digest(unclass(select(answer5, first, last))), 
    "8fa9b7d74019d7998c6e3b5e63a1d08c"
)
cat("success!")

success!

### Question 6

Using the answer to the previous question, do the opposite operation: combine the `last` and `first` columns into a new column named `name`, that has their name in the form "last, first". Store the resulting tibble in a variable named `answer6`.

**Hint**:

- Use the `tidyr::unite()` function.
- Want a challenge? Try the same exercise, using `str_c()` -- an important step for understanding the difference between the `sep` and `collapse` arguments.

Starter code:

```
# Using tidyr:
answer6 <- answer5 %>% 
    unite(col = FILL_THIS_IN, FILL_THIS_IN, FILL_THIS_IN, sep = FILL_THIS_IN)
# Challenge:
answer6 <- answer5 %>% 
    mutate(name = str_c(FILL_THIS_IN)) %>% 
    select(-first, -last) %>%
    select(party, name, everything())
```

In [36]:
 answer6 <- answer5 %>% 
  unite(col = name, last, first, sep = ", ")

# your code here
print(answer6)

[90m# A tibble: 30 x 7[39m
   party name  meal_wedding meal_brunch attendance_wedd… attendance_brun…
   [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m       [3m[90m<chr>[39m[23m            [3m[90m<chr>[39m[23m           
[90m 1[39m     1 Medr… PENDING      PENDING     PENDING          PENDING         
[90m 2[39m     1 Medr… vegetarian   Menu C      CONFIRMED        CONFIRMED       
[90m 3[39m     1 Medr… chicken      Menu A      CONFIRMED        CONFIRMED       
[90m 4[39m     1 Medr… PENDING      PENDING     PENDING          PENDING         
[90m 5[39m     2 Park… chicken      Menu C      CONFIRMED        CONFIRMED       
[90m 6[39m     2 Webb… [31mNA[39m           [31mNA[39m          CANCELLED        CANCELLED       
[90m 7[39m     3 Engl… PENDING      PENDING     PENDING          PENDING         
[90m 8[39m     4 Mark… vegetarian   Menu B      PENDING          PENDING         
[90m 9[39m     

In [37]:
expect_identical(digest(answer6$name), "86be22dab9696ef3e2e5c235ce7914a3")
cat("success!")

success!

### Question 7

Still using the tibble with the first and last names separated into their own columns, make a tibble with one row per party, with columns named `people` and `wedding_status`:

- `people`: contains the first names of everyone in the party, separated by commas (and a space: `", "`).
- `wedding_status`: should be `"CONFIRMED"` if all their wedding status entries are `"CONFIRMED"`, and `"PENDING"` otherwise. 

Store the resulting tibble in a variable named `answer7`.

Starter code:

```
answer7 <- answer5 %>% 
   group_by(party) %>% 
   summarise(
       people = str_c(FILL_THIS_IN),
       wedding_status = if_else(FILL_THIS_IN, "CONFIRMED", "PENDING")
   )
```

In [41]:
answer7 <- answer5 %>% 
   group_by(party) %>% 
   summarise(
       people = str_c(first,collapse=", "),
       wedding_status = if_else(all(attendance_wedding=="CONFIRMED"), "CONFIRMED", "PENDING")
   )
print(answer7)

`summarise()` ungrouping output (override with `.groups` argument)



[90m# A tibble: 15 x 3[39m
   party people                                  wedding_status
   [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m                                   [3m[90m<chr>[39m[23m         
[90m 1[39m     1 Sommer, Phillip, Blanka, Emaan          PENDING       
[90m 2[39m     2 Blair, Nigel                            PENDING       
[90m 3[39m     3 Sinead                                  PENDING       
[90m 4[39m     4 Ayra                                    PENDING       
[90m 5[39m     5 Atlanta, Denzel, Chanelle               PENDING       
[90m 6[39m     6 Jolene, Hayley                          PENDING       
[90m 7[39m     7 Amayah, Erika                           PENDING       
[90m 8[39m     8 Ciaron                                  PENDING       
[90m 9[39m     9 Diana                                   CONFIRMED     
[90m10[39m    10 Cosmo                                   PENDING       
[90m11[39m    11 Cai                        

In [42]:
expect_identical(
    digest(unclass(select(answer7, people, wedding_status))), 
    "cf18af7e44c899c48e53b83614739b86"
)
cat("success!")

success!

### Question 8

Select individuals in the `wedding` tibble whose first name starts between "A" and "Em" inclusive. Store the resulting tibble in a variable named `answer8`.

Starter code:

```
answer8 <- wedding %>% 
    filter(FILL_THIS_IN(name, "FILL_THIS_IN")) %>% 
    arrange(name)
```

In [44]:
answer8 <- wedding %>% 
    filter(start_with(name, c("A","Em"))) %>% 
    arrange(name)
print(answer8)

ERROR: Error: Problem with `filter()` input `..1`.
[31m✖[39m could not find function "start_with"
[34mℹ[39m Input `..1` is `start_with(name, c("A", "Em"))`.


In [None]:
expect_identical(digest(sort(answer8$name)), "6bbf440d3cca5b2e4b670b48f7bddc14")
cat("success!")

# Part 3: Exploring Regular Expressions (regex)


Let's work with the gapminder countries:

In [60]:
countries <- levels(gapminder$country)
head(countries)

### Question 9: The "any character"

Use `str_subset()` to find all countries in the gapminder data set with the following pattern: "i", followed by any single character, followed by "a". Store the result in a vector named `answer9`.

Note that Italy is not on the list, because regex is case-sensitive.

Explore further: use `str_view_all()` to get a visual of what's being matched, and where. This is especially useful for debugging!

In [61]:
 answer9 <- str_subset(countries, pattern = "i.a")
# str_view_all(countries, pattern = "FILL_THIS_IN", match = TRUE)

# your code here
print(answer9)

 [1] "Argentina"                "Bosnia and Herzegovina"  
 [3] "Burkina Faso"             "Central African Republic"
 [5] "China"                    "Costa Rica"              
 [7] "Dominican Republic"       "Hong Kong, China"        
 [9] "Jamaica"                  "Mauritania"              
[11] "Nicaragua"                "South Africa"            
[13] "Swaziland"                "Taiwan"                  
[15] "Thailand"                 "Trinidad and Tobago"     


In [62]:
expect_identical(digest(answer9), "fdf1c0b93db219fb32d927700cab3c4e")
cat("success!")

success!

### Question 10

Canada isn't the only country with three interspersed "a"'s. Find all countries with a similar pattern, storing the result in a vector named `answer10`.

In [63]:
 answer10 <- str_subset(countries, pattern = "a.a.a")
# str_view_all(countries, pattern = "FILL_THIS_IN", match = TRUE)

print(answer10)

[1] "Canada"     "Madagascar" "Panama"    


In [64]:
expect_identical(digest(answer10), "4751851d94825e74a6569abdf9759209")
cat("success!")

success!

### Question 11: The escape

What if I wanted to literally search for countries with a period in the name? "Escape the period" to make a vector of all countries with at least one period in their name. Store the result in a vector named `answer11`.

Explore further: use `str_view_all()` to get a visual of what's being matched, and where. This is especially useful for debugging!

In [None]:
# answer11 <- str_subset(countries, pattern = "FILL_THIS_IN")
# str_view_all(countries, pattern = "FILL_THIS_IN", match = TRUE)

# your code here
fail() # No Answer - remove if you provide an answer
print(answer11)

In [None]:
expect_identical(digest(answer11), "4c500f226f5abbe540ef2506a4644375")
cat("success!")

### Question 12: Groups

Find all countries with three non-vowel letters next to each other (don't count spaces, commas, and periods). Store the resulting vector in a variable named `answer12`. 

In [None]:
# answer12 <- str_subset(countries, pattern = "FILL_THIS_IN")
# str_view_all(countries, pattern = "FILL_THIS_IN", match = TRUE)

# your code here
fail() # No Answer - remove if you provide an answer
print(answer12)

In [None]:
expect_identical(digest(answer12), "3bc834e20e4109423e850f263d1c0cee")
cat("success!")

### Question 13: "Or" and Precedence

Use `|` to denote "or". "And" is implied otherwise, and has precedence. Use parentheses to be deliberate with precedence.

For example:

In [None]:
bbb <- c("bear", "beer", "bar")
cat("'bee' or 'ar':")
str_view_all(bbb, pattern =  "bee|ar")
cat("'e' or 'a':")
str_view_all(bbb, pattern = "be(e|a)r") 

Now, find all countries that have either "o" twice in a row or "e" twice in a row (no changeover allowed). Store the resulting vector in a variable named `answer13`.

In [None]:
# answer13 <- str_subset(countries, "FILL_THIS_IN")

# your code here
fail() # No Answer - remove if you provide an answer
print(answer13)

In [None]:
expect_identical(digest(answer13), "558af24f1d19b86ffc6b74541aef9f9b")
cat("success!")

### Question 14

Task: what letters are used in the first sentence of the `stringr::sentences` dataset? Make a vector of all the unique letters in the sentence (in lowercase), and store it in a variable called `answer14`. Don't forget to remove non-letters, which are either a space or a period.

Hint:

```
answer14 <- sentences[1] %>% 
  str_remove_all("FILL_THIS_IN") %>% 
  FILL_THIS_IN() %>% 
  str_split(FILL_THIS_IN) %>% 
  .[[1]] %>% 
  unique()
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
print(answer14)

In [None]:
expect_identical(digest(sort(answer14)), "d586631001ba6d44947a09efecc4f960")
cat("success!")

### Question 15: Quantifiers/Repetition

The handy ones are:

- `*` for 0 or more
- `+` for 1 or more
- `?` for 0 or 1

See list at https://r4ds.had.co.nz/strings.html#repetition

Find all countries that have any number of "o"'s (but at least 1), following an "r". Store the resulting vector in a variable named `answer15`.

In [None]:
# answer15 <- str_subset(countries, "FILL_THIS_IN")

# your code here
fail() # No Answer - remove if you provide an answer
print(answer15)

In [None]:
expect_identical(digest(answer15), "fa31d9cfe634b9a841cabdf9e31c0eeb")
cat("success!")

### Question 16

Find all countries that have either "o" or "e", twice in a row (with a changeover allowed, such as "oe" or "eo"). Store the resulting vector in a variable named `answer16`.

In [None]:
# answer16 <- str_subset(countries, "FILL_THIS_IN")

# your code here
fail() # No Answer - remove if you provide an answer
print(answer16)

In [None]:
expect_identical(digest(answer16), "f64216702a5b71dfb2b4ae0661d084cb")
cat("success!")

### Question 17: Position indicators

Use:

- `^` to correspond to the __beginning__ of a string.
- `$` to correspond to the __end__ of a string.

Find all countries that end in "land". Store the result in a vector named `answer17`.

In [None]:
# answer17 <- str_subset(countries, "FILL_THIS_IN")

# your code here
fail() # No Answer - remove if you provide an answer
print(answer17)

In [None]:
expect_identical(digest(answer17), "692ee00b59194cea743c5ac3bf2302ae")
cat("success!")

### Question 18

Find all countries that start with "Ca". Store the result in a vector named `answer18`.

In [None]:
# answer18 <- str_subset(countries, pattern = "FILL_THIS_IN")

# your code here
fail() # No Answer - remove if you provide an answer
print(answer18)

In [None]:
expect_identical(digest(answer18), "649cb10a94daec6fe36112c82e659b39")
cat("success!")

### Question 19

Find all countries that only contain letters. Hint for making the regex: the word should start with a letter, continue as a letter, and end as a letter. Store the result in a vector named `answer19`.

In [None]:
# answer19 <- str_subset(countries, "FILL_THIS_IN")

# your code here
fail() # No Answer - remove if you provide an answer
head(answer19)

In [None]:
expect_identical(digest(answer19), "43b7b3e81361aa5d00367c2ff01ab240")
cat("success!")

### Question 20: Groups

You can use parentheses not only to specify precendence, but also to indicate groups that you can refer to later using integers to refer to the group number. 

Example using a's and b's: matching all instances of a character sandwiched between the same two characters:

In [None]:
ab <- c("aaa", "aab", "aba", "baa", "abb", "bab", "bba", "bbb")
str_view_all(ab, pattern="(.)(.)\\1")

Example: matching all instances of a character followed by two identical characters:

In [None]:
str_view_all(ab, pattern="(.)(.)\\2")

Your task: Find all countries that have the same letter repeated twice (like "Greece", which has "ee"). Store the result in a vector named `answer20`.

In [None]:
# answer20 <- str_subset(countries, "FILL_THIS_IN")

# your code here
fail() # No Answer - remove if you provide an answer
print(answer20)

In [None]:
expect_identical(digest(answer20), "3531a88f6935e86d4ff1054504182875")
cat("success!")

### Question 21

Find all countries that end in two vowels (not including "y"). Store the result in a vector named `answer21`.

In [None]:
# answer21 <- str_subset(countries, "FILL_THIS_IN")

# your code here
fail() # No Answer - remove if you provide an answer
print(answer21)

In [None]:
expect_identical(digest(answer21), "627ee4e6c6f8977349ed20734a52c1c0")
cat("success!")

### Question 22

Find all countries that start with two non-vowels (don't count "y" as a vowel). Store the result in a vector named `answer22`.

In [None]:
# answer22 <- str_subset(countries, "FILL_THIS_IN")

# your code here
fail() # No Answer - remove if you provide an answer
print(answer22)

In [None]:
expect_identical(digest(answer22), "2256308acab2196cc90529e7410881e0")
cat("success!")

## More Practice

Want more interactive practice? Check out this [regex crossword](https://regexcrossword.com/challenges/beginner/puzzles/1).

