-
Notifications
You must be signed in to change notification settings - Fork 1
/
hw06.Rmd
460 lines (335 loc) · 13.5 KB
/
hw06.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
---
title: "STAT545 HW06"
output:
html_document:
toc: true
number_sections: true
---
<style type="text/css">
.twoC {width: 100%}
.clearer {clear: both}
.twoC table {max-width: 100%; float: left; max-height: 200px}
</style>
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Character data
To work with strings, first we should load the libraries.
```{r}
library(tidyverse)
library(stringr)
```
Then after finished the strings chapter, we can begin to work on exercises.
## Exercises 14.2.5
1. In code that doesn’t use stringr, you’ll often see `paste()` and `paste0()`. What’s the difference between the two functions? What stringr function are they equivalent to? How do the functions differ in their handling of `NA`?
```{r}
paste("stat", "545")
paste0("stat", "545")
str_c("stat", NA)
paste("stat", NA)
```
With `paste` you can specify a `sep` argument to seperate the strings. But with `paste0`, the `sep` argument is fixed to nothing. The corresponding stringr function is `str_c`. `str_c` will output `NA` if there is a `NA` in the middle, and `paste` will convert `NA` to a string "NA" and then do the `paste`.
2. In your own words, describe the difference between the `sep` and `collapse` arguments to `str_c()`.
```{r}
str_c("stat", "545", c(101, 102), sep = " ")
str_c("stat", "545", c(101, 102), sep = " ", collapse = ",")
```
`sep` argument seperates the strings to be merged, and with `collapse` the returned vector is merged again with `collapse` argument in between.
3. Use `str_length()` and `str_sub()` to extract the middle character from a string. What will you do if the string has an even number of characters?
```{r}
x <- "abcdef"
str_sub(x, str_length(x)/2+1, str_length(x)/2+1)
```
If the string has an even number of characters, this will select the latter one of the middle two characters.
4. What does `str_wrap()` do? When might you want to use it?
```{r}
txt <- "What does `str_wrap()` do? When might you want to use it?"
cat(str_wrap(txt, width = 40, indent = 2))
```
`str_wrap` takes the input string and output a print format string with line width constrain. When you want to write to a text file, you may want to use it.
5. What does `str_trim()` do? What’s the opposite of `str_trim()`?
```{r}
txt <- " What does `str_wrap()` do? When might you want to use it? "
str_trim(txt)
str_pad(txt, 70, side = "both")
```
`str_trim` removes white spaces from one or both sides of the string. `str_pad` pads the string to some width with argument `pad`, and defaultly it is set to white space.
6. Write a function that turns (e.g.) a vector `c("a", "b", "c")` into the string `a, b, and c`. Think carefully about what it should do if given a vector of length 0, 1, or 2.
```{r}
vector_3 <- c("a", "b", "c")
vector_2 <- c("a", "b")
vector_1 <- c("a")
vector_0 <- c()
and <- function(v) {
len <- length(v)
if (len %in% c(0, 1)) return(v)
if (len == 2) return(str_c(v, collapse = " and "))
v[len] <- paste("and ", v[len])
str_c(v, collapse = ", ")
}
and(vector_0)
and(vector_1)
and(vector_2)
and(vector_3)
```
## Exercises 14.3.1.1
1. Explain why each of these strings don’t match a `\`: `"\"`, `"\\"`, `"\\\"`.
```{r}
x <- "a\\b"
str_view(x, "\\\\")
```
`"\"` is not a legal string, because `\"` would be one character, and same with `"\\\"`. `"\\"` is single `\` in the string, and when you make a string again, it goes back to `"\"` case.
2. How would you match the sequence `"'\`?
```{r}
x <- "a\"\'\\b"
str_view(x, "\"\'\\\\")
```
3. What patterns will the regular expression `\..\..\..` match? How would you represent it as a string?
This matches dot splited 3 characters, like `".a.b.c"`. The string representation would be `"\\..\\..\\.."`.
## Exercises 14.3.2.1
1. How would you match the literal string `"$^$"`?
```{r}
x <- "$^$"
str_view(x, "\\$\\^\\$")
```
2. Given the corpus of common words in `stringr::words`, create regular expressions that find all words that:
- Start with “y”.
- End with “x”
- Are exactly three letters long. (Don’t cheat by using `str_length()`!)
- Have seven letters or more.
Since this list is long, you might want to use the 'match' argument to 'str_view()' to show only the matching or non-matching words.
<div class="twoC">
```{r}
str_view(stringr::words, "^y", match = TRUE)
str_view(stringr::words, "x$", match = TRUE)
str_view(stringr::words, "^.{3}$", match = TRUE)
str_view(stringr::words, "^.{7,}$", match = TRUE)
```
</div><div class="clearer"></div>
## Exercises 14.3.3.1
1. Create regular expressions to find all words that:
- Start with a vowel.
- That only contain consonants. (Hint: thinking about matching “not”-vowels.)
- End with `ed`, but not with `eed`.
- End with `ing` or `ise`.
```{r}
str_view(stringr::words, "^[aeiou]", match = TRUE)
str_view(stringr::words, "^[^aeiou]*$", match = TRUE)
str_view(stringr::words, "[^e]ed$", match = TRUE)
str_view(stringr::words, "ing$|ise$", match = TRUE)
```
2. Empirically verify the rule “i before e except after c”.
```{r}
str_view(stringr::words, "[^c]ie", match = TRUE)
```
3. Is “q” always followed by a “u”?
```{r}
str_view(stringr::words, "q[^u]", match = TRUE)
```
Yes, it is.
4. Write a regular expression that matches a word if it’s probably written in British English, not American English.
```{r}
str_view(stringr::words, "our$", match = TRUE)
```
5. Create a regular expression that will match telephone numbers as commonly written in your country.
```{r}
str_view("123(456)7890", "\\d{3}\\(\\d{3}\\)\\d{4}")
```
## Exercises 14.3.4.1
1. Describe the equivalents of `?`, `+`, `*` in `{m,n}` form.
`?` equals `{0,1}`. `+` equals `{1,}`. `*` equals `{0,}`.
2. Describe in words what these regular expressions match: (read carefully to see if I’m using a regular expression or a string that defines a regular expression.)
- `^.*$` matches strings that start without `"\n"` and `"\r"`.
- `"\\{.+\\}"` matches a brace with some charactors in between.
- `\d{4}-\d{2}-\d{2}` matches something like a date, for example "2018-01-01".
- `"\\\\{4}"` matches four `\` in a row.
3. Create regular expressions to find all words that:
- Start with three consonants.
- Have three or more vowels in a row.
- Have two or more vowel-consonant pairs in a row.
```{r}
str_view(stringr::words, "^[^aeiou]{3}", match = TRUE)
str_view(stringr::words, "[aeiou]{3,}", match = TRUE)
str_view(stringr::words, "([aeiou][^aeiou]){2,}", match = TRUE)
```
4. Solve the beginner regexp crosswords at https://regexcrossword.com/challenges/beginner.
## Exercises 14.3.5.1
1. Describe, in words, what these expressions will match:
- `(.)\1\1` matches three same characters in a row.
- `"(.)(.)\\2\\1"` matches two same characters with other two same characters in between, for example "abba".
- `(..)\1` matches two characters repeat twice, for example "abab".
- `"(.).\\1.\\1"` matches three same characters with other two characters in between, for example "abaca".
- `"(.)(.)(.).*\\3\\2\\1"` matches something like "abc1234abcdcba".
2. Construct regular expressions to match words that:
- Start and end with the same character.
- Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
- Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
```{r}
str_view(stringr::words, "^(.).*\\1$", match = TRUE)
str_view(stringr::words, "^(..).*\\1.*$", match = TRUE)
str_view(stringr::words, "^.*(.).*(\\1.*){2,}$", match = TRUE)
```
## Exercises 14.4.2
1. For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple `str_detect()` calls.
- Find all words that start or end with `x`.
- Find all words that start with a vowel and end with a consonant.
- Are there any words that contain at least one of each different vowel?
```{r}
str_view(stringr::words, "^x|x$", match = TRUE)
str_view(stringr::words, "^[aeiou].*[^aeiou]$", match = TRUE)
str_view(stringr::words, "(?=.*a)(?=.*e)(?=.*i)(?=.*o)(?=.*u)", match = TRUE)
```
```{r}
words <- tibble(
words = stringr::words
)
words %>%
filter(str_detect(words, "^x|x$"))
words %>%
filter(str_detect(words, "^[aeiou].*[^aeiou]$"))
words %>%
filter(str_detect(words, "(?=.*a)(?=.*e)(?=.*i)(?=.*o)(?=.*u)"))
```
2. What word has the highest number of vowels? What word has the highest proportion of vowels? (Hint: what is the denominator?)
```{r}
words <- words %>%
mutate(
vowels = str_count(words, "[aeiou]"),
proportion = str_count(words, "[aeiou]") / str_count(words, ".")
)
words %>%
arrange(desc(vowels)) %>%
head()
words %>%
arrange(desc(proportion)) %>%
head()
```
## Exercises 14.4.3.1
1. In the previous example, you might have noticed that the regular expression matched “flickered”, which is not a colour. Modify the regex to fix the problem.
```{r}
colours <- c("red", "orange", "yellow", "green", "blue", "purple")
colour_match <- str_c("\\s", colours, collapse = "|")
more <- sentences[str_count(sentences, colour_match) > 1]
str_view_all(more, colour_match)
```
2. From the Harvard sentences data, extract:
- The first word from each sentence.
- All words ending in `ing`.
- All plurals.
```{r}
str_view_all(sentences, "^.*?\\s")
str_view_all(sentences, "\\w*ing\\s", match = TRUE)
str_view_all(sentences, "es\\s", match = TRUE)
```
## Exercises 14.4.4.1
1. Find all words that come after a “number” like “one”, “two”, “three” etc. Pull out both the number and the word.
```{r}
noun <- "\\s(one|two|three)\\s*\\w+"
has_num <- sentences %>%
str_subset(noun)
has_num %>%
str_extract(noun)
```
2. Find all contractions. Separate out the pieces before and after the apostrophe.
```{r}
con <- "\\w*\'\\w*"
has_con <- sentences %>%
str_subset(con)
has_con %>%
str_extract(con)
```
## Exercises 14.4.5.1
1. Replace all forward slashes in a string with backslashes.
```{r}
x <- "\\\\a//b\\c///d\\\\"
str_replace_all(x, "\\\\", "/")
```
2. Implement a simple version of `str_to_lower()` using `replace_all()`.
```{r}
x <- "AaBbCcDd"
names(letters) <- LETTERS
str_replace_all(x, letters)
```
3. Switch the first and last letters in `words`. Which of those strings are still words?
```{r}
str_replace_all(stringr::words, "(\\w)(\\w*)(\\w)", "\\3\\2\\1") %>% head()
```
## Exercises 14.4.6.1
1. Split up a string like `"apples, pears, and bananas"` into individual components.
```{r}
x <- "apples, pears, and bananas"
str_split(x, "(, and )|(, )")
```
2. Why is it better to split up by `boundary("word")` than " "?
```{r}
str_split(x, " ")
str_split(x, boundary("word"))
```
`boundary` can automatically remove white spaces and other signs, including comma and dot.
3. What does splitting with an empty `string ("")` do? Experiment, and then read the documentation.
```{r}
str_split(x, "")
```
It splits the string into characters.
## Exercises 14.5.1
1. How would you find all strings containing `\` with `regex()` vs. with `fixed()`?
```{r}
x <- "\\\\a//b\\c///d\\\\"
str_view_all(x, regex("\\\\"))
str_view_all(x, fixed("\\"))
```
2. What are the five most common words in `sentences`?
```{r}
words_in_s <-
str_split(sentences, boundary("word")) %>%
unlist() %>%
str_to_lower()
as.data.frame(table(words_in_s)) %>%
arrange(desc(Freq)) %>%
head(5)
```
## Exercises 14.7.1
1. Find the stringi functions that:
- Count the number of words. `stringi::stri_count_words()`
- Find duplicated strings. `stringi::stri_duplicated()`
- Generate random text. `stringi::stri_rand_strings()`
2. How do you control the language that `stri_sort()` uses for sorting?
By `locale` argument. For example:
`stri_sort(c("hladny", "chladny"), locale="sk_SK")`
# Write function
Write one (or more) functions that do something useful to pieces of the Gapminder or Singer data. It is logical to think about computing on the mini-data frames corresponding to the data for each specific country, location, year, band, album, … This would pair well with the prompt below about working with a nested data frame, as you could apply your function there.
Make it something you can’t easily do with built-in functions. Make it something that’s not trivial to do with the simple dplyr verbs. The linear regression function presented here is a good starting point. You could generalize that to do quadratic regression (include a squared term) or use robust regression, using `MASS::rlm()` or `robustbase::lmrob()`.
```{r}
library(gapminder)
gap_cad <- gapminder %>%
filter(country == "Canada") %>%
mutate(
year_m = year - 1952
)
p <- ggplot(gap_cad, aes(x = year, y = lifeExp))
p + geom_point() + geom_smooth(method = "lm", se = FALSE)
```
`le_qua_fit` is a function for quadratic regression. We can see that for countries like Zimbabwe, it is a better fit than the linear regression.
```{r}
le_lin_fit <- function(dat, offset = 1952) {
the_fit <- lm(lifeExp ~ I(year - offset), dat)
setNames(coef(the_fit), c("intercept", "slope"))
}
le_qua_fit <- function(dat, offset = 1952) {
the_fit <- lm(lifeExp ~ poly(year - offset, 2, raw = TRUE), dat)
setNames(coef(the_fit), c("intercept", "poly 1", "poly 2"))
}
gap_zim <- gapminder %>% filter(country == "Zimbabwe")
(model_lin <- le_lin_fit(gap_zim))
(model_qua <- le_qua_fit(gap_zim))
years = seq(1952, 2007, 5)
le_fit = tibble(
year = years,
le_lin = model_lin[1] + (years - 1952) * model_lin[2],
le_qua = model_qua[1] + (years - 1952) * model_qua[2] + (years - 1952)^2 * model_qua[3]
)
ggplot(gap_zim, aes(x = year, y = lifeExp)) +
geom_point() +
geom_line(aes(x = year, y = le_lin), le_fit, color = "blue") +
geom_line(aes(x = year, y = le_qua), le_fit, color = "red")
```