# Working with strings: The `stringr` package
To ensure a common workflow and package use, we will try to stick to packages from the "Tidyverse" as much as possible: https://www.tidyverse.org/

These packages are all developed with a consistent syntax for the functions and are all well-documented, used and tested.

The packages in the tidyverse are best for working with "tidy data": A data format where each row in a data set is a unique observation. 

The package `stringr` is a tidyverse package for working with strings. Install it with `install.packages('stringr')`.

When installed, load it with `library(stringr)`. Write it into your script. (Good practice is to either write it at the very beginning or just before you are using the commands from the package).

In [24]:
library(stringr)

`stringr` can be used in a variety of different ways.

In [39]:
text <- "The Greeks believed that the mental qualifications of their gods were of a
much higher order than those of men, but nevertheless, as we shall see,
they were not considered to be exempt from human passions, and we
frequently behold them actuated by revenge, deceit, and jealousy. They,
however, always punish the evil-doer, and visit with dire calamities any
impious mortal who dares to neglect their worship or despise their rites.
We often hear of them visiting mankind and partaking of their hospitality,
and not unfrequently both gods and goddesses {8} become attached to
mortals, with whom they unite themselves, the offspring of these unions
being called heroes or demi-gods, who were usually renowned for their great
strength and courage. But although there were so many points of resemblance
between gods and men, there remained the one great characteristic
distinction, viz., that the gods enjoyed immortality. Still, they were not
invulnerable, and we often hear of them being wounded, and suffering in
consequence such exquisite torture that they have earnestly prayed to be
deprived of their privilege of immortality."

Functions from `stringr` primarily follow the same structure: `function(textinput, arguement, ...)`

`str_replace_all` is fx used to replace bits of text (like `gsub`).

In [40]:
text <- str_replace_all(text, "\\n", " ")

In [41]:
# Changing case (here to lowercase)
str_to_lower(text)

In [42]:
# Looking up words
str_detect(text, "god")

In [43]:
# Counting matches
str_count(text, "god")

## Regular expressions
Often when working with text, we have more "fuzzy" patterns we want to search for. 

Regular expression is a common language for processing text patterns.

`stringr` supports regular expression arguements.

Possible uses:
- Finding words of a certain length
- Finding sentences containing a certain word or pattern
- Finding words following a certain pattern
- and so on.

Regular expression (or "regex") can be used in most function in `stringr`. 

The function `str_detect()` is used to check whether a string contains a specific text or pattern.

In [44]:
# Detecting literal

str_detect(text, "god")

In [45]:
# Using modifier to specify that text should be matched literally

str_detect(text, fixed("god"))

### Operators for matching in regular expression

Regular expression uses different operators to match specific patterns in a text. Some general operators include:

|Character|Description|
|--|--|
|.|Match any character|
|^|Match beginning of string|
|$|Match end of string|
|\||Either or|
|?|Match zero or once|
|+|Match once or more|
|*|Match zero or more|
|{x,y}|Match between x and y times|


`str_extract_all()` extracts all text snippets matching the pattern. To normalize, we convert all text to lower-case:

In [50]:
text_lower <- str_to_lower(text)

# Extract 10 characters on each side of "god" or "hero"
str_extract_all(text_lower, ".{10}(god|hero).{10}")

### Special characters in regular expression

Regular expression uses a seris of special characters to match specific character types. These usually are denoted by a single backslash ("\") but R uses double backslash ("\\"). Some of these special characters include:

|Character|Description|
|--|--|
|\\\w|Any word character|
|\\\W|Non-word character (like a space or newline)|
|\\\d|Digit|
|\\\s|Whitespace|
|\\\S|Non-whitespace|
|\\\n|Newline|
|\\\b|Word boundary|

#### Sets and groups

Regular expression supports matching sets of characters using `[]`. Fx `[0-9]` matches numbers between 0-9.

It is also possible to specify "groups". Groups in regular expression can be seen as "subpatterns". This allows fx to specify a pattern that has to be matched one or more times as part of a longer pattern. Groups are created using `()`.

In [55]:
# Find all words beginning with "d" with a max length of 10 characters
str_extract_all(text_lower, "\\bd\\S{1,9}\\b")

Notice the difference between using "\\S" (non-whitespace) and "\\w" (word character):

In [56]:
str_extract_all(text_lower, "\\bd\\w{1,9}\\b")

### Escape characters

Because characters like ?, ., + etc. are used in regular expression, one has to specify differently, if the pattern should match those characters as is.

This is refered to as "escaping" the character. This is usually achieved with single backslash, but R uses double backslash ("\\"):

In [97]:
# Find words followed by a period:

str_extract_all(text, "\\b\\S+\\.")

### Look ahead and look behinid

When using regular expression for extracting text pieces, it can be useful to specify what should follow or precede the pattern without capturing it.

This can be exceed using operators for look ahead and look behind:

|Character|Description|
|--|--|
|(?=)|Positive look ahead|
|(?!)|Negative look ahead|
|(?<=)|Positive look behind|
|(?!<)|Negative look behind|

In [96]:
# Find words followed by a period, excluding the period:

str_extract_all(text, "\\b\\S+(?=\\.)")

## EXERCISE: REGULAR EXPRESSIONS

1. Write a regular expression pattern that matches words beginning with "d" with a minimum length of 5 characters.
    -  Use `str_extract_all()` to test whether it works


2. Use `str_count` together with your regular expression to determine, how many matches there are in your text snippets