# Strings

A "string" is a programming term for a value containing text. A value of the class "character" is a string.

Below the first paragraph from chapter 15 of "The Picture of Dorian Gray" by Oscar Wilde is stored as a string (copied from: https://www.gutenberg.org/files/174/174-h/174-h.htm#chap15)

In [2]:
my_text <- "That evening, at eight-thirty, exquisitely dressed and wearing a large button-hole of Parma violets, Dorian Gray was ushered into Lady Narborough’s drawing-room by bowing servants. His forehead was throbbing with maddened nerves, and he felt wildly excited, but his manner as he bent over his hostess’s hand was as easy and graceful as ever. Perhaps one never seems so much at one’s ease as when one has to play a part. Certainly no one looking at Dorian Gray that night could have believed that he had passed through a tragedy as horrible as any tragedy of our age. Those finely shaped fingers could never have clutched a knife for sin, nor those smiling lips have cried out on God and goodness. He himself could not help wondering at the calm of his demeanour, and for a moment felt keenly the terrible pleasure of a double life. "
class(my_text)

## Texts as vectors
Texts can be thought of as vectors in a number of ways:
1. A collection of texts
2. A collection of words in a text
3. A collection of sentences in a text

Depending on what we are trying to figure out with text, R can be used in a number of ways.

Let's start by thinking of a text as a series of single words:

In [4]:
text_words <- unlist(strsplit(my_text, split = "\\s"))  #split at every whitespace - \\s is an escape character
text_words

The text is now a vector, each word with its own index (subset using `[]`):

In [5]:
text_words[5]
text_words[22]

This allows us to perform counts and other summaries.

## Working with strings: The `stringr` package

The package `stringr` is a tidyverse package for working with strings.

In [6]:
library(stringr)

`stringr` can be used in a variety of different ways.

In [7]:
# Changing case (here to lowercase)
str_to_lower(my_text)

In [9]:
# Looking up words
str_detect(my_text, "tragedy")

In [11]:
# Counting matches
str_count(my_text, "tragedy")

The functions of `stringr` also work on a grouping of elements (like a vector). We can see this when we split the text into sentences and then use the same functions

In [20]:
# Splitting text into elements; here separating at commas
# unlist is used to coerce to a vector; otherwise it is returned as a list
text_sent <- str_split(my_text, pattern = "\\.") %>% 
unlist()

In [21]:
# Looking up word in each sentence
str_detect(text_sent, "tragedy")

In [22]:
# Counting word in each sentence
str_count(text_sent, "tragedy")

### A note on indexing and booleans
When inputting boolean valules as an index, R will only return the `TRUE` values.

This means that we can use commands like `str_detect` to only return text elements containing specific words.

In [23]:
text_sent[str_detect(text_sent, "tragedy")]

`str_subset` has combined this functionality in one function:

In [25]:
str_subset(text_sent, "tragedy")

## EXERCISE: WORKING WITH A TEXT AS A VECTOR

In the following, you will create a vector containing the sentences of the texts and then looking up certain words.

Make sure you have the package `stringr` installed and loaded.

1. Assign the following text snippet to an object:

    "Themis, who has already been alluded to as the wife of Zeus, was the daughter of Cronus and Rhea, and personified those divine laws of justice and order by means of which the well-being and morality of communities are regulated. She presided over the assemblies of the people and the laws of hospitality."
    

2. Convert the text snippet to a vector of senteces.
    
    a. Split the texts into sentences using `str_split(texts, pattern = ",")`. Assign to an object (splitting at commas instead of periods).
    
    c. Unlist the object to convert to a vector using `unlist()`. Assign to the same or a new object.
    
    
2. Use `str_subset()` to see extract sentences that contain the name "Zeus". 

## Regular expressions
Often when working with text, we have more "fuzzy" patterns we want to search for. 

Regular expression is a common "language" for processing text patterns.

`stringr` supports regular expression arguements.

Possible uses:
- Finding words of a certain length
- Finding sentences containing a certain word or pattern
- Finding words following a certain pattern
- and so on.

Regular expression (or "regex") can be used in most function in `stringr`. 

The function `str_subset()` creates a subset of elements with the strings containing the pattern.

In [26]:
text_sent

In [27]:
# Return sentences containing either "origin" or "before"
str_subset(text_sent, "excited|wonder")

In [33]:
# Return sentences containing words with uppercase (not including the first word of the sentence).
str_subset(text_sent, "\\w.*[A-Z]")

## EXERCISE: SIMPLE REGEX

Use your vector of sentences from the previous exercise.

1. Use `str_subset()` to extract sentences containing either the word "justice" or "people".