# Strings

A "string" is a programming term for a value containing text. A value of the class "character" is a string.

In [17]:
my_text <- "The ancient Greeks had several different theories with regard to the origin of the world, but the generally accepted notion was that before this world came into existence, there was in its place a confused mass of shapeless elements called Chaos. "
class(my_text)

## Texts as vectors
Texts can be thought of as vectors in a number of ways:
1. A collection of texts
2. A collection of words in a text
3. A collection of sentences in a text

Depending on what we are trying to figure out with text, R can be used in a number of ways.

Let's start by thinking of a text as a series of single words:

In [24]:
text_words <- unlist(strsplit(text, split = "\\s"))  #split at every whitespace - \\s is an escape character
text_words

The text is now a vector, each word with its own index (subset using `[]`):

In [25]:
text_words[5]
text_words[22]

This allows us to perform counts and other summaries.

## Working with strings: The `stringr` package

The package `stringr` is a tidyverse package for working with strings.

In [18]:
library(stringr)

`stringr` can be used in a variety of different ways.

In [27]:
# Changing case (here to lowercase)
str_to_lower(text)

In [28]:
# Looking up words
str_detect(text, "world")

In [29]:
# Counting matches
str_count(text, "world")

The functions of `stringr` also work on a grouping on elements (like a vector). We can see this when we split the text into sentences and then use the same functions

In [30]:
# Splitting text into elements; here separating at commas
# unlist is used to coerce to a vector; otherwise it is returned as a list
text_sent <- str_split(text, pattern = ",") %>% 
unlist()

In [31]:
# Looking up word in each sentence
str_detect(text_sent, "world")

In [32]:
# Counting word in each sentence
str_count(text_sent, "world")

### A note on indexing and booleans
When inputting boolean valules as an index, R will only return the `TRUE` values.

This means that we can use commands like `str_detect` to only return text elements containing specific words.

In [33]:
text_sent[str_detect(text_sent, "world")]

`str_subset` has combined this functionality in one function:

In [34]:
str_subset(text_sent, "world")

## EXERCISE: WORKING WITH A TEXT AS A VECTOR

In the following, you will create a vector containing the sentences of the texts and then looking up certain words.

Make sure you have the package `stringr` installed and loaded.

1. Assign the following text snippet to an object:

    "Themis, who has already been alluded to as the wife of Zeus, was the daughter of Cronus and Rhea, and personified those divine laws of justice and order by means of which the well-being and morality of communities are regulated. She presided over the assemblies of the people and the laws of hospitality."
    

2. Convert the text snippet to a vector of senteces.
    
    a. Split the texts into sentences using `str_split(texts, pattern = ",")`. Assign to an object.
    
    c. Unlist the object to convert to a vector using `unlist()`. Assign to the same or a new object.
    
    
2. Use `str_subset()` to see extract sentences that contain the name "Zeus". 

## Regular expressions
Often when working with text, we have more "fuzzy" patterns we want to search for. 

Regular expression is a common language for processing text patterns.

`stringr` supports regular expression arguements.

Possible uses:
- Finding words of a certain length
- Finding sentences containing a certain word or pattern
- Finding words following a certain pattern
- and so on.

Regular expression (or "regex") can be used in most function in `stringr`. 

The function `str_subset()` creates a subset of elements with the strings containing the pattern.

In [36]:
text_sent

In [79]:
# Return sentences containing either "origin" or "before"
str_subset(text_sent, "origin|before")

In [38]:
# Return sentences containing an uppercase.
str_subset(text_sent, "[A-Z]")

## EXERCISE 4: SIMPLE REGEX

Use your vector of sentences from the previous exercise.

1. Use `str_subset()` to extract sentences containing either the word "justice" or "people".