## Texts as vectors

"Vectors" in R are collections of objects of the same class.

Texts can be thought of as vectors in a number of ways:
1. A collection of texts
2. A collection of words in a text
3. A collection of sentences in a text

Depending on what we are trying to figure out with text, R can be used in a number of ways.

Let's start by thinking of a text as a series of single words:

In [26]:
text <- "The Greeks believed that the mental qualifications of their gods were of a
much higher order than those of men, but nevertheless, as we shall see,
they were not considered to be exempt from human passions, and we
frequently behold them actuated by revenge, deceit, and jealousy. They,
however, always punish the evil-doer, and visit with dire calamities any
impious mortal who dares to neglect their worship or despise their rites.
We often hear of them visiting mankind and partaking of their hospitality,
and not unfrequently both gods and goddesses {8} become attached to
mortals, with whom they unite themselves, the offspring of these unions
being called heroes or demi-gods, who were usually renowned for their great
strength and courage. But although there were so many points of resemblance
between gods and men, there remained the one great characteristic
distinction, viz., that the gods enjoyed immortality. Still, they were not
invulnerable, and we often hear of them being wounded, and suffering in
consequence such exquisite torture that they have earnestly prayed to be
deprived of their privilege of immortality."

In [27]:
library(stringr)
text_words <- str_split(text, pattern = "\\s")  #split at every whitespace - \\s is an escape character

Many functions from `stringr` return a list by default. A list is collection of different objects in R - be it strings, numbers, vectors or even lists (list of lists).

This is in done in order to support using the function on a vector containing several strings.

We can convert the list to a vector simply by using the function `unlist()` or by refering to the first element in the list (which is the vector of words):

In [29]:
text_words <- unlist(text_words)
#text_words <- text_wods[[1]] # Alternative

The text is now subsettable - each word with its own index (subset using `[]`):

In [30]:
text_words[5]
text_words[22]

This allows us to perform counts and other summaries (which we wil get back to).

## Sentences as vectors

The functions in `stringr` work both on single strings and collections of strings (like vectors). That way the functions can easily be applied over a collections of texts - either stored as a vector, a list or a column in a data frame. 

As an example, we will convert the text above to sentences (split at comma or period):

In [45]:
text <- str_replace_all(text, "\\n", " ")
text_lower <- str_to_lower(text)
text_sent <- str_split(text_lower, pattern = ", |\\. ")
text_sent <- unlist(text_sent)
text_sent

Using a function from `stringr` with a list of strings as input will automatically apply the function for each string in the list:

In [46]:
# Looking up word in each sentence
str_detect(text_sent, "god")

In [47]:
# Counting word in each sentence
str_count(text_sent, "god")

## Subsetting text

The function `str_subset()` can be used to subset a collection of strings by only returning strings that contain the provided pattern

In [48]:
# Return sentences containing "god"
str_subset(text_sent, "god")

In [49]:
# Return sentences that start with a "t"
str_subset(text_sent, "^t")

## EXERCISE: STRINGS AS VECTORS

1. Convert your two text snippets from earlier to a vector of senteces.

    a. Put both texts into a vector using `c()`. Assign to an object.
    
    b. Split the texts into sentences using `str_split(texts, pattern = ", |\\. ")`. Assign to an object.
    
    c. Unlist the object to convert to a vector using `unlist()`. Assign to the same or a new object.
    
    
2. Use `str_subset` along with regular expression to locate sentences containing a word starting with upper-case

**Bonus**
- Can you write the regular expression in a way that avoids including the words following a period?