# R Objects: Vectors
So far we have looked at R objects containing single values: a number, a word, a text or a boolean.

R has different ways of storing a series of elements (values, objects, etc.). One of the more common is the *vector*.

A vector is a grouping of elements. They are created using the function `c`:

In [20]:
my_vec <- c(1, 9, 7, 3)
my_vec

Elements can be added to the vector with the function `append` or by creating a vector containing the vector:

In [21]:
my_vec2 <- append(my_vec, 22)
my_vec2

In [22]:
my_vec3 <- c(my_vec, 22)
my_vec3

Note that a vector is not a class. The class it takes depends on the elements in the vector. This means that all elements in a vector have to be the same class (or will be coerced to the same class).

In [23]:
class(my_vec)

## Texts as vectors
Texts can be thought of as vectors in a number of ways:
1. A collection of texts
2. A collection of words in a text
3. A collection of sentences in a text

Depending on what we are trying to figure out with text, R can be used in a number of ways.

Let's start by thinking of a text as a series of single words:

In [24]:
text_words <- unlist(strsplit(text, split = "\\s"))  #split at every whitespace - \\s is an escape character
text_words

The text is now a vector, each word with its own index (subset using `[]`):

In [25]:
text_words[5]
text_words[22]

This allows us to perform counts and other summaries (which we wil get back to).

# R Libraries - Packages 

R being open source means that a lot of developers are constantly adding new functions to R.
These new functions are distributed as *R packages* that can be loaded into the R library.

All the commands you have been using so far have been part of the `base` package (ships with R). 

Packages are installed using (name of package *with* quotes!): 

`install.packages('packagename')` 

The functions from the package is loaded into the environment using (name of package *without* quotes!):
    
`library(packagename)` 

Information for installed packages can be found using (name of package *with* quotes!):

`library(help = 'packagename')` 

The many additional functionalities of R gained from the addition of other packages is a huge benefit.

Some notes of caution should be mentioned:
- Packages are added constantly and there is not a centralized and long review proces. Make sure to use package with good documentation and documented uses.
- Packages can be developed by anyone which means that they can also stop developing on them.
- Packages can use the same name for functions. This can cause confusion and sometimes conflict.
- Packages are developed with different workflows in mind, so compatibility between packages can vary.

# Working with strings: The `stringr` package
To ensure a common workflow and package use, we will try to stick to packages from the "Tidyverse" as much as possible: https://www.tidyverse.org/

These packages are all developed with a consistent syntax for the functions and are all well-documented, used and tested.

The packages in the tidyverse are best for working with "tidy data": A data format where each row in a data set is a unique observation. 

The package `stringr` is a tidyverse package for working with strings. Install it with `install.packages('stringr')`.

When installed, load it with `library(stringr)`. Write it into your script. (Good practice is to either write it at the very beginning or just before you are using the commands from the package).

In [26]:
library(stringr)

`stringr` can be used in a variety of different ways.

In [27]:
# Changing case (here to lowercase)
str_to_lower(text)

In [28]:
# Looking up words
str_detect(text, "world")

In [29]:
# Counting matches
str_count(text, "world")

The functions of `stringr` also work on a grouping on elements (like a vector). We can see this when we split the text into sentences and then use the same functions

In [30]:
# Splitting text into elements; here separating at commas
# unlist is used to coerce to a vector; otherwise it is returned as a list
text_sent <- str_split(text, pattern = ",") %>% 
unlist()

In [31]:
# Looking up word in each sentence
str_detect(text_sent, "world")

In [32]:
# Counting word in each sentence
str_count(text_sent, "world")

### A note on indexing and booleans
When inputting boolean valules as an index, R will only return the `TRUE` values.

This means that we can use commands like `str_detect` to only return text elements containing specific words.

In [33]:
text_sent[str_detect(text_sent, "world")]

`str_subset` has combined this functionality in one function:

In [34]:
str_subset(text_sent, "world")

# EXERCISE 3: WORKING WITH VECTORS

You should still have the objects `mytext1` and `mytext2`. In the following, you will create a vector containing the sentences of the texts and then looking up certain words.

Make sure you have the package `stringr` installed and loaded.

1. Convert the two text snippets from earlier to a vector of senteces.

    a. Put both texts into a vector using `c()`. Assin to an object.
    
    b. Split the texts into sentences using `str_split(texts, pattern = ",")`. Assign to an object.
    
    c. Unlist the object to convert to a vector using `unlist()`. Assign to the same or a new object.
    
    
2. Use `str_detect()` to see which sentences contains the name "Zeus". 

In [35]:
mytexts <- c(mytext1, mytext2)
mysents <- str_split(mytexts, pattern = ",")
mysents <- unlist(mysents)

In [78]:
str_detect(mysents, "Zeus")

## Regular expressions
Often when working with text, we have more "fuzzy" patterns we want to search for. 

Regular expression is a common language for processing text patterns.

`stringr` supports regular expression arguements.

Possible uses:
- Finding words of a certain length
- Finding sentences containing a certain word or pattern
- Finding words following a certain pattern
- and so on.

Regular expression (or "regex") can be used in most function in `stringr`. 

The function `str_subset()` creates a subset of elements with the strings containing the pattern.

In [36]:
text_sent

In [79]:
# Return sentences containing either "origin" or "before"
str_subset(text_sent, "origin|before")

In [38]:
# Return sentences containing an uppercase.
str_subset(text_sent, "[A-Z]")

# EXERCISE 4: SIMPLE REGEX

Use your vector of sentences from the previous exercise.

1. Use `str_subset()` to extract sentences containing either "Zeus" or "Athene".

In [82]:
str_subset(mysents, "Zeus|Athene")