# Refresher?
If including refresher: Go through part-1-refresher in the code folder. Then give the exercise below. Otherwise skip to cell on data frames.

# EXERCISE 0: REFRESHER

1. Load the package `stringr` into your environment (if not installed, install with `install.packages('stringr')`).
2. Assign the paragraph below to an object using `<-` (name the object how you see fit):

"CALLISTO, the daughter of Lycaon, king of Arcadia, was a huntress in the train of Artemis, devoted to the pleasures of the chase, who had made a vow never to marry; but Zeus, under the form of the huntress-goddess, succeeded in obtaining her affections. Hera, being extremely jealous of her, changed her into a bear, and caused Artemis (who failed to recognize her attendant under this form) to hunt her in the chase, and put an end to her existence. After her death she was placed by Zeus among the stars as a constellation, under the name of Arctos, or the bear."

3. Check the class of the object using `class()`. What class is it?
4. Create a vector of sentences from the paragrpah using `str_split`. Split at `.` and `,` (`pattern = ",|\\."`). You can consult the help file with `?str_split`.
    - Note that `str_split` returns a list. Use `unlist()` to coerce it to a vector.
5. Check the length of the vector using `length()`. How many sentences?
6. Find the sentences containing "Artemis" using `str_subset()`.

# R Objects: Data Frames
A "data frame" is the R-equivalent of a spreadsheet (a table of rows and columns). It is one of the most useful storage structures for data analysis in R.

Typically rows consist of individual observations and the columns of the different variables (tidy data).

Data frames are useful formats regardless of working with text, numbers, dates etc.

# Text as data frames: The `tidytext` package

`tidytext` is a package useful for working with texts as dataframes.

It provides a lot of simple functions for converting text to tokens (individual text elements).

Combined with other functionality in the tidyverse, it is easy to create simple summaries.

Install the packages `tidytext` and `gutenbergr` (contains the texts from the Gutenberg project: https://www.gutenberg.org/).

(as I'm not sure whether it is included, install `dplyr` as well. Alternatively install `tidyverse`.)

Let's inspect a single text from the Gutenberg project: "Myths and Legends of Ancient Greece and Rome" by E.M. Berens.

The command below downloads the text as a dataframe.

In [42]:
library(tidytext)
library(gutenbergr)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union



In [43]:
# Downloading the text "Myths and Legends of Ancient Greece and Rome" by E.M. Berens
text_df <- gutenberg_download(22381)

Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
Using mirror http://aleph.gutenberg.org


## Exploring Data Frames
To get an idea of what the data contains, we can use `head()`:

In [44]:
head(text_df) #Shows the 6 first rows of each observation

gutenberg_id,text
22381,_A HAND-BOOK OF MYTHOLOGY._
22381,
22381,* * * * *
22381,
22381,THE
22381,


We can check the names of the columns (the variable names) using `colnames`:

In [45]:
colnames(text_df)

# BREAK

![dog_plant](https://i.pinimg.com/originals/e0/ea/96/e0ea96a82a68cb3699f6aeaa9f1b8275.jpg)

## Working with tidy text data
The data frame does not make a lot of sense to work with in its current state.

We can in a few simple steps create a new data frame containing counts of individual words.

In [46]:
# Importing stopwords (commonly used words that are mostly void of meaning)
data(stop_words) #the "stop_words" dataset is a part of the tidytext package

In [47]:
# Unnesting tokens - similar to splitting. Splits individual words by default and converts to lower case
tokens_df <- text_df %>% # Create object and "pipe" (%>%). This tells R to use the newly created object in the next line
  unnest_tokens(word, text) %>% #unnest tokens. first arg is the new column, second is the old
  anti_join(stop_words) #filters out rows containing stop words

Joining, by = "word"


In [48]:
# Print first 6 rows of new data - now a data frame with individual words as rows
head(tokens_df)

gutenberg_id,word
22381,_a
22381,hand
22381,book
22381,mythology
22381,myths
22381,legends


The `count` command counts up identical values and returns a data frame. This can then tell us the most common words in the text.

In [49]:
tokens_count <- count(tokens_df, word, sort = TRUE)
head(tokens_count)

word,n
king,232
zeus,216
son,209
gods,199
god,181
heracles,154


## Subsetting Data Frames
Specific columns/variables can be called by their name using `$`:

In [50]:
head(tokens_count$word) #6 first rows of the variable "Miles per galon"

Data frames can also be subset via index: `[]` - either by index or column names:

| Code | Description |
|:-----|:------------|
|`tokens_count[2, 1]` | Row 2, column 1 |
|`tokens_count[2, "n"]` | Row 2, column "n" (column 2) |
|`tokens_count[10, ]` | The entire 10th row |
|`tokens_count[, 2]` | The entire 2nd column |

## Operations on a data frame
Each column in a data frame works like a vector. Functions can therefore be used on data frame columns like they can be used on vectors:

In [51]:
head(nchar(tokens_count$word))

In [52]:
str_subset(tokens_count$word, "^w")

## Creating variables
Variables can be added to a data frame using the operator `$` and calling a name not yet used in the data frame:

In [53]:
tokens_count$nchar <- nchar(tokens_count$word)
head(tokens_count)

word,n,nchar
king,232,4
zeus,216,4
son,209,3
gods,199,4
god,181,3
heracles,154,8


Alternatively, do it the tidyverse way with mutate:

In [54]:
tokens_count <- mutate(tokens_count, nchar = nchar(word))
head(tokens_count)

word,n,nchar
king,232,4
zeus,216,4
son,209,3
gods,199,4
god,181,3
heracles,154,8


## Subsetting the tidyverse way

The command `filter()` can be used to only return rows meeting a certain criteria.

In [55]:
# Filtering rows with word longer than 4 characters
tokens_filter <- tokens_count %>%
    filter(nchar > 4)

head(tokens_filter)

word,n,nchar
heracles,154,8
called,152,6
apollo,133,6
father,126,6
beautiful,120,9
daughter,112,8


Filter keeps all rows where the evaluated statement returns `True`. This means we can use other commands like `str_detect` to filter.

In [56]:
# Filtering rows with word starting with a "w" - "^" is regular expression for "starts with"
tokens_filter <- tokens_count %>%
    filter(str_detect(word, "^w"))

head(tokens_filter)

word,n,nchar
world,96,5
whilst,94,6
wife,73,4
war,67,3
worship,54,7
wine,43,4


# The R Help Files
All R functions and commands are thoroughly documented so you do not have to remember what every function does or even how it should be written.

Every function and command in R has its own help file. The help file describes how to use the various functions and commands.

The help file for a specific function is accessed using the operator `?` (also works for the built-in datasets):

In [57]:
?sum

When in doubt, just look up the help file.

# EXERCISE 5: SIMPLE TEXT MINING
You will be repeating a lot of the same steps as you just saw. The goal is to figure out what names are used the most in the text "Pagan and Christian Rome" by Rodolfo Lanciani.

You can download these slides at: https://forskning.moodle.aau.dk/course/view.php?id=4 (select "CAS" and login), if you want to go back and look at the previous slides.

1. Load the text "Pagan and Christian Rome" from gutenberg into an object using `gutenberg_download(22153)`
2. Use `unnest_tokens()` to convert the text to word tokens *without* converting to lower case (consult the help file with `?()` to see what option to change)
3. Use `count` to create a data frame with the word tokens counted: `count(df, word, sort = TRUE)`
4. Combine `filter()` and `str_detect()` to only keep words starting with uppercase letter (use the pattern "^[A-Z]") 
5. Determine the most mentioned names (if you sorted the data, you can just print the top rows with `head()`

**Bonus (for a better result)**
1. Try using `filter()` and `nchar()` to only keep words longer than 3 characters.

In [58]:
paga_text <- gutenberg_download(22153)

In [59]:
paga_tokens <- paga_text %>%
    unnest_tokens(word, text, to_lower = FALSE) %>%
    count(word, sort = TRUE) %>%
    filter(str_detect(word, "^[A-Z]"))

head(paga_tokens)

word,n
The,1149
S,608
Rome,326
I,285
In,188
It,186


In [60]:
paga_tokens <- paga_text %>%
    unnest_tokens(word, text, to_lower = FALSE) %>%
    count(word, sort = TRUE) %>%
    filter(str_detect(word, "^[A-Z]")) %>%
    filter(nchar(word) > 3)

head(paga_tokens)

word,n
Rome,326
Christian,135
Pope,135
Illustration,124
Roman,113
This,110
