# Introduction
## What is R and why use it?

R is a free software environment with its own programming language. 

It is especailly suited for statistical analysis and graphical outputs.

R's popularity as a data science tool as well as it being open source has made its applications vast.

R can work with a large variety of data formats and is (with a few add-ons) compatible with data from other software solutions (Excel, SPSS, SAS, STATA).

## Why use R for History?

<u>Data versatility:</u> It can work with numerical data, text data, dates etc.

<u>Complete workflow in one environment:</u> From data loading to cleaning to output (and even reporting, if you want)

<u>Lots of help to get:</u> Very large community offering their help via forums, free educational material, blogs and so on.

## Content of the R introduction

- The R language
- The RStudio environment
- Constructing variables and objects
- Working with strings
- Simple text mining with tables (dataframes)
- Using the help command
- Reading data from Excel
- Working with dates

The introduction will combine presenting R and R code in Jupyter Notebook while demonstrating in RStudio. You are encouraged to work and write in RStudio during the workshop.

*Please write along as we go through the different examples.*

*Download the slides via moodle to go back during the workshop, if necessary: https://forskning.moodle.aau.dk/course/view.php?id=4 (select "CAS" user and login)*

## The RStudio environment

During the workshop (and the other R workshops in CALDISS), we will be working with RStudio.

RStudio is an IDE for R (Integrated Development Environment) - Makes for a nicer workspace

<https://www.rstudio.com/products/rstudio/download/>

# The R language
R has it's own programming language. R works by you writing lines of code in that language (writing commands) and R interpreting that code (running commands).

R (and RStudio) has a limited user interface meaning almost all functionality (statistics, plots, simulations etc.) must be executed using code in the R language.

A programming language is a lot like any other language (except not being very dialogical): 
- You can only expect to be understood, if you speak the same language = R will only execute code written correctly
- You are contributing to the language by speaking it = Create your own functions in R that R will understand

## R as a calculator
So what does it mean that R interprets our code?
It means that you tell R to do something by writing a command and R will do that (if R can understand you).

R, for example, understands mathematical expressions:

In [1]:
2 + 5

In [2]:
0.37 * 256

I can "ask" R for different results and outputs using functions:

In [3]:
nchar("when in Rome, do as Romans do")

In [4]:
toupper("when in Rome, do as Romans do")

When you run a commmand that R doesn't know, R will throw an error:

In [5]:
finish_sentece("when in Rome")

ERROR: Error in finish_sentece("when in Rome"): could not find function "finish_sentece"


The commands in R are virtually endless, as you are able to create your own:

In [6]:
finish_sentence <- function(text) {
    paste0(text, ", do as Romans do")
}

finish_sentence("when in Rome")

# The R Language: Objects and Functions
R works by storing values in "objects". These objects can then be used in various commands like calculating differences, saving a file, creating a graph and so on. To simplify a bit: An object is some kind of stored value and a function is something that can manipulate a stored value (which then creates a new object). 

Most of R can be boiled down to these 3 basic steps:

1. Assign values to an object
2. Make sure R interprets the object correctly (its class)
3. Perfom some operation or manipulation on the object using a function

Translated to data analysis, the steps would (in general terms) look as follows:

1. Load our dataset: `data <- read.csv("my_datafile.csv")`
2. Check the that the variables are the correct class: `class(data$age)`
3. Perform some kind of analysis: `mean(data$age)`

The gap between these steps of course vary greatly.

## Objects
A lot of writing in R is about defining objects: A name to use to call up stored data.

Objects can be a lot of things: 
- a word
- a text
- a number
- a series of numbers
- a dataset 
- a corpus of texts
- a URL
- a formula
- a result 
- a filepath
- and so on...

When an object is defined, it is available in the current working space (or environment).

This makes it possible to store and work with a variety of informaiton simultaneously.

### Defining objects
Objects are defined using the `<-` operator:

In [7]:
a <- 2 + 5
a

In [8]:
b <- 'Rome'
b

Using `' '` or `" "` denotes that the code should be read as text.

Objects with text (known in programming as `strings`) can be as long as you like.

In [9]:
text <- "The ancient Greeks had several different theories with regard to the origin of the world, but the generally accepted notion was that before this world came into existence, there was in its place a confused mass of shapeless elements called Chaos. "

## Functions

When an object is created, we can use functions on them.

Most functions are written in the syntax of `function(object, option = something)`. A lot of functions only need the object as an arguement.

In [10]:
toupper(text) #Convert to uppercase

In [11]:
nchar(text) #Number of characters

Others take several arguements:

In [12]:
gsub("world", "cheese", text) #Pattern replacement

### Naming objects
Objects can be named almost anything but a good rule of thumb is to use names that are indicative of what the object contains.

#### Restrictions for naming objects
- Most special characters not allowed: `/`, `?`, `*`, `+` and so on (most characters mean something to R and will be read as an expression)
- Already existing names in R (will overwrite the function/object in the environment)

#### Good naming conventions 
- Using '`_`': `my_object`, `room_number`

or:

- Capitalize each word except the first: `myObject`, `roomNumber`

# EXERCISE 1: DEFINING OBJECTS

Below are two text snippets from the book "Myths and Legends of Ancient Greece and Rome" by E.M. Berens.

<u>Snippet 1:</u>

"Themis, who has already been alluded to as the wife of Zeus, was the daughter of Cronus and Rhea, and personified those divine laws of justice and order by means of which the well-being and morality of communities are regulated. She presided over the assemblies of the people and the laws of hospitality."

<u>Snippet 2:</u>

"Athene was universally worshipped throughout Greece, but was regarded with special veneration by the Athenians, she being the guardian deity of Athens. Her most celebrated temple was the Parthenon, which stood on the Acropolis at Athens, and contained her world-renowned statue by Phidias, which ranks second only to that of Zeus by the same great artist."

**1.** Assign each snippet to its own object (make up your own object names or use `mytext1` and `mytext2`).

**2.** The function `nchar()` returns the number of characters. Determine which text snippet has the most characters.

In [13]:
mytext1 <- "Themis, who has already been alluded to as the wife of Zeus, was the daughter of Cronus and Rhea, and personified those divine laws of justice and order by means of which the well-being and morality of communities are regulated. She presided over the assemblies of the people and the laws of hospitality."
mytext2 <- "Athene was universally worshipped throughout Greece, but was regarded with special veneration by the Athenians, she being the guardian deity of Athens. Her most celebrated temple was the Parthenon, which stood on the Acropolis at Athens, and contained her world-renowned statue by Phidias, which ranks second only to that of Zeus by the same great artist."

In [14]:
nchar(mytext1)
nchar(mytext2)

nchar(mytext1) - nchar(mytext2)

# Different types of objects (classes)
R distinguishes between different types of objects.

An objects is stored as a *class*. The class denotes what type of object it is and affects what operations are possible.

## Numeric and character classes
As you work with R, you will encounter a lot of different classes. For now we will be focusing on two of the more common ones:
- Numeric classes
- Character classes

Numbers are automatically stored as a numeric class (or one of the variants: double, integer etc.).

When using `''` or `""` around the information to be stored in the object, R will interpret that as text; meaning it will be stored as a character class. 

*Numbers enclosed in `''` or `""` are therefore stored as a character class, as R interprets it as text!*

R has to be told that something is text as R would otherwise interpret it as an object.

In [15]:
mytext3 <- rome

ERROR: Error in eval(expr, envir, enclos): objekt 'rome' blev ikke fundet


## Coercing classes
The class of an object can be examined with `class(object)`.

Objects can be coerced with specific functions:

- Coerce to character class:`as.character(object)`
- Coerce to numeric class: `as.numeric(object)`

R will always try to "guess" the class. If R guesses wrong, you can tell R what class it should be (if possible).

# R scripts: For reproducability! 
Script files are text files containing code that R can interpret.

It is your "analysis recipe" showing what you have done as well as allowing you to re-run commands easily.

Always make a habit of writing your commands into a script, when you have the command figured out.

- `#` can be used for comments (skipped when run)
- `Ctrl` + `Enter`: Runs the current line or selection
- `Ctrl` + `Alt` + `R`: Runs the whole script

# EXERCISE 2: CLASSES

1. Assign the number of characters of text snippet 1 (`nchar(mytext1)`) to the object `text1_nc`.
2. Check the classes of your object `mytext1` and `text1_nc` with `class()`. What are they?
3. Try changing the class of `text1_nc` to a character class with `as.character()`. Is it possible?
4. Try changing the class of `mytext1` to a numeric class with `as.numeric()`. Is it possible?

**Bonus**
1. Assign the number of characters of text snippet 2 to another object
2. Test if the number of characters of the two texts are the same with the operator `==`:
    - `text1_nc == text2_nc`
3. Assign the test above to the object `mytest`
4. Check the class of `mytest`. What class is it?

In [16]:
text1_nc <- nchar(mytext1)
text2_nc <- nchar(mytext2)

mytest <- text1_nc == text2_nc

In [17]:
as.character(text1_nc)

In [18]:
as.numeric(mytext1)

"NAs introduced by coercion"

In [19]:
class(mytest)

# The logical class
Logicals are *boolean* objects meaning they will either have the value `TRUE` or `FALSE`.

When using the following operators (among others), R will interpret it as a logical class:
- `>`
- `>=`
- `<`
- `<=`
- `==`
- `!=`

Logicals can be used in functions, loops and if-statements to ensure that a certain condition is met before something is run.

# BREAK

![cat_window](https://2.bp.blogspot.com/-C8QgO2Yd3ew/TbMiWtd-VEI/AAAAAAACFu8/AhHEIYkfvnU/s1600/cats_chillin_02.jpg)

# R Objects: Vectors
So far we have looked at R objects containing single values: a number, a word, a text or a boolean.

R has different ways of storing a series of elements (values, objects, etc.). One of the more common is the *vector*.

A vector is a grouping of elements. They are created using the function `c`:

In [20]:
my_vec <- c(1, 9, 7, 3)
my_vec

Elements can be added to the vector with the function `append` or by creating a vector containing the vector:

In [21]:
my_vec2 <- append(my_vec, 22)
my_vec2

In [22]:
my_vec3 <- c(my_vec, 22)
my_vec3

Note that a vector is not a class. The class it takes depends on the elements in the vector. This means that all elements in a vector have to be the same class (or will be coerced to the same class).

In [23]:
class(my_vec)

## Texts as vectors
Texts can be thought of as vectors in a number of ways:
1. A collection of texts
2. A collection of words in a text
3. A collection of sentences in a text

Depending on what we are trying to figure out with text, R can be used in a number of ways.

Let's start by thinking of a text as a series of single words:

In [24]:
text_words <- unlist(strsplit(text, split = "\\s"))  #split at every whitespace - \\s is an escape character
text_words

The text is now a vector, each word with its own index (subset using `[]`):

In [25]:
text_words[5]
text_words[22]

This allows us to perform counts and other summaries (which we wil get back to).

# R Libraries - Packages 

R being open source means that a lot of developers are constantly adding new functions to R.
These new functions are distributed as *R packages* that can be loaded into the R library.

All the commands you have been using so far have been part of the `base` package (ships with R). 

Packages are installed using (name of package *with* quotes!): 

`install.packages('packagename')` 

The functions from the package is loaded into the environment using (name of package *without* quotes!):
    
`library(packagename)` 

Information for installed packages can be found using (name of package *with* quotes!):

`library(help = 'packagename')` 

The many additional functionalities of R gained from the addition of other packages is a huge benefit.

Some notes of caution should be mentioned:
- Packages are added constantly and there is not a centralized and long review proces. Make sure to use package with good documentation and documented uses.
- Packages can be developed by anyone which means that they can also stop developing on them.
- Packages can use the same name for functions. This can cause confusion and sometimes conflict.
- Packages are developed with different workflows in mind, so compatibility between packages can vary.

# Working with strings: The `stringr` package
To ensure a common workflow and package use, we will try to stick to packages from the "Tidyverse" as much as possible: https://www.tidyverse.org/

These packages are all developed with a consistent syntax for the functions and are all well-documented, used and tested.

The packages in the tidyverse are best for working with "tidy data": A data format where each row in a data set is a unique observation. 

The package `stringr` is a tidyverse package for working with strings. Install it with `install.packages('stringr')`.

When installed, load it with `library(stringr)`. Write it into your script. (Good practice is to either write it at the very beginning or just before you are using the commands from the package).

In [26]:
library(stringr)

`stringr` can be used in a variety of different ways.

In [27]:
# Changing case (here to lowercase)
str_to_lower(text)

In [28]:
# Looking up words
str_detect(text, "world")

In [29]:
# Counting matches
str_count(text, "world")

The functions of `stringr` also work on a grouping on elements (like a vector). We can see this when we split the text into sentences and then use the same functions

In [30]:
# Splitting text into elements; here separating at commas
# unlist is used to coerce to a vector; otherwise it is returned as a list
text_sent <- str_split(text, pattern = ",") %>% 
unlist()

In [31]:
# Looking up word in each sentence
str_detect(text_sent, "world")

In [32]:
# Counting word in each sentence
str_count(text_sent, "world")

### A note on indexing and booleans
When inputting boolean valules as an index, R will only return the `TRUE` values.

This means that we can use commands like `str_detect` to only return text elements containing specific words.

In [33]:
text_sent[str_detect(text_sent, "world")]

`str_subset` has combined this functionality in one function:

In [34]:
str_subset(text_sent, "world")

# EXERCISE 3: WORKING WITH VECTORS

You should still have the objects `mytext1` and `mytext2`. In the following, you will create a vector containing the sentences of the texts and then looking up certain words.

Make sure you have the package `stringr` installed and loaded.

1. Convert the two text snippets from earlier to a vector of senteces.

    a. Put both texts into a vector using `c()`. Assin to an object.
    
    b. Split the texts into sentences using `str_split(texts, pattern = ",")`. Assign to an object.
    
    c. Unlist the object to convert to a vector using `unlist()`. Assign to the same or a new object.
    
    
2. Use `str_detect()` to see which sentences contains the name "Zeus". 

In [35]:
mytexts <- c(mytext1, mytext2)
mysents <- str_split(mytexts, pattern = ",")
mysents <- unlist(mysents)

In [78]:
str_detect(mysents, "Zeus")

## Regular expressions
Often when working with text, we have more "fuzzy" patterns we want to search for. 

Regular expression is a common language for processing text patterns.

`stringr` supports regular expression arguements.

Possible uses:
- Finding words of a certain length
- Finding sentences containing a certain word or pattern
- Finding words following a certain pattern
- and so on.

Regular expression (or "regex") can be used in most function in `stringr`. 

The function `str_subset()` creates a subset of elements with the strings containing the pattern.

In [36]:
text_sent

In [79]:
# Return sentences containing either "origin" or "before"
str_subset(text_sent, "origin|before")

In [38]:
# Return sentences containing an uppercase.
str_subset(text_sent, "[A-Z]")

# EXERCISE 4: SIMPLE REGEX

Use your vector of sentences from the previous exercise.

1. Use `str_subset()` to extract sentences containing either "Zeus" or "Athene".

In [82]:
str_subset(mysents, "Zeus|Athene")

# R Objects: Data Frames
A "data frame" is the R-equivalent of a spreadsheet (a table of rows and columns). It is one of the most useful storage structures for data analysis in R.

Typically rows consist of individual observations and the columns of the different variables (tidy data).

Data frames are useful formats regardless of working with text, numbers, dates etc.

# Text as data frames: The `tidytext` package

`tidytext` is a package useful for working with texts as dataframes.

It provides a lot of simple functions for converting text to tokens (individual text elements).

Combined with other functionality in the tidyverse, it is easy to create simple summaries.

Install the packages `tidytext` and `gutenbergr` (contains the texts from the Gutenberg project: https://www.gutenberg.org/).

(as I'm not sure whether it is included, install `dplyr` as well. Alternatively install `tidyverse`.)

Let's inspect a single text from the Gutenberg project: "Myths and Legends of Ancient Greece and Rome" by E.M. Berens.

The command below downloads the text as a dataframe.

In [42]:
library(tidytext)
library(gutenbergr)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union



In [43]:
# Downloading the text "Myths and Legends of Ancient Greece and Rome" by E.M. Berens
text_df <- gutenberg_download(22381)

Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
Using mirror http://aleph.gutenberg.org


## Exploring Data Frames
To get an idea of what the data contains, we can use `head()`:

In [44]:
head(text_df) #Shows the 6 first rows of each observation

gutenberg_id,text
22381,_A HAND-BOOK OF MYTHOLOGY._
22381,
22381,* * * * *
22381,
22381,THE
22381,


We can check the names of the columns (the variable names) using `colnames`:

In [45]:
colnames(text_df)

# BREAK

![dog_plant](https://i.pinimg.com/originals/e0/ea/96/e0ea96a82a68cb3699f6aeaa9f1b8275.jpg)

## Working with tidy text data
The data frame does not make a lot of sense to work with in its current state.

We can in a few simple steps create a new data frame containing counts of individual words.

In [46]:
# Importing stopwords (commonly used words that are mostly void of meaning)
data(stop_words) #the "stop_words" dataset is a part of the tidytext package

In [47]:
# Unnesting tokens - similar to splitting. Splits individual words by default and converts to lower case
tokens_df <- text_df %>% # Create object and "pipe" (%>%). This tells R to use the newly created object in the next line
  unnest_tokens(word, text) %>% #unnest tokens. first arg is the new column, second is the old
  anti_join(stop_words) #filters out rows containing stop words

Joining, by = "word"


In [48]:
# Print first 6 rows of new data - now a data frame with individual words as rows
head(tokens_df)

gutenberg_id,word
22381,_a
22381,hand
22381,book
22381,mythology
22381,myths
22381,legends


The `count` command counts up identical values and returns a data frame. This can then tell us the most common words in the text.

In [49]:
tokens_count <- count(tokens_df, word, sort = TRUE)
head(tokens_count)

word,n
king,232
zeus,216
son,209
gods,199
god,181
heracles,154


## Subsetting Data Frames
Specific columns/variables can be called by their name using `$`:

In [50]:
head(tokens_count$word) #6 first rows of the variable "Miles per galon"

Data frames can also be subset via index: `[]` - either by index or column names:

| Code | Description |
|:-----|:------------|
|`tokens_count[2, 1]` | Row 2, column 1 |
|`tokens_count[2, "n"]` | Row 2, column "n" (column 2) |
|`tokens_count[10, ]` | The entire 10th row |
|`tokens_count[, 2]` | The entire 2nd column |

## Operations on a data frame
Each column in a data frame works like a vector. Functions can therefore be used on data frame columns like they can be used on vectors:

In [51]:
head(nchar(tokens_count$word))

In [52]:
str_subset(tokens_count$word, "^w")

## Creating variables
Variables can be added to a data frame using the operator `$` and calling a name not yet used in the data frame:

In [53]:
tokens_count$nchar <- nchar(tokens_count$word)
head(tokens_count)

word,n,nchar
king,232,4
zeus,216,4
son,209,3
gods,199,4
god,181,3
heracles,154,8


Alternatively, do it the tidyverse way with mutate:

In [54]:
tokens_count <- mutate(tokens_count, nchar = nchar(word))
head(tokens_count)

word,n,nchar
king,232,4
zeus,216,4
son,209,3
gods,199,4
god,181,3
heracles,154,8


## Subsetting the tidyverse way

The command `filter()` can be used to only return rows meeting a certain criteria.

In [55]:
# Filtering rows with word longer than 4 characters
tokens_filter <- tokens_count %>%
    filter(nchar > 4)

head(tokens_filter)

word,n,nchar
heracles,154,8
called,152,6
apollo,133,6
father,126,6
beautiful,120,9
daughter,112,8


Filter keeps all rows where the evaluated statement returns `True`. This means we can use other commands like `str_detect` to filter.

In [56]:
# Filtering rows with word starting with a "w" - "^" is regular expression for "starts with"
tokens_filter <- tokens_count %>%
    filter(str_detect(word, "^w"))

head(tokens_filter)

word,n,nchar
world,96,5
whilst,94,6
wife,73,4
war,67,3
worship,54,7
wine,43,4


# The R Help Files
All R functions and commands are thoroughly documented so you do not have to remember what every function does or even how it should be written.

Every function and command in R has its own help file. The help file describes how to use the various functions and commands.

The help file for a specific function is accessed using the operator `?` (also works for the built-in datasets):

In [57]:
?sum

When in doubt, just look up the help file.

# EXERCISE 5: SIMPLE TEXT MINING
You will be repeating a lot of the same steps as you just saw. The goal is to figure out what names are used the most in the text "Pagan and Christian Rome" by Rodolfo Lanciani.

You can download these slides at: https://forskning.moodle.aau.dk/course/view.php?id=4 (select "CAS" and login), if you want to go back and look at the previous slides.

1. Load the text "Pagan and Christian Rome" from gutenberg into an object using `gutenberg_download(22153)`
2. Use `unnest_tokens()` to convert the text to word tokens *without* converting to lower case (consult the help file with `?()` to see what option to change)
3. Use `count` to create a data frame with the word tokens counted: `count(df, word, sort = TRUE)`
4. Combine `filter()` and `str_detect()` to only keep words starting with uppercase letter (use the pattern "^[A-Z]") 
5. Determine the most mentioned names (if you sorted the data, you can just print the top rows with `head()`

**Bonus (for a better result)**
1. Try using `filter()` and `nchar()` to only keep words longer than 3 characters.

In [58]:
paga_text <- gutenberg_download(22153)

In [59]:
paga_tokens <- paga_text %>%
    unnest_tokens(word, text, to_lower = FALSE) %>%
    count(word, sort = TRUE) %>%
    filter(str_detect(word, "^[A-Z]"))

head(paga_tokens)

word,n
The,1149
S,608
Rome,326
I,285
In,188
It,186


In [60]:
paga_tokens <- paga_text %>%
    unnest_tokens(word, text, to_lower = FALSE) %>%
    count(word, sort = TRUE) %>%
    filter(str_detect(word, "^[A-Z]")) %>%
    filter(nchar(word) > 3)

head(paga_tokens)

word,n
Rome,326
Christian,135
Pope,135
Illustration,124
Roman,113
This,110


# Reading data from Excel: The `readxl` package
R can read in data from Excel with the `readxl` package.

The command belows reads a Excel file into an R object:

In [61]:
library(readxl)

In [62]:
runaways <- read_excel("runaways_exampledata.xlsx")

The data is now loaded into R as a dataframe:

In [63]:
dim(runaways)
head(runaways)

name,id,registered,escaped,returned
Peter vesmand,590,21/03/1712,29/10/1716,03/11/1716
Thomas Petersen,591,21/03/1712,17/10/1713,26/10/1713
Niels Jensen Skaaning,592,31/03/1712,25/08/1716,02/09/1716
Niels Jensen Skaaning,592,31/03/1712,21/09/1716,13/11/1716
[ ] Jens Sönner,594,26/04/1712,00/00/1716,
Magnus Bendixsen / Mogens,599,08/03/1713,09/07/1719,11/11/1720


What are the "registered", "escaped" and "returned" columns showing? What would be interesting to do with them?

In [64]:
head(runaways)

name,id,registered,escaped,returned
Peter vesmand,590,21/03/1712,29/10/1716,03/11/1716
Thomas Petersen,591,21/03/1712,17/10/1713,26/10/1713
Niels Jensen Skaaning,592,31/03/1712,25/08/1716,02/09/1716
Niels Jensen Skaaning,592,31/03/1712,21/09/1716,13/11/1716
[ ] Jens Sönner,594,26/04/1712,00/00/1716,
Magnus Bendixsen / Mogens,599,08/03/1713,09/07/1719,11/11/1720


In [65]:
class(runaways$registered)

In [66]:
# Stored as character - R does not now how to evaluate!
runaways$escaped[1] - runaways$registered[1]

ERROR: Error in runaways$escaped[1] - runaways$registered[1]: non-numeric argument to binary operator


# BREAK?

![lion_chill](https://i.pinimg.com/736x/66/70/75/6670750ccf134bb4d0de4eb726a396e2.jpg)

# Working with dates: The `lubridate` package
We have worked with numeric and text classes before but R also has a `date` class. 

The base R functionality of working with dates can be a bit tricky but `lubridate` makes it very simple!

*Install and load the `lubridate` package.*

In [67]:
library(lubridate)


Attaching package: 'lubridate'

The following object is masked from 'package:base':

    date



In [68]:
# Some example dates - all stored as character
date1 <- "29 aug 1876"
date2 <- "1770-11-26"
date3 <- "12.26.1810"

lapply(c(date1, date2, date3), class)

Converting to dates with `lubridate` is very simple! You just need to know the order of the information in the date (year, month, date).

The main function for converting is `ymd()` (short for year-month-date). This will take a character class object and convert to a date.

A function is there for each combination of year-month-date; meaning you just have to shuffle the letters around to fit the format:

In [69]:
date1 <- dmy(date1)
date2 <- ymd(date2)
date3 <- mdy(date3)

lapply(c(date1, date2, date3), class)

When stored as dates, it is easy to extract components with commands as `year()`, `month()`, `day()`:

In [70]:
print(date1)
year(date1)
month(date1)
day(date1)
wday(date1, label = TRUE, locale = "English")

[1] "1876-08-29"


Date objects allow us to calculate time differences:

In [71]:
date1 - date2

Time difference of 38627 days

Calculating differences between two dates creates a `difftime` object by default. `difftime` objects are more useful for shorter differences. For longer time differences, it is more useful to work with `interval()`.

With intervals, we can ask the number of days, years, months in the interval with `as.period()`.

It is also possible to coerce directly to a numeric object with the specified units with `as.numeric()`.

In [72]:
# Create time difference as interval
time_int <- interval(date2, date1)

# Display time differences with different units
as.period(time_int, unit = "days")
as.period(time_int, unit = "months")
as.period(time_int, unit = "years")

# Numeric coercion
as.numeric(time_int, "years")

Because `lubridate` is a part of the tidyverse, the functions supports vectors or vector-like objects as well!

In [73]:
dates <- c("1876-12-21", "1873-11-01", "1885-01-30", "1842-06-10")
dates <- ymd(dates)
dates

date_ints <- interval(ymd("1800-01-01"), dates)
as.numeric(date_ints, "years")

# EXERCISE 6: WORKING WITH DATES

Make sure you have the runaways data loaded.

1. Convert the columns containing dates to date formats using the proper variation of `ymd()`
2. Create a new column calculating the time difference in *days* between `registered` and `escaped`. Use `interval()` and `as.numeric()`
3. Determine the shortest and longest stays (you can use the `arrange()` command)

# Saving files with `readr`

R can already save files in some other formats but the `readr` package is often more intuitive to use.

Check your directory with `getwd`. Directory can be changed with `setwd`.

We can save an excel .csv-file (comma-separated values) with `readr` using `write_excel_csv`:

In [77]:
library(readr)
write_excel_csv(runaways, path = "my_data.csv", delim = ";", col_names = TRUE)

**Code breakdown:**

| Code | Description |
|:-----|:------------|
|`data` | The object we want to save |
|`path = "my_data.csv"` | The filename - .csv for comma-separated values |
|`delim = ";"` | Setting the separator between values to be commas |
|`col_names = TRUE` | Specifying that data contains column names |

# Summation of part 1

R objects are created using the operator `<-`. R objects can be pretty much anything.

Objects are assigned a `class`. The class can be checked with `class(object)`. Classes can be coerced with commands like `as.numeric()` or `as.character()`

Vectors are groupings of values. They are created with `c()`. Vectors can be subset using `[]`.

Texts can be thought of as vectors in a number of ways (grouping of words, sentences, texts etc.)

Data frames are R's version of a spreadsheet (rows and columns). A column/variable can be selected using `$`. A column/variable can (in most cases) be treated like a vector.

Data frames can be subset using `[ , ]`. Rows are specified before the comma, and columns after the comma.

# Summation of part 1 - Continued

Other functions can be loaded into R via packages:
- `stringr` for manipulating strings
- `tidytext` for working with text data frames
- `lubridate` for working with dates
- `readxl` for reading Excel files
- `haven` for storing data in other formats

# What's next?