# Introduktion til R (del 1)

# Introduction
## What is R and why use it?

R is a free software environment with its own programming language. 

It is especailly suited for statistical analysis and graphical outputs.

R's popularity as a data science tool as well as it being open source has made its applications vast.

R can work with a large variety of data formats and is (with a few add-ons) compatible with data from other software solutions (Excel, SPSS, SAS, Stata).

## Content of the R introduction

- The RStudio environment
- The R language
- Constructing variables and objects
- Functions and libraries/packages
- Working with dataframes (table structure in R)
- Simple data handling in R
- Simple visualizations in R (`ggplot2` package)

The introduction will combine presenting R and R code in Jupyter Notebook while demonstrating in RStudio. You are encouraged to work and write in RStudio during the workshop.

*Please write along as we go through the different examples.*

## The RStudio environment

During the workshop (and the other R workshops in CALDISS), we will be working with RStudio.

RStudio is an IDE for R (Integrated Development Environment) - Makes for a nicer workspace

<https://www.rstudio.com/products/rstudio/download/>

# The R language
R has it's own programming language. R works by you writing lines of code in that language (writing commands) and R interpreting that code (running commands).

R (and RStudio) has a limited user interface meaning almost all functionality (statistics, plots, simulations etc.) must be executed using code in the R language.

## R as a calculator
So what does it mean that R interprets our code?
It means that you tell R to do something by writing a command and R will do that (if R can understand you).

R, for example, understands mathematical expressions:

In [1]:
7 * 6

In [2]:
912 - 132

# Using R scripts

Script files are text files containing code that R can interpret.

It is your "analysis recipe" showing what you have done as well as allowing you to re-run commands easily.

Always make a habit of writing your commands into a script, when you have the command figured out.

- `#` can be used for comments (skipped when run)
- `Ctrl` + `Enter`: Runs the current line or selection
- `Ctrl` + `Alt` + `R`: Runs the whole script

*NOTE!* There is no undo in R. When a code is executed, the change has been made. The only way to undo is to re-run previous code to get back to an earlier stage. This is what scripts are used for.

# The R Language: Objects and Functions

R works by storing values and information in "objects". These objects can then be used in various commands like calculating a statistical model, saving a file, creating a graph and so on. To simplify a bit: An object is some kind of stored information and a function is something that can manipulate that stored information (which then creates a new object). 

## Objects

A lot of writing in R is about defining objects: A name to use to call up stored information.

Objects can be a lot of things: 
- a word
- a number
- a series of numbers
- a dataset 
- a URL
- a formula
- a result 
- a filepath
- a series of datasets
- and so on...

When an object is defined, it is available in the current working space (or environment).

This makes it possible to store and work with a variety of informaiton simultaneously.

### Defining objects
Objects are defined using the `<-` operator (`Alt` + `-`):

In [3]:
year <- 1964

In [4]:
year

When defined the object can be used like any other numeric value.

In [5]:
year + 10

Notice that R differentiates between lower- and upper-case letters:

In [6]:
Year # Does not exist

ERROR: Error in eval(expr, envir, enclos): objekt 'Year' blev ikke fundet


Using `' '` or `" "` denotes that the input should be read as text. *This also applies to numbers!*

In [7]:
name <-  "keenan"

In [8]:
name

In [9]:
year_now  <- '2021'

In [10]:
year_now

Notice that numbers stored as text will be enclosed in quotes. Numbers stored as text cannot immediately be used as numbers:

In [11]:
year_now - 5

ERROR: Error in year_now - 5: non-numeric argument to binary operator


This error happens because R differentiates between objects by assigning them to a specific *class*. The class denotes what is possible with the object.

### Naming objects
Objects can be named almost anything but a good rule of thumb is to use names that are indicative of what the object contains.

#### Restrictions for naming objects
- Most special characters not allowed: `/`, `?`, `*`, `+` and so on (most characters mean something to R and will be read as an expression)
- Already existing names in R (will overwrite the function/object in the environment)

#### Good naming conventions 
- Using '`_`': `my_object`, `room_number`

or:

- Capitalize each word except the first: `myObject`, `roomNumber`

## Functions

Functions are commands used to transform an object in some way and give an output.

The input to a function is an "arguement". The number of arguements vary between function.

Functions have the basic syntax: `function(arg1, arg2, arg3)`.

Some arguements are required while others are optional.

In [12]:
name <- 'kilmister'
toupper(name) #Returns the object in upper-case

Most functions take the object as the first input but not all.

In [13]:
gsub("e", "a", name) #Replace all e's with a's

### Functions and their outputs

Note that functions almost *never* change the object. When calling functions you are asking R for a specific output but not to change anything.

Output of a function have to be stored in objects, if to be stored

In [14]:
name # Unchanged even though used in objects

In [15]:
name <- gsub("e", "a", name) # Store object with characters "e" swapped with "a" - replacing the object

In [16]:
name

# EXERCISE 1: OBJECTS AND FUNCTIONS

1. Define the following objects:

    - `name1`: `"araya"`
    - `name2`: `"townsend"`
    - `year1`: `1961` (without quotation marks)
    - `year2`: `"1972"` (with quotation marks)

2. Try calculating the ages for `year1` and `year2` (current year - year-object). 

3. Use the function `toupper()` to convert `name1` to upper-case.

# EXERCISE 1: DEFINING OBJECTS
*What happens?*

In [17]:
name1 <- "araya"
name2 <- "townsend"
year1 <- 1961
year2 <- "1972"

In [18]:
age1 <- 2021 - year1 # Works fine

age1

In [19]:
age2 <- 2021 - year2 # Produces an error

ERROR: Error in 2021 - year2: non-numeric argument to binary operator


In [20]:
name1 <- toupper(name1)

name1

# R Libraries - Packages 

R being open source means that a lot of developers are constantly adding new functions to R.
These new functions are distributed as *R packages* that can be loaded into the R library.

All the commands you have been using so far have been part of the `base` package (ships with R). 

Packages are installed using (name of package *with* quotes!): 

`install.packages('packagename')` 

The functions from the package is loaded into the environment using (name of package *without* quotes!):
    
`library(packagename)` 

Information for installed packages can be found using (name of package *with* quotes!):

`library(help = 'packagename')` 

## Importing data with `readr`

`readr` is a package for reading various data files into R.

R does have some "base" functions for doing this but `readr` is more efficient.

`readr` is part of a collection of packages called `tidyverse`: https://www.tidyverse.org/

In [22]:
library(readr)

ess18 <- read_csv("https://github.com/RolfLund/4semesterR/raw/master/teaching-materials/r-intro/datasets/ESS2018DK_subset.csv")

[1mRows: [22m[34m1285[39m [1mColumns: [22m[34m17[39m

[36m--[39m [1mColumn specification[22m [36m------------------------------------------------------------------------------------------------[39m
[1mDelimiter:[22m ","
[31mchr[39m  (7): vote, prtvtddk, health, lvpntyr, tygrtr, gndr, edlvddk
[32mdbl[39m (10): idno, netustm, ppltrst, yrbrn, eduyrs, wkhct, wkhtot, grspnum, frl...


[36mi[39m Use [30m[47m[30m[47m`spec()`[47m[30m[49m[39m to retrieve the full column specification for this data.
[36mi[39m Specify the column types or set [30m[47m[30m[47m`show_col_types = FALSE`[47m[30m[49m[39m to quiet this message.



# R Objects: Data Frames

`ess18` is now an object containing a dataset. Notice that the basic syntax stays the same: `objectname <- somefunction(something)`.

A "data frame" is the R-equivalent of a spreadsheet (a table of rows and columns). It is one of the most useful storage structures for data analysis in R.

## Data for this workshop: European Social Survey 2018

We are using a subset of the Danish European Social Survey data from 2018 (https://www.europeansocialsurvey.org/)

- https://www.europeansocialsurvey.org/docs/round9/fieldwork/source/ESS9_source_questionnaires.pdf
- https://www.europeansocialsurvey.org/docs/round9/survey/ESS9_appendix_a7_e03_1.pdf

The data contains the following variables:

|variable | description |
|----|---|
|idno|Respondent's identification number|
|netustm |Internet use, how much time on typical day, in minutes|
|ppltrst|Most people can be trusted or you can't be too careful|
|vote|Voted last national election|
|prtvtddk|Party voted for in last national election, Denmark|
|health|Subjective general health|
|lvpntyr|Year first left parents for living separately for 2 months or more|
|tygrtr|Retire permanently, age too young. SPLIT BALLOT|
|gndr|Gender|
|yrbrn|Year of birth|
|edlvddk|Highest level of education, Denmark|
|eduyrs|Years of full-time education completed|
|wkhct|Total contracted hours per week in main job overtime excluded|
|wkhtot|Total hours normally worked per week in main job overtime included|
|grspnum|What is your usual [weekly/monthly/annual] gross pay|
|frlgrsp|Fair level of [weekly/monthly/annual] gross pay for you|
|inwtm|Interview length in minutes, main questionnaire|


## Exploring Data Frames
To get an idea of what the data contains, we can use `head()`:

In [26]:
head(ess18)

idno,netustm,ppltrst,vote,prtvtddk,health,lvpntyr,tygrtr,gndr,yrbrn,edlvddk,eduyrs,wkhct,wkhtot,grspnum,frlgrsp,inwtm
<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
5816,90.0,7,Yes,SF Socialistisk Folkeparti - Socialist People's Party,Good,1994,60,Male,1974,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",35,37,37,37000.0,35000.0,61
7251,300.0,5,Yes,Dansk Folkeparti - Danish People's Party,Fair,1993,40,Female,1975,"Faglig uddannelse (håndværk, handel, landbrug mv.), F.eks. Faglærte, Social-",13,32,34,22000.0,30000.0,68
7887,360.0,8,Yes,Socialdemokratiet - The Social democrats,Fair,1983,55,Male,1958,"Lang videregående uddannelse. Kandidatuddannelser af 5.-6. års varighed, F.eks",25,39,39,36000.0,42000.0,89
9607,540.0,9,Yes,Alternativet - The Alternative,Good,1982,64,Female,1964,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",13,32,34,32000.0,,50
11688,,5,Yes,Socialdemokratiet - The Social democrats,Very bad,1968,50,Female,1952,"Faglig uddannelse (håndværk, handel, landbrug mv.), F.eks. Faglærte, Social-",2,37,37,,,77
12355,120.0,5,Yes,Socialdemokratiet - The Social democrats,Fair,1987,60,Male,1963,"Faglig uddannelse (håndværk, handel, landbrug mv.), F.eks. Faglærte, Social-",14,38,37,36000.0,38000.0,48


We can check the names of the columns (the variable names) using `colnames`. `dim` returns number of rows and columns.

In [27]:
colnames(ess18)

In [28]:
dim(ess18)

See key summary statistics using `summary()`. (counts, mean, std, min, max, quartiles).

In [29]:
summary(ess18)

      idno           netustm          ppltrst          vote          
 Min.   :  5816   Min.   :   0.0   Min.   : 0.00   Length:1285       
 1st Qu.: 93707   1st Qu.:  90.0   1st Qu.: 6.00   Class :character  
 Median :112877   Median : 150.0   Median : 7.00   Mode  :character  
 Mean   :110980   Mean   : 227.4   Mean   : 7.08                     
 3rd Qu.:131072   3rd Qu.: 300.0   3rd Qu.: 8.00                     
 Max.   :150446   Max.   :1020.0   Max.   :10.00                     
                  NA's   :151      NA's   :3                         
   prtvtddk            health            lvpntyr             tygrtr         
 Length:1285        Length:1285        Length:1285        Length:1285       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                       

# Data frames and vectors

Data frames consists of rows and columns. Typically R expects the rows of a data frame to contain observations and the columns of the data frame to contain variables (information about the observations).

R treats single columns (or variables) as "vectors". A vector is a series of values of the same class.

We can refer to a single column in a data frame with `$` (a vector).

In [30]:
head(ess18$yrbrn) # First six values of yrbrn variable

Each value in a vector is assigned an index refering to the position of the value in the vector (starts from 1).

A vector is indexed using `[]`:

In [31]:
ess18$yrbrn[10] # Returns the 10th value (row 10) of the yrbrn variable

In [32]:
ess18$yrbrn[2:10] # Returns value 2-10 of the yrbn variable (both inclusive)

A range of useful functions exist for calculating descriptive measures for a vector; fx `mean()`, `min()`, `max()` and `length()`.

In [34]:
min(ess18$yrbrn) # Returns smallest value
max(ess18$yrbrn) # Returns largest value
mean(ess18$yrbrn) # Returns mean value
sd(ess18$yrbrn) # Standardafvigelse
length(ess18$yrbrn) # Returns number of values in the vector (corresponding to the number of rows)

Husk at vi med R kan lagre hvad som helst som objekt. Hvis vi fx løbende får brug for at kalde en statistik frem, kan vi lagre den som et objekt for sig:

In [36]:
mean_yrbrn <- mean(ess18$yrbrn)

In [37]:
mean_yrbrn

`unique()` returns the unique values in a vector (useful for getting familiar with a variable):

In [39]:
unique(ess18$health)

## Missing values

Data will often contain missing values. Missing values can denote a lot of things like a non-response, an invalid answer, an inaccessible information and so on. 

Missing values are used to assign a value without assigning a value. They are denotes as `NA` in R.

NOTE: It is not given that missing values are actually coded as missing in the dataset. Conventions between software solutions vary, so often specific values (like 777777 or 888888) are used to indicate missing values.

The `summary()` function includes information about the number of missing values:

In [40]:
summary(ess18$inwtm)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  18.00   51.00   59.00   63.32   70.00  613.00       5 

Missing values are neither high or low in R. This means that it is not possible to perform computations on missing values:

In [33]:
min(ess18$inwtm) # NA is neither high or low - returns NA
max(ess18$inwtm) # NA is neither high or low - returns NA
mean(ess18$inwtm) # NA is neither high or low - returns NA

Usually one will have to deal with the missing values in some ways - either by replacing them or removing them.

Some functions have a built-in arguement for dealing with missing values.

## Using the help function

All R functions and commands are thoroughly documented so you do not have to remember what every function does or even how it should be written.

Every function and command in R has its own help file. The help file describes how to use the various functions and commands.

The help file for a specific function is accessed using the operator `?` (also works for the built-in datasets):

Looking at the help file for `max()`, we can see an arguement called `na.rm`. We can see that this arguement is used for removing missing values when performing the calculation.

Notice that in the help file, the arguement is set `na.rm = FALSE`. This is the default setting of the function, meaning that unless otherwise specified, the function will run with the arguement set to `FALSE` (missing values will be kep)

Changing the arguement when calling the function, the missing values are removed:

In [41]:
max(ess18$inwtm, na.rm = TRUE)

# EXERCISE : DESCRIPTIVE MEASURES

Use some of the functions for calculating descriptive measures to calculate the following:

- The mean (`mean()`) time used on the internet per day (`netustm`)
- The highest value (`max()`) for time used on the internet per day (`netustm`)
- The median (`median()`) for monthly gross pay (`grspnum`)

NOTE! The variables may contain missing values. Consult the documentation of the functions to see how to account for that.

# EXERCISE 3: DESCRIPTIVE MEASURES

In [42]:
mean(ess18$netustm, na.rm = TRUE)
max(ess18$netustm, na.rm = TRUE)
median(ess18$grspnum, na.rm = TRUE)

# Data handling in R

When working with data, we usually need to perform some introductory data handling steps before being able to conduct our analysis.

This could include:

- Filtering observations and variables (also refered to as subsetting)
- Creating new variables
- Recoding values

R supports all these operations both from "base" operations but the functions available in the tidyverse (https://www.tidyverse.org/) are far more intuitive functions.

## Subsetting data with `dplyr` 

The package `dplyr` contains various commands for filtering and subsetting data. The functions `filter` and `select` can be used to subset data instead of base R commands.

`filter()` takes a dataset and a logical statement using a variable in the data. It returns a dataset with the observations that meet the criteria.

`select()` takes a dataset and a list of variable names. It returns the dataset and the specified variables.

NOTE: There is also a base R function called `filter()`. This function is overwritten when importing `dplyr`.

### Filtering observations with `filter`

In [43]:
library(dplyr)

ess18_male <- filter(ess18, gndr == 'Male') # Subset with only males

head(ess18_male)


Vedhæfter pakke: 'dplyr'


De følgende objekter er maskerede fra 'package:stats':

    filter, lag


De følgende objekter er maskerede fra 'package:base':

    intersect, setdiff, setequal, union




idno,netustm,ppltrst,vote,prtvtddk,health,lvpntyr,tygrtr,gndr,yrbrn,edlvddk,eduyrs,wkhct,wkhtot,grspnum,frlgrsp,inwtm
<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
5816,90,7,Yes,SF Socialistisk Folkeparti - Socialist People's Party,Good,1994,60,Male,1974,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",35,37.0,37,37000,35000.0,61
7887,360,8,Yes,Socialdemokratiet - The Social democrats,Fair,1983,55,Male,1958,"Lang videregående uddannelse. Kandidatuddannelser af 5.-6. års varighed, F.eks",25,39.0,39,36000,42000.0,89
12355,120,5,Yes,Socialdemokratiet - The Social democrats,Fair,1987,60,Male,1963,"Faglig uddannelse (håndværk, handel, landbrug mv.), F.eks. Faglærte, Social-",14,38.0,37,36000,38000.0,48
16357,488,5,Yes,Dansk Folkeparti - Danish People's Party,Very good,2013,50,Male,1991,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",4,37.0,37,40000,,50
20724,60,5,Yes,"Venstre, Danmarks Liberale Parti - The Liberal Party",Good,1981,Never too young,Male,1958,"Faglig uddannelse (håndværk, handel, landbrug mv.), F.eks. Faglærte, Social-",14,37.0,40,28000,34000.0,65
24928,120,8,Yes,"Venstre, Danmarks Liberale Parti - The Liberal Party",Very good,1984,Should never retire permanently,Male,1965,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",19,,80,50000,,77


Subset rows with people with more than 15 years of education:

In [50]:
ess18_edusub <- filter(ess18, eduyrs > 15) 

head(ess18_edusub)

idno,netustm,ppltrst,vote,prtvtddk,health,lvpntyr,tygrtr,gndr,yrbrn,edlvddk,eduyrs,wkhct,wkhtot,grspnum,frlgrsp,inwtm
<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
5816,90,7,Yes,SF Socialistisk Folkeparti - Socialist People's Party,Good,1994,60,Male,1974,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",35,37.0,37,37000.0,35000.0,61
7887,360,8,Yes,Socialdemokratiet - The Social democrats,Fair,1983,55,Male,1958,"Lang videregående uddannelse. Kandidatuddannelser af 5.-6. års varighed, F.eks",25,39.0,39,36000.0,42000.0,89
19970,240,9,Yes,Liberal Alliance - Liberal Alliance,Very good,1984,60,Female,1966,"Lang videregående uddannelse. Kandidatuddannelser af 5.-6. års varighed, F.eks",21,36.0,36,85000.0,,42
22248,121,9,Yes,Socialdemokratiet - The Social democrats,Good,1970,Never too young,Female,1950,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",19,37.0,37,,,62
24928,120,8,Yes,"Venstre, Danmarks Liberale Parti - The Liberal Party",Very good,1984,Should never retire permanently,Male,1965,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",19,,80,50000.0,,77
27211,120,7,Yes,Kristendemokraterne - Christian Democrats,Fair,1983,60,Male,1969,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",16,15.0,17,26000.0,,99


### Filtering with booleans/logical values

Normally we do not know the rowindex of the values we want to keep. Rather we want to filter observations based on a certain criteria. 

In R this is done via the use of "booleans" or "logical values". These are values that are either `TRUE` or `FALSE`.

A number of operations in R always return a logical value:

- `>`
- `>=`
- `<`
- `<=`
- `==`
- `!=`

In [44]:
42 > 10

In [45]:
10 != 10

### Selecting columns/variables with `select`

In [44]:
ess18_male_subset <- select(ess18_male, idno, gndr, yrbrn, edlvddk) # Selecting specific variables

head(ess18_male_subset)

idno,gndr,yrbrn,edlvddk
<dbl>,<chr>,<dbl>,<chr>
5816,Male,1974,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,"
7887,Male,1958,"Lang videregående uddannelse. Kandidatuddannelser af 5.-6. års varighed, F.eks"
12355,Male,1963,"Faglig uddannelse (håndværk, handel, landbrug mv.), F.eks. Faglærte, Social-"
16357,Male,1991,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,"
20724,Male,1958,"Faglig uddannelse (håndværk, handel, landbrug mv.), F.eks. Faglærte, Social-"
24928,Male,1965,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,"


# EXERCISE 3: FILTERING AND SELECTING

1. Create a subset of the data containing females born after 1980 with the variables: `idno`, `gndr`, `yrbrn`, `netustm`

2. Use `head()` to check if the subset looks correct

2. Calculate the mean time used on the internet per day (`netustm`) for the subset


## Bonus exercise

1. Create a subset for males born after 1980
2. Calculate the mean time used on the internet per day for this subset - is it higher or lower than the subset for females?

# EXERCISE 3: FILTERING AND SELECTING

In [46]:
ess18_female_subset <- filter(ess18, gndr == 'Female', yrbrn > 1980)
ess18_female_subset <- select(ess18_female_subset, idno, gndr, yrbrn, netustm)


ess18_male_subset <- filter(ess18, gndr == 'Male', yrbrn > 1980)
ess18_male_subset <- select(ess18_male_subset, idno, gndr, yrbrn, netustm)

In [47]:
mean(ess18_female_subset$netustm, na.rm = TRUE)

In [48]:
mean(ess18_male_subset$netustm, na.rm = TRUE) # Bonus

In [49]:
mean(ess18_male_subset$netustm, na.rm = TRUE) > mean(ess18_female_subset$netustm, na.rm = TRUE)

## Recoding and creating variables

Creating variables and (simple) recoding is usually done in the same way. The only difference being whether the recoding is assigned to a new variable or overwriting an existing (we are here only looking at recoding by arithmetic operations and not by replacing values).

### Recoding and creating variables using `dplyr`

The function `mutate()` in `dplyr` is use for creating and recoding variables:

In [52]:
ess18 <- mutate(ess18, inwth = inwtm / 60)

head(ess18$inwth)

# EXERCISE 4: CREATING VARIABLES

1. Create an age variable (the dataset is from 2018)

2. Create a variable containing the difference between hours worked per week (`wkhtot`) and hours contracted per week (`wkhct`).

3. What is the highest overtime value? (use either `summary()` or `max()`

# EXERCISE 4: CREATING VARIABLES

In [53]:
ess18 <- mutate(ess18, age = 2018 - yrbrn,
           overthrs = wkhtot - wkhct)

max(ess18$overthrs, na.rm = TRUE)

# Classes in R

As mentioned earlier, R differentiates between objects via the "class" of the object.

The function `class()` is used to check the class of an object:

In [54]:
name = "keenan"
year = 1964

In [55]:
class(name)

In [56]:
class(year)

Remember that single variables/vectors can only contain values of the same class. The `class()` function therefore works on vectors too.

The variable `tygrtr` (Retire permanently, age too young) seems like a variable that should contain numeric values (the age). However, looking at the first couple of rows, we see that it also contains text values:

In [58]:
head(ess18$tygrtr)

When we check the class, we also see that the values are stored as text:

In [59]:
class(ess18$tygrtr)

This means that we cannot perform calculations with this variable:

In [60]:
max(ess18$tygrtr)

## Class coercion

In most cases, R can coerce values from one class to another. When doing this, values that are incompatible with the class are coded to missing (`NA`) so beware!

Values can be coerced to character values with `as.character()`

Values can be coerved to numeric values with `as.numeric()`

Here we coerce the variable to be numeric (notice the warning):

In [66]:
ess18 <- mutate(ess18, tygrtr_num = as.numeric(tygrtr))

"NAs introduced by coercion"


Now the variable can be used in calculations:

In [67]:
max(ess18$tygrtr_num, na.rm = TRUE)

# EXERCISE 5: CLASS COERCION

Create a variable containing the number of years the respondent lived at home before living separately for 2 months or more (`lvpntyr`)

Note that the variable `lvpntyr` might not be ready for calculations right away. 

Remember that `as.numeric()` converts values to numeric.

# EXERCISE 5: CLASS COERCION

In [69]:
ess18 <- mutate(ess18, lvpntyr_num = as.numeric(lvpntyr),
                yrhome = lvpntyr_num - yrbrn)

head(ess18)

"NAs introduced by coercion"


idno,netustm,ppltrst,vote,prtvtddk,health,lvpntyr,tygrtr,gndr,yrbrn,⋯,wkhtot,grspnum,frlgrsp,inwtm,inwth,age,overthrs,tygrtr_num,lvpntyr_num,yrhome
<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
5816,90.0,7,Yes,SF Socialistisk Folkeparti - Socialist People's Party,Good,1994,60,Male,1974,⋯,37,37000.0,35000.0,61,1.0166667,44,0,60,1994,20
7251,300.0,5,Yes,Dansk Folkeparti - Danish People's Party,Fair,1993,40,Female,1975,⋯,34,22000.0,30000.0,68,1.1333333,43,2,40,1993,18
7887,360.0,8,Yes,Socialdemokratiet - The Social democrats,Fair,1983,55,Male,1958,⋯,39,36000.0,42000.0,89,1.4833333,60,0,55,1983,25
9607,540.0,9,Yes,Alternativet - The Alternative,Good,1982,64,Female,1964,⋯,34,32000.0,,50,0.8333333,54,2,64,1982,18
11688,,5,Yes,Socialdemokratiet - The Social democrats,Very bad,1968,50,Female,1952,⋯,37,,,77,1.2833333,66,0,50,1968,16
12355,120.0,5,Yes,Socialdemokratiet - The Social democrats,Fair,1987,60,Male,1963,⋯,37,36000.0,38000.0,48,0.8,55,-1,60,1987,24


# Saving files 

The most important thing to save is the R script.

Use the save icon to save your script.

![save](./img/save-script.png)