# Extended introduction to R

## Content of the workshop

This workshop introduces various useful concepts in R. The workshop contains the following sections:

- Common errors when working with data in R
- Recoding categories in R
- Working with strings
- Control structuresin R
- Functions in R

# Common errors when working with data in R

R makes some assumptions about how the data being imported is set up (value delimiters, decimal points etc.). Furthermore, data may contain errors or need some handling before being ready for any kind of analysis. 

Errors that one encounters when working with data can be rather unique and solving them will often involve a lot of trial and error specific to the data being worked with.

In this section we take a look at some of the errors one may encounter when working with tabular data in R. The section uses the same subset of ESS 2018 from the introduction but with some errors added to the data.

## Common error 1: Data uses a non-standard separator

This error can occur when working with CSV files. "CSV" stands for comma-separated values but the standard csv-format actually differs a bit across countries: some countries use commas; others used semi-commas.

"CSV" is a type of "delimited data file". Delimited data files are all made of lines where each value is separated by some character (tab, comma, semi-comma or something else).

In the code below, the dataset is imported with the standard `read_csv()` function from `readr`:

In [3]:
library(readr)

ess2018 <- read_csv("https://raw.githubusercontent.com/CALDISS-AAU/workshop_r-extended-intro/master/data/ESS2018DK_subset_with-errors.csv")

head(ess2018, 5)

Parsed with column specification:
cols(
  `idno;netustm;ppltrst;vote;prtvtddk;lvpntyr;tygrtr;gndr;yrbrn;edlvddk;eduyrs;wkhct;wkhtot;grspnum;frlgrsp;inwtm` = col_character()
)
"1572 parsing failures.
row col  expected    actual                                                                                                                   file
  1  -- 1 columns 4 columns 'https://raw.githubusercontent.com/CALDISS-AAU/workshop_r-extended-intro/master/data/ESS2018DK_subset_with-errors.csv'
  2  -- 1 columns 4 columns 'https://raw.githubusercontent.com/CALDISS-AAU/workshop_r-extended-intro/master/data/ESS2018DK_subset_with-errors.csv'
  3  -- 1 columns 4 columns 'https://raw.githubusercontent.com/CALDISS-AAU/workshop_r-extended-intro/master/data/ESS2018DK_subset_with-errors.csv'
  4  -- 1 columns 4 columns 'https://raw.githubusercontent.com/CALDISS-AAU/workshop_r-extended-intro/master/data/ESS2018DK_subset_with-errors.csv'
  5  -- 1 columns 4 columns 'https://raw.githubusercontent.com/CAL

idno;netustm;ppltrst;vote;prtvtddk;lvpntyr;tygrtr;gndr;yrbrn;edlvddk;eduyrs;wkhct;wkhtot;grspnum;frlgrsp;inwtm
110;180
705;60
1327;240
3760;300
4658;90


As can be seen, there seems to be something off with the data, as all the values are condensed into a single column. This happens because `read_csv()` assumes the values are separated by commas, but in this dataset the values are separated by semi-commas.

`read_csv2()` can be used in this case, as this assumes semi-commas as separators (which is common in some European countries - like Denmark):

In [4]:
ess2018 <- read_csv2("https://raw.githubusercontent.com/CALDISS-AAU/workshop_r-extended-intro/master/data/ESS2018DK_subset_with-errors.csv") # Works with ";"
head(ess2018)

Using ',' as decimal and '.' as grouping mark. Use read_delim() for more control.
Parsed with column specification:
cols(
  idno = col_double(),
  netustm = col_character(),
  ppltrst = col_character(),
  vote = col_character(),
  prtvtddk = col_character(),
  lvpntyr = col_character(),
  tygrtr = col_character(),
  gndr = col_character(),
  yrbrn = col_double(),
  edlvddk = col_character(),
  eduyrs = col_double(),
  wkhct = col_double(),
  wkhtot = col_double(),
  grspnum = col_double(),
  frlgrsp = col_double(),
  inwtm = col_character()
)


idno,netustm,ppltrst,vote,prtvtddk,lvpntyr,tygrtr,gndr,yrbrn,edlvddk,eduyrs,wkhct,wkhtot,grspnum,frlgrsp,inwtm
110,1800,8,Yes,Socialdemokratiet - The Social democrats,1968,Never too young,Male,1949,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",9,28,28,9999999,9999999,1190
705,600,5,Yes,Det Konservative Folkeparti - Conservative People's Party,1976,67,Male,1958,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",22,37,45,9999999,9999999,550
1327,2400,5,,,"Still in parental home, never left 2 months",,Male,2000,Folkeskole 9.-10. klasse,11,37,37,9999999,9999999,370
3760,3000,7,Not eligible to vote,,"Still in parental home, never left 2 months",40,Male,2002,Folkeskole 9.-10. klasse,9,2,2,200,9999999,430
4658,900,8,Yes,,1974,50,Female,1956,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",4,30,30,9999999,9999999,620
5816,900,7,Yes,SF Socialistisk Folkeparti - Socialist People's Party,1994,60,Male,1974,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",35,37,37,37000,35000,610


Alternatively, the function `read_delim()` can be used where the delimiter (the character separating the values) is specified:

In [5]:
ess2018 <- read_delim("https://raw.githubusercontent.com/CALDISS-AAU/workshop_r-extended-intro/master/data/ESS2018DK_subset_with-errors.csv", delim = ";") # Alternative - specify the delimiter
head(ess2018)

Parsed with column specification:
cols(
  idno = col_double(),
  netustm = col_character(),
  ppltrst = col_character(),
  vote = col_character(),
  prtvtddk = col_character(),
  lvpntyr = col_character(),
  tygrtr = col_character(),
  gndr = col_character(),
  yrbrn = col_double(),
  edlvddk = col_character(),
  eduyrs = col_double(),
  wkhct = col_double(),
  wkhtot = col_double(),
  grspnum = col_double(),
  frlgrsp = col_double(),
  inwtm = col_character()
)


idno,netustm,ppltrst,vote,prtvtddk,lvpntyr,tygrtr,gndr,yrbrn,edlvddk,eduyrs,wkhct,wkhtot,grspnum,frlgrsp,inwtm
110,1800,8,Yes,Socialdemokratiet - The Social democrats,1968,Never too young,Male,1949,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",9,28,28,9999999,9999999,1190
705,600,5,Yes,Det Konservative Folkeparti - Conservative People's Party,1976,67,Male,1958,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",22,37,45,9999999,9999999,550
1327,2400,5,,,"Still in parental home, never left 2 months",,Male,2000,Folkeskole 9.-10. klasse,11,37,37,9999999,9999999,370
3760,3000,7,Not eligible to vote,,"Still in parental home, never left 2 months",40,Male,2002,Folkeskole 9.-10. klasse,9,2,2,200,9999999,430
4658,900,8,Yes,,1974,50,Female,1956,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",4,30,30,9999999,9999999,620
5816,900,7,Yes,SF Socialistisk Folkeparti - Socialist People's Party,1994,60,Male,1974,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",35,37,37,37000,35000,610


## Common error 2: Missing not coded as missing

As there is no global standard for denoting missing values, the values used for denoting missing values will often vary from dataset to dataset. For surveys, the codebook usually contains information about what values are used to denote missing values (often very high numbers are used).

If this is overlooked, one can end up with errorneous results. In the code below, the mean for the variable `grspnum` (usual weekly/monthly/annual gross pay) is calculated:

In [6]:
mean(ess2018$grspnum)

The result is very high. This is because the variable contains high values to denote the mixing values (this can for example be seen using summary):

In [7]:
summary(ess2018$grspnum)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0   32000  500000 4833784 9999999 9999999 

The function `na_if` from `dplyr` can be used to simply recode specific values to missing:

In [8]:
library(dplyr)

ess2018 <- ess2018 %>%
    mutate(grspnum = na_if(grspnum, 9999999))

mean(ess2018$grspnum, na.rm = TRUE)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union



## Common error 3: Data uses a different decimal point

R assumes that periods (".") are used as the decimal point in the data. This is however not standard in all countries (some countries use commas). 

Because R usually does not recognize commas as decimal points, R will instead treat vectors with commas as a character class (string).

The variable `inwtm` contains the length of the interview in minutes which should be numeric. It is however currently not possible to perform arithmetic operations on it:

In [9]:
mean(ess2018$inwtm, na.rm = TRUE)

"argument is not numeric or logical: returning NA"

The code above produces an error because the variable/vector is the wrong class:

In [10]:
class(ess2018$inwtm)
head(ess2018$inwtm)

If we simply try to coerce the class, R will convert them all to missing values:

In [11]:
head(as.numeric(ess2018$inwtm))

"NAs introduced by coercion"

This error can either be fixed by correcting the data before importing. Alternatively, the commas can be replaced with periods in R and then make the coercion.

In the code below, the function `str_replace()` from the package `stringr` is used to replace the commas with periods (`gsub()` also works):

In [12]:
library(stringr)
# Using base R recoding

ess2018$inwtm <- str_replace(ess2018$inwtm, ",", ".") # Replace commas with periods
ess2018$inwtm <- as.numeric(ess2018$inwtm) # Coerce to numeric class

mean(ess2018$inwtm, na.rm = TRUE)

"NAs introduced by coercion"

## EXERCISE 1: COMMON ERRORS

- Load the data "ESS2018DK_subset_with-errors.csv" if you have not already. Make sure it is imported using the correct delimiter.
    - Link: https://raw.githubusercontent.com/CALDISS-AAU/workshop_r-extended-intro/master/data/ESS2018DK_subset_with-errors.csv


- The variable `frlgrsp` contains what level of weekly/monthly/annual gross pay the respondent feel is fair for them. Try calculating the mean of the variable. The variable may contain high values to denote missing values so be sure to recode these first.

- The variable `netustm` contains how much time the respondent spends on the internet on a typical day (in minutes). Try calculating the mean time the repondent spends on the internet on a typical day. If you encounter errors, try correcting them and calculaing the mean again.

# Recoding categories in R

We have previously seen how variables can be created or recoded from existing variables using arithetic operations (for example `df$newvar <- df$oldvar^2`).

Data often contains categorical data which may have to be recoded as well. Often categories are stored as strings. Changing the content of the category name or combining categories thus requires one to replace the text with something else.

## Recoding categories with base R

It is possible to recode categories with base R operations. Recoding is done by basically pin-pointing the values that needs to be replaced and then replacing those values with the new category.

In the example below, a variable is created indicating the level of educational attainment recoded to ISCED ([International Standard Classification of Education](https://ec.europa.eu/eurostat/statistics-explained/index.php?title=International_Standard_Classification_of_Education_(ISCED)#Implementation_of_ISCED_2011_.28levels_of_education.29)).

The values are recoded using the following schema:

|edlvddk| ISCED|
|----|----|
|Folkeskole 6.-8. klasse |      1  |
|               Folkeskole 9.-10. klasse |       2|
|Gymnasielle uddannelser, studentereksamen, HF, HHX, HTX |   3|
|Kort erhvervsuddannelse under 1-2 års varighed, F.eks. AMU Arbejdsmarkedsuddann |   3  |
|Faglig uddannelse (håndværk, handel, landbrug mv.), F.eks. Faglærte, Social- |   3|
|Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem  |   5|
|Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer, |   6|
|Universitetsbachelor. 1. del af kandidatuddannelse |   6|
|Lang videregående uddannelse. Kandidatuddannelser af 5.-6. års varighed, F.eks |   7|
|                              Licentiat |   7     |
|       Forskeruddannelse. Ph.d., doktor |   8|
|                                  Other |   NA|


In [13]:
# Read in the data

library(readr)

data <- read_csv("https://github.com/CALDISS-AAU/workshop_R-intro/raw/master/data/ESS2018DK_subset.csv")
head(data)

Parsed with column specification:
cols(
  idno = col_double(),
  netustm = col_double(),
  ppltrst = col_double(),
  vote = col_character(),
  prtvtddk = col_character(),
  lvpntyr = col_character(),
  tygrtr = col_character(),
  gndr = col_character(),
  yrbrn = col_double(),
  edlvddk = col_character(),
  eduyrs = col_double(),
  wkhct = col_double(),
  wkhtot = col_double(),
  grspnum = col_double(),
  frlgrsp = col_double(),
  inwtm = col_double()
)


idno,netustm,ppltrst,vote,prtvtddk,lvpntyr,tygrtr,gndr,yrbrn,edlvddk,eduyrs,wkhct,wkhtot,grspnum,frlgrsp,inwtm
110,180,8,Yes,Socialdemokratiet - The Social democrats,1968,Never too young,Male,1949,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",9,28,28,,,119
705,60,5,Yes,Det Konservative Folkeparti - Conservative People's Party,1976,67,Male,1958,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",22,37,45,,,55
1327,240,5,,,"Still in parental home, never left 2 months",,Male,2000,Folkeskole 9.-10. klasse,11,37,37,,,37
3760,300,7,Not eligible to vote,,"Still in parental home, never left 2 months",40,Male,2002,Folkeskole 9.-10. klasse,9,2,2,200.0,,43
4658,90,8,Yes,,1974,50,Female,1956,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",4,30,30,,,62
5816,90,7,Yes,SF Socialistisk Folkeparti - Socialist People's Party,1994,60,Male,1974,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",35,37,37,37000.0,35000.0,61


In [14]:
# Create new empty variable (all values as missing)

data$edlvisced <- NA

# Specify values to replace and replace with ISCED

data[which(data$edlvddk == "Folkeskole 6.-8. klasse"), "edlvisced"] <- 1
data[which(data$edlvddk == "Folkeskole 9.-10. klasse"), "edlvisced"] <- 2
data[which(data$edlvddk == "Gymnasielle uddannelser, studentereksamen, HF, HHX, HTX"), "edlvisced"] <- 3
data[which(data$edlvddk == "Kort erhvervsuddannelse under 1-2 års varighed, F.eks. AMU Arbejdsmarkedsuddann"), "edlvisced"] <- 3
data[which(data$edlvddk == "Faglig uddannelse (håndværk, handel, landbrug mv.), F.eks. Faglærte, Social-"), "edlvisced"] <- 3
data[which(data$edlvddk == "Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem"), "edlvisced"] <- 5
data[which(data$edlvddk == "Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,"), "edlvisced"] <- 6
data[which(data$edlvddk == "Universitetsbachelor. 1. del af kandidatuddannelse"), "edlvisced"] <- 6
data[which(data$edlvddk == "Lang videregående uddannelse. Kandidatuddannelser af 5.-6. års varighed, F.eks"), "edlvisced"] <- 7
data[which(data$edlvddk == "Licentiat"), "edlvisced"] <- 7
data[which(data$edlvddk == "Forskeruddannelse. Ph.d., doktor"), "edlvisced"] <- 8
data[which(data$edlvddk == "Other"), "edlvisced"] <- NA

In [15]:
head(data[, c('edlvddk', 'edlvisced')])

edlvddk,edlvisced
"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",5
"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",5
Folkeskole 9.-10. klasse,2
Folkeskole 9.-10. klasse,2
"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",5
"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",6


## Recoding categories with dplyr 

`dplyr` offers functions for recoding. There are three main functions:
- `recode`: For recoding single values
- `if_else`: For recoding based on logical
- `case_when`: For recoding based on several logicals

All these have to be combined with `mutate`.

In [16]:
# Read in the data

library(readr)
library(dplyr)

data <- read_csv("https://github.com/CALDISS-AAU/workshop_R-intro/raw/master/data/ESS2018DK_subset.csv")

Parsed with column specification:
cols(
  idno = col_double(),
  netustm = col_double(),
  ppltrst = col_double(),
  vote = col_character(),
  prtvtddk = col_character(),
  lvpntyr = col_character(),
  tygrtr = col_character(),
  gndr = col_character(),
  yrbrn = col_double(),
  edlvddk = col_character(),
  eduyrs = col_double(),
  wkhct = col_double(),
  wkhtot = col_double(),
  grspnum = col_double(),
  frlgrsp = col_double(),
  inwtm = col_double()
)


In [17]:
# Recoding edlvddk to two categories to ISCED (text value)
data <- data %>%
    mutate(edlvisced = recode(edlvddk, "Folkeskole 6.-8. klasse" = "Primary education", "Folkeskole 9.-10. klasse" = "Lower secondary education"))

head(data)

idno,netustm,ppltrst,vote,prtvtddk,lvpntyr,tygrtr,gndr,yrbrn,edlvddk,eduyrs,wkhct,wkhtot,grspnum,frlgrsp,inwtm,edlvisced
110,180,8,Yes,Socialdemokratiet - The Social democrats,1968,Never too young,Male,1949,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",9,28,28,,,119,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem"
705,60,5,Yes,Det Konservative Folkeparti - Conservative People's Party,1976,67,Male,1958,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",22,37,45,,,55,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem"
1327,240,5,,,"Still in parental home, never left 2 months",,Male,2000,Folkeskole 9.-10. klasse,11,37,37,,,37,Lower secondary education
3760,300,7,Not eligible to vote,,"Still in parental home, never left 2 months",40,Male,2002,Folkeskole 9.-10. klasse,9,2,2,200.0,,43,Lower secondary education
4658,90,8,Yes,,1974,50,Female,1956,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",4,30,30,,,62,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem"
5816,90,7,Yes,SF Socialistisk Folkeparti - Socialist People's Party,1994,60,Male,1974,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",35,37,37,37000.0,35000.0,61,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,"


Using the `.default` argument, new values can be set for the values not specified.

In [18]:
# Recoding edlvddk to two categories ("lower secondary or below" and "above lower secondary"
data <- data %>%
    mutate(edlvbin = recode(edlvddk, "Folkeskole 6.-8. klasse" = "lower secondary or below", "Folkeskole 9.-10. klasse" = "lower secondary or below",
                              .default = "above lower secondary"))

head(data)

# Can you see any problems with the code above?

idno,netustm,ppltrst,vote,prtvtddk,lvpntyr,tygrtr,gndr,yrbrn,edlvddk,eduyrs,wkhct,wkhtot,grspnum,frlgrsp,inwtm,edlvisced,edlvbin
110,180,8,Yes,Socialdemokratiet - The Social democrats,1968,Never too young,Male,1949,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",9,28,28,,,119,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",above lower secondary
705,60,5,Yes,Det Konservative Folkeparti - Conservative People's Party,1976,67,Male,1958,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",22,37,45,,,55,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",above lower secondary
1327,240,5,,,"Still in parental home, never left 2 months",,Male,2000,Folkeskole 9.-10. klasse,11,37,37,,,37,Lower secondary education,lower secondary or below
3760,300,7,Not eligible to vote,,"Still in parental home, never left 2 months",40,Male,2002,Folkeskole 9.-10. klasse,9,2,2,200.0,,43,Lower secondary education,lower secondary or below
4658,90,8,Yes,,1974,50,Female,1956,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",4,30,30,,,62,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",above lower secondary
5816,90,7,Yes,SF Socialistisk Folkeparti - Socialist People's Party,1994,60,Male,1974,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",35,37,37,37000.0,35000.0,61,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",above lower secondary


Use `if_else` when recoding based on a single logical condition.

In [19]:
data <- data %>% #note that this code also recodes missing
    mutate(phdornot = if_else(edlvddk == "Forskeruddannelse. Ph.d., doktor", "PhD", "Not PhD"))

head(data)

idno,netustm,ppltrst,vote,prtvtddk,lvpntyr,tygrtr,gndr,yrbrn,edlvddk,eduyrs,wkhct,wkhtot,grspnum,frlgrsp,inwtm,edlvisced,edlvbin,phdornot
110,180,8,Yes,Socialdemokratiet - The Social democrats,1968,Never too young,Male,1949,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",9,28,28,,,119,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",above lower secondary,Not PhD
705,60,5,Yes,Det Konservative Folkeparti - Conservative People's Party,1976,67,Male,1958,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",22,37,45,,,55,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",above lower secondary,Not PhD
1327,240,5,,,"Still in parental home, never left 2 months",,Male,2000,Folkeskole 9.-10. klasse,11,37,37,,,37,Lower secondary education,lower secondary or below,Not PhD
3760,300,7,Not eligible to vote,,"Still in parental home, never left 2 months",40,Male,2002,Folkeskole 9.-10. klasse,9,2,2,200.0,,43,Lower secondary education,lower secondary or below,Not PhD
4658,90,8,Yes,,1974,50,Female,1956,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",4,30,30,,,62,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",above lower secondary,Not PhD
5816,90,7,Yes,SF Socialistisk Folkeparti - Socialist People's Party,1994,60,Male,1974,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",35,37,37,37000.0,35000.0,61,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",above lower secondary,Not PhD


Use `case_when` when recoding based on several logicals.

In [20]:
# Recoding edlvddk to two categories to ISCED (text value) - same as first recode example

data <- data %>%
    mutate(edlvisced = case_when(
        edlvddk == "Folkeskole 6.-8. klasse" ~ "Primary education", 
        edlvddk == "Folkeskole 9.-10. klasse" ~ "Lower secondary education",
        TRUE ~ edlvddk)) #This line keeps remaining values as they are

head(data)

idno,netustm,ppltrst,vote,prtvtddk,lvpntyr,tygrtr,gndr,yrbrn,edlvddk,eduyrs,wkhct,wkhtot,grspnum,frlgrsp,inwtm,edlvisced,edlvbin,phdornot
110,180,8,Yes,Socialdemokratiet - The Social democrats,1968,Never too young,Male,1949,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",9,28,28,,,119,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",above lower secondary,Not PhD
705,60,5,Yes,Det Konservative Folkeparti - Conservative People's Party,1976,67,Male,1958,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",22,37,45,,,55,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",above lower secondary,Not PhD
1327,240,5,,,"Still in parental home, never left 2 months",,Male,2000,Folkeskole 9.-10. klasse,11,37,37,,,37,Lower secondary education,lower secondary or below,Not PhD
3760,300,7,Not eligible to vote,,"Still in parental home, never left 2 months",40,Male,2002,Folkeskole 9.-10. klasse,9,2,2,200.0,,43,Lower secondary education,lower secondary or below,Not PhD
4658,90,8,Yes,,1974,50,Female,1956,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",4,30,30,,,62,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",above lower secondary,Not PhD
5816,90,7,Yes,SF Socialistisk Folkeparti - Socialist People's Party,1994,60,Male,1974,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",35,37,37,37000.0,35000.0,61,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",above lower secondary,Not PhD


## DISCUSSION: RECODING FUNCTIONS

- Now being familiar with the various ways of recoding with `dplyr`, how would you prefer recoding all values in `edlvddk` to the ISCED categories?

- Can you identify (possibly other) situations where `case_when` could be useful?

# Strings

A "string" is a programming term for a value containing text. A value of the class "character" is a string.

Below the first paragraph from chapter 15 of "The Picture of Dorian Gray" by Oscar Wilde is stored as a string (copied from: https://www.gutenberg.org/files/174/174-h/174-h.htm#chap15)

In [22]:
my_text <- "That evening, at eight-thirty, exquisitely dressed and wearing a large button-hole of Parma violets, Dorian Gray was ushered into Lady Narborough’s drawing-room by bowing servants. His forehead was throbbing with maddened nerves, and he felt wildly excited, but his manner as he bent over his hostess’s hand was as easy and graceful as ever. Perhaps one never seems so much at one’s ease as when one has to play a part. Certainly no one looking at Dorian Gray that night could have believed that he had passed through a tragedy as horrible as any tragedy of our age. Those finely shaped fingers could never have clutched a knife for sin, nor those smiling lips have cried out on God and goodness. He himself could not help wondering at the calm of his demeanour, and for a moment felt keenly the terrible pleasure of a double life. "
class(my_text)

## Texts as vectors
Texts can be thought of as vectors in a number of ways:
1. A collection of texts
2. A collection of words in a text
3. A collection of sentences in a text

Depending on what we are trying to figure out with text, R can be used in a number of ways.

Let's start by thinking of a text as a series of single words:

In [23]:
text_words <- unlist(strsplit(my_text, split = "\\s"))  #split at every whitespace - \\s is an escape character
head(text_words)

The text is now a vector, each word with its own index (subset using `[]`):

In [24]:
text_words[5]
text_words[22]

This allows us to perform counts and other summaries.

## Working with strings: The `stringr` package

The package `stringr` is a tidyverse package for working with strings.

In [25]:
library(stringr)

`stringr` can be used in a variety of different ways.

In [26]:
# Changing case (here to lowercase)
str_to_lower(my_text)

In [27]:
# Looking up words
str_detect(my_text, "tragedy")

In [28]:
# Counting matches
str_count(my_text, "tragedy")

The functions of `stringr` also work on a grouping of elements (like a vector). We can see this when we split the text into sentences and then use the same functions

In [29]:
# Splitting text into elements; here separating at commas
# unlist is used to coerce to a vector; otherwise it is returned as a list
text_sent <- str_split(my_text, pattern = "\\.") %>% 
unlist()

In [30]:
# Looking up word in each sentence
str_detect(text_sent, "tragedy")

In [31]:
# Counting word in each sentence
str_count(text_sent, "tragedy")

### A note on indexing and booleans
When inputting boolean valules as an index, R will only return the `TRUE` values.

This means that we can use commands like `str_detect` to only return text elements containing specific words.

In [32]:
text_sent[str_detect(text_sent, "tragedy")]

`str_subset` has combined this functionality in one function:

In [33]:
str_subset(text_sent, "tragedy")

## EXERCISE 2: WORKING WITH A TEXT AS A VECTOR

In the following, you will create a vector containing the sentences of the texts and then looking up certain words.

Make sure you have the package `stringr` installed and loaded.

1. Assign the following text snippet to an object:

    "Themis, who has already been alluded to as the wife of Zeus, was the daughter of Cronus and Rhea, and personified those divine laws of justice and order by means of which the well-being and morality of communities are regulated. She presided over the assemblies of the people and the laws of hospitality."
    

2. Convert the text snippet to a vector of senteces.
    
    a. Split the texts into sentences using `str_split(texts, pattern = ",")`. Assign to an object (splitting at commas instead of periods).
    
    c. Unlist the object to convert to a vector using `unlist()`. Assign to the same or a new object.
    
    
2. Use `str_subset()` to see extract sentences that contain the name "Zeus". 

## Regular expressions
Often when working with text, we have more "fuzzy" patterns we want to search for. 

Regular expression is a common "language" for processing text patterns.

`stringr` supports regular expression arguements.

Possible uses:
- Finding words of a certain length
- Finding sentences containing a certain word or pattern
- Finding words following a certain pattern
- and so on.

Regular expression (or "regex") can be used in most function in `stringr`. 

The function `str_subset()` creates a subset of elements with the strings containing the pattern.

In [34]:
text_sent

In [35]:
# Return sentences containing either "origin" or "before"
str_subset(text_sent, "excited|wonder")

In [36]:
# Return sentences containing words with uppercase (not including the first word of the sentence).
str_subset(text_sent, "\\w.*[A-Z]")

# Control structures in R

R contains several control structures that can be implemented. "Loops" can for example be used to repeat commands over a range of values (a "for loop") or while a certain condition is met (a "while loop").

Furthermore, if-else statements can be used to specify conditions that has to be met before certain commands are run.

## For loops

A for loop is used to repeat one or more commands over a range of values.

Below a vector is created with some values. A for loop is then used to evaluate whether or not the value is above 5.

In [37]:
values <- c(9, 10, 4, 91, 27)

for (value in values){
    print(value > 5)
    }

[1] TRUE
[1] TRUE
[1] FALSE
[1] TRUE
[1] TRUE


Notice that the original vector is not affected by this particular for loop. 

When writing a for loop like the one above, the content of `value` changes with each iteration of the for loop (first 9, then 10, then 4 and so on).

Note that the object `value` is actually written each time, meaning that `value` exists in the environment after the for loop has run.

In [38]:
print(value)

[1] 27


Because `value` is created as a separate object, a for loop like the one above *cannot* be used to change the contents of the vector:

In [39]:
values <- c(9, 10, 4, 91, 27)

# Try to replace each value with itself added 5
for (value in values){
    value <- value + 5
    }

print(values) # No change

[1]  9 10  4 91 27


If one wanted to change the actual content of a vector with a for loop, this can be done by iterating over the *index* of the vector instead:

In [40]:
values <- c(9, 10, 4, 91, 27)

# Try to replace each value with itself added 5
for (i in 1:length(values)){
    values[i] <- values[i] + 5
    }

print(values)

[1] 14 15  9 96 32


## While loops

Where for loops are limited by the input values for which to iterate over, while loops will keep running as long as a condition is met.

*This means that it is possible to create infinite loops! Be careful if you want to use while loops.*

In [41]:
x <- 1

while (x < 5){
    print(paste0("x is ", x, " and the loop keeps going"))
    x <- x + 1
    }

[1] "x is 1 and the loop keeps going"
[1] "x is 2 and the loop keeps going"
[1] "x is 3 and the loop keeps going"
[1] "x is 4 and the loop keeps going"


In [42]:
fruit <- "banana"

while (nchar(fruit) <= 22){
    print(paste0("the word is ", fruit))
    fruit <- paste0(fruit, "na")
    }

[1] "the word is banana"
[1] "the word is bananana"
[1] "the word is banananana"
[1] "the word is bananananana"
[1] "the word is banananananana"
[1] "the word is bananananananana"
[1] "the word is banananananananana"
[1] "the word is bananananananananana"
[1] "the word is banananananananananana"


## If-else statements

If-else statements allows one to set conditions that has to be met before code is run. 

The most basic structure contains an "if" block" and an "else" block. The "if" block contains the code to be run if the condition set is met, while the "else" block is run in any other instance.

In [43]:
x <- 12

if (x > 10){
    print("The number is larger than 10!")
    } else {
    print("The number is not larger than 10!")
    }

[1] "The number is larger than 10!"


It is possibly to specify several conditions using `else if`. The code is evaluated in order, meaning that as soon as the condition is met, it will run that block of code. 

In [44]:
x <- 7

if (x > 10){
    print("The number is larger than 10!")
    } else if (x > 5) {
    print("The number is larger than 5!")
    } else {
    print("The number is not larger than 5!")
    }

[1] "The number is larger than 5!"


Because the code is run the first time a condition is met, other blocks are disregarded even though they also meet the condition.

In [45]:
x <- 11

if (x > 10){ # This condition is met - run the code below
    print("The number is larger than 10!")
    } else if (x > 5) {
    print("The number is larger than 5!") # This condition is also met but the code is not run
    } else {
    print("The number is not larger than 5!")
    }

[1] "The number is larger than 10!"


## If-statements and for loops

If statements are useful when combined with other control structures. By combining for loops and if-statements, one can write code where the commands executed in the for loop differs based on a condition:

In [46]:
values <- c(1:10)

for (value in values){
    if (value %% 2 == 0){
        print(paste0(value, " is an equal value!"))
        } else {
        print(paste0(value, " is not an equal value!"))
        }
    }

[1] "1 is not an equal value!"
[1] "2 is an equal value!"
[1] "3 is not an equal value!"
[1] "4 is an equal value!"
[1] "5 is not an equal value!"
[1] "6 is an equal value!"
[1] "7 is not an equal value!"
[1] "8 is an equal value!"
[1] "9 is not an equal value!"
[1] "10 is an equal value!"


# EXERCISE 3: CONTROL STRUCTURES

- Create a vector of words: `words <- c("potato", "cat", "dog", "monitor", "carpenter", "mouse", "refrigerator")`

- Write a for loop that only prints the words with more than 5 characters (the function `nchar` returns the number of characters in a word).

In [47]:
words <- c("potato", "cat", "dog", "monitor", "carpenter", "mouse", "refrigerator")

for (word in words){
    if (nchar(word) > 5) {
        print(word)
        } else {
        }
    }

[1] "potato"
[1] "monitor"
[1] "carpenter"
[1] "refrigerator"


# Functions in R

Functions are commands used to transform an object in some way and return an output.

The input to a function is an "argument". The number of arguments vary between functions.

Functions have the basic syntax: `function(arg1, arg2, arg3)`.

## Functions and their arguments

Arguments can both be required or optional. Required arguments are arguments the function needs in order to return an output.

**Required arguments**

In the base function `mean()`, `x` (the object to calculate the mean of) is an required argument. If we try to run the function without, it will return an error:

In [48]:
mean()

ERROR: Error in mean.default(): argument "x" is missing, with no default


The function needs `x` to work:

In [49]:
numbers <- c(2, 9, 10, 13)
mean(numbers)

**Optional arguments**

Optional arguments are additional arguments that can be given to a function. Often times these arguments can be seen as "settings" for the function, changing a certain way the function behaves. Optional arguments always have a default value. The default values of the optional arguments can be seen in the documentation of the function.

The base function `mean()`, the argument `na.rm` is an optional argument. The default value can be inspected either by looking up the documentation (`?mean`) or by inspecting the source code itself.

The source code of a function can be inspected by simply inputting the function name in the console (in Rstudio, it is also possible to inspect the source code by placing the cursor inside the function and pressing F2). 

(NOTE: `mean()` automatically calls `mean.default()`, which contain the actual source code).

In [50]:
mean.default

As can be seen in the source code, the argument `na.rm` is set to `FALSE`. The documentation describes what the argument does ("a logical value indicating whether NA values should be stripped before the computation proceeds").

The default value can be changed simply by passing the argument when calling the function, as long as the input given is valid for the argument. The documentation describes what inputs are valid for the argument. In the case of `na.rm`, a boolean value is a valid input (`TRUE` or `FALSE`).

In [51]:
numbers <- c(2, 9, 10, 13, NA)
mean(numbers) # With default

mean(numbers, na.rm = TRUE) # Default changed

### Specifying arguments

Notice that it is often not necessary to specify the name of the argument. As long as the arguments are specified in the right order (the one given in the documentation), simply the input values can be given. This is why it is not necessary to specify `x` when using `mean()`:

In [52]:
numbers <- c(2, 9, 10, 13)
mean(numbers) # Works

mean(x = numbers) # Also works

When arguments are not named, R assumes that the arguments are specified in order. If one names the arguments, it can be put in any order:

In [53]:
numbers <- c(2, 9, 10, 13, NA)
mean(TRUE, numbers) # Does not work

ERROR: Error in mean.default(TRUE, numbers): 'trim' must be numeric of length one


In [54]:
mean(na.rm = TRUE, x = numbers) # Does work

## EXERCISE 4: FUNCTIONS AND THEIR ARGUMENTS

- Inspect the documentation for `str_detect`. What arguments does the function require and what are optional?

- The function `head()` is used to return the first 6 rows of a dataframe. Is there a way to make the function return more than 6 rows?

## Creating functions in R

Functions can be created like any other object in R. A function consists of arguments, some code to be executed and a return statement indicating what the function should return.

The code below creates a function for adding 5 to a number:

In [1]:
add5 <- function(x) {
    result = x + 5
    return(result)
    }

Running the code does not return any output but makes the function available in the environment:

In [19]:
add5(10)

**Several arguments**

Just like existing functions, any number of arguments can be added to a function. Below the function is changed to simply add two input numbers:

In [21]:
add2num <- function(x, y){
    result = x + y
    return(result)
    }

add2num(7, 9)

**Optional arguments in own functions**

Optional arguments can be created simply by specifying a default value for the function. Below the function is changed to have the second number be 10 by default but it can still be changed when using the function:

In [24]:
add2num <- function(x, y = 10){
    result = x + y
    return(result)
    }

add2num(7) # Using the default value

add2num(7, 8) # Changing the default value

**The return statement**

The return statement indicates what the function should return. Without a return statement, the function returns no ouput:

In [25]:
add2num <- function(x, y = 10){
    result = x + y
    }

add2num(7) # No return statement - no output

The return statement also marks the end of the function; meaning that the function will stop execution when reaching a return statement (i.e. code following a return statement is ignored):

In [26]:
add2num <- function(x, y = 10){
    result = x + y
    print("This is included in the function")
    
    return(result)
    
    print("This is not included in the function")
    
    }

add2num(7) # No return statement - no output

[1] "This is included in the function"


**Objects inside functions**

Objects created inside a function only exists while the function is run; i.e. the objects are not created outside of the function and does not become part of the accessible environment:

In [32]:
result # Does not exist - only created inside the function

ERROR: Error in eval(expr, envir, enclos): objekt 'result' blev ikke fundet


### Functions and control structures

Control structures like if-else statements can easily be incorporated in a function. In a previous example an if-else statement was used to check wheter a number is larger than 10. This could be written as a function instead:

In [28]:
isabove10 <- function(x){
    if (x > 10){
        print("The number is larger than 10!")
    } else {
        print("The number is not larger than 10!")
    }
    } 

isabove10(12)

[1] "The number is larger than 10!"


Note that this function does not contain a return statement because it simply prints text telling whether or not the number is larger than 10.

### Functions in functions

When writing functions, it is possible to incorporate other functions. This allows one to fx create so-called "wrapper functions": functions that use existing functions but with different settings. (`read_csv` from the package `readr` is fx a wrapper function of `read_delim`).

Here is a simple wrapper function for the `head()` function with a different default:

In [2]:
head10 <- function(x, n = 10){
    head(x, n)
    }

In [30]:
library(readr)

data <- read_csv("https://github.com/CALDISS-AAU/workshop_R-intro/raw/master/data/ESS2018DK_subset.csv")

head10(data)

Parsed with column specification:
cols(
  idno = col_double(),
  netustm = col_double(),
  ppltrst = col_double(),
  vote = col_character(),
  prtvtddk = col_character(),
  lvpntyr = col_character(),
  tygrtr = col_character(),
  gndr = col_character(),
  yrbrn = col_double(),
  edlvddk = col_character(),
  eduyrs = col_double(),
  wkhct = col_double(),
  wkhtot = col_double(),
  grspnum = col_double(),
  frlgrsp = col_double(),
  inwtm = col_double()
)


idno,netustm,ppltrst,vote,prtvtddk,lvpntyr,tygrtr,gndr,yrbrn,edlvddk,eduyrs,wkhct,wkhtot,grspnum,frlgrsp,inwtm
110,180,8,Yes,Socialdemokratiet - The Social democrats,1968,Never too young,Male,1949,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",9,28,28,,,119
705,60,5,Yes,Det Konservative Folkeparti - Conservative People's Party,1976,67,Male,1958,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",22,37,45,,,55
1327,240,5,,,"Still in parental home, never left 2 months",,Male,2000,Folkeskole 9.-10. klasse,11,37,37,,,37
3760,300,7,Not eligible to vote,,"Still in parental home, never left 2 months",40,Male,2002,Folkeskole 9.-10. klasse,9,2,2,200.0,,43
4658,90,8,Yes,,1974,50,Female,1956,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",4,30,30,,,62
5816,90,7,Yes,SF Socialistisk Folkeparti - Socialist People's Party,1994,60,Male,1974,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",35,37,37,37000.0,35000.0,61
7251,300,5,Yes,Dansk Folkeparti - Danish People's Party,1993,40,Female,1975,"Faglig uddannelse (håndværk, handel, landbrug mv.), F.eks. Faglærte, Social-",13,32,34,22000.0,30000.0,68
7887,360,8,Yes,Socialdemokratiet - The Social democrats,1983,55,Male,1958,"Lang videregående uddannelse. Kandidatuddannelser af 5.-6. års varighed, F.eks",25,39,39,36000.0,42000.0,89
9607,540,9,Yes,Alternativet - The Alternative,1982,64,Female,1964,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",13,32,34,32000.0,,50
11123,150,7,Yes,Socialdemokratiet - The Social democrats,1994,45,Male,1974,"Faglig uddannelse (håndværk, handel, landbrug mv.), F.eks. Faglærte, Social-",16,37,37,9000.0,,62


## EXERCISE 5: CREATING FUNCTIONS

- Create a wrapper function for the `mean()` function where missing values are removed by default.

- Use the function on a valid variable in the ESS dataset.

### Use cases for creating own functions

Creating your own functions is often not necessary at all when using R but it can come in handy. Here are some examples of possible use cases:

- Creating wrapper functions
- Convert repeated parts of a script to a function
- Creating functions for specific datasets (fx a single function containing various datamanagement tasks for varialbes in ESS data that are used in several rounds. That way the same function can be used for ESS 2010, 2012, 2014 and so on).