# <center>WORKING WITH TABLE DATA WITH R</center>
<img src="../elem/caldiss_symbol_square.png" width="200">


<i><center>Kristian Gade Kjelmann</center></i>
<i><center>March 5th 2020</center><i>

# Reading and inspecting table data in R

In [None]:
library(readr)

ess_data <- read_csv("https://github.com/CALDISS-AAU/workshop_r-table-data/raw/master/data/ess2014_mainsub_p1.csv")

In [None]:
#First 6 rows of data
head(ess_data)

In [None]:
#Summary statistics
summary(ess_data)

In [None]:
#Names of variables (columns)
colnames(ess_data)

In [None]:
#First six values of a variable (column)
head(ess_data$gndr)

In [None]:
#The class of the variable
class(ess_data$gndr)

In [None]:
#Basic R subsetting with index
ess_data[c(1:5), c("gndr", "alcfreq")]

In [None]:
#Basic R subsetting with logical
ess_data[which(ess_data$height > 190), ] #Select respondents over 190 in cm

## Datawrangling with dplyr

Dplyr is incredibly useful for datawrangling for several reasons. It both provides a more concise syntax for writing command while providing quick and intuitive functions for selecting, arranging, filtering, merging and so on.

Let's start by looking at some common datawrangling functions:
- `select()`: Select subset of variables
- `filter()`: Select subset of observations based on condition
- `arrange()`: Order dataset by specific variable

`dplyr` is a part of the `tidyverse` collection. It often makes sense to just load the entire `tidyverse` instead of just `dplyr`.

In [None]:
library(tidyverse)

### Select

Select is used for selecting specific variables or reordering the variables

In [None]:
#Select specific columns
ess_data %>%
    select(idno, ppltrst, vote) %>%
    head(4)

In [None]:
#Select all columns except ppltrst with '-'
ess_data %>%
    select(-ppltrst)

In [None]:
#Select all columns but with a specific column moved (yrbrn)
ess_data %>%
    select(yrbrn, everything())

### Filter

Filter is used to select observations based on a given condition. It is often easier to write and more intuitive to use that basic R subsetting/filtering

In [None]:
#Filter for a given condition
ess_data %>%
    filter(yrbrn > 1990)

In [None]:
#Filter for several conditions
ess_data %>%
    filter(yrbrn > 1990 & vote == "Yes")

In [None]:
#Filter for non-missing
ess_data %>%
    filter(is.na(cgtsday)==FALSE) #or !(is.na(cgtsday))

In [None]:
#Filter across variables - only observations with missing
ess_data %>%
    filter_all(any_vars(is.na(.)))

In [None]:
#Filter across variables - only complete observations
ess_data %>%
    filter_all(all_vars(is.na(.)==FALSE))

In [None]:
#Alternative (for missing)
ess_data %>%
    drop_na()

### Arrange

Arrange is used to sort the observations after one or several variables.

Sorting/arranging have few practical applications in statistics but can be useful for inspecting or when working with time series data.

In [None]:
#Sort ascending
ess_data %>%
    arrange(yrbrn)

In [None]:
#Sort descending using desc()
ess_data %>%
    arrange(desc(yrbrn))

In [None]:
#Sort by several
ess_data %>%
    arrange(desc(yrbrn), height)

A note on missing and arrange: Missing are always placed last regardless of arranging ascending or descending.

### Wonders of the pipe

The great thing about the pipe is that it makes the code to write a lot shorter.

Instead of having to constantly specify the dataset, R will assume the data output from the previous line as the current line's input.

This also means that commands can easily be chained:

In [None]:
#Chaining commands with pipe
ess_data %>%
    drop_na() %>%
    filter(yrbrn > 1983) %>%
    select(yrbrn, height, weight, gndr) %>%
    arrange(desc(yrbrn), height) %>%
    head(4)

# EXERCISE 1

Using the `dplyr` package and the functions `drop_na`, `filter` and `arrange`, subset the data to show the following:
- Only complete observatinos (no missing)
- Only people born before 1970
- Showing the oldest and the ones smoking the most at the top of the dataset (`cgtsday`)

# Rekodning

Basic R recoding can quickly become a bit verbose as you have to specify the dataset several times and write out a longer condition.

In [None]:
#Basic R recoding for numerical values
ess_copy <- ess_data

ess_copy[which(is.na(ess_data$cgtsday)), "cgtsday"] <- 999

head(ess_copy, 4)

In [None]:
#Basic R recoding for text
ess_copy <- ess_data

ess_copy[which(ess_copy$alcfreq == "Once a week"), "alcfreq"] <- "WEEKLY DRINKER"

head(ess_copy, 4)

## New variables in R

In base R, variables are created by refering to variables that do not already exist:

In [None]:
ess_data$bmi <- ess_data$weight / (ess_data$height/100)**2
head(ess_data)

## New variables with dplyr

New variables can be created with dplyr using the function `mutate`. This function is both used for creating and manipulating/recoding variables. 

The advantage of `mutate` is that it can be used in a pipe:

In [None]:
#Creating bmi variable with mutate
ess_data %>%
    select(idno, weight, height, gndr, yrbrn) %>%
    mutate(bmi = weight/(height/100)**2) %>%
    head(4)

Using `mutate` in combination with `if_else`, we can specify different values based on conditions.

In [None]:
ess_data

In [None]:
#Creating smoker dummy with mutate and if_else
library(dplyr)
ess_data %>%
    mutate(smoker = if_else(is.na(cgtsday), "No", "Yes")) %>%
    head(4)

It is also possible to create several variables in the same function call:

In [None]:
#Creating both smoker dummy and bmi
ess_data %>%
    mutate(smoker = if_else(is.na(cgtsday), "No", "Yes"),
          bmi = weight/(height/100)**2) %>%
    select(idno, gndr, bmi, cgtsday, smoker) %>%
    head(4)

Using `case_when` we can specify multiple conditions and create variables for each:

In [None]:
#Creating height_cat using case_when
ess_data %>%
    mutate(height_cat = case_when(
        height >= 190 ~ "tall",
        height < 177 ~ "not tall"
    )) %>%
    select(idno, height, gndr, height_cat) %>%
    head(4)

## Recoding with dplyr 

`dplyr` offers functions for recoding. There are three main functions:
- `recode`: For recoding single values
- `if_else`: For recoding based on logical
- `case_when`: For recoding based on several logicals

All these have to be combined with `mutate`.

In [None]:
#Recoding alcfreq to two categories
ess_data %>%
    mutate(alcfreq = recode(alcfreq, "Every day" = "DAILY DRINKER", "Once a week" = "WEEKLY DRINKER"))

Using the `.default` arguement, new values can be set for the values not specified.

In [None]:
#Recoding alcfreq to three categories
ess_data %>%
    mutate(new_alcfreq = recode(alcfreq, "Every day" = "DAILY DRINKER", "Once a week" = "WEEKLY DRINKER", 
                            .default = "IRRELEVANT"))

Use `if_else` when recoding based on a single logical condition.

In [None]:
ess_data %>% #note that this code also recodes missing
    mutate(health = if_else(health == "Very good", "HEALTHY PERSON", "LESS HEALTHY PERSON"))

Use `case_when` when recoding based on several logicals.

In [None]:
#Recoding health to healthy/unhealthy
ess_data %>%
    mutate(health = case_when(
        health == "Very good" ~ "healthy", 
        health == "Good" ~ "healthy",
        health == "Bad" ~ "unhealthy",
        health == "Very bad" ~ "unhealthy", 
        TRUE ~ health)) #This line keeps remaining values as they are

# EXERCISE 2

1. Use `mutate` to create an age variable
2. Use `mutate` and `case_when` to create a variable for whether or not the respondents drinks at least once a week

# Categorical variables

Categorical variables in R are typically stored as "factors".

Unlike other statistical software solutions, R does not assign categorical variables an underlying numerical value. Values in a factor can therefore only be refered to by their category name!

Factors can sometimes cause issues, as a standard setting for a lot of import functions in R is to import text variables as factors. This causes issues as you have little control over how they are converted to categorical variables.
It often makes more sense to recode the variables as factors yourself.

Factors are necessary in a lot of functions for creating graphs or statistical models.

In [None]:
#Coerce as factor
ess_data %>%
    mutate(gndr = as.factor(gndr)) 

In [None]:
#Isolating a factor
gend_cat <- as.factor(ess_data$gndr)

In [None]:
#Inspecting values and levels
unique(gend_cat)

In [None]:
#Create factor as ordered/ordinal (but what order?)
gend_order <- factor(ess_data$gndr, ordered = TRUE)

In [None]:
#Inspecting values and levels
unique(gend_order)

In [None]:
#Creating ordered factor but setting custom order
polintr_fact <- factor(ess_data$polintr, levels = c('Not at all interested', 'Hardly interested',
                                                    'Quite interested', 'Very interested'), ordered = TRUE)

unique(polintr_fact)

# Statistical models

There are a lot of packages for creating statistical and there are packages for all kinds of specific analysis.

A recurring element of a lot of these packages and functions however is to specify the model as a function.

Formulas are specified as:
- `y ~ x1 (+x2 +x3 ... +xn)`


The code below created a linear model for age and weight:

In [None]:
#Linear model for weight and yrbrn
lm(weight ~ yrbrn, ess_data)

In [None]:
#Multiple
lm(bmi ~ weight + height, ess_data)

An advantage of R is the ability to store the model as any other object making it easy to store and recall past results.

In [None]:
#Storing model
bmi_model <- lm(bmi ~ weight + height, ess_data)

In [None]:
#Summary statistics for bmi_model
summary(bmi_model)

## Models and categorical

When working with categoricals in R, almost everything about how to treat that categorical in a model should be specified *before* creating the model.

- Should the variable be treated as ordered (nominal) or unordered (ordinal)?
- What value should be used as reference/base?
- Is the ordinal variable to be used as an interval variable?


In [None]:
#Linear model with categorical (2 values)
lm(height ~ yrbrn + gndr, ess_data)

In [None]:
#Linear model with ordinal
ess_data$healthcat <- factor(ess_data$health, levels = c('Very bad', 'Bad', 'Fair', 'Good', 'Very good'), ordered = TRUE)

summary(lm(height ~ yrbrn + healthcat, ess_data))

In [None]:
#Linear model with nominal (character as factor)
summary(lm(height ~ yrbrn + health, ess_data))

## Output a model

In [None]:
library(stargazer)

In [None]:
height_model <- lm(height ~ yrbrn + health, ess_data)
stargazer(height_model, type = "html", out = "../output/modelout.html")