# Week 11: (Differenet) types of Regression


## Introduction 

In this tutorial, we will learn to how to perform multiple logistic regression.

**Preparation and session set up**

Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead and ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).


In [None]:
# install packages
#install.packages("here")
#install.packages("dplyr")
#install.packages("ggplot2")
#install.packages("glmulti")
#install.packages("sjPlot")
#install.packages("report")
#install.packages("car")
#install.packages("rms")
#install.packages("broom")


Now that we have installed the packages, we activate them as shown below.



In [None]:
# activate packages
library(here)
library(dplyr)
library(ggplot2)
library(glmulti)
library(sjPlot)
library(report)
library(car)
library(rms)
library(broom)


##  Tutorial Activity 

Go into groups - each group and help each other to bring the data into the correct format, visualize the data and perform the logistic regression.

## Task 1

Multiple logistic regression is a better alternative for X^2^-test (because it details exactly what variable levels have a significant relationship with the dep. variable).

In this example, we want to see what factors impact making plural marking errors.

Load the data set `week11d1.xlsx`. Visualize the data and perform a  logistic regression to determine what factors impact if students pass the aforementioned test. 


In [None]:
dat1 <- readxl::read_excel(here::here("data", "week11d2.xlsx")) 
# inspect
head(dat1)


Prepare data



In [None]:
dat1 <- dat1 %>% 
  dplyr::mutate(Proficiency = factor(Proficiency),
                Abroad = factor(Abroad),
                University = factor(University),
                PluralError = factor(PluralError))
# inspect
head(dat1)


Visualize data



In [None]:
dat1  %>%
  ggplot(aes(PluralError, fill = PluralError)) +
  geom_bar(position = position_dodge(), stat = "count") + 
  facet_grid(Abroad ~ University)


Set options



In [None]:
# set contrasts
options(contrasts  = c("contr.treatment", "contr.poly"))
# extract distribution summaries for all potential variables
blrdata.dist <- rms::datadist(dat1)
# store distribution summaries for all potential variables
options(datadist = "blrdata.dist")


Fitting a model



In [None]:
m1 <- stats::glm(PluralError ~ Abroad,
                 family = binomial,
                 data = dat1)
# inspect results
summary(m1)


Model fitting



In [None]:
mfit <- glmulti(PluralError ~ Abroad * University * Proficiency, 
                family = "binomial", 
                crit = bic, 
                data = dat1)
# extract best models
top <- weightable(mfit)
top <- top[1:10,]
# inspect top 10 models
top


Define final minimal adequate model



In [None]:
m1 <- glm(PluralError ~ Abroad + Proficiency, family = "binomial", data = dat1)
# inspect results
summary(m1)


Diagnostics

Multicolliniarity


In [None]:
rms::vif(m1)



All good: the vif values are smaller than 5!

Outliers?


In [None]:
plot(m1, which = 4, id.n = 3)



Effects



In [None]:
sjPlot::plot_model(m1, type = "pred", terms = c("Abroad", "Proficiency"))



Summarize



In [None]:
sjPlot::tab_model(m1)



Report



In [None]:
report::report(m1)



## Task 2

We are now having a look at a new data set. This data represents the results of a language test that students either passed or failed. As predictors we use the students' IQs, their language proficiency and how much sleep they had before taking the test.

Load the data set `week11d1.xlsx`. Visualize the data and perform a full regression analysis. 


In [None]:
dat2 <- readxl::read_excel(here::here("data", "week11d1.xlsx"))
# inspect
head(dat2)


Prepare data



In [None]:
dat2  <- dat2 %>%
    dplyr::mutate_if(is.character, factor) %>%
  dplyr::mutate(Proficiency = dplyr::case_when(Proficiency < 3 ~ "low", 
                                               Proficiency < 6 ~ "mid",
                                               TRUE ~ "high")) %>%
  dplyr::mutate(Proficiency = factor(Proficiency, levels = c("low", "mid", "high"))) 
# inspect
head(dat2)


Visualize data



In [None]:
dat2  %>%
  dplyr::mutate(Sleep = ifelse(Sleep > mean(Sleep), "MuchSleep", "LittleSleep")) %>%
  ggplot(aes(Result, IQ, fill = Result)) +
  geom_boxplot() +
  facet_grid(Sleep ~ Proficiency)


Fitting a model



In [None]:
m2 <- glm(Result ~ Proficiency * IQ * Sleep, 
          family = binomial,
          data = dat2)
# inspect results
summary(m2)


Model fitting



In [None]:
mfit <- glmulti(Result ~ Proficiency * IQ * Sleep,  data = dat2, 
                family = binomial, 
                crit = bic)
# extract best models
top <- weightable(mfit)
top <- top[1:20,]
# inspect top 20 models
top


Define final minimal adequate model



In [None]:
m2 <- glm(Result ~ Proficiency + IQ + Sleep,
          family = binomial,
          data = dat2)
# inspect results
summary(m2)


Diagnostics

Multicolliniarity


In [None]:
rms:vif()



This is not optimal! The vif values are greater than 5!

Outliers?


In [None]:
plot(m2, which = 4, id.n = 3)



> In a real analysis, you should remove data points with high Cook's distance and re-run the analysis on the reduced data set. You can remove data points if you use dat2 <- dat2[-c(185, 159, 110),]


Effects


In [None]:
sjPlot::plot_model(, type = "", terms = c("IQ", "Proficiency", "Sleep"))



Summarize



In [None]:
sjPlot:tab_model()



Report



In [None]:
report:report(m2)



## Outro



In [None]:
sessionInfo()

