# Week 09: Simple Linear Regression

## Introduction 

In this tutorial, we will learn to how to perform simple linear regression.

**Preparation and session set up**

Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead and ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).


In [None]:
# install packages
#install.packages("here")
#install.packages("dplyr")
#install.packages("ggplot2")
#install.packages("sjPlot", dependencies = T)
#install.packages("report")


Now that we have installed the packages, we activate them as shown below.



In [None]:
# activate packages
library(here)
library(dplyr)
library(ggplot2)
library(sjPlot)
library(report)


##  Tutorial Activity 

Go into groups - each group and help each other to bring the data into the correct format, visualize the data and perform the test.

## Task 1


Simple linear regression is a better alternative for independent t-tests (because, in addition to providing the same information that t-tests do, it also provides model fit information).


Let's use the example we already encountered in week 8 and perform a regression on this data set.

Imagine you want to investigate if L1 Chinese learners of English differ in the length with which they produce vowel sounds from L1 Australian English speakers. This would be important because vowel length in English is meaning distinguishing as in *bit* vs *beat*. Thus, English speakers pay particular attestation to vowel duration and notice unnaturally long and short vowels as being weird or more difficult to understand.

To this end, Martin has extracted vowel duration for you from Chinese learners of English and L1 English speakers.

The RQ is if Chinese learners of English differ from L1 English speakers in terms of vowel duration.

Can you answer the RQ based on the week8d3.xlsx data set?

Load data


In [None]:
# load data
dat1 <- readxl::read_excel(here::here("data", "week8d3.xlsx"))
# inspect
head(dat1)


In [None]:
dat1  %>%
  ggplot(aes(L1, Duration)) +
  geom_boxplot()


Fit linear regression



In [None]:
m1 <- lm(Duration ~ L1, data = dat1)
# inspect results
summary(m1)


Let us now compare this to the results to the t-test.



In [None]:
t.test(Duration ~ L1, data = dat1)



Diagnose model: do we need to remove outliers or use another method?



In [None]:
plot(m1)



Visualize results



In [None]:
sjPlot::plot_model(m1, type = "pred", terms = c("L1"))



Summarize



In [None]:
sjPlot::tab_model(m1)



Write-up results



In [None]:
report::report(m1)



> We fitted a linear model (estimated using OLS) to predict vowel duration based on speakers' L1  (formula: Duration ~ L1). The model explains a statistically significant and moderate proportion of variance (R^2^ = 0.19, F(1, 188) = 44.37, p < .001, adj. R^2^ = 0.19). The model's intercept, corresponding to L1 = English, is at 102.13 ([98.67, 105.60], t(188) = 58.19, p < .001). The effect of L1 [Chinese] is statistically significant and positive (beta = 18.02 [12.68, 23.35], t(188) = 6.66, p < .001; Std. beta = 0.88 [0.62, 1.14])



## Task 2

We go back to the data sets we have analyzed in week 6. Here, the RQ is if the courses differ in how satisfied students were with the courses. Satisfaction is operationalized as secat scores and the course a student attended is provided in the *class* column.

Load the data set `week6g1.xlsx`. Visualize the data and perform a full regression analysis. 


In [None]:
dat2 <- readxl::read_excel(here::here("data", "week6g1.xlsx"))
# inspect
head(dat2)


Visualize data



In [None]:
dat2  %>%
  ggplot(aes(class, secat)) +
  geom_boxplot()


Fitting a model



In [None]:
m2 <- lm(secat ~ class, data = dat2)
# inspect results
summary(m2)


Diagnostics: outliers?



In [None]:
plot(m2)



Effects



In [None]:
sjPlot::plot_model(m2, type = "pred", terms = c("class")) +
  coord_cartesian(ylim = c(0, 7))


Summarize



In [None]:
sjPlot::tab_model(m2)



Report



In [None]:
report::report(m2)



## Outro



In [None]:
sessionInfo()

