-
Notifications
You must be signed in to change notification settings - Fork 13
/
12-STAT_220_RegressionInferencePreLab-formula.Rmd
106 lines (61 loc) · 2.99 KB
/
12-STAT_220_RegressionInferencePreLab-formula.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
---
title: 'Inference for regression pre-lab: formula'
author: "Professor McNamara"
output:
html_document:
toc: true
toc_depth: 3
toc_float: true
---
## Setup and packages
We use three packages in this course: `Lock5Data`, `mosaic` and `ggformula`. To load a package, you use the `library()` function, wrapped around the name of a package. I've put the code to load one package into the chunk below. Add the other two you need.
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, error = TRUE)
library(Lock5Data)
# put in the other two packages you need here
```
## Loading in data
As usual, we'll load the example data, `GSS_clean.csv`. It should be inside the `data` folder in your RStudio Cloud. We'll use the `read.csv()` function to read in the data.
```{r data-load}
GSS <- read.csv("data/GSS_clean.csv")
```
## Regression
We've talked about regression before, in lab 4. However, that was just regression as a descriptive technique. Now we'll be going further to do inference about regression, using a distributional approximation.
Let's consider the same relationship as we did before, predicting `highest_year_of_school_completed` (our response) using `highest_year_school_completed_spouse` (our explanatory variable). First, we should do a visualization of this relationship.
```{r}
```
Now, let's compute the model and find the regression line
```{r}
```
We can use LaTeX to write out the equation of the line,
$$
\hat{y} = 5.85 + 0.59\cdot\text{highest year school completed spouse}
$$
Let's review how to interpret the slope cofficient:
## Inference for regression
R is great because it gives you all the inferential information you want in the `summary()` output,
```{r}
```
What is our test statistic? Our standard error?
What is our generic conclusion of the test? How can we interpret this in context?
What is the $R^2$ value?
```{r}
rsquared(m1)
```
How do we interpret that in context?
## Confidence and prediction intervals
Like we did last time, we could make a prediction for a person whose spouse has completed 12 years of school (high school),
```{r}
5.84740 + 0.59403 * 12
```
but, we might want to create an interval estimate rather than just a point estimate. To do that, we will create either a confidence interval (for a mean Y) or a prediction interval (for a particular Y). Both of them can be created with the `predict()` function.
```{r}
predict(m1, newdata = data.frame(highest_year_school_completed_spouse = 12), interval = "confidence")
predict(m1, newdata = data.frame(highest_year_school_completed_spouse = 12), interval = "prediction")
```
Let's interpret those two intervals
## Multiple regression
You can go even further with regression, adding additional predictors into the model. Let's say we want to predict the number of years of schooling someone has by using the highest year of school their spouse completed **and** the number of brothers and sisters they have.
```{r}
```
Now, we have to add another phrase to our interpretation sentence.