-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.Rmd
133 lines (101 loc) · 2.98 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
title: "Applied Econometrics"
site: workflowr::wflow_site
output:
html_document
---
```{r eval=T, echo=F, warning=F, message=F}
# Packages
library(knitr)
library(tidyverse)
library(readxl)
library(broom)
library(stargazer)
library(gridExtra)
```
```{r eval=T, echo=F, warning=F}
# Data Import
corr_1 <- read_excel("data/corr_1.xlsx")
cancer_test <- read_excel("data/cancer_test.xlsx")
lag <- read_excel("data/lag.xlsx")
```
# Setting Up
## Packages
* **tidyverse** - basic package for data wrangling
* **readxl** - allows inputs of excel files
* **readr** - allows inputs of text files
* **broom** - result organization with tidy tibbles
* **stargazer** - better organized regression outputs
## Output Tables
# Initial Statistics
## Correlation
### One Variable
Correlation is the strength of linear assosciation. It can be sensitive to outliers.
```{r eval=T, echo=T}
cor <- corr_1 %>%
summarise(r=cor(X, Y)) %>%
pull(r)
cor
```
Correlations can also be visualized through scatterplots which are the foundation of econometric analysis.
```{r eval=T, echo=T, message=F}
ggplot(corr_1, aes(x=X, y=Y))+
geom_point(alpha=0.5)+
geom_smooth(method = "lm", se=F)
```
```{r eval=T, echo=F}
remove(cor)
```
### Multiple Variables
# Simple Linear Regression
Linear regression can be performed by:
```{r eval=T, echo=T}
lm.model <- lm(Cancer_Diagnosis~Median_Income+Median_Age+Percent_Black, data=cancer_test)
lm.res <- augment(lm.model) # visualize all residuals in table form
```
## Least Square Lines
The following code is for two variables:
```{r eval=T, echo=T}
lm.ls <- lm.res %>%
summarize(x.sd=sd(Median_Age), y.sd=sd(Cancer_Diagnosis),
cor=cor(Cancer_Diagnosis, Median_Age)) %>%
mutate(slope=(x.sd/y.sd)*cor) # Slope = 0.015
```
When we look at the **lm** model, the slope is also 0.015.
## Visualizing Assumptions
**a.** Linearity (scatterplot + residual plot - residuals needs to be random)
**b.** Nearly normal residuals (histogram of residuals or QQ residual plot)
**c.** Constant variability (residual plot)
[Link](https://gallery.shinyapps.io/slr_diag/) for interactive regression diagnostic test.
```{r eval=T, echo=T}
a <- ggplot(lm.res, aes(x=.fitted, y=.resid))+
geom_point()+
geom_hline(yintercept = 0, linetype="dashed", color="red")+
labs(title="Residuals vs Fitted Values", x="Fitted Values", y="Residuals")
b <- ggplot(lm.res, aes(x=.resid))+
geom_density()+
labs(title="Histogram of residuals", x="Residuals") #geom_density can also be added
c <- ggplot(lm.res, aes(sample=.resid))+
stat_qq()+
stat_qq_line()
```
```{r eval=T, echo=F}
grid.arrange(a, b, c, ncol=3)
remove(a,b,c)
```
# Dummy Variables
# Hypothesis Testing
# Lagged Regression
```{r eval=T, echo=T}
lag1 <- lag %>%
mutate(Work_1 = lag(Work, 1)) %>%
mutate(Work_2 = lag(Work, 2))
kable(head(lag1))
```
Using this we can perform linear regression the normal way:
```{r eval=F, echo=T}
reg <- lm(Income ~ Work+Work_1+Work_2, data=dataset)
```
```{r eval=T, echo=F}
remove(lag, lag1)
```