-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.Rmd
135 lines (103 loc) · 3.02 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
title: "Applied Econometrics"
site: workflowr::wflow_site
output:
html_document
---
```{r eval=T, echo=F, warning=F, message=F}
# Packages
library(knitr)
library(tidyverse)
library(readxl)
library(broom)
library(stargazer)
library(gridExtra)
```
```{r eval=T, echo=F, warning=F}
# Data Import
corr_1 <- read_excel("data/corr_1.xlsx")
cancer_test <- read_excel("data/cancer_test.xlsx")
lag <- read_excel("data/lag.xlsx")
```
# Setting Up
## Packages
* **tidyverse** - basic package for data wrangling
* **readxl** - allows inputs of excel files
* **readr** - allows inputs of text files
* **broom** - result organization with tidy tibbles
* **stargazer** - better organized regression outputs
## Output Tables
# Initial Statistics
## Correlation
### One Variable
Correlation is the strength of linear assosciation. It can be sensitive to outliers.
```{r eval=T, echo=T}
cor <- corr_1 %>%
summarise(r=cor(X, Y)) %>%
pull(r)
cor
```
Correlations can also be visualized through scatterplots which are the foundation of econometric analysis.
```{r eval=T, echo=T, message=F}
ggplot(corr_1, aes(x=X, y=Y))+
geom_point(alpha=0.5)+
geom_smooth(method = "lm", se=F)
```
```{r eval=T, echo=F}
remove(cor)
```
### Multiple Variables
# Simple Linear Regression
Linear regression can be performed by:
```{r eval=T, echo=T}
lm.model <- lm(Cancer_Diagnosis~Median_Income+Median_Age+Percent_Black, data=cancer_test)
lm.res <- augment(lm.model) # visualize all residuals in table form
```
## Least Square Lines
The following code is for two variables:
```{r eval=T, echo=T}
lm.ls <- lm.res %>%
summarize(x.sd=sd(Median_Age), y.sd=sd(Cancer_Diagnosis),
cor=cor(Cancer_Diagnosis, Median_Age)) %>%
mutate(slope=(x.sd/y.sd)*cor) # Slope = 0.015
```
When we look at the **lm** model, the slope is also 0.015.
## Visualizing Assumptions
**a.** Linearity (scatterplot + residual plot - residuals needs to be random)
**b.** Nearly normal residuals (histogram of residuals or QQ residual plot)
**c.** Constant variability (residual plot)
[Link](https://gallery.shinyapps.io/slr_diag/) for interactive regression diagnostic test.
```{r eval=T, echo=T}
a <- ggplot(lm.res, aes(x=.fitted, y=.resid))+
geom_point()+
geom_hline(yintercept = 0, linetype="dashed", color="red")+
labs(title="Residuals vs Fitted Values", x="Fitted Values", y="Residuals")
b <- ggplot(lm.res, aes(x=.resid))+
geom_density()+
labs(title="Histogram of residuals", x="Residuals") #geom_density can also be added
c <- ggplot(lm.res, aes(sample=.resid))+
stat_qq()+
stat_qq_line()
```
```{r eval=T, echo=F}
grid.arrange(a, b, c, ncol=3)
remove(a,b,c)
```
# Dummy Variables
# Lagged Regression
```{r eval=T, echo=T}
lag1 <- lag %>%
mutate(Work_1 = lag(Work, 1)) %>%
mutate(Work_2 = lag(Work, 2))
kable(head(lag1))
```
Using this we can perform linear regression the normal way:
```{r eval=F, echo=T}
reg <- lm(Income ~ Work+Work_1+Work_2, data=dataset)
```
```{r eval=T, echo=F}
remove(lag, lag1)
```
# Hypothesis Testing
## t-test b=0
## Confidence Interval