analysis/econometrics.Rmd

---
title: "Applied Econometrics"
site: workflowr::wflow_site
output:  
  html_document
---

```{r eval=T, echo=F, warning=F, message=F}
# Packages
library(knitr)
library(tidyverse)
library(readxl)
library(broom)
library(stargazer)
library(jtools)

library(GGally)
library(gridExtra)
library(AER)
```

```{r eval=T, echo=F, warning=F}
# Data Import
bloodp <- read_excel("data/bloodp.xlsx")
corr_1 <- read_excel("data/corr_1.xlsx")
cancer_test <- read_excel("data/cancer_test.xlsx")
lag <- read_excel("data/lag.xlsx")
dummy <- read_excel("data/dummy.xlsx")

cancer.1 <- cancer_test %>% 
  select(-7,-11,)
```
```{r eval=T, echo=F}
remove(cancer_test)
```
# Setting Up

## Packages 
* **tidyverse** - basic package for data wrangling

* **readxl** - allows inputs of excel files

* **readr** - allows inputs of text files

* **jtools** - `summ` functions for better regression outputs

* **broom** - result organization with tidy tibbles

* **stargazer** - better organized regression outputs 

* **sandwitch** - robust standard errors

## Output Tables
Here is a basic example of linear regression. Age and Weight of an individual is used to predict the blood pressure.
```{r eval=T, echo=T}
model <- lm(bloodp~age+weight, data=bloodp)
summ(model)
```
The function `summ` is good and direct in printing output. The base r `summary` is okay to use, and the more intricate `stargazer` is good too.

`summ` will be used mostly here when printing regression outputs.

***

# Correlation

## One Variable

Correlation is the strength of linear assosciation. It can be sensitive to outliers. The `pull` function is used to print out the correlation value more.
```{r eval=T, echo=T}
cor <- corr_1 %>% 
  summarise(r=cor(X, Y)) %>% 
  pull(r)
cor
```

Observing correlation from regression. Sqrt of R^2 is correlation.
```{r eval=T, echo=T}
cor.lm <- lm(Y~X, data=corr_1)
summ(cor.lm)
```

Or using base r function `cor`:
```{r eval=T, echo=T}
cor.2 <- cor(corr_1)
cor.2
```

Correlations can also be visualized through scatterplots which important to any econometric analysis. Method = "lm" produces the blue linear regression line. 
```{r eval=T, echo=T, message=F}
ggplot(corr_1, aes(x=X, y=Y))+
  geom_point(alpha=0.5)+
  geom_smooth(method = "lm", se=F)
```

Or, you can use the effects plot. This is an example of the blood pressure model.
```{r eval=T, echo=T}
effect_plot(model, pred = age, interval = T, plot.points = T)
```
```{r eval=T, echo=F}
remove(cor, cor.2, cor.lm)
```

## Multiple Variables
Multiple variables can be tabulated against eachother to observe the correlation. In this example, `cancer diagnosis ~ income + age + private + public + per_white + per_black`
```{r eval=T, echo=F}
cancer.2 <- cancer.1 %>% 
  select(3,4,7,20,21,22,23)
```

Viewing correlation table:
```{r eval=T, echo=T}
cor.cancer <- cor(cancer.2)
cor.cancer
```

Viewing in diagrams:
```{r eval=T, echo=T}
ggpairs(cancer.2)
```
```{r eval=T, echo=F}
remove(cor.cancer)
```
***

# Understanding Linear Regression

`cancer diagnosis ~ income + age + private + public + per_white + per_black` model is used in the following examples.

```{r eval=T, echo=T}
lm.model <- lm(Cancer_Diagnosis~Median_Income+Median_Age+Private_Coverage+Public_Coverage+Percent_White+Percent_Black, data=cancer.2)
summ(lm.model)
```

To visualize the residual table we use the function `augment`.
```{r eval=T, echo=T}
lm.res <- augment(lm.model) 
```

## Least Square Lines
The least square estimator is:
$$
\sum_{i=1}^{n} (Y_i-\beta_0-\beta_1 X_i)^2
$$
The estimators can be calculated as follows:
$$
\begin{align}
\beta_1 &= \frac{\sum_{i=1}^n (X_i - \bar{X})(Y_i-\bar{Y})}{\sum_{i=1}^{n}(X_1-\bar{X})^2}\\
\\
\beta_0 &= \bar{Y} - \hat{\beta_1} \bar{X}
\end{align}
$$

We can manually do these calculations. To do these we use the package `AER` and load the `CASchool` dataset.
```{r eval=T, echo=F}
data("CASchools")
ca.data <- CASchools %>% 
  mutate(Score = (read+math)/2) %>% 
  mutate(Str = students/teachers)
remove(CASchools)
```

Lets conduct a simple linear regression `score ~ STR` (STR = student teacher ratio).
```{r eval=T, echo=T}
ca.model <- lm(Score~Str, data=ca.data)
summ(ca.model)
```

Manually computing the estimators.
```{r eval=T, echo=T}
attach(ca.data)
beta_1 <- sum((Str-mean(Str))*(Score-mean(Score))) / sum((Str-mean(Str))^2)
beta_1 
beta_0 <- mean(Score) - beta_1*mean(Str)
beta_0
```
```{r eval=T, echo=F}
remove(beta_0, beta_1)
```
## Goodness of Fit
We need to understand: Explained Sum of Squares (**ESS**), Total Sum of Squares (**TSS**), and Sum of Square Residuals (**SSR**).
$$
\begin{align}
ESS &= \sum_{i=1}^{n} (\hat{Y_i}-\bar{Y})^2\\
TSS &= \sum_{i=1}^{n} (Y_i - \bar{Y})^2\\
R^2 &= \frac{ESS}{TSS}
\\ \\
SSR &= \sum_{i=1}^{n} \hat{u_i}^2\\
TSS &= ESS + SSR\\
R^2 &= 1-\frac{SSR}{TSS}

\end{align}
$$
We can calculate the Standard Error of Regression (**SER**) as $s^2=\frac{SSR}{n-2}$

Computing these goodness of fit values manually.
```{r eval=T, echo=T}
ca.summary <- summary(ca.model)
SSR <- sum(ca.summary$residuals^2)
TSS <- sum((Score-mean(Score))^2)
R_sq <- 1 - SSR/TSS
R_sq
detach(ca.data)
```
```{r eval=T, echo=F}
remove(R_sq, SSR, TSS)
```

## Least Square Assumptions 

**a.** Linearity (scatterplot + residual plot - residuals needs to be random)

**b.** Nearly normal residuals (histogram of residuals or QQ residual plot)

**c.** Constant variability (residual plot)

[Link](https://gallery.shinyapps.io/slr_diag/) for interactive regression diagnostic test. 

```{r eval=T, echo=T}
a <- ggplot(lm.res, aes(x=.fitted, y=.resid))+
  geom_point()+
  geom_hline(yintercept = 0, linetype="dashed", color="red")+
  labs(title="Residuals vs Fitted Values", x="Fitted Values", y="Residuals")
b <- ggplot(lm.res, aes(x=.resid))+
  geom_density()+
  labs(title="Histogram of residuals", x="Residuals") #geom_density can also be added
c <- ggplot(lm.res, aes(sample=.resid))+
  stat_qq()+
  stat_qq_line()
```
```{r eval=T, echo=F}
grid.arrange(a, b, c, ncol=3)
remove(a,b,c)
```

## Covariance Matrix

Covaraince matrix can be created using `vocv`.
```{r eval=T, echo=T}
vcov(lm.model)
```

```{r eval=T, echo=F}
remove(lm.model, lm.res)
```


# Robust Standard Errors
This requires the `sandwitch` package. There are many robust standard error (heteroskedasctic errors) parameters ranging from `HC0` to `HC5`. The default for STATA is **HC1** which is important to keep in mind if we are trying to emulate STATA result.

```{r eval=T, echo=T}
model <- lm(bloodp~age+weight, data=bloodp)
summ(model, robust = "HC1")
```
```{r eval=T, echo=F}
remove(model)
```

***

# Dummy Variables
Group 1, Group 2, and Group 3 are all dummy variables (0 or 1). When regressors are dummy variables you need to regress it without the intercept term.

```{r eval=T, echo=T}
reg <- lm(Earnings ~ -1+Group1+Group2+Group3, data=dummy)
summ(reg)
```

If we want to add the intercept term, one of the dummy variables needs to be removed to avoid collinearity with the intercept term.
```{r eval=T, echo=T}
reg2 <- lm(Earnings~Group1+Group2, data=dummy)
summ(reg2)
```
```{r eval=T, echo=F}
remove(reg2)
```

***

# Lagged Regression

```{r eval=T, echo=T}
lag1 <- lag %>% 
  mutate(Work_1 = lag(Work, 1)) %>% 
  mutate(Work_2 = lag(Work, 2))
kable(head(lag1))
```

Using this we can perform linear regression the normal way:
```{r eval=F, echo=T}
reg.lag <- lm(Income ~ Work+Work_1+Work_2, data=dataset)
```
```{r eval=T, echo=F}
remove(reg.lag, lag, lag1)
```

***

# Hypothesis Testing

## Confidence Interval
There are two methods that can be used. The base r `confint` and `summ`. Both examples are below. The dummy variable regression result is used as example. 
```{r eval=T, echo=F}
summ(reg)
```

The `confint` method to find a 95% CI is,
```{r eval=T, echo=T}
confint(reg, level=0.95)
```

The `summ` method to find a 95% CI is,
```{r eval=T, echo=T}
summ(reg, confint = TRUE, ci.width = 0.95, digits = 3)
```

## t-test b=0

For this we will be using the **score~STR** regression. 

```{r eval=T, echo=T}
summ(ca.model)
```

$$
\begin{align}
H_0: \beta_1 = 0 &&& H_1: \beta_1 \neq 0 
\end{align}
$$

Testing this hypothesis manually:
$$
t = \frac{-2.2798 - 0}{0.4798} = -4.75
$$
This means we reject the null hypothesis at 5% because the level of significance is $|t| > 1.96$. This is what the `summ` results in **t val** and **p**.

## Joint Hypothesis
$$
F = \frac{(SSR_{restricted} - SSR_{unrestricted})/q}{SSR_{unrestricted}/(n-k-1)}
$$

```{r eval=T, echo=F}
model.f <- lm(Score ~ expenditure + Str + english, data=ca.data)
summ(model.f, digits = 3)
```

Testing the joint hypothesis that **Str = 0** and **english = 0**. We use the function `linearHypothesis`.
```{r eval=T, echo=T}
linearHypothesis(model.f, c("Str=0", "english=0"))
```
F - stat is **147.68** and P-value is **2.2e-16** thus reject that null hypothesis.

A **heteroskedastic-robust F test** can also be conducted by adding `white.adjust`.
```{r eval=T, echo=T}
linearHypothesis(model.f, c("Str=0", "english=0"), white.adjust = "hc1")
```

# Testing Multiple Models