Merge pull request #156 from ScPoEcon/snow

Snow
ScPoEcon · Oct 7, 2020 · 1eb643b · 1eb643b
2 parents 6f50660 + 6ef251f
commit 1eb643b
Show file tree

Hide file tree

Showing 23 changed files with 1,138 additions and 20 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -5,7 +5,7 @@ os:
 
 before_install:
 #  - if [ $TRAVIS_OS_NAME = linux ]; then sudo apt-get update; fi
-  - if [ $TRAVIS_OS_NAME = linux ]; then sudo apt-get install -y ghostscript; sudo apt-get install -y libmagick++-dev; sudo add-apt-repository -y ppa:cran/poppler; sudo apt-get update; sudo apt-get install -y libpoppler-cpp-dev; sudo apt-get install -y libv8-dev ;fi
+  - if [ $TRAVIS_OS_NAME = linux ]; then sudo apt-get install -y ghostscript; sudo apt-get install -y libmagick++-dev; sudo add-apt-repository -y ppa:cran/poppler;sudo apt-get install -y libpoppler-cpp-dev; sudo apt-get install -y libv8-dev ;  sudo apt-get install -y libudunits2-dev libgdal-dev libgeos-dev libproj-dev libfontconfig1-dev;fi
   - if [[ "$TRAVIS_OS_NAME" == "osx" ]]; then brew install llvm; brew install v8;  brew install poppler;
     export PATH="/usr/local/opt/llvm/bin:$PATH" &&
     export LDFLAGS="-L/usr/local/opt/llvm/lib" &&
@@ -29,4 +29,3 @@ script:
   - if [ $TRAVIS_OS_NAME = osx ]; then R CMD check *tar.gz ; fi
   - if [ $TRAVIS_OS_NAME = linux ]; then R CMD check *tar.gz; fi
   - if [ $TRAVIS_OS_NAME = linux ] && [[ $TRAVIS_COMMIT_MESSAGE != *"[nobook]"* ]]; then ./_build.sh && ./_deploy.sh; fi
-
diff --git a/06-StdErrors.Rmd b/06-StdErrors.Rmd
@@ -1,9 +1,3 @@
----
-output:
-  pdf_document: default
-  html_document: default
----
-
 # Regression Inference {#std-errors}
 
 In this chapter we want to investigate uncertainty in regression estimates. We want to understand what the precise meaning of the `Std. Error` column in a typical regression table is telling us. In terms of a picture, we want to understand better the meaning of the shaded area as in this one here:
@@ -179,7 +173,7 @@ First, we *assumed* that \@ref(eq:abline-5) is the correct represenation of the
 
 
 
-### Violating the Assumptions of the CRM
+### Violating the Assumptions of the CRM {#violating}
 
 It's interesting to consider in which circumstances we might violate those assumptions. Let's give an example for each of them:
 

diff --git a/07-Causality.Rmd b/07-Causality.Rmd
@@ -312,7 +312,3 @@ If the direction of correlation between omitted variable $z$ and $x$ is the same
 <br>
 
 
-
-## STAR Experiment
-
-tbd
diff --git a/10-IV.Rmd b/10-IV.Rmd
diff --git a/11-IV2.Rmd b/11-IV2.Rmd
diff --git a/12-panel.Rmd b/12-panel.Rmd
diff --git a/13-discrete.Rmd b/13-discrete.Rmd
@@ -0,0 +1,204 @@
+# Binary Outcomes {#binary}
+
+Until now we have encountered only contiunously distributed outcomes on the right hand side of our estimation equations. For example, in our typical linear model, we would define
+
+\begin{align}
+y &= b_0 + b_1 + e \\
+e &\sim N\left(0,\sigma^2\right)
+\end{align}
+
+where the second line defines the unobservable $e$ to be drawn from the Normal distribution with mean zero and variance $\sigma^2$.^[We have not insisted too much on the fact that $e$ should be distributed according to the *Normal* distribution (this is required in particular for the theoretical derivation of standard errors as seen in chapter \@ref(std-errors)). However, we'd always have an unbounded and continuous distribution underlying our models] That means that, at least in principle, $y$ could be any number from the real line ($e$ could be arbitrarily small or large), and we can say that $y \in \mathbb{R}$.
+
+For the outcomes we studied, that was fine: test scores, earnings, crime rates etc are all continuous outcomes. But some outcomes are clearly binary (i.e. either `TRUE` or `FALSE`):
+
+* You either work or you don't,
+* You either have children or you don't,
+* You either bought a product or you didn't,
+* You flipped a coin and it came up either heads or tails.
+
+In this situation, our outcome is restricted to come from a small set of values: `FALSE` vs `TRUE`, or `0` vs `1`. We'd have $y \in  \{0,1\}$. In those situations we are primarily interested in estimating the **response probability** or the **probability of success**,
+
+$$
+p(x) = \Pr(y=1 | x), 
+$$
+or in words, *the probability to observe $y=1$ (a success), given explanatory variables $x$*. In particular, we will often be interested in learning how $p(x)$ changes as we change $x$ - that is, we are interested in the same *partial effect* of $x$ on the outcome as in our usual linear regression setup. Here, we ask 
+
+```{block,type = "tip"}
+If we increase $x$ by one unit, how would the probability of $y=1$ change?
+```
+
+It is worth reminding ourselves about two simple facts about binary random variables (i.e drawn from the  [Bernoulli](https://en.wikipedia.org/wiki/Bernoulli_distribution) distribution). So, we call a random variable $y \in \{0,1\}$ such that
+
+\begin{align}
+\Pr(y = 1) &= p \\
+\Pr(y = 0) &= 1-p \\
+p &\in[0,1]
+\end{align}
+
+a *Bernoulli* random variable. In our setting, we just *condition* those probabilities on a covariate $x$, as above - that is, we measure the probability *given that $X$ takes value $x$*:
+
+\begin{align}
+\Pr(y = 1 | X = x) &= p(x) \\
+\Pr(y = 0 | X = x) &= 1-p(x) \\
+p(x) &\in[0,1]
+\end{align}
+
+Of particular interest for us is the fact that the *expected value* (i.e. the average) of $Y$ given $x$ is
+
+$$
+E[y | x] = p(x) \times 1 + (1-p(x)) \times 0 = p(x)
+$$
+
+There are several ways to model such binary outcomes. Let's look at them.
+
+## The Linear Probability Model
+
+The Linear Probability Model (LPM) is the simplest option. In this case, we model the response probability as
+
+$$
+\Pr(y = 1 | x) = p(x) = \beta_0 + \beta_1 x_1 + \dots + \beta_K x_K (\#eq:LPM)
+$$
+Our interpretation is slightly changed to our usual setup, as we'd say *a 1 unit change in $x_1$, say, results in a change of $p(x)$ of $\beta_1$.*
+
+Estimation of the LPM as in equation \@ref(eq:LPM) can be performed by standard OLS. Let's look at an example. The Mroz (1987) dataset let's us investigate female labor market participation. How does a woman's `inlf` (*in labor force*) status depend on non-wife household income, her education, age and number of small children? First, let's look at a quick plot that shows how the outcome varies with 1 variable, age say:
+
+```{r}
+data(mroz, package = "wooldridge")
+plot(factor(inlf) ~ age, data = mroz, 
+     ylevels = 2:1,
+     ylab = "in labor force?")
+```
+
+Not so much variation with respect to age, except for the later years. Let's run the LPM now:
+
+```{r}
+LPM = lm(inlf ~ nwifeinc + educ + exper 
+         + I(exper^2) + age +I(age^2) + kidslt6, mroz)
+summary(LPM)
+```
+You can see that this is *identical* to our previous linear regression models - with the exception that the outcome `inlf` takes on only two values, 0 or 1. The results from this: if non-wife income increases by 10 (i.e 10,000 USD), the probability of being in the labor force falls by 0.034 (that's a small effect!), whereas an additional small child would reduce the probability of work by 0.26 (that's large). So far, so simple.
+
+
+One often-mentioned problem of this model is that fact that nothing restricts our predictions of $p(x)$ to be proper probabilities, i.e. to lie in the unit interval $[0,1]$. You can see that quite easily here:
+
+```{r}
+pr = predict(LPM)
+plot(pr[order(pr)],ylab = "p(inlf = 1)")
+abline(a = 0, b = 0, col = "red")
+abline(a = 1, b = 0, col = "red")
+```
+
+This picture tells you that for quite a few observations, this model predicts a probability of working which is either greater than 1, or smaller than zero. This may or may not be a big problem for your analysis. If you only care about marginal effects (i.e. the $\beta$s, that may be ok, in particular if you have discrete variables on the RHS; if you want actual *predictions* than that's more problematic).
+
+In the case of a *saturated model* - if we only have dummy explanatory variables - then this problem does not exist for the LPM:
+
+```{r saturated,message=FALSE,warning=FALSE,fig.cap = "LPM model in a saturated setting, i.e. only mutually exhaustive dummy variables on the RHS."}
+library(dplyr)
+library(ggplot2)
+mroz %<>% 
+  # classify age into 3 and huswage into 2 classes
+  mutate(age_fct = cut(age,breaks = 3,labels = FALSE),
+         huswage_fct = cut(huswage, breaks = 2,labels = FALSE)) %>%
+  mutate(classes = paste0("age_",age_fct,"_hus_",huswage_fct))
+
+LPM_saturated = mroz %>%
+  lm(inlf ~ age_fct + huswage_fct, data = .)
+
+mroz$pred <- predict(LPM_saturated)
+
+ggplot(mroz[order(mroz$pred),], aes(x = 1:nrow(mroz),y = pred,color = classes)) + 
+  geom_point() + 
+  theme_bw() + 
+  scale_y_continuous(limits = c(0,1), name = "p(inlf)") +
+  ggtitle("LPM in a Saturated Model is Perfectly Fine")
+```
+
+In figure \@ref(fig:saturated) each line segment corresponds to the average probability of work *within that cell* of people. For example you see that women from the youngest age category and lowest husband income (class `age_1_hus_1`) have the highest probability of working (`r round(max(mroz$pred),3)`).
+
+## Nonlinear Binary Response Models
+
+In this class of models we change the way we model the response probability $p(x)$. Instead of the simple linear structure from above, we write
+
+$$
+\Pr(y = 1 | x) = p(x) = G \left(\beta_0 + \beta_1 x_1 + \dots + \beta_K x_K \right) (\#eq:GLM)
+$$
+You note that this is *almost* identical, only that the entire sum $\beta_0 + \beta_1 x_1 + \dots + \beta_K x_K$ is now inside some function $G(\cdot)$. The main property of $G$ is that it can transform any value $z\in \mathbb{R}$ you give it to a number in the interval $(0,1)$. This immediately solves our problem of getting weird predictions for probabilities. The two most widely used forms of $G$ are the **probit** and the **logit** model. here are both forms for $G$ in one plot:
+
+```{r cdfs, fig.cap = "The Probit and Logit functional forms for binary choice models",warning = FALSE}
+library(ggplot2)
+ggplot(data.frame(x = c(-5,5)), aes(x=x)) + 
+  stat_function(fun = pnorm, aes(colour = "Probit")) + 
+  stat_function(fun = plogis, aes(colour = "Logit")) + 
+  theme_bw() + 
+  scale_colour_manual(name = "Function G",values = c("red", "blue")) +
+  scale_y_continuous(name = "Pr(y = 1 | x)")
+```
+You can see that 
+
+1. any value $x$ results in a value $y$ between 0 and 1
+1. the higher $x$, the higher the resulting $p(x)$.
+
+
+### Interpretation of Coefficients
+
+Let's run the Mroz example from above in both probit and logit now:
+
+```{r}
+probit <- glm(inlf ~ age, 
+                    data = mroz, 
+                    family = binomial(link = "probit"))
+
+logit <- glm(inlf ~ age, 
+                    data = mroz, 
+                    family = binomial(link = "logit"))
+modelsummary::modelsummary(list("probit" = probit,"logit" = logit))
+```
+
+From this table, we learn that the coefficient for `age` is `r round(coef(probit)[2],3)` for probit and `r round(coef(logit)[2],3)` for logit, respectively. In both cases, this tells us that the impact of an additional year of age on the probability of working is **negative**. However, we cannot straightforwardly read off the *magnitude* of the effect - **how much** does the probability decrease we can't tell. Why is that? 
+
+One simple way to see this is to look back at figure \@ref(fig:cdfs) and imagine we had just one explanatory variable (like here - `age`). The model is
+
+$$
+\Pr(y = 1 | \text{age})= G \left(x \beta\right) = G \left(\beta_0 + \beta_1 \text{age} \right) 
+$$
+and the *marginal effect* of `age` on the response probability is
+$$
+\frac{\partial{\Pr(y = 1 | \text{age})}}{ \partial{\text{age}}} = g \left(\beta_0 + \beta_1 \text{age} \right) \beta_1 (\#eq:ME)
+$$
+where function $g$ is defined as $g(z) = \frac{dG}{dz}(z)$ - the first derivative function of $G$ (i.e. the *slope* of $G$). The formulation in \@ref(eq:ME) is a result of the [chain rule](https://en.wikipedia.org/wiki/Chain_rule). Now, given that in figure \@ref(fig:cdfs) we see $G$ that is nonlinear, this means that also $g$ will be non-linear: sometimes (close to the edges of the graph) the slope will be really small and close to zero, but sometimes (in the center of the graph), the slope will be really steep. You are able to try this out yourself with our [app -issue](https://github.com/ScPoEcon/ScPoApps/issues/4).
+
+So you can see that there is not one single *marginal effect* in those models, as that depends on *where we evaluate* expression \@ref(eq:ME). Notice that the case is identical for more than one $x$. In practice, there are two common approaches:
+
+1. report \@ref(eq:ME) at the average values of $x$:             $$g(\bar{x} \beta) \beta_j$$
+1. report the sample average of all marginal effects: $$\frac{1}{n} \sum_{i=1}^N g(x_i \beta) \beta_j$$
+
+Thankfully there are packages available that help us to compute those marginal effects fairly easily. One of them is called [`mfx`](https://cran.r-project.org/web/packages/mfx/), and we would use it as follows:
+
+```{r glms}
+f <- "inlf ~ age + kidslt6 + nwifeinc" # setup a formula
+glms <- list()
+glms$probit <- glm(formula = f, 
+                    data = mroz, 
+                    family = binomial(link = "probit"))
+glms$logit <- glm(formula = f, 
+                    data = mroz, 
+                    family = binomial(link = "logit"))
+# now the marginal effects versions
+glms$probitMean <- mfx::probitmfx(formula = f, 
+                    data = mroz, atmean = TRUE)
+glms$probitAvg <- mfx::probitmfx(formula = f, 
+                    data = mroz, atmean = FALSE)
+glms$logitMean <- mfx::logitmfx(formula = f, 
+                    data = mroz, atmean = TRUE)
+glms$logitAvg <- mfx::logitmfx(formula = f, 
+                    data = mroz, atmean = FALSE)
+
+modelsummary::modelsummary(glms, 
+                           stars = TRUE,
+                           gof_omit = "AIC|BIC",
+                           title = "Logit and Probit estimates and marginal effects evaluated at mean of x or as sample average of effects")
+```
+
+In table \@ref(tab:glms) you should first note that the estimates of the first two columns (probit or logit) don't correspond to the remaining columns. That's because they only give you the $\beta$'s. As we have learned above, that in itself is not informative, as it depends *where* one computes the marginal effects. Hence the remaining columns compute the marginal effects either at the mean of all regressors (`probitMean`) or as the sample average over all effects in the data (`probitAvg`). You can notice some differences here, for example we find at the average regressor, an additional child below age of 6 reduces the probability of work by 0.314, whereas as an averag over all sample effects it reduces it by 0.29. Furthermore, you see that the marginal effect estimates between probit and logit don't correspond exactly, which is a consequence of the different shapes of the curves in figure \@ref(fig:cdfs). No one approach is correct here and depends on how your data is distributed (e.g. is the mean a good summary of the data here?). What is clear, though, is that in most cases reporting coefficient estimates only is not very informative (it only tells you the direction of any effect).
+
+
diff --git a/11-references.Rmd → 13-references.Rmd b/11-references.Rmd → 13-references.Rmd
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,5 +1,5 @@
 Package: ScPoEconometrics
-Type: Book
+Type: Package
 Title: ScPoEconometrics
 Date: 2019-10-10
 Version: 0.2.6
@@ -32,8 +32,17 @@ Imports: bookdown,
     quantreg,
     equatiomatic,
     ungeviz,
+    masteringmetrics,
     ggdag,
     data.table,
-    huxtable
-Remotes: datalorax/equatiomatic, wilkelab/ungeviz
+    huxtable,
+    cholera, 
+    reshape2,
+    modelsummary,
+    estimatr,
+    gganimate,
+    fixest,
+    transformr,
+    mfx
+Remotes: datalorax/equatiomatic, wilkelab/ungeviz, jrnold/masteringmetrics/masteringmetrics
 RoxygenNote: 7.1.1
diff --git a/R/utils.R b/R/utils.R
@@ -13,3 +13,9 @@ pasta_maker <- function(){
   pasta_jar
 }
 
+pasta_image <- function(){
+  data(pasta_jar)
+  pasta_jar$cf = as.numeric(factor(pasta_jar$color))
+  m = matrix(c(pasta_jar$cf),44,45)
+  image(m,col = c("green","orange","white"),xaxt="n",yaxt = "n")
+}
diff --git a/_build.sh b/_build.sh
@@ -4,6 +4,6 @@ set -e
 
 # build book(s)
 Rscript -e "bookdown::render_book('index.Rmd', 'bookdown::gitbook')"
-Rscript -e "bookdown::render_book('index.Rmd', 'bookdown::pdf_book')"
+# Rscript -e "bookdown::render_book('index.Rmd', 'bookdown::pdf_book')"
 Rscript -e "bookdown::render_book('index.Rmd', 'bookdown::epub_book')"
 
diff --git a/_deploy.sh b/_deploy.sh
@@ -4,8 +4,6 @@ set -e
 
 [ -z "${GH_TOKEN}" ] && exit 0
 [ "${TRAVIS_BRANCH}" != "master" ] && exit 0
-[ "${TRAVIS_PULL_REQUEST}" != "false" ] && exit 0
-
 
 git config --global user.email "florian.oswald@gmail.com"
 git config --global user.name "Florian Oswald"

diff --git a/10-projects.Rmd → _to_be_done/11-projects.Rmd b/10-projects.Rmd → _to_be_done/11-projects.Rmd
diff --git a/book.bib b/book.bib
@@ -29,4 +29,54 @@ @article{pinotti
 Pages = {138-68},
 DOI = {10.1257/aer.20150355},
 URL = {http://www.aeaweb.org/articles?id=10.1257/aer.20150355}}
+@article{freedman1991,
+  title={Statistical models and shoe leather},
+  author={Freedman, David A},
+  journal={Sociological methodology},
+  pages={291--313},
+  year={1991},
+  publisher={JSTOR}
+}
+@book{deaton1997,
+  title={The analysis of household surveys: a microeconometric approach to development policy},
+  author={Deaton, Angus},
+  year={1997},
+  publisher={The World Bank}
+}
+@article{angristlavy,
+  title={Using Maimonides' rule to estimate the effect of class size on scholastic achievement},
+  author={Angrist, Joshua D and Lavy, Victor},
+  journal={The Quarterly journal of economics},
+  volume={114},
+  number={2},
+  pages={533--575},
+  year={1999},
+  publisher={MIT Press}
+}
+@article{angristkrueger,
+    author = {Angrist, Joshua D. and Krueger, Alan B.},
+    title = {Does Compulsory School Attendance Affect Schooling and Earnings?},
+    journal = {The Quarterly Journal of Economics},
+    volume = {106},
+    number = {4},
+    pages = {979-1014},
+    year = {1991},
+    month = {11},
+    abstract = "{We establish that season of birth is related to educational attainment because of school start age policy and compulsory school attendance laws. Individuals born in the beginning of the year start school at an older age, and can therefore drop out after completing less schooling than individuals born near the end of the year. Roughly 25 percent of potential dropouts remain in school because of compulsory schooling laws. We estimate the impact of compulsory schooling on earnings by using quarter of birth as an instrument for education. The instrumental variables estimate of the return to education is close to the ordinary least squares estimate, suggesting that there is little bias in conventional estimates.}",
+    issn = {0033-5533},
+    doi = {10.2307/2937954},
+    url = {https://doi.org/10.2307/2937954},
+    eprint = {https://academic.oup.com/qje/article-pdf/106/4/979/5298446/106-4-979.pdf},
+}
+@article{angristkruegerIV,
+  title={Instrumental variables and the search for identification: From supply and demand to natural experiments},
+  author={Angrist, Joshua D and Krueger, Alan B},
+  journal={Journal of Economic perspectives},
+  volume={15},
+  number={4},
+  pages={69--85},
+  year={2001}
+}
+
+
 
diff --git a/favicon.gif b/favicon.gif
diff --git a/favicon.png b/favicon.png
diff --git a/images/Cholera_art.jpg b/images/Cholera_art.jpg
diff --git a/images/father-thames.jpg b/images/father-thames.jpg
diff --git a/images/snow-map.jpg b/images/snow-map.jpg
diff --git a/images/snow-supply.jpg b/images/snow-supply.jpg
diff --git a/inst/datasets/grade5.dta b/inst/datasets/grade5.dta
diff --git a/preamble.tex b/preamble.tex
@@ -1,8 +1,11 @@
+\usepackage{tcolorbox}
 \usepackage{booktabs}
 \usepackage{amsthm}
-\usepackage{tcolorbox}
+
 \newenvironment{note}{\begin{tcolorbox}[colback=blue!5!white,colframe=blue!75!black]}{\end{tcolorbox}}
+\newenvironment{notel}{\begin{tcolorbox}[colback=blue!5!white,colframe=blue!75!black]}{\end{tcolorbox}}
 \newenvironment{warning}{\begin{tcolorbox}[colback=orange!5!white,colframe=orange]}{\end{tcolorbox}}
+\newenvironment{warningl}{\begin{tcolorbox}[colback=orange!5!white,colframe=orange]}{\end{tcolorbox}}
 \newenvironment{tip}{\begin{tcolorbox}[colback=green!5!white,colframe=green]}{\end{tcolorbox}}
 
 \makeatletter

diff --git a/style.css b/style.css
@@ -25,7 +25,13 @@ pre code {
     background-color: #e7f2fa;
     border-radius: 5px;
     text-align: center;
+}
 
+.notel {
+    padding: 0.5em;
+    background-color: #e7f2fa;
+    border-radius: 5px;
+    text-align: left;
 }
 
 .warning {
@@ -34,6 +40,12 @@ pre code {
     border-radius: 5px;
     text-align: center;
 }
+.warningl {
+    padding: 0.5em;
+    background-color: #f0b37e;
+    border-radius: 5px;
+    text-align: left;
+}
 
 .tip {
     padding: 0.5em;