# 2 Regression analysis basics
## 2.2 Main objectives for regression analysis
1. The most common use of regression analysis is to quantify how one factor causally affects another.
1. A second objective of regressions is to forecast or predict an outcome.
1. A third use of regressions is to determine the predictors of some factor.
1. A fourth main use of regression analysis is to adjust an outcome for various factors.
## 2.3 The Simple Regression Model
### 2.3.1 The components and equations
$$Y_i = \beta_0+\beta_1 * X_i + \epsilon_i$$
- The **dependent variable** ($Y$), which is also called the outcome, response variable, regressand, or Y variable.
- The **explanatory variable** ($X$), which is also called the independent variable, explanatory variable, treatment variable, regressor, or simply X variable.
- The **coefficient on the explanatory variable** ($\beta_1$), which indicates the slope of the regression line, or how the outcome ($Y$) is estimated to move, on average, with a one-unit change in the explanatory variable ($X$).
- The **intercept term** ($\beta_0$), which indicates the Y-intercept from the regression line, or what the expected value of Y would be when X = 0. This is sometimes called the "constant" term.
- The **error term ($\epsilon$)**, which indicates how far off an individual data point is, vertically, from the true regression line. This occurs because regressions typically cannot perfectly predict the outcome.
#### 2.3.1.1 X and Y Variables
Theoretically (and hopefully) Y depends on what X is and not vice versa. Hopefully, X is random with respect to Y, meaning that X changes for reasons that are unrelated to what is happening to Y.
#### 2.3.1.3 The error term
There are three main components of the error term
1. The influence of variables is not included in the regression.
1. The possibility that the X or Y variable is miscoded. For example, if the person has a college degree (16 years-of-schooling) and is reported to have just two years of college (14).
1. The effects of random processes affecting the outcome.
#### 2.3.1.4 The theoretical/true regression equation vs. the estimated regression equation
The true regression equation:
$$Y_i = \beta_0+\beta_1 * X_i + \epsilon_i$$
With data available, we produce the estimated regression equation as follows:
$$Y_i = \hat{\beta}_0+\hat{\beta}_1 * X_i + \hat{\epsilon}_i\tag{2.3a}$$
$$\hat{Y}_i = \hat{\beta}_0+\hat{\beta}_1 * X_i\tag{2.3b}$$

The “hats” ($\hat{ }$) over $Y$, $\beta_0$, $\beta_1$, and $\epsilon$ indicate that they are predicted or estimated values.
- $\beta_0$ is the predicted intercept term.
- $\beta_1$ is the predicted coefficient on the variable X.
- $\epsilon_i$ is the predicted error term (known as the **residual**) based on the actual X and Y values and the coefficient estimates.
- $\hat{Y}_i$ is the predicted value of Y based on the estimated coefficient estimates and value of the variable, X. Note that the predicted value of Y does not include the residual because the expected or average residual equals zero with most methods.

### 2.3.2 An example with education and income data
There are various methods for estimating the best-fitting regression line. By far, the most common one is the Ordinary Least Squares (OLS) method. The “Least Squares” part refers to minimizing the sum of the squared residuals across all observations

The two relevant variables for the example now are:
- $Y$ = $income$ = income of the individual
- $X$ = $educ$ = years of schooling

With the sample available, we obtain the following regression equation (rounding):
$$\hat{Y} = -54,299 + 8,121 * X\tag{2.4a}$$
or $$\hat{income} = -54,299 + 8,121 * educ\tag{2.4b}$$

### 2.3.3 Calculating individual predicted values and residuals
The **predicted value** indicates what we would expect income to be for a given level of schooling. 

The **residual**, how the actual Y value differs from the predicted value, would be how much more (or less) income is relative to the predicted income, given the level of schooling. With some regression methods, notably the most common one of OLS, the average residual equals 0. Thus, the amount of over-prediction and under-prediction are equal.

As an example, one person in the sample has 10 years-of-schooling and \$25,000 of income. His regression statistics would be:
- Predicted value of $Y = \hat{Y} = E[Y|X=10] = -54,299 + 8,121 * 10 = \$26,911 $
- Residual $=Y - \hat{Y} = Y - E[Y|X] = \$25,000 - \$26,911 = -\$1,911 $ 

The interpretations are that:
- We predict that someone with 10 years-of-schooling would have an income of \$26,911.
- This person with 10 years-of-schooling and an income of \\$25,000 has \\$1,911 lower income than what would be predicted from the regression.

## 2.4 How are regression lines determined?
### 2.4.1 Calculating regression equations

<center>Table 2.2 Four-observation example of education and income</center>
    
|<div style="width: 50pt">Person</div>|Year of schooling (X)|Income (Y)|Deviation from mean X|Deviation from mean Y|<div style="width: 100pt">Numerator for slope $(X_i-\bar{X}) * (Y_i-\bar{Y})$ <div>|Denominator for slope $(X_i-\bar{X})^2$|
|:-|:-|:-|:-|:-|:-|:-|
|1|10|40|-3|-5|15|9|
|2|12|45|-1|0|0|1|
|3|14|40|+1|-5|-5|1|
|4|16|55|+3|+10|30|9|
| |$\bar{X} = 13$|$\bar{Y} = 45$| | |40|20|

When using the OLS method for the Simple Regression Model, the following equation is the estimated slope of the regression line:
$$\hat{\beta_1} = \frac{(X_i-\bar{X}) * (Y_i-\bar{Y})}{(X_i-\bar{X})^2}$$
It is fairly straightforward. The equation represents how X and Y move together (the numerator) relative to how much variation there is in X (the denominator).Given these calculations, $\beta_1 = 40 / 20 = 2$. To then derive the coefficient estimate for $\beta_0$, we would use the centroid feature of Ordinary Least Squares. That is, the regression line under OLS goes through the point that has the average X and Y values.$\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 * \bar{X} = 45 - 2 * 13 = 19$.
    
### 2.4.2 Total variation, variation explained, and remaining variation
The **total variation** or **Total Sum of Squares** (*TSS*) of the outcome is the sum of the squared deviations from the mean, or:
$$TSS = \sum(Y_i - \bar{Y})^2$$
In our example, the TSS is calculated as $TSS = (-5)^2+(0)^2+(-5)^2+(+10)^2 = 150$    

The **TSS** can then be separated into two components::
- ExSS = Explained Sum of Squares = total variation explained by the regression model.
- RSS = Residual Sum of Squares = total variation remaining unexplained by the regression model (or the sum of the squared residuals).

When using Ordinary Least Squares. The model is estimated by finding the set of coefficients that with least RSS and thus maximal ExSS.
    
<br><center>Table 2.3 Predicted values and residuals for the four-observation sample</center></br>
|Person|Year of Schooling|Income|Predicted income|Residual|
|:-|:-|:-|:-|:-|
|1|10|40|39|+1|
|2|12|45|43|+2|
|3|14|40|47|-7|
|4|16|55|51|+4|
    
In our example, the RSS is calculated as $RSS = (1)^2+(2)^2+(-7)^2+(+4)^2 = 70$
And the ExSS as $ExSS = TSS - RSS = 150 - 70 = 80$
    
## 2.5 The explanatory power of the regression
### 2.5.1 R-squared
R-squared (or $R^2$) is the proportion of the variation in the outcome, Y, that is explained by the X variable(s). In other words:
$$R^2 = \frac{ExSS}{TSS} = \frac{TSS-RSS}{TSS} $$
In our example:
$$R^2 = \frac{150-70}{150} = 0.533$$

The interpretation for R2 in such models is that it indicates how much of the total variation in the dependent variable (Y) can be explained by all of the explanatory (X) variables. 

### 2.5.2 Adjusted R-squared
When a new explanatory variable is added to a model (as in the Multiple Regression Model below), $R^2$ always increases, even if there is no systematic relationship between the new X variable and the Y variable. This is because there is almost always at least some incidental correlation between two variables. The Adjusted $R^2$ corrects for the incidental correlation. The formula is:
$$Adjusted R^2 = \bar{R}^2 = 1 - \frac{\frac{\sum{\hat{\epsilon}^2}}{(n-K-1)}}{\frac{TSS}{(n-1)}}$$

Here n is the sample size, and K is the number of explanatory variables.

The Adjusted $R^2$ increases only if the new X variable explains more of the outcome, Y, than a randomized variable would be expected to explain variation in Y by chance, thereby reducing the sum of the residuals ($\sum{\hat{\epsilon}^2}$) by a meaningful amount.
    
### 2.5.3 Mean Square Error and Standard Error of the Regression
An important statistic that will be used later is the estimated variance of the error, or the **Mean Square Error (MSE)**, calculated as:
$$MSE = \hat{\sigma}^2 = \frac{\sum{\hat{\epsilon}^2}}{n}$$
The square root of MSE, or $\hat{\epsilon}$, is referred to as: 
- The Standard Error of the Estimate.
- The Standard Error of the Regression.
- The Root Mean Square Error (**RMSE**).
    
This latter statistic will tell you the nature of the distribution of the residuals. In particular, the absolute value of 95% of the residuals should be less than roughly 1.96 × RMSE.
    
## 2.8 Correlation vs. causation
A correlation is when two variables move together, positively or negatively. That is when one variable is higher, the other variable tends to be higher (a positive correlation) or lower (a negative correlation). In contrast, two variables having no correlation (or a correlation very close to zero) means that when one moves, there is no tendency for the other variable to be higher or lower.
Causation is when one variable has an effect on another variable. Causation would mean that, if some randomness or some force were to cause variable X to change, then, on average, variable Y would change as a result.
The important point is this: a correlation can exist without causation.
    
## 2.9 The Multiple Regression Model
Continuing with the example from the prior section, for estimating the effects of schooling on income. We want to hold these other factors constant so that we can compare incomes for people who are observationally similar except for the level of education they have. One way to eliminate or reduce the influence of some of these other factors may be the Multiple Regression Model. 
    
The explanatory variables can be classified into two types, characterized here in terms of the objective of identifying causal effects:
- The key-explanatory (X) variable(s) is the variable or set of variables for which you are trying to identify the causal effect. This is often called the “treatment.” (“Years-of-schooling,” $X_1$, is the key-X variable in our example.)
- The control variables are the variables included in the model to help identify the causal effects of the key-X variable(s). (“Aptitude score,” $X_2$, is the control variable.)

When we add Armed Forces Qualification Test percentile (hereafter, called “AFQT” or “AFQT score”) to our former regression model, the regression equation (using descriptive variable names) is:
    $$(\hat{income})_i = -34,027 + 5,395 * (educ)_i + 367 * (afqt)_i$$
compared with:
    $$(\hat{income})_i = -54,299 + 8,121 * (educ)_i\tag{2.4b}$$

One reason why those with more schooling may have had higher income was that they tend to have higher aptitude. As researchers, we are interested in how one extra year of schooling would affect a person’s income, not the combined effects of an extra year of schooling plus some extra aptitude associated with that extra year of schooling. The variation in income coming from the variation in years-of-schooling is now largely (but not entirely) independent of variation in aptitude.
    
## 2.10 Assumptions of regression models
There are assumptions or conditions under which OLS coefficient estimates and standard errors are unbiased estimators of the true population parameters – standard errors being the standard deviation in the estimated coefficient estimate.

- **A1. The average error term, $\epsilon$, equals 0**.
- **A2. The error terms are independently and identically distributed (i.i.d.).** This means that if one observation has, say, a positive error term, it should have no bearing on whether another observation has a positive or negative error term. That is, a given observation is not correlated with another observation. This could be violated if, for example, there were siblings in a sample. The siblings’ incomes (beyond what would be explained by years-of-schooling and aptitude) would probably be correlated with each other, as they have common unobservable determinants. Thus, this would violate the i.i.d. assumption.
- **A3. The error terms are normally distributed.** A common mistake is to believe that it is the dependent variable that needs a normal distribution, but it is the error terms that we hope are normally distributed. With the Central Limit Theorem, however, almost any model that has an approximately continuous dependent variable and at least 200 observations should have an approximately normal distribution of error terms – that is, the errors would be asymptotically normal. Others say having 30–40 observations is adequate. Frost (2014) executes some simulations with various non-normal distributions and finds that having 15 observations is adequate.
- **A4. The error terms are homoskedastic.** This means that the variance of the error term, $\epsilon$, is uncorrelated with the values of the explanatory variables, or $var(\epsilon|X) = var(\epsilon)$ for all values of X.
- **A5. The key-explanatory variable(s) are uncorrelated with the error term, $\epsilon$**: If X1 represents a key-explanatory variable and X2 the set of control variables, then $E[\epsilon|X_1, X_2] = E[\epsilon|X_2]$. Note that the typical corresponding assumption used in most textbooks is that all of the explanatory variables, X, are uncorrelated with the error term. This is often called conditional mean independence. But this more stringent assumption is not necessary. Recall that the error term captures the effects of factors that are not included as explanatory variables. Consider the issue of how years-of-schooling affects income. If years-of-schooling (the X1 variable) were correlated with intelligence, and intelligence was part of the error term since it cannot be properly controlled for, then the coefficient estimate on years-of-schooling would be biased because it would reflect the effects of intelligence on income. There are also issues with A5:
    - For two of the regression objectives (forecasting and determining predictors), it is not important that this condition holds. It is only important for estimating causal effects and adjusting outcomes.
    - This condition is often violated, so technically it might not be such a great idea to make the assumption. Rather, use theory to assess whether it is a good assumption. If there were any possibility of this assumption being violated, then you would need to design a model to address the problem or, at the very least, acknowledge the problem.
    
## 2.13 Why regression results might be wrong: inaccuracy and imprecision
The two main reasons why economists could be wrong are: (1) inaccuracy brought on by a systematic bias; and (2) imprecision or uncertainty because of limited sample or some randomness in the sample.
    
## 2.14 The use of regression flowcharts
The rectangles represent a variable or set of variables, while an oval represents an unobserved factor for which there would not be data due to being unavailable, unobserved, or non-quantifiable. In our case, we aim to estimate (for children) how the average daily hours of TV watched affect weight, as measured by the Body Mass Index (BMI). The regression model would hopefully be designed so that it would, as accurately and precisely as possible, produce a coefficient representing the true effect. Yet, as discussed in the prior section, many things might get in the way of that.If B or both C and D in the following regression flowchart are non-zero, then there would be a bias.
![image.png](attachment:image.png)
    
## 2.16 Definitions and key concepts
### 2.16.1 Different types of data (based on the unit of observation and sample organization)
#### 2.16.1.1 Unit of observation: individual vs. aggregate data
The unit of observation indicates what an observation represents, or who/what is the subject. Individual entities are often an individual or a family, but it could be an organization or business – e.g., a baseball team, with the variables of interest being wins and total payroll. Aggregated data are averaged or summed typically at some geographical or organizational level, such as class-level or school-level data.
#### 2.16.1.2 Sample organization
1. Cross-sectional data. It involves taking a sample of all subjects (an individual or, say, a state) at a given period of time.
1. Time-series data. This is based on one subject over many time periods. An example of this could be quarterly earnings per share for one company over time.
1. Panel data. This is a combination of cross-sectional and time-series data. It involves multiple subjects observed at two or more periods each.
1. Multiple cross-sectional-period data. This has multiple years (as with panel data), but the subjects are different each year.
1. Survival data. This is somewhat similar to panel data in that it involves multiple time periods for each person or subject, but the dependent variable is a variable that turns from 0 to 1 if some event occurs. For example, in an examination of what causes divorces, a couple will be in the data for each year of marriage until a divorce occurs or until the end of observation for the couple. If they get divorced, they are no longer in the sample of couples “at risk” for getting divorced.
    
### 2.16.2 A model vs. a method
A method indicates how the model is estimated. And a given model can be estimated with various methods.
    
### 2.16.3 How to use subscripts and when they are needed
There are some situations in which multiple subscripts will be needed (or at least recommended). 
- First, subscripts would be recommended (though not essential) when there are observations over several periods, it would be useful to add a “time” subscript, usually t.
- Second, subscripts are needed when there is an aggregated variable used in the model that applies to multiple observations.
- Third, subscripts are needed when fixed effects are used.

# 3. Essential tools for regression analysis
## 3.1 Using dummy (binary) variables
### 3.1.1 The basics of dummy variables
When creating the dummy variables, all categories cannot be included in the regression model. Rather, there needs to be a “reference category” (also called “reference group” or “excluded category”) so that each group has another group to be compared to. For example, for a yes/no classifica- tion (e.g., whether a patient receives a medical treatment), the two categories are boiled down to one variable (say, T, for “treatment”) coded as follows:
- T = 1 if the person receives the treatment.
- T = 0 if the person does not receive the treatment (the reference category).

### 3.1.2 Be mindful of the proper interpretation based on the reference group
![image.png](attachment:image.png)
Consider an alternative specification for education, using the highest degree earned. A “college degree” refers to having a 4-year college degree as one’s highest degree, while a “graduate degree” involves having a Master’s, Doctoral, or professional degree. Figure 3.1 shows the notional average incomes for each group, plus the average for the first two categories combined, which is weighted towards the “High school” average because a larger share of the sample has just a high-school diploma.

The estimated income premium for a graduate degree depends on what the reference group is. Estimating a model just with a dummy variable for having a graduate degree produces the following equation:
$$(\hat{income})_i = 50,000 + 40,000 * (graduate degree)_i$$

Note that the intercept ($50,000) is the average income for the first two groups. The $40,000 estimate is the difference between those with a graduate degree and all others. Thus, without assigning those with just a high-school degree to a separate category, they are counted in the reference group along with those with a college degree. This is an important point: For a categorization, any group not having
a value of one for a dummy variable in the categorization is part of the reference group.

Adding a variable for “having one’s highest degree being a college degree” would make the reference group those with just a high-school diploma: 
$$(\hat{income})_i = 40,000 + 30,000 * (college degree)_i + 50,000 * (graduate degree)_i$$
The intercept is now the “High school” average. The $50,000 is the difference between the $90,000 (for those with a graduate degree) and the $40,000 for those with just a high-school diploma. It follows that the $30,000 coefficient estimate is the difference between the $70,000 (for those with a college degree being the highest degree earned) and the $40,000.

An alternative setup could use “has a college degree” rather than “highest degree is a college degree,” which gives the regression equation:
$$(\hat{income})_i = 40,000 + 30,000 * (college degree or more)_i + 20,000 * (graduate degree)_i$$
The coefficient estimate of $20,000 on the graduate degree is now relative to those with a college degree ($90,000 − $70,000). This is because those with a graduate degree (all of whom have a college degree also) have the $30,000 for the college degree contributing to the predicted value, so the coefficient estimate on graduate degree is now what is over and above those with a college degree.

## 3.2 Non-linear functional forms using Ordinary Least Squares
### 3.2.1 Combining variables for interaction effects
you have a theory that divorce would not be as bad and may actually help children if it gets them away from a dysfunctional situation with high levels of conflict. How can you examine whether divorce effects on children are different for those in such families?
One option is to estimate separate models for families with a high level (H) vs. a low level (L) of conflict.

High-level of conflict families: $Y_{iH} = \beta_{1H}X_{iH}+\beta_{2H}D_{iH}+\epsilon_{iH}$

Low-level of conflict families: $Y_{iL} = \beta_{1L}X_{iL}+\beta_{2L}D_{iL}+\epsilon_{iL}$

Where
- “H” and “L” subscripts refer to the families with “high” and “low” levels of conflict, respectively.
- Y is the outcome for child i, measured as the change in test score from 2010 to 2014.
- X is a set of control variables.
- D is an indicator (dummy variable) for having one’s parents divorce between 2010 and 2014.

The test to examine whether children from high-conflict families have different effects of divorce from that for children from low-conflict families would be a comparison of the coefficient estimates, $\hat{\beta}_{2H}$ and $\hat{\beta}_{2L}$. The expectation would be that $\hat{\beta}_{2H}$ would be less negative than $\hat{\beta}_{2L}$ (or even positive) – that is, any adverse effects of the divorce may be lower or non-existent for children from high-conflict families, and the divorce effects may even be positive.

An alternative method is to use interaction effects, which would involve combining all children, from both high- and low-conflict families, into one model, as follows:
$$Y_i = \beta_1X_i + \beta_2D_i + \beta_3H_i + \beta_4(D_i * H_i) + \epsilon_i$$
where H is an indicator (dummy variable) for being in a high-conflict family. The interaction term is D × H, which has the “divorce” dummy variable being interacted with the “high-conflict” dummy variable. This variable equals 1 only if the child’s parents divorced and the family is “high conflict.” The estimated effect of a divorce would be calculated as:
**$$\frac{\Delta Y}{\Delta D} = \hat{\beta}_2 + \hat{\beta}_4 * H$$**

For children from low-conflict families, the estimated divorce effect would be $\hat{\beta}_2$ (because H = 0). For children from high-conflict families, the estimated divorce effect would be $\hat{\beta}_2+\hat{\beta}_4$ (because H = 1).

The test for whether the divorce effects are different for high-conflict vs. low-conflict families would be based on the value of $\hat{\beta}_4$. If there were evidence that the effect of a divorce on children’s test scores was different for high-conflict vs. low-conflict families, then the level of conflict would be a moderating factor for this effect.

The advantages of using the interaction effect instead of separate regression equations are that:
- It is a more direct test of differences in the estimate across the two models rather than comparing two coefficient estimates.
- It produces more precise estimates for all variables, as it has a larger sample than separate models.

The disadvantages are that:
- It does not produce a direct estimate of the effect of divorce on children from high-conflict families – rather, you have to add the two estimates, and the standard error may need to be calculated manually.
- It is constraining $\hat{\beta}_1$, the estimates on other X variables, to be the same for the two groups – the low-conflict and the high-conflict samples.

### 3.2.2 Logarithmic forms of variables
In some cases, we may be interested in percent changes in X and/or Y rather than the changes in the actual values. When a variable is transformed to its natural logarithm, it becomes interpreted as a percentage change.

|Functional form|Model|Interpretation|Formula for $\beta_1$|
|:-|:-|:-|:-|
|Linear|<div style="width: 150pt">$Y_i = \beta_0 + \beta_1X_i + \epsilon_i$</div>|The change in Y associated with a one-unit higher value of X|$\frac{\Delta Y}{\Delta X}$|
|Linear-log|$Y_i = \beta_0 + \beta_1ln(X_i) + \epsilon_i$|The change in Y associated with a one-percent higher value of X|$\frac{\Delta Y}{\%\Delta X}$|
|Log-linear|$ln(Y_i) = \beta_0 + \beta_1X_i + \epsilon_i$|The percentage change in Y associated with a one-unit higher value of X|$\frac{\%\Delta Y}{\Delta X}$|
|Log-log|$ln(Y_i) = \beta_0 + \beta_1ln(X_i) + \epsilon_i$|The percentage change in Y associated with a one-percent higher value of X. (This is also called the elasticity.)|$\frac{\%\Delta Y}{\%\Delta X}$|

### 3.2.3 Quadratic and spline models
There are many situations in which there could be non-linear relationships between the explanatory variable and the outcome. Let us assume a sample of study hours and test scores as follows:
![image-2.png](attachment:image-2.png)
Let’s say that you use a regression analysis and merely test the linear relationship between hours studied (H) and test score (Y), as follows:
$$Y_i = \beta_0 + \beta_1H_i + \epsilon_i$$

It looks like $\hat{\beta}_1$ would understate the slope for low numbers of hours studied, as seen by the slope of the trend line being lower than the general slope for the data points. And $\hat{\beta}_1$ would overstate the slope for higher values of hours studied. There are two alternative models that could provide for a better fit of the data:

First, there is a **quadratic model** that adds the square of hours studied, as follows:
$$Y_i = \beta_0 + \beta_1H_i + \beta_2H_i^2 + \epsilon_i$$
Of course, one could use a higher-order polynomial than a quadratic model. This would be more accurate for predicting the test score, but it would make it more difficult to answer the question of whether and how hours studied affects the test score.

Second, one could use what is called a **spline function**, in which you allow H to have a different estimated linear effect at different levels of H. We can see from Figure 3.2 that around 8 hours of studying is when the benefit of studying appears to level off and maybe decrease. Thus, one could estimate the separate marginal effects of “an hour of studying up to 8 hours” and “an hour of studying beyond 8 hours.” The model is the following:
$$Y_i = \beta_0 + \beta_1 * (H_i\leq8) + \beta_2 * I(H_i\geq8) + \beta_3 * [(H_i beyond 8) * I(H_i\geq8)] + \epsilon_i$$
where I(·) is an indicator function, taking the value of 1 if the expression is true and 0 otherwise.
![image-3.png](attachment:image-3.png)

## 3.3 Weighted regression models
There are many situations in which each observation should not be counted the same. Some of these reasons include:
- Some surveys, including the NLSY, over-sample from particular sub-populations.
- The unit of observation, if aggregated, may be based on different-sized populations.

Applying weights to OLS would be considered the Weighted Least Squares method. But practically any method can apply different weights to different observations.

## 3.4 Calculating standardized coefficient estimates to allow comparisons
In particular, what is often estimated or calculated is a standardized coefficient estimate. For a general regression equation:
$$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \epsilon$$
the standardized coefficient estimate of a given variable, say X1, is:
$$Standardized  Coefficient  Estimate = \hat{\beta}_1 * \frac{\hat{\sigma}_{X1}}{\hat{\sigma}_Y}$$
Simply, a standardized coefficient estimate is the coefficient estimate multiplied by the ratio of the standard deviations of the corresponding variable (X1) and the outcome (Y).

# 4. What does “holding other factors constant” mean?
## 4.1 Why do we want to “hold other factors constant”?
When we say we want to hold some factors constant, we want to design the model so that one explanatory variable moves without the other factor moving with it.
## 4.2 Operative-vs-“held constant” and good-vs-bad variation in a key-explanatory variable
Good variation, the variation in the key-x variable that we want to prove to contribute to y-variable. Ohter than good variation, all other variations are bad variation (i.e. variation in the key-x variablt that could contribute to y-variable, but not our research focus and could confound the good variation).

Within bad variation, there is held-constant variation, i.e. the variation that could be held constant and its effect on y-variable be rid of. And there is operative variation, i.e. the variation that is latent in key-x variable, and could is not (yet) rid of.

## 4.3 How “holding other factors constant” works when done cleanly
One of the most important concepts of regression analysis is the fact that, often, we cannot adequately hold other factors constant. Before demonstrating this, I will give an example that cleanly holds a particular factor constant. This occurs when the factor being controlled for is a dummy variable or a series of dummy variables representing a categorization.

Let’s take the issue of how the number of students in a class affects the average student-evaluation scores of professors – I’ll call it “class-size”. We have a notional sample of 100 classes, 25 for each of the four professors. Each professor has a distinct range of class sizes, with the size of the range being 40 for each professor. There is a pre-imposed effect of class-size (CS) on the average evaluation (E), which is negative for prof.A and B, but zero for C and D.

|Professor|Range of class size|Pre-imposed effects of CS on E|
|:-|:-|:-|
|A|20-60|-0.03|
|B|100-140|-0.01|
|C|180-220|0|
|D|260-300|0|

If we use a simple regression model to fit all data, the results would be as follows:
![image.png](attachment:image.png)
In this chart, classes for one professor are compared to classes of other professors. This means that, as class-size changes, eventually the professor changes as well. Thus, the operative variation in the key-X variable, the class-size, is not just from the randomization of class- sizes but also who the professor is. There is no held-constant variation. The problem is that the operative variation in class-size from the professor is bad variation because the effectiveness of the professor affects the dependent variable (E). Thus, we need to move the variation in class-size due to the professor from operative to held-constant variation.

When the professor is controlled for, we have the following results (Now, the coefficient estimate on CS has turned negative (−0.0098)):
$$\hat{E}_i = 3.92-0.0098*CS_i+0.081*(ProfB)_i+2.29*(ProfC)_i+3.23*(ProfD)_i$$

To demonstrate how we know that the professor is held constant and what that does, in the bottom chart, I estimate separate models for each professor. The results are:
![image-2.png](attachment:image-2.png)
Professor A: $\hat{E}_i = 4.57-0.0257*CS_i$

Professor B: $\hat{E}_i = 4.92-0.0113*CS_i$

Professor C: $\hat{E}_i = 3.82+0.0023*CS_i$

Professor D: $\hat{E}_i = 4.14-0.0010*CS_i$

When categories are controlled for with dummy variables, the coefficient estimate on CS is the average of the within-category coefficient estimates on CS, weighted by the product of:
- The number of observations in each category.
- The variance of CS in each category.

## 4.4 Why is it difficult to “hold a factor constant”?
When one needs to hold constant a factor that is quantitative and not categorical, then it is not as clean.

When we control for a categorical variable, we only estimate the treatment-outcome relationship within each category. In contrast, when we control for a quantitative variable (i.e., not categorized), we are merely removing the estimated linear relationship between that variable and the key-X variable (and other variables). Any non-linear relationship between the control variable and the key-X variable would mean that, as the key-X variable (the treatment) varies in the sample, so does the control variable.

Furthermore, just like there will be error in the estimated causal effect of some treatment on an outcome, there will be error in the estimated relationship between a control variable and the key-X variable and between the control variable and the outcome.

## 4.6 Proper terminology for controlling for a variable

From the lessons in this chapter, we can only say we are “holding a factor constant” if that factor is categorical and not based on a quantitative variable.

For quantitative (non-categorical) variables, the best we can do is say that we “adjust for” or we “control for” the factor.

# 5. Standard errors, hypothesis tests, p-values, and aliens
## 5.1 Standard errors
The standard error is a measure of how precise the coefficient estimate is. Often standard errors are reported in studies in parentheses next to or under a coefficient estimate.

The standard error (SE) of a statistic (usually an estimate of a parameter) is the standard deviation of its sampling distribution or an estimate of that standard deviation. If the statistic is the sample mean, it is called the standard error of the mean (SEM). The standard error is a key ingredient in producing confidence intervals. Here the standard error is the standard deviation of the sampling distribution for the *coefficient estimate*. The formula for the SE on a coefficient estimate on a variable, $X_j$, is the following:
$$SE(\hat{\beta}_j) = \frac{\hat{\sigma}}{\sum{(X_j-\bar{X}_j)^2*(1-R_j^2)}}$$

From the formula for the standard error of the coefficient estimate, we can determine that the following main factors would be associated with a *lower* standard error (which is always better) for some variables, $X_j$:
- A larger sample size (n is in the denominator in the formula for $\hat{\sigma}$
- A smaller amount of unexplained variation in Y
- A larger standard deviation in $X_j$ (which gives more variation to explain Y)
- Less variation in $X_j$ explained by the other X variables

This means that adding a control variable would contribute to a lower standard error for variable, $X_j$,
by reducing the standard-error-of-the-regression, but it would contribute to a higher standard error
by leaving less variation in $X_j$ not covered by other variables. Thus, adding a variable that is more-
highly correlated with $X_j$ is more likely to increase the standard error than would a variable that is not
highly correlated with $X_j$.

![image.png](attachment:image.png)
To demonstrate the importance of sample size and variation in the explanatory variable, consider the figure above. Here are not only the estimated slopes provided, but the flopes characterizing the 95% confidence intervals for coefficient estimates. What differ across the three charts in Figure 5.1 are the sample size and the range of values for the key-X variable. The figure shows that coefficient estimate would be more precise (with a smaller standard error) with larger sample size and larger variation of explanatory variables. 

## 5.2 How the standard error determines the likelihood of various values of the true coefficient
The standard error is important for hypothesis testing but also for informing us on the range of likely values and even the likelihood of certain values for the true coefficient. To demonstrate that, we created a notional regression result presumably with no systematic biases as follows:
$$\hat{Y} = 0.065*\hat{X}$$ 
$$(SE = 0.030)$$
The value of 0.065 is still likely to be wrong even without systematic biases. The following figure shows the likelihood of various ranges of values for the true coefficient.
![image-2.png](attachment:image-2.png)
The figure tells that:
- There is only a 13.2% chance that the true coefficient is between 0.060 and 0.070, which is centered around the coefficient estimate of 0.65.
- The likelihood that the true coefficient is about one standard error lower than the coefficient estimate (in the 0.030 to 0.040 range) is 0.081, which is about 61% as likely as being in the 0.060–0.070 range. The same applies to the true coefficient being about one standard error higher (in the 0.090–0.100 range).
- There is a greater likelihood that the true coefficient is in one of the two ranges that is one standard error away than in the band centered on the coefficient estimate (0.081 + 0.081 = 0.162 vs. 0.132). Note that all of these numbers would be a bit different for different-sized bands.
- There is a 1.5% chance that the true coefficient is below zero.

There is often a wide range of possible values for the true coefficient. What good is the eighth decimal place when there is such uncertainty on the second decimal place, and even the first decimal place?

## 5.3 Hypothesis testing in regression analysis
### 5.3.1 Setting up the problem for hypothesis tests
Let's assume a case where the standardized scores on empathy have a normal distribution with a mean of 100 and a standard deviation of 15 in the historic population of teenagers. Now we have a sample of 25 current teenagers with a mean of 104, a little higher than the historical average of 100, and a same standard deviation of 15. We want to conduct a hypothesis test to see whether there is evidence confirming that current teenagers do indeed have a different level of empathy from the historical teenage average. (Note that I am initially testing for a *different level* and not a *higher level* of empathy. The test is really about how certain we can be to rule out randomness giving you the sample mean that is different from 100, the historical mean. Because assuming this was a random sample of the population of teenagers, the higher mean level of empathy in the sample of current teenagers means that either:
- They do have the same mean empathy level as the historical population of teenagers, and it was random variation that caused this sample of 25 teenagers to have a mean level of empathy that is 4 points off from the population mean of 100 or
- Current teenagers indeed have a different mean empathy level from the historical population of teenagers.

The test procedures are:
1. **Define the hypotheses**:
- Null hypothesis: $H_0$: mean empathy of current teenagers = $\mu$ = 100
- Alternative hypothesis: $H_1$: mean empathy of current teenagers = $\mu$ ≠ 100
2. **Determine the standard error of the estimate**. The standard error of the mean is equal to:
$$SE = \frac{\sigma}{\sqrt{n}} = \frac{15}{\sqrt{25}} = 3$$
3. **Determine how certain you want to be for the test by establishing a statistical significance level**. Most people use 95% certainty, which translates into a 5% (statistical) significance level.
4. Determine whether 104 is different enough from the hypothesized value of 100 that we can rule out (with a high degree of certainty) that randomness gave us the sample mean (104) this far from the hypothesized value of 100. Following the figure below, we can see that to stay in refection region (the region in shade), the sample mean has to be deviate a lot from the population mean. For 5% significance level, the sample mean should be 1.96 times SEM away from the population mean.
![image-3.png](attachment:image-3.png)
Namely, the threshold for rejection region is:
$$100\pm{1.96*3}= 94.12, 105.88$$

5. While 104 is between 94.12 and 105.88, we cannot reject the null hypothesis at 95% certainty.

This was a two-sided test, formally testing whether current teenagers have a different mean level of empathy from the historical population. Alternatively, you could do a one-sided test, testing whether current teenagers have higher empathy than the population. In this case, the hypotheses would be:
$$H_0: \mu < 100$$
$$H_1: \mu > 100$$

The critical value would be: $100 + 1.645 * 3 = 104.935$. (The z value of +1.645 is the z value that leaves all 0.05 in the right tail in the standard normal distribution.) Again, with the sample mean of 104, we could not conclude that current teenagers have a higher empathy level than the historical population of teenagers.

There is always the possibility that the conclusion of the test could be wrong. The two types of errors are:
- Type I error (False Positive) – current teens have the same level of empathy as historical teens, but random sampling happens to produce a level of empathy that is significantly different from the historical value of 100.
- Type II error (False Negative) – current teens do have a different level of empathy, but we are not able to detect a statistically-significant difference.

### 5.3.2 The four steps for hypothesis testing for regression coefficient estimates
1. Define the hypotheses for a coefficient, with the most common being:
<br>Null Hypothesis: $H_0: \beta_i = 0$</br>
Alternative Hypothesis: $H_1: \beta_i\neq{0}$
2. Determine the standard error of the coefficient estimate.
3. Determine how certain you want to be for your test.
4. Starting with the assumption that the null hypothesis was true ($\beta_i$ = 0), test whether the coefficient estimate, $\beta_i$, is far enough from 0, given the standard error, to rule out (with the chosen level of certainty) that randomness gave us this estimate.

### 5.3.3 t-statistics
With regression analysis, the Student’s t-distribution is used. The Student’s t-distribution (or, just tdistribution) is very much like the standard normal distribution, but it is a little wider. With samples of around 100 or more, the Student’s t-distribution gets pretty close to the standard normal distribution.

The observed t-statistic or t-stat ($t_observed$ or $t_o$) for a coefficient estimate is simply:
$$t_o = \frac{coefficient\ {estimate}(\hat{\beta})}{standard\ {error}(SE(\hat{\beta}))}$$
In rare cases, the hypothesized value for the null hypothesis will be some number other than zero. In those cases, the numerator for the t-stat would be the coefficient estimate minus the hypothesized value.

Similar to the transformation to Z for the standard normal distribution, the t-stat transformation involves subtracting out the hypothesized value (which is normally zero) and dividing by the standard error. And so the t-stat is similar to the Z value in that it represents the number of standard errors away from zero that the coefficient estimate is. The more standard errors away from zero it is, the more certainty we can have that the true coefficient is not zero.

The number of degrees of freedom indicates the number of values that are free to estimate the parameters, which equals the number of observations minus the constraints (*parameters to be estimated*). The definition is:
$$Degrees\ of\ freedom = n-K-1 = n-k$$
where
- n = sample size
- K = explanatory variables
- k = K + 1 = parameters to be estimated (explanatory variables and the constant)

We then apply the standard errors and the t-distribution in order to conduct the hypothesis tests and calculate p-values and confidence intervals. This would be appropriate as long as the error terms were normally distributed or at least approximately normally distributed. 

### 5.3.4 Statistical significance and p-values
#### 5.3.4.1 Two-sided hypothesis tests
We start with the following hypotheses:
- $Null\; H_0:\beta_i = 0$
- $Alternative\ H_1:\beta_i\neq{0}$

With the t-stat, the statistical significance of a coefficient estimate can be determined. Note the language: *It is not a variable but rather it is the coefficient estimate that is statistically significant or insignificant*. The test for significance for a coefficient estimate involves comparing the t-stat to critical values on the Student’s t-distribution.

![image-4.png](attachment:image-4.png)

In our case, to achieve a 95% confidence, one would reject the null hypothesis (that the coefficient estimate equals 0) if the t-stat were in the rejection region of greater than 1.9608 or less than −1.9608.

The p-value indicates the likelihood that, if the true coefficient were actually zero, random processes (i.e., randomness from sampling or from determining the outcomes) would generate a coefficient estimate as far from zero as it is. From Tables 5.1 and 5.2, note how the larger the t-stat is, the lower the p-value is.

If two or more explanatory variables are highly correlated, they may be sharing any “effect” or empirical relationship with the dependent variable. This may cause the variables to, individually, have statistically-insignificant coefficient estimates even though the variables are collectively significantly related to the dependent variable. One option in such a situation is to exclude one or more of the correlated explanatory variables to see if the one remaining in the model has a significant coefficient estimate. A second option is to test the joint significance of the coefficient estimates.

#### 5.3.4.2 One-sided hypothesis tests
One-sided hypothesis test should be used when it is clear, theoretically, that an X variable could affect the outcome only in one direction.

For example, for certain variable, the hypotheses would be:
- $Null\; H_0:\beta_i\leq{0}$
- $Alternative\ H_1:\beta_i\geq{0}$

![image-5.png](attachment:image-5.png)

Pretty much all researchers (me included) make the wrong official interpretation of two-sided tests. We would take, say the coefficient estimate and t-stat (−3.13) on black and conclude: “being Black is associated with significantly lower income than non-Hispanic Whites (the reference group).” But the proper interpretation is “being Black is associated with significantly different income from non-Hispanic Whites.

The reason why it is okay is that people make this incorrect interpretation is that, if it passes a two-sided test, then it would pass the one-sided test in its direction as well. This is because the rejection region for a one-sided test would be larger in the relevant direction than for a two-sided test. Thus, the two-sided test is a stricter test.

### 5.3.5 Confidence intervals
Confidence intervals indicate the interval in which you can be fairly “confident” that the value of the true coefficient lies.

![image-6.png](attachment:image-6.png)

For example, from the above table, the 95% confidence interval for the coefficient estimate on *age* would be:
$$752.8\pm1.9608*386.2=(-4.4,\ 1510.1)$$

Note that the interval includes zero. In fact, there is a relationship between statistical significance (for two-sided hypothesis tests) and confidence intervals:
- Significant at the 5% level (p < 0.05) ↔ 95% confidence interval does not include 0
- Insignificant at the 5% level (p > 0.05) ↔ 95% confidence interval includes 0

### 5.3.6 F-tests for joint hypothesis tests and overall significance
#### 5.3.6.1 Joint hypothesis tests
The formal hypothesis test, for four variables, $X_1\ - \ X_4$, would be:
- $H_0:\beta_1=\beta_2=\beta_3=\beta_4=0$(for corresponding explanatory variables $X_1,X_2,X_3,X_4,$)
- $H_1:$one of the $\beta$'s does not equal 0.

![image-7.png](attachment:image-7.png)

#### 5.3.6.2 Overall-significance test
For overall significance, the hypothesis test would be:
- $H_0:\beta_1=\beta_2=\beta_3=...=\beta_k=0$(for a model with K explanatory variables $X_1,X_2,X_3,...X_K,$)
- $H_1:$one of the $\beta$'s does not equal 0.

![image-8.png](attachment:image-8.png)

The overall-significance test is not that common a test. Most regressions that see the light of day would have some significant coefficient estimates, which is a good sign that the overall regression has significance. Furthermore, the test itself has minimal bearing on any of the four main objectives of regression analysis.

The joint hypothesis test is also rare, but it has more value. This is particularly the case when there are two variables that are highly correlated.

### 5.3.7 False positives and false negatives
Conventionally, the probability of a Type I error (a false positive), typically labeled as $\alpha$, would be the significance level at which one is using to test the coefficient estimate. Typically, 5% is used, so 5% would be the probability of a Type I error.

### 5.3.8 Choosing the optimal significance level for the hypothesis test
- First, a very large sample would mean that all coefficient estimates would have a good chance of being statistically significant, so the significance level should be lowered. 
- Second, the cost of being wrong one way or the other should be a factor. If there is a high cost from a false positive, then the significance level should be lowered. 
- A third reason will come from likelihood that there would be an empirical relationship in the first place.

### 5.3.9 Statistical vs. practical significance
Just because it is statistically significant does not mean that the estimated effect is meaningful. That is, it may not be practically significant.

## 5.4 Problems with standard errors (multicollinearity, heteroskedasticity, and clustering) and how to fix them
### 5.4.1 Multicollinearity
Multicollinearity is a situation in which having explanatory variables that are highly correlated with each other causes the coefficient estimates to have inflated standard errors. Perfect multicollinearity occurs when one X variable is an exact linear transformation of another X variable or set of X variables. Controlling for a variable that is highly correlated with the key-X variable would significantly reduce the operative variation in the key-X variable, which would cause higher standard errors.

The inflated standard error from multicollinearity for the key-X variable does not necessarily mean that a potential control variable that is correlated with the key-X variable should be excluded, as doing so could cause omitted-factors bias. A researcher should assess the benefits of including a control variable (reducing omitted-factors bias) and the costs (inflated standard errors).

The criterion I like to use is, if the control variable were included, whether there would be any independent variation in the key-X variable. That is, can the key-X variable move on its own much without the correlated control variable also moving? If so, then I consider it okay to include the control variable. One situation in which you do not have to be concerned with multicollinearity is if it occurs for two or more control variables (and not the key-explanatory variable).

### 5.4.2 Heteroskedasticity
One of the assumptions for the Multiple Regression model was A4: the model has homoskedasticity in that the error terms have the same variance, regardless of the values of X or the predicted value of Y. 

Homoskedasticity - there is a relatively consistent distribution of values of income across the whole range of the AFQT percentile scores as follows:
![image-9.png](attachment:image-9.png)

Heteroskedasticity - the distribution of income is much narrower for low percentiles for the AFQT score than for higher percentiles as follows:
![image-10.png](attachment:image-10.png)

Heteroskedasticity causes biased standard errors in regression models. This occurs because, at the values of the X variable where there is greater variation in the Y variable, we have less certainty on where the central tendency is, but we assume that we have the same certainty that we do for the other values for the X variable. With the variation in the Y variable greater at some values of the X variable, the weight should be less for those observations to calculate the standard errors due to the greater uncertainty in the value of Y. Thus, the estimated standard errors would be biased estimators of the true population standard deviation of the coefficient. Given that the estimated standard errors are biased, any hypothesis tests would be affected. (Heteroskedasticity does not bias the coefficient estimates.)

The correction is simply to use robust standard errors, also known as Huber-White estimators. This allows the standard errors to vary by the values of the X variables. In most cases, the direction of the bias on the standard errors is downwards, so robust standard errors (those corrected for heteroskedasticity) are usually larger. Corrected standard errors will be smaller in the atypical cases in which there is wider variation in the error terms for the central values of an explanatory variable than for the extreme values. 

An insignificant test for heteroskedasticity does not prove that there is no heteroskedasticity. The best approach may be to do a visual inspection of the data. Yet, given the simplicity of the correction for heteroskedasticity in most statistical programs, it is worth making the correction when there is even a small possibility of heteroskedasticity

### 5.4.3 Correlated observations or clustering
Imagine that, in a school of 10 classes of 30 students each, 5 of the classes are randomly selected to have a new-math program that is given to all students in those 5 classes. So 150 students get the new math program, and 150 students receive the old program. Are the 150 students randomized? Not exactly. Really, only 5 of 10 entities were randomized – the classes. Randomizing 5 of 10 classes is not as powerful as randomizing 150 of 300 students would be. Thus, if we observe one child in a class with a positive residual, then we would expect others in that class to be more likely to have a positive residual than a negative residual. The error terms for observations would tend to be positively correlated for people from a particular class. This means that the “effective sample size” would be much less than 300.

In this situation, we would say that there is “clustering” at the class level. Because of this violation, the standard errors would need to be corrected. The reason for this is that the “effective sample size” is lower because, with a sample size of N, there are not N independent observations.

The correction is typically pretty simple in most statistical packages. It basically involves an indication that observations should be allowed to be correlated based on the value(s) of a particular variable or set of variables that you stipulate

## 5.5 The Bayesian critique of p-values (and statistical significance)
### 5.5.1 The problem
Whereas most perceive this to mean that the probability that the statistical relationship is real is one minus the p-value, the actual probability the statistical relationship is legitimate requires extra information. The probability that a research finding is true depends on three important pieces of information:
- The prior probability that there is an effect or relationship.
- The statistical power of the study.
- The t-statistic for the coefficient estimate.

As an example, it is shown that the number of films Nicolas Cage appeared in for a given year was highly correlated with the number of people who drowned in a swimming pool in the United States with a p-value of 0.025. The problem is that we could do the same for the top 1000 actors/actresses, and we would get some statistically-significant relationship (at the 5% level of significance) for about 5% of them – this is a Type I error. Nicolas Cage just happens to be one of those 50 or so out of 1000 for the 1999–2009 period.

To account for the likelihood that 5% of actors/actresses would have a significant coefficient estimate by chance, the prior probability that there is a relationship needs to be accounted for. This is Bayes’ critique of p-values.

A related problem that also casts doubt on the p-value is that those null hypotheses are almost always false – almost everything is related statistically by a non-zero amount. Many of these relationships are so small that they are meaningless, but with a large enough sample, p-values would indicate significance and null hypotheses would be rejected. What this suggests is that, with larger samples, the p-value thresholds for determining significance should be lower.

### 5.5.2 What is the best approach given these issues?
Some researchers do an “informal Bayesian approach,” which involves:
- Using p-values of 0.01 as the benchmark, and being skeptical of results with p-values above 0.01 unless they are backed by strong prior probabilities of there being a relationship.
- Lowering those p-value thresholds even more when sample sizes are very large.
- Focusing on practical significance as well as economic significance.
- Being cautious with interpretations. When significance is not strong (e.g., a p-value greater than 0.01), then perhaps the most that can be said is that the data “support the theory.”

## 5.8 What does an insignificant estimate tell you?
I list four general possible explanations for an insignificant estimate:
- There is actually no effect of the explanatory variable on the outcome in the population.
- There is an effect in one direction, but the model is unable to detect the effect due to a modeling problem.
- There is a small effect that cannot be detected with the available data due to inadequate power.
- There are varying effects in the population (or sample); some people’s outcomes may be affected positively by the treatment, others’ outcomes may be affected negatively, and others’ outcomes may not be affected; and the estimated effect (which is the average effect) is insignificantly different from zero due to the positive and negative effects canceling each other out or being drowned out by those with zero effects.

The lack of evidence for a side effect does not mean that there is no effect.

# 6. What could go wrong when estimating causal effects?
## 6.1 Setting up the problem for estimating a causal effect
What researchers would typically be interested in is the Average Treatment Effect (ATE) on an outcome, or the average effect of the treatment on a given outcome in the relevant population. The problem is that most treatments are not randomly assigned, but rather are determined by some factor that could also affect the outcome. 

Whether it is problematic to only observe a subject in one scenario (being in either the treatment or control group) and not the counterfactual ultimately depends on the reasons why subjects receive the treatment. Or, for a quantitative variable on the treatment, we need to know why some receive greater exposure to the treatment.

## 6.2 Good variation vs. bad variation in the key-explanatory variable
If a key-X variable affected the dependent variable, then any factor that caused variation in the key-X variable would be correlated with the dependent variable through the key-X variable. If that was the only reason why the factor creating the variation in the key-X variable was correlated with the dependent variable, then it would be good variation.

In the presence of bad variation, the model needs to be designed to convert the bad variation from operative to held-constant variation, which would leave only good variation among the operative variation. 

Let’s take an example of how the state unemployment rate affects crime rates. Variation in the state unemployment rate coming from the state and year would be bad variation because certain states would have characteristics (from weather, wealth, and other factors) that could cause a state to have higher (or lower) crime rates than other states. Because state and year create bad variation in the unemployment rate, they should be controlled for with sets of dummy variables (or equivalently with fixed effects). That would shift the bad variation in the state unemployment rate from the state and year from operative variation to held-constant variation.

The reality is that there could be part of variation from the state and year that is good variation in that it has no connection with the outcome. It is unfortunate to lose good-operative variation by controlling for the sources (the state and year), but that is better than keeping bad-operative variation.

## 6.3 The 7 common PITFALLS
Basically, the sources of variation of the key-X variable could be related to the dependent variable, meaning it has bad variation.

|PITFALL|What to check for|Direction of bias|
|:-|:-|:-|
|1. Reverse causality|Does the outcome affect an X variable?|The direction of how the outcome affects the treatment variable|
|2. Omitted-factors bias|Does some omitted factor affect both the key-X variable and the outcomes? Might there be an incidental correlation between factors of the key-X vairable and the outcome? Could those receiving little-or-no treatment engage in a replacement action that affects the outcome?|The sign of the product of the correlations between the omitted-factor to the treatment variable and to the outcome|
|3. Self-selection bias|Did the subject choose or get assigned to the key-X variable by some means that is related to the personal benefits or costs of the X variable|Positive if higher values of the outcome are good; negative if higher values are bad|
|4. Measurement error|Is there non-trivial error in the coding of the explanatory variable? Is the X variable imperfect in representing the intended concept?|Towards zero if the measurement error is random; uncertain otherwise|
|5. Using mediating factors or outcomes as control variables|Is there a control variable included in the model that is a product of the key-X variable (or determined after the key-X variable)?|The opposite of the sign of the mechanism of how the treatment affects the outcome|
|6. Improper reference groups|Does the reference group represent the correct counterfactual? Is there a replacement action that impacts the outcome? Does the reference group have a lower-intensity effect of the treatment?|Positive if the reference group in the analysis has lower values of the outcome than what the proper reference group has|
|7. Over-weighted groups|Could the variance of the key-X variable and the effect of the key-X variable on the outcome vary across groups that are controlled for?|Positive if the groups being over-weighted have more positive treatment effects than the other groups; negative otherwise|


$$\hat{Y} =\hat{\beta}_0 + \hat{\beta}_1*Poly + \hat{\beta}_2*2nd\_Wave + \hat{\beta}_3*3rd\_Wave + \hat{\beta}_4*Poly*2nd\_Wave + \hat{\beta}_5*Poly*3rd\_Wave $$

$$\frac{\partial \hat{Y}}{\partial Poly} = \hat{\beta}_1 + \hat{\beta}_4*2nd\_Wave + \hat{\beta}_5*Poly*3rd\_Wave$$ 
$$\hat{E}_2= \hat{\beta}_1 + \hat{\beta}_4\ (2nd\_wave = 1,\ 3rd\_wave = 0)$$

$$\hat{E}_2 = \hat{\beta}_1 + \hat{\beta}_2 + \hat{\beta}_4$$


In the 1st wave: $\hat{E}_1 = \hat{\beta}_1$
<br>In the 2nd wave: $\hat{E}_2 = \hat{\beta}_1+ \hat{\beta}_4$</br>
In the 3rd wave: $\hat{E}_3 = \hat{\beta}_1+ \hat{\beta}_5$

Main effect: $\hat{E}_m = \frac{\hat{E}_1+\hat{E}_2+\hat{E}_3}{3}=\frac{3*\hat{\beta}_1+\hat{\beta}_2+\hat{\beta}_3}{3}$