## 1. Binary Responses (Classification problem)

Binary dependent variables are frequently studied in applied econometrics. Because a dummy variable y can only take the values 0 and 1, its conditional expected value is equal to the conditional probability that y=1. (Conclusion drawn from Bernoulli distribution).

So when we study the conditional mean, it makes sense to think about it as the probability of outcome y=1. Likewise, the predicted value $\hat{y}$ should be thought of as a predicted probability.

### 1.1 Linear Probability Models

If a dummy variable is used as the dependent variable y, and we still use the OLS method to obtain the estimators, the model is called a linear probability model, i.e.

$$y = \beta_0 + \beta_1 x_1 + \dots + \beta_k x_k + u$$

This model will return an unbiased and consistent estimator as long as $E(u|x)=0$. Although the estimator is no longer efficient due to heteroskedasticity. So with OLS, we should use HC standard errors.

The interpretation of the coefficients is straightforward: $\beta_j$ is a measure of the average change in probability of a "success" (y=1) as $x_j$ increases by one unit and the other determinants remain constant. 

### 1.2 Logit and Probit Models

Logistic and Probit models are designed for binary responds by restricting the probability P(y=1) (or equivalently E[y|x]) between 0 and 1. An important class of models specifies the success probability as 

$$P(y=1|x) = G(\beta_0 + \beta_1 x + \dots + \beta_k x_k)$$

where the "inverse link function" G(z) always returns values between 0 and 1. In the statistics literature, this type of models is often called generalized linear model (GLM) because a linear part $x\beta$ shows up within the nonlinear transformation function G.

Within GLM family, the most popular specifications for binary response models are

- the **probit** model with $G(z)=\Phi(z)$, the standard normal CDF and
- the **logit** model with $G(z)=\Lambda(z)=\frac{e^z}{1+e^z}$, the CDF of the logistic distribution.

The estimation of these two models are commonly done by the **maximum likelihood** method instead of OLS. Because (1) maximum likelihood is more intuitive when it comes to probabilities, and (2) the well-known formulas for OLS estimates need to be modified in the case of nonlinear relationships.

### Maximum Likelihood Estimation

Parallel to OLS, maximum likelihood is another method to find the estimators, such as $\hat{\beta_0}$ and $\hat{\beta_1}$.

Instead of trying to minimize sample MSE. Maximum likelihood chooses the set of $\hat{\beta}$ that can maximize the joint probability of the observed $\{y_i\}$ given all independent variables $X$. If we assume y are independent, we can apply the multiplication law and obtain the following optimization problem.

$$\max_{\beta_0\dots\beta_k} \Pi_{i} p(y_i|x_i)$$

This problem is equivalent to (optimal points doesn't change if we transform the object function monotonically)

$$\max_{\beta_0\dots\beta_k} ln(\Pi_{i} p(y_i|x_i))$$

which then gives us the famous log-likelihood form (summation is preferable to product)

$$\max_{\beta_0\dots\beta_k} \sum_i ln(p(y_i|x_i))$$

In [2]:
import wooldridge as woo
import statsmodels.formula.api as smf

df = woo.data("mroz")
df.head()

Unnamed: 0,inlf,hours,kidslt6,kidsge6,age,educ,wage,repwage,hushrs,husage,...,faminc,mtr,motheduc,fatheduc,unem,city,exper,nwifeinc,lwage,expersq
0,1,1610,1,0,32,12,3.354,2.65,2708,34,...,16310.0,0.7215,12,7,5.0,0,14,10.91006,1.210154,196
1,1,1656,0,2,30,12,1.3889,2.65,2310,30,...,21800.0,0.6615,7,7,11.0,1,5,19.499981,0.328512,25
2,1,1980,1,3,35,12,4.5455,4.04,3072,40,...,21040.0,0.6915,12,7,5.0,0,15,12.03991,1.514138,225
3,1,456,0,3,34,12,1.0965,3.25,1920,53,...,7300.0,0.7815,7,7,5.0,0,6,6.799996,0.092123,36
4,1,1568,1,2,31,14,4.5918,3.6,2000,32,...,27300.0,0.6215,12,14,9.5,1,7,20.100058,1.524272,49


In [6]:
formula = "inlf~nwifeinc+educ+exper+expersq + age+kidslt6+kidsge6"
reg_logit = smf.logit(formula, data=df)
res_logit = reg_logit.fit()
print(res_logit.summary())

Optimization terminated successfully.
         Current function value: 0.533553
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:                   inlf   No. Observations:                  753
Model:                          Logit   Df Residuals:                      745
Method:                           MLE   Df Model:                            7
Date:                Mon, 29 Nov 2021   Pseudo R-squ.:                  0.2197
Time:                        15:35:37   Log-Likelihood:                -401.77
converged:                       True   LL-Null:                       -514.87
Covariance Type:            nonrobust   LLR p-value:                 3.159e-45
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.4255      0.860      0.494      0.621      -1.261       2.112
nwifeinc      -0.0213      0.

> ll-Null is the log likelihood value of a model with an intercept only. Pseudo R-squared is a goodness of fit measure resembles the R-squared measure from OLS.

$$pseudo R^2 = 1 - \frac{lnL}{llnull}$$

In [7]:
reg_probit = smf.probit(formula,data=df)
res_probit = reg_probit.fit()
print(res_probit.summary())

Optimization terminated successfully.
         Current function value: 0.532938
         Iterations 5
                          Probit Regression Results                           
Dep. Variable:                   inlf   No. Observations:                  753
Model:                         Probit   Df Residuals:                      745
Method:                           MLE   Df Model:                            7
Date:                Mon, 29 Nov 2021   Pseudo R-squ.:                  0.2206
Time:                        15:36:01   Log-Likelihood:                -401.30
converged:                       True   LL-Null:                       -514.87
Covariance Type:            nonrobust   LLR p-value:                 2.009e-45
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.2701      0.509      0.531      0.595      -0.727       1.267
nwifeinc      -0.0120      0.

Due to the nonlinear relationship, we can no longer interpret the coefficients as a partial effect. Instead, the interpretation depends on the link function we used.

![interpret](./images/logit_and_probit_interpretation.png)

There are four common ways to interpret results from Logit or Probit models.

1. Fortunately, for both logit and probit model, the partial effect always has the same sign as $\beta_j$. 
2. Equation (17.7) also shows that the *relative effects* of any two continuous explanatory variables do not depend on x: the ratio of the partial effects for $x_j$ and $x_h$ is $\beta_j/\beta_h$.
3. Partial effects at the mean: $\hat{\beta_j}*g(\bar{x}\hat{\beta})$
4. Average partial effects: $\frac{1}{n}\sum_i \hat{\beta}\cdot g(x_i\hat{\beta})$

## 2. Count Data: The Poisson Regression Model

Maximum likelihood model added a new channel for estimation - optimization based on probability P(y|x). This enlightened us to try different kinds of probability distributions.

As an example, if the dependent variable can take nonnegative discrete values (for example, the number of attendances of a student, death toll due to COVID, etc..), we can consider the Poisson distribution.

The model:

$$P(y=h|x) = \frac{e^\mu\mu^h}{h!}$$

where x and $\beta$ are modeled in $\mu$

$$\mu = exp(x\beta)$$

> recall that in $E[y|x]=G(x\beta)$, G(.) is called the inverse link function.

The log-likelihood optimization problem then becomes:

$$\max_{\beta} \sum_i ln(P(y_i=h|x)) $$

Again, the partial effect interpretation no longer holds in the case of nonlinear relationship. Indeed, for Poisson regression

$$\frac{\partial E[y|x]}{\partial x_j} = \beta_j e^{x\beta} = \beta_j E(y|x)$$

In addition to the four interpretations mentioned above, $\beta_j$ is the semi elasticity of y w.r.t x. i.e. all else equal, if $x_j$ increase by one unit, E(y|x) will increase by $\beta_j$*100 %.

### quasi-maximum likelihood estimator

Poisson model is quite restrictive. The Poisson distribution implicit that the variance of y is equal to its expectation. If this assumption is violated, by contradiction, we no longer have a Poisson distribution.

But we can show that as long as E[y|x] is correctly specified, the maximum likelihood estimator is still consistent with the true data generating process. But we want to call the resulting estimator the **quasi-maximum likelihood estimator** to remind us of the possible misspecification.

Estimating Poisson regression models in **statsmodels** is straightforward.

In [8]:
df_crime = woo.data("crime1")
df_crime.head()

Unnamed: 0,narr86,nfarr86,nparr86,pcnv,avgsen,tottime,ptime86,qemp86,inc86,durat,black,hispan,born60,pcnvsq,pt86sq,inc86sq
0,0,0,0,0.38,17.6,35.200001,12,0.0,0.0,0.0,0,0,1,0.1444,144,0.0
1,2,2,0,0.44,0.0,0.0,0,1.0,0.8,0.0,0,1,0,0.1936,0,0.64
2,1,1,0,0.33,22.799999,22.799999,0,0.0,0.0,11.0,1,0,1,0.1089,0,0.0
3,2,2,1,0.25,0.0,0.0,5,2.0,8.8,0.0,0,1,1,0.0625,25,77.440002
4,1,1,0,0.0,0.0,0.0,0,2.0,8.1,1.0,0,0,0,0.0,0,65.610008


In [10]:
formula = "narr86 ~ pcnv + avgsen + tottime + ptime86 + qemp86 + inc86 + black + hispan + born60"
reg_poisson = smf.poisson(formula, data=df_crime)
res_poisson = reg_poisson.fit()
print(res_poisson.summary())

Optimization terminated successfully.
         Current function value: 0.825233
         Iterations 6
                          Poisson Regression Results                          
Dep. Variable:                 narr86   No. Observations:                 2725
Model:                        Poisson   Df Residuals:                     2715
Method:                           MLE   Df Model:                            9
Date:                Mon, 29 Nov 2021   Pseudo R-squ.:                 0.07910
Time:                        19:03:48   Log-Likelihood:                -2248.8
converged:                       True   LL-Null:                       -2441.9
Covariance Type:            nonrobust   LLR p-value:                 1.134e-77
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.5996      0.067     -8.916      0.000      -0.731      -0.468
pcnv          -0.4016      0.

In [14]:
# Or use glm - generalized linear model with maximum likelihood estimator
import statsmodels.api as sm

reg_glm = smf.glm(formula, data=df_crime, family = sm.families.Poisson())
res_glm = reg_poisson.fit()
print(res_glm.summary())

Optimization terminated successfully.
         Current function value: 0.825233
         Iterations 6
                          Poisson Regression Results                          
Dep. Variable:                 narr86   No. Observations:                 2725
Model:                        Poisson   Df Residuals:                     2715
Method:                           MLE   Df Model:                            9
Date:                Mon, 29 Nov 2021   Pseudo R-squ.:                 0.07910
Time:                        19:11:38   Log-Likelihood:                -2248.8
converged:                       True   LL-Null:                       -2441.9
Covariance Type:            nonrobust   LLR p-value:                 1.134e-77
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.5996      0.067     -8.916      0.000      -0.731      -0.468
pcnv          -0.4016      0.

## 3. Censored Responses: The Tobit Model

Censored responses describe situations where the dependent variable is continuous but is restricted in a certain range. A typical example is "wage" which can not be negative.

This situation can be easily addressed using maximum likelihood. Tobit model assumes there is a latent variable $y^*$ that can take any real values, but we can only observe its true value when $y^*>0$. For $y^*\le 0$ the observed value $y=0$.

Hence we will use $P(y|x) = f(y|x)dx = f(y|x)P(y^*>0|x)dx$ if $y_i$ is positive. $dx$ is then ignored because it won't affect our estimation of $\beta$. And we use $P(y^*\le 0)$ if $y_i=0$.

Of course, for any maximum likelihood estimation, we need to assume a distribution of $y$ (or $u$). Tobit model assumes $y\sim N(x\beta,\sigma^2)$.

## 4. Sample Selection Model

As an extension to Tobit. Sample Selection Model assumes that the cutoff point of $y^*$ is not a constant threshold, but rather a nonrandom selection process determined by a different mechanism. For example we do not observe some respondents' wage because they don't want/need/qualified to work. And the latter is determined by age, nationality, and family wealth.

To account for this possibility, Heckman developed a selection model that consists of a probit-like model for the binary fact whether y is observed and a linear regression-like model for y. Selection can be driven by the same determinants as y but should have **at least one additional factor** excluded from the equation for you.

Moreover, the second step of the sample selection model can be estimated using OLS.

In [15]:
df.head()

Unnamed: 0,inlf,hours,kidslt6,kidsge6,age,educ,wage,repwage,hushrs,husage,...,faminc,mtr,motheduc,fatheduc,unem,city,exper,nwifeinc,lwage,expersq
0,1,1610,1,0,32,12,3.354,2.65,2708,34,...,16310.0,0.7215,12,7,5.0,0,14,10.91006,1.210154,196
1,1,1656,0,2,30,12,1.3889,2.65,2310,30,...,21800.0,0.6615,7,7,11.0,1,5,19.499981,0.328512,25
2,1,1980,1,3,35,12,4.5455,4.04,3072,40,...,21040.0,0.6915,12,7,5.0,0,15,12.03991,1.514138,225
3,1,456,0,3,34,12,1.0965,3.25,1920,53,...,7300.0,0.7815,7,7,5.0,0,6,6.799996,0.092123,36
4,1,1568,1,2,31,14,4.5918,3.6,2000,32,...,27300.0,0.6215,12,14,9.5,1,7,20.100058,1.524272,49


In [18]:
import scipy.stats as stats
# step 1 (probit model for selection)

reg_probit = smf.probit("inlf ~ educ + exper + expersq + nwifeinc + age + kidslt6 + kidsge6", data=df)
res_probit = reg_probit.fit()

# store the fittedvalues and compute a inverse mills ratio
pred_inlf = res_probit.fittedvalues
df["inv_mills"] = stats.norm.pdf(pred_inlf)/stats.norm.cdf(pred_inlf)

# step 2 (ols estimator, we add the inv_mills ratio to account for the conditional expectation)
reg_heckit = smf.ols("lwage~educ+exper+expersq+inv_mills", data=df)
res_heckit = reg_heckit.fit()

print(res_heckit.summary())

Optimization terminated successfully.
         Current function value: 0.532938
         Iterations 5
                            OLS Regression Results                            
Dep. Variable:                  lwage   R-squared:                       0.157
Model:                            OLS   Adj. R-squared:                  0.149
Method:                 Least Squares   F-statistic:                     19.69
Date:                Mon, 29 Nov 2021   Prob (F-statistic):           7.14e-15
Time:                        19:47:02   Log-Likelihood:                -431.57
No. Observations:                 428   AIC:                             873.1
Df Residuals:                     423   BIC:                             893.4
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------