In [1]:
import numpy as np
from numpy import linalg as la
from tabulate import tabulate as tabulate

# NP chol. gives only lower triangle
from scipy.linalg import cholesky
np.set_printoptions(precision=5)
import LinearModelsWeek4_ante as lm

%load_ext autoreload
%autoreload 2

# Prepare the data
In this problem set we consider the state dependency in female **labour market participation** (LMP). The data dpdat.txt comes from the Panel Survey of Income Dynamics (PSID) for the years 1986-1989. The sample consists of 1,442 women aged between 18 and 55 in 1986 who are married or cohabiting. The variables available are the following:

  |*Variable*  | *Content* |
  |------------| --------------------------------------------|
  |y0          | Participation|
  |x1          | Fertility|
  |x2          | Children aged 2-6.|
  |x5          | Children of the same sex (male or female).|
 | x7         |  Schooling level 1. |
  |x8          | Schooling level 2. |
  |x9          | Schooling level 3. |
  |x10         | Age |
  |x11         | Race |
  |y1          | Lagged participation |
  |Year        | Year of observation |
  |const       | Constant (only ones) |

### Preperation
Start by loading the data. **For today we need only y0, y1 and year**. Use the `np.loadtxt` function. Remember to give it the proper delimited. It also has an argument that allows you to choose which column you want to use, see if you can find which one that is.

In [None]:
y = # FILL IN participation

const = # FILL IN
y_l = # FILL IN lagged participation
year = # FILL IN

T = 4

x = np.column_stack((const, y_l))

ylbl = 'participation'
xlbl = ['const', 'lag participation']

# Part 1: POLS
Today we will focus on a parsimonious model of female LMP (econometricians often use "parsimonious" to mean a "simple"). 

Consider first the following AR(1) (autoregressive model of order $1$),

$$
LMP_{it} = \alpha_0 +  \rho LMP_{it-1} + c_i + u_{it}, \quad t = 1, 2, \dotsc, T \tag{1}
$$

As we have seen before, if one does not take into consideration $c_i$ when estimating $\rho$, one will get biased results. One way to solve this, which is also a common way for AR(1) processes, is to take first-differences. We then have the model,

$$
\Delta LMP_{it} = \rho \Delta LMP_{it-1} + \Delta u_{it}, \quad t = 2, \dotsc, T \tag{2}
$$

This solves the presence of fixed effects.

### Question 1
Estimate eq. (1) using POLS. 
* Are there signs of autocorrelation in female labour participation?
* What assumptions are no longer satisfied? What happens with fixed effects when we include a lag?

*Note:* We need to use the lagged values for participation. But this time we don't need to lag it ourselves, as it is already given to us in the data.

In [None]:
# FILL IN
# Estimate the AR(1) model using OLS
# Print out in a nice table

Your table should look like this:

AR(1) <br>
Dependent variable: participation

|                   |   Beta |      Se |   t-values |
|-------------------|--------|---------|------------|
| const             |  0.278 | 0.01234 |      22.51 |
| lag participation |  0.637 | 0.01303 |      48.89 |
R² = 0.403 <br>
σ² = 0.106

### Question 2
Estimate eq. (2) using first differences. 
* What problem does this solve? 
* What type of exogeneity assumption is used to justify this method of estimation?

*Note:* You have to create the first differencing matrix yourself, and use the `perm` function to permutate the dependen and independent variables. <br>
*Note 2:* This time you should use robust standard errors. The function is provided to you.

In [None]:
# FILL IN
# Create a first difference matrix
# First difference both LMP and lag of LMP
# Estimate AR(1) model using OLS and print a nice table

Your table should look like this:

FD AR(1) <br>
Dependent variable: participation

|                   |   Beta |     Se |   t-values |
|-------------------|--------|--------|------------|
| lag participation | -0.321 | 0.0181 |     -17.76 |
R² = 0.105 <br>
σ² = 0.117

## Super short introduction to pooled IV (piv)

Consider that we want to estimate the effect of $x_K$ on $y$, including $K - 1$ controls, we then have the usual equation,

$$
\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{u} \tag{3}
$$

where $\mathbf{X} = (\mathbf{x}_1, \dotsc, \mathbf{x}_K)$. If $\mathbf{x}_K$ is not exogenous, we can define the instrument vector $\mathbf{Z} = (\mathbf{x}_1, \dotsc, \mathbf{x}_{K - 1}, \mathbf{z}_1)$, where $\mathbf{z}_1$ is an instrument for $\mathbf{x}_K$. The details and necessary assumptions and conditions are outlined in Wooldridge (2010) (chapter 5).

We can estimate eq. (1) by OLS using $z_1$ as an instrument for $x_K$, in order to make it easier for you when writing code, I write it up in matrix notation,

$$
\boldsymbol{\hat{\beta}} = (\mathbf{\hat{X'}}\mathbf{\hat{X}})^{-1} \mathbf{\hat{X'}}\mathbf{Y}, \tag{4}
$$

where $\mathbf{\hat{X}} = \mathbf{Z}(\mathbf{Z'}\mathbf{Z})^{-1}\mathbf{Z'}\mathbf{X}$.

# Part 2: Pooled IV
It should not be a surprise that models (1) and (2) violates the strict exegoneity assumption, but even if we relax this assumption to sequential exegoneity, the FD-estimator remains inconsistent.

A solution for this is to use an instrument for $\Delta LMP_{it-1}$. The biggest issue is to find an instrument that is not only relevant, but also exogenous.

We often use an additional lag as instruments. So for $\Delta LMP_{it-1}$, we can use $LMP_{it-2}$. In general, we have all possible lags available as instruments. So for $\Delta LMP_{it-1}$ we have, $
LMP_{it-2}^{\textbf{o}} = (LMP_{i0}, LMP_{i1}, \dotsc LMP_{it-2})$ available as instruments.

*Note:* $R^2$ has no meaning in IV-regressions, you can report it if you want to. But I set it to 0.

### Question 1
Estimate eq. (2) by using the lag of the independent variable in levels, $z_{it} = LMP_{it-2}$ as an instrument. You need to finish writing the `est_piv` function and a part of the `estimate` function.

*Note:* In the estimate function, the variance function takes x as an argument. But we want to pass the `variance` function $\mathbf{\hat{X}}$ instead. <br>
*Note 2:* In order to create the instrument, you need to create a lag matrix, and use `perm`.

In [None]:
# FILL IN
# Create first a lag matrix
# Lag the lagged LMP variable
# Finish writing the piv function
# Finish writing the estimate function
# Estimate using first differences and lagged first differences. Use the 2. lag as instrument.

Your table should look like this:

FD-IV AR(1) <br>
Dependent variable: participation

|                   |   Beta |     Se |   t-values |
|-------------------|--------|--------|------------|
| lag participation |  0.296 | 0.0469 |       6.30 |
R² = 0.000 <br>
σ² = 0.167

### Question 2
Estimate eq. (2) by using the lag of the independent variable in first differences, $z_{it} = \Delta LMP_{it-2}$ as an instrument.

In [None]:
# FILL IN
# Lag the first differenced lag LMP variable
# The second lag uses up an extra observation, so you need to use the year variable to shorten both first differenced LMP and the 1. first difference lag.
# Estimate using first differences and lagged first differences. Use the 2. first difference lag as instrument.
reduced_year = year[year != 1986]  # Remove first year, since we loose the first obs when doing first differences.

Your table should look like this:
FD-IV AR(1) <br>
Dependent variable: participation

|                   |   Beta |     Se |   t-values |
|-------------------|--------|--------|------------|
| lag participation |  0.210 | 0.0880 |       2.39 |
R² = 0.000 <br>
σ² = 0.154

### Summing up Exercise 1 and 2.

First of all, try to consider if it is more obvious to use $LMP_{it-2}$ or $\Delta LMP_{it-2}$ as an instrument for $\Delta LMP_{it-1}$?

Then consider how do the different models compare to each other, some questions that you might discuss with your class mates could be:
* Which ones do you feel gives most sense from an economic perspective. 
* Which ones gives most sense from an econometric perspective? 
* Do you feel that there is conclusive evidence that there is state dependence in female labour market participation?