In [None]:
import numpy as np
from numpy import linalg as la
from tabulate import tabulate as tabulate
import LinearModelsWeek4_ante as lm

np.set_printoptions(precision=5)
%load_ext autoreload
%autoreload 2

# Prepare the data
In this problem set we consider the state dependency in female **labour market participation** (LMP). The data dpdat.txt comes from the Panel Survey of Income Dynamics (PSID) for the years 1986-1989. The sample consists of 1,442 women aged between 18 and 55 in 1986 who are married or cohabiting. The variables available are the following:

  |*Variable*  | *Content* |
  |------------| --------------------------------------------|
  |y0          | Participation|
  |x1          | Fertility|
  |x2          | Children aged 2-6.|
  |x5          | Children of the same sex (male or female).|
 | x7         |  Schooling level 1. |
  |x8          | Schooling level 2. |
  |x9          | Schooling level 3. |
  |x10         | Age |
  |x11         | Race |
  |y1          | Lagged participation |
  |Year        | Year of observation |
  |const       | Constant (only ones) |

### Preperation
Start by loading the data. **For today we need only y0, y1 and year**. Use the `np.loadtxt` function. Remember to give it the proper delimited. It also has an argument that allows you to choose which column you want to use, see if you can find which one that is.

In [None]:
y = np.loadtxt('dpdat.txt', delimiter=',', usecols=0).reshape(-1, 1)

const = np.loadtxt('dpdat.txt', delimiter=',', usecols=-1)
y_l = np.loadtxt('dpdat.txt', delimiter=',', usecols=12)
year = np.loadtxt('dpdat.txt', delimiter=',', usecols=-2)

T = 4

x = np.column_stack((const, y_l))

ylbl = 'participation'
xlbl = ['const', 'lag participation']

# Part 1: POLS
Today we will focus on a parsimonious model of female LMP (econometricians often use "parsimonious" to mean a "simple"). 

Consider first the following AR(1) (autoregressive model of order $1$),

$$
LMP_{it} = \alpha_0 +  \rho LMP_{it-1} + c_i + u_{it}, \quad t = 1, 2, \dotsc, T \tag{1}
$$

As we have seen before, if one does not take into consideration $c_i$ when estimating $\rho$, one will get biased results. One way to solve this, which is also a common way for AR(1) processes, is to take first-differences. We then have the model,

$$
\Delta LMP_{it} = \rho \Delta LMP_{it-1} + \Delta u_{it}, \quad t = 2, \dotsc, T \tag{2}
$$

This solves the presence of fixed effects.

### Question 1
Estimate eq. (1) using POLS. 
* Are there signs of autocorrelation in female labour participation?
* What assumptions are no longer satisfied? What happens with fixed effects when we include a lag?

*Note:* We need to use the lagged values for participation. But this time we don't need to lag it ourselves, as it is already given to us in the data.

In [None]:
# FILL IN
# Estimate the AR(1) model using OLS
# Print out in a nice table

In [None]:
ar1_result = lm.estimate(y, x, robust_se=True)
lm.print_table(
    (ylbl, xlbl), ar1_result, title='AR(1)', floatfmt=['', '.3f', '.5f', '.2f']
)

Your table should look like this:

AR(1) <br>
Dependent variable: participation

|                   |   Beta |      Se |   t-values |
|-------------------|--------|---------|------------|
| const             |  0.278 | 0.01234 |      22.51 |
| lag participation |  0.637 | 0.01303 |      48.89 |
R² = 0.403 <br>
σ² = 0.106

### Question 2
Estimate eq. (2) using first differences. 
* What problem does this solve? 
* What type of exogeneity assumption is used to justify this method of estimation?

*Note:* You have to create the first differencing matrix yourself, and use the `perm` function to permutate the dependen and independent variables. <br>
*Note 2:* This time you should use robust standard errors. The function is provided to you.

In [None]:
# FILL IN
# Create a first difference matrix
# First difference both LMP and lag of LMP
# Estimate AR(1) model using OLS and print a nice table

In [None]:
def fd(t):
    # Create a first difference matrix.
    # We also remove the last row, so that we delete the first observation.
    D_t = np.eye(t, k=1) - np.eye(t)
    return D_t[:-1]

In [None]:
D_t = fd(T)
yfd = lm.perm(D_t, y)
yfd_l = lm.perm(D_t, y_l.reshape(-1, 1))

In [None]:
ar1_diff_result = lm.estimate(yfd, yfd_l, robust_se=True, t=T-1)
lm.print_table(
    (ylbl, ['lag participation']), 
    ar1_diff_result, title='FD AR(1)', floatfmt=['', '.3f', '.4f', '.2f']
)

Your table should look like this:

FD AR(1) <br>
Dependent variable: participation

|                   |   Beta |     Se |   t-values |
|-------------------|--------|--------|------------|
| lag participation | -0.321 | 0.0181 |     -17.76 |
R² = 0.105 <br>
σ² = 0.117

## Super short introduction to pooled IV (piv)

Consider that we want to estimate the effect of $x_K$ on $y$, including $K - 1$ controls, we then have the usual equation,

$$
\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{u} \tag{3}
$$

where $\mathbf{X} = (\mathbf{x}_1, \dotsc, \mathbf{x}_K)$. If $\mathbf{x}_K$ is not exogenous, we can define the instrument vector $\mathbf{Z} = (\mathbf{x}_1, \dotsc, \mathbf{x}_{K - 1}, \mathbf{z}_1)$, where $\mathbf{z}_1$ is an instrument for $\mathbf{x}_K$. The details and necessary assumptions and conditions are outlined in Wooldridge (2010) (chapter 5).

We can estimate eq. (1) by OLS using $z_1$ as an instrument for $x_K$, in order to make it easier for you when writing code, I write it up in matrix notation,

$$
\boldsymbol{\hat{\beta}} = (\mathbf{\hat{X'}}\mathbf{\hat{X}})^{-1} \mathbf{\hat{X'}}\mathbf{Y}, \tag{4}
$$

where $\mathbf{\hat{X}} = \mathbf{Z}(\mathbf{Z'}\mathbf{Z})^{-1}\mathbf{Z'}\mathbf{X}$.

# Part 2: Pooled IV
It should not be a surprise that models (1) and (2) violates the strict exegoneity assumption, but even if we relax this assumption to sequential exegoneity, the FD-estimator remains inconsistent.

A solution for this is to use an instrument for $\Delta LMP_{it-1}$. The biggest issue is to find an instrument that is not only relevant, but also exogenous.

We often use an additional lag as instruments. So for $\Delta LMP_{it-1}$, we can use $LMP_{it-2}$. In general, we have all possible lags available as instruments. So for $\Delta LMP_{it-1}$ we have, $
LMP_{it-2}^{\textbf{o}} = (LMP_{i0}, LMP_{i1}, \dotsc LMP_{it-2})$ available as instruments.

*Note:* $R^2$ has no meaning in IV-regressions, you can report it if you want to. But I set it to 0.

### Question 1
Estimate eq. (2) by using the lag of the independent variable in levels, $z_{it} = LMP_{it-2}$ as an instrument. You need to finish writing the `est_piv` function and a part of the `estimate` function.

*Note:* In the estimate function, the variance function takes x as an argument. But we want to pass the `variance` function $\mathbf{\hat{X}}$ instead. <br>
*Note 2:* In order to create the instrument, you need to create a lag matrix, and use `perm`.

In [None]:
# FILL IN
# Create first a lag matrix
# Lag the lagged LMP variable
# Finish writing the piv function
# Finish writing the estimate function
# Estimate using first differences and lagged first differences. Use the 2. lag as instrument.

In [None]:
def lag(t):
    # Create a lag matrix.
    # Again remove the first observation, by removing the first row.
    L_t = np.eye(t, k=-1)
    return L_t[1:]

In [None]:
# First lag yl, and then use it as an instrument for the lagged differences.
L_t = lag(T)
y_2l = lm.perm(L_t, y_l.reshape(-1, 1))
ar1_iv_lvl_result = lm.estimate(
    yfd, yfd_l, y_2l, robust_se=True, t=T-1
)

In [None]:
lm.print_table(
    (ylbl, ['lag participation']), 
    ar1_iv_lvl_result, title='FD-IV AR(1)', floatfmt=['', '.3f', '.4f', '.2f']
)

Your table should look like this:

FD-IV AR(1) <br>
Dependent variable: participation

|                   |   Beta |     Se |   t-values |
|-------------------|--------|--------|------------|
| lag participation |  0.296 | 0.0469 |       6.30 |
R² = 0.000 <br>
σ² = 0.167

### Question 2
Estimate eq. (2) by using the lag of the independent variable in first differences, $z_{it} = \Delta LMP_{it-2}$ as an instrument.

In [None]:
# FILL IN
# Lag the first differenced lag LMP variable
# The second lag uses up an extra observation, so you need to use the year variable to shorten both first differenced LMP and the 1. first difference lag.
# Estimate using first differences and lagged first differences. Use the 2. first difference lag as instrument.

In [None]:
# Create a new lag matrix (that is shorter, since we already removed one obs)
# Create second lag of first differences.
L_t = lag(T - 1)
yfd_l2 = lm.perm(L_t, yfd_l)

In [None]:
# Remove the first observation for each person.
reduced_year = year[year != 1986]  # Remove first year, so that shape is the same as yfd
yfd0 = yfd[reduced_year != 1987]
yfd_l0 = yfd_l[reduced_year != 1987]

In [None]:
ar1_iv_result = lm.estimate(
    yfd0, yfd_l0, yfd_l2, robust_se=True, t=T-2
)
lm.print_table(
    (ylbl, ['lag participation']), 
    ar1_iv_result, title='FD-IV AR(1)', floatfmt=['', '.3f', '.4f', '.2f']
)

Your table should look like this:
FD-IV AR(1) <br>
Dependent variable: participation

|                   |   Beta |     Se |   t-values |
|-------------------|--------|--------|------------|
| lag participation |  0.210 | 0.0880 |       2.39 |
R² = 0.000 <br>
σ² = 0.154

### Summing up Exercise 1 and 2.

First of all, try to consider if it is more obvious to use $LMP_{it-2}$ or $\Delta LMP_{it-2}$ as an instrument for $\Delta LMP_{it-1}$?

Then consider how do the different models compare to each other, some questions that you might discuss with your class mates could be:
* Which ones do you feel gives most sense from an economic perspective. 
* Which ones gives most sense from an econometric perspective? 
* Do you feel that there is conclusive evidence that there is state dependence in female labour market participation?

# Part 3: System 2SLS
As mentioned earlier, we have $LMP_{it-2}^{\textbf{o}} = (LMP_{i0}, LMP_{i1}, \dotsc LMP_{it-2})$ available instruments at time $t$. In order to take advantage of these instruments, we could create the following instrument matrix,

$$
\mathbf{Z^{\mathbf{o}}} = 
\begin{bmatrix}
    y_{i0} & 0 & 0 & 0 & 0 & 0 & \cdots & 0 \\
    0 & y_{i0} & y_{i1} & 0 & 0 & 0 & \cdots & 0 \\
    0 & 0 & 0 & y_{i0} & y_{i1} & y_{i2} & \cdots & 0 \\
    \vdots  & \vdots & \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\
    0 & 0 & 0 & 0 & 0 & 0 & \cdots & \mathbf{y^o_{it-2}} \\
\end{bmatrix}
\begin{pmatrix}
t = 2 \\
t = 3 \\
t = 3 \\
\vdots \\
t = T
\end{pmatrix}.
$$

We will use the `gmm` module for part, but parts of the `lm` module will be used in the `gmm` module.

## Super short introduction to S2SLS, alse read Wooldridge (2010) chapter 5 and chapter 8

Since we have different amount of instruments each period, we are unable to use these instruments using piv. We therefore have to leverage system two stage least square (S2SLS). 

To recap what ("normal") 2SLS looks like, eq. (3) would have been estimated this way if we used 2SLS,
* In the first-stage regression, we estimate $\mathbf{x}_K$ on $\mathbf{Z}$, and then calculate the predicted $\hat{\mathbf{x}}_K$.
* We use these predicted $\hat{\mathbf{x}}_K$ in our second stage regression, so for eq. (3), this would be to regress $\mathbf{y}$ on $\mathbf{x}_1, \dotsc, \mathbf{x}_{K-1}, \hat{\mathbf{x}}_K$.

So for 2SLS, we perform a first stage regression for each time period, and store the predicted $\hat{\mathbf{x}}_K$ from each time period. We then combine these predicted observations (and we remember to do it in a way that the time periods are sorted correctly for each person) into one array, and use this in our second stage regression.

### Question 1: Create the level instrument matrix $\mathbf{Z^{\mathbf{o}}}$

Finish the function `sequential_instruments` in order to create the instrument matrix $\mathbf{Z^{\mathbf{o}}}$ using the second lag of LMP in **levels**. Note that you will not have one array that looks like $\mathbf{Z^{\mathbf{o}}}$, but an array that have something that looks like $\mathbf{Z^{\mathbf{o}}}$ for each individual in the data. Since we have four time periods, and access to $y_{i0}$, you should get three rows of instruments for each individual.

If if is too difficult to create the instrument matrix $\mathbf{Z^{\mathbf{o}}}$, you could also in this instance hard code it. This can be done by using the year column to do boolean indexing in python. You can then use the `np.hstack` to create the necessary columns for the first stage regressions. Look at *question 2* below if you are uncertain on how many columns you should have for each regression.

In [None]:
instrument_matrix = gmm.sequential_instruments(y_2l, T)
print(instrument_matrix)

To prepeare for next question, I have created an function that retrieves one time period for each person, and store them in seperate arrays, and finaly returns these in a list. (I also keep the array for the all time periods, since we need that for the second stage regression). 

In [None]:
def xy_eq_by_eq(x, t):
    equation = []
    # Reshape, so that each column is each time period
    xt = x.reshape(-1, t)

    # I then transpose this, so that each time period is in rows
    for row in xt.T:
        # I then put each time period in a separate array, that I store in the list.
        equation.append(row.reshape(-1, 1))
    # We need individual observations for all years in the second stage regression,
    # so let us keep that at the end of the list.
    equation.append(x)
    return equation

In [None]:
# Put in your own variables for y and x
y_arrays = xy_eq_by_eq(, T-1)
x_arrays = xy_eq_by_eq(, T-1)

In [None]:
y_arrays = xy_eq_by_eq(yfd, T-1)
x_arrays = xy_eq_by_eq(yfd_l, T-1)

### Question 2: Use the instruments to perform 2SLS

Now that you have the necessary instruments, use these to estimate eq. (2), with all possible lagged levels as instruments.

I recommend that you create two functions for this. One that simply performs the first-stage regression given y and x, let's call it `first_stage`, and should return the predicted $\hat{\mathbf{x}}_K$, which in this case is the second lagged LMP.

The second function, let us call is `system_2sls`, loops over a list of y, x and z arrays that are given as inputs, and performs the `first_stage` regression on each array (remember, each array has all observations from one time period only). Since the `first_stage` function you made returns the predicted $\hat{\mathbf{x}}_K$, you need to keep them, and in the end combine them. You can now perform the second-stage regression.

So in our example where we use levels, you should perform three first-stage regressions that look like this:

$$
\begin{align}
\Delta LMP_{i1987} & = \rho_{11}LMP_{i1986} + u_{it} \\
\Delta LMP_{i1988} & = \rho_{21}LMP_{i1986} + \rho_{22}LMP_{i1987} + u_{it} \\
\Delta LMP_{i1989} & = \rho_{31}LMP_{i1986} + \rho_{32}LMP_{i1987} + \rho_{33}LMP_{i1988} + u_{it} \\
\end{align}
$$

Then, for each regression, you also need to predict the LMP. These need to be stored and in the end combined, so that you get a column that looks like this,

$$
\underset{N(T - 1) \times 1}{
\begin{pmatrix}
    \Delta \widehat{LMP}_{1, 1987} \\
    \Delta \widehat{LMP}_{1, 1988} \\
    \Delta \widehat{LMP}_{1, 1989} \\
    \Delta \widehat{LMP}_{2, 1987} \\
    \Delta \widehat{LMP}_{2, 1988} \\
    \Delta \widehat{LMP}_{2, 1989} \\
    \vdots \\
    \Delta \widehat{LMP}_{N, 1987} \\
    \Delta \widehat{LMP}_{N, 1988} \\
    \Delta \widehat{LMP}_{N, 1989} \\
\end{pmatrix}
}
$$

So in the end, you perform a second stage regression that looks like this

$$
    \Delta LMP_{it} = \rho\Delta \widehat{LMP}_{it-1} + \Delta u_{it}
$$

In [None]:
# I have created these helper functions. So if you managed to create the instrument matrix, you can use the iv_eq_by_eq to get columns that give you the correct number of columns for the different time periods.

In [None]:
def IV_eq_by_eq(instrument_matrix, reduced_year):
    z_1987 = instrument_matrix[reduced_year == 1987, 0].reshape(-1, 1)
    z_1988 = instrument_matrix[reduced_year == 1988, 1:3]
    z_1989 = instrument_matrix[reduced_year == 1989, 3:]
    return (z_1987, z_1988, z_1989)
z_arrays = IV_eq_by_eq(instrument_matrix, reduced_year)

In [None]:
s2sls_result = gmm.system_2sls(y_arrays, x_arrays, z_arrays)
lm.print_table(
    ('yfd', ['yfd_l']), s2sls_result, title='System 2SLS results', floatfmt='.4f'
)

System 2SLS results <br>
Dependent variable: yfd

|       |   Beta |     Se |   t-values |
|-------|--------|--------|------------|
| yfd_l | 0.2390 | 0.0166 |    14.3891 |
R² = 0.000 <br>
σ² = 0.158

### Question 3
Repeat question 1 and 2, but now with first differences as instruments. In other words, perform 2SLS with twice lagged first differences of LMP as instruments. 

I have given you some of the code if you managed to finish the `gmm.sequential_instruments` function. To create the y, x, and z arrays you use the provided `xy_eq_by_eq` function.

In [None]:
reduced_year2 = reduced_year[reduced_year != 1987]

instrument_matrix = gmm.sequential_instruments(, T - 1)  # Input your second lag first differences variable
def IV_eq_by_eq2(instrument_matrix, reduced_year):
    z_1988 = instrument_matrix[reduced_year == 1988, 0].reshape(-1, 1)
    z_1989 = instrument_matrix[reduced_year == 1989, 1:]
    return (z_1988, z_1989)

z_arrays = IV_eq_by_eq2(instrument_matrix, reduced_year2)
y_arrays = # Fill in
x_arrays = # Fill in

In [None]:
reduced_year2 = reduced_year[reduced_year != 1987]
instrument_matrix = gmm.sequential_instruments(yfd_l2, T - 1)
def IV_eq_by_eq2(instrument_matrix, reduced_year):
    z_1988 = instrument_matrix[reduced_year == 1988, 0].reshape(-1, 1)
    z_1989 = instrument_matrix[reduced_year == 1989, 1:]
    return (z_1988, z_1989)

z_arrays = IV_eq_by_eq2(instrument_matrix, reduced_year2)
y_arrays = xy_eq_by_eq(yfd0, T-2)
x_arrays = xy_eq_by_eq(yfd_l0, T-2)

In [None]:
s2sls_result = gmm.system_2sls(y_arrays, x_arrays, z_arrays)
lm.print_table(
    ('yfd', ['yfd_l']), s2sls_result, title='System 2SLS results', floatfmt='.4f',
    tablefmt='github'
)

Your table should look something like this
System 2SLS results <br>
Dependent variable: yfd

|       |   Beta |     Se |   t-values |
|-------|--------|--------|------------|
| yfd_l | 0.1959 | 0.0197 |     9.9622 |
R² = 0.000 <br>
σ² = 0.152

Part 4: GMM
Coming up

In [None]:
gmm_result = gmm.est_gmm(
    W = la.inv(instrument_matrix.T@instrument_matrix), 
    y = yfd0,
    x = yfd_l0,
    z = instrument_matrix,
    t = T-2,
    step=1
)

In [None]:
lm.print_table(('yfd', ['yfd_l']), gmm_result, title='GMM results', floatfmt='.4f', tablefmt='github')