In [None]:
import numpy as np
from numpy import linalg as la
from tabulate import tabulate as tabulate
import LinearModelsWeek5 as lm
import gmm_ante as gmm

np.set_printoptions(precision=5)
%load_ext autoreload
%autoreload 2

# Prepare the data
In this problem set we consider the state dependency in female **labour market participation** (LMP). The data dpdat.txt comes from the Panel Survey of Income Dynamics (PSID) for the years 1986-1989. The sample consists of 1,442 women aged between 18 and 55 in 1986 who are married or cohabiting. The variables available are the following:

  |*Variable*  | *Content* |
  |------------| --------------------------------------------|
  |y0          | Participation|
  |x1          | Fertility|
  |x2          | Children aged 2-6.|
  |x5          | Children of the same sex (male or female).|
 | x7         |  Schooling level 1. |
  |x8          | Schooling level 2. |
  |x9          | Schooling level 3. |
  |x10         | Age |
  |x11         | Race |
  |y1          | Lagged participation |
  |Year        | Year of observation |
  |const       | Constant (only ones) |

### Preperation
Start by loading the data. **For today we need only y0, y1 and year**. Use the `np.loadtxt` function. Remember to give it the proper delimited. It also has an argument that allows you to choose which column you want to use, see if you can find which one that is.

In [None]:
y = np.loadtxt('dpdat.txt', delimiter=',', usecols=0).reshape(-1, 1)

const = np.loadtxt('dpdat.txt', delimiter=',', usecols=-1)
y_l = np.loadtxt('dpdat.txt', delimiter=',', usecols=12)
year = np.loadtxt('dpdat.txt', delimiter=',', usecols=-2)

T = 4

x = np.column_stack((const, y_l))

ylbl = 'participation'
xlbl = ['const', 'lag participation']

In [None]:
# Some variables that we used the last time. Note, these will have my names, you might have made some other names.
## Variables used for OLS.
def fd(t):
    # Create a first difference matrix.
    # We also remove the last row, so that we delete the first observation.
    D_t = np.eye(t, k=1) - np.eye(t)
    return D_t[:-1]
D_t = fd(T)
yfd = lm.perm(D_t, y)
yfd_l = lm.perm(D_t, y_l.reshape(-1, 1))

## Variables used for PIV
# Second lag
def lag(t):
    # Create a lag matrix.
    # Again remove the first observation, by removing the first row.
    L_t = np.eye(t, k=-1)
    return L_t[1:]
L_t = lag(T)
y_2l = lm.perm(L_t, y_l.reshape(-1, 1))

# Second lag of FD, and shortened arrays that I use with that second lagged instrument.
L_t = lag(T - 1)
yfd_l2 = lm.perm(L_t, yfd_l)
reduced_year = year[year != 1986]  # Remove first year, so that shape is the same as yfd
yfd0 = yfd[reduced_year != 1987]
yfd_l0 = yfd_l[reduced_year != 1987]

# Part 3: System 2SLS
As mentioned earlier, we have $LMP_{it-2}^{\textbf{o}} = (LMP_{i0}, LMP_{i1}, \dotsc LMP_{it-2})$ available instruments at time $t$. In order to take advantage of these instruments, we could create the following instrument matrix,

$$
\mathbf{Z^{\mathbf{o}}} = 
\begin{bmatrix}
    y_{i0} & 0 & 0 & 0 & 0 & 0 & \cdots & 0 \\
    0 & y_{i0} & y_{i1} & 0 & 0 & 0 & \cdots & 0 \\
    0 & 0 & 0 & y_{i0} & y_{i1} & y_{i2} & \cdots & 0 \\
    \vdots  & \vdots & \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\
    0 & 0 & 0 & 0 & 0 & 0 & \cdots & \mathbf{y^o_{it-2}} \\
\end{bmatrix}
\begin{pmatrix}
t = 2 \\
t = 3 \\
t = 3 \\
\vdots \\
t = T
\end{pmatrix}.
$$

We will use the `gmm` module for part, but parts of the `lm` module will be used in the `gmm` module.

## Super short introduction to S2SLS, alse read Wooldridge (2010) chapter 5 and chapter 8

Since we have different amount of instruments each period, we are unable to use these instruments using piv. We therefore have to leverage system two stage least square (S2SLS). 

To recap what ("normal") 2SLS looks like, eq. (3) would have been estimated this way if we used 2SLS,
* In the first-stage regression, we estimate $\mathbf{x}_K$ on $\mathbf{Z}$, and then calculate the predicted $\hat{\mathbf{x}}_K$.
* We use these predicted $\hat{\mathbf{x}}_K$ in our second stage regression, so for eq. (3), this would be to regress $\mathbf{y}$ on $\mathbf{x}_1, \dotsc, \mathbf{x}_{K-1}, \hat{\mathbf{x}}_K$.

So for 2SLS, we perform a first stage regression for each time period, and store the predicted $\hat{\mathbf{x}}_K$ from each time period. We then combine these predicted observations (and we remember to do it in a way that the time periods are sorted correctly for each person) into one array, and use this in our second stage regression.

### Question 1: Create the level instrument matrix $\mathbf{Z^{\mathbf{o}}}$

Finish the function `sequential_instruments` in order to create the instrument matrix $\mathbf{Z^{\mathbf{o}}}$ using the second lag of LMP in **levels**. Note that you will not have one array that looks like $\mathbf{Z^{\mathbf{o}}}$, but an array that have something that looks like $\mathbf{Z^{\mathbf{o}}}$ for each individual in the data. Since we have four time periods, and access to $y_{i0}$, you should get three rows of instruments for each individual.

If if is too difficult to create the instrument matrix $\mathbf{Z^{\mathbf{o}}}$, you could also in this instance hard code it. This can be done by using the year column to do boolean indexing in python. You can then use the `np.hstack` to create the necessary columns for the first stage regressions. Look at *question 2* below if you are uncertain on how many columns you should have for each regression.

In [None]:
instrument_matrix = gmm.sequential_instruments(y_2l, T)
print(instrument_matrix)

To prepeare for next question, I have created an function that retrieves one time period for each person, and store them in seperate arrays, and finaly returns these in a list. (I also keep the array for the all time periods, since we need that for the second stage regression). 

In [None]:
def xy_eq_by_eq(x, t):
    equation = []
    # Reshape, so that each column is each time period
    xt = x.reshape(-1, t)

    # I then transpose this, so that each time period is in rows
    for row in xt.T:
        # I then put each time period in a separate array, that I store in the list.
        equation.append(row.reshape(-1, 1))
    # We need individual observations for all years in the second stage regression,
    # so let us keep that at the end of the list.
    equation.append(x)
    return equation

In [None]:
y_arrays = xy_eq_by_eq(yfd, T-1)
x_arrays = xy_eq_by_eq(yfd_l, T-1)

### Question 2: Use the instruments to perform 2SLS

Now that you have the necessary instruments, use these to estimate eq. (2), with all possible lagged levels as instruments.

I recommend that you create two functions for this. One that simply performs the first-stage regression given y and x, let's call it `first_stage`, and should return the predicted $\hat{\mathbf{x}}_K$, which in this case is the second lagged LMP.

The second function, let us call is `system_2sls`, loops over a list of y, x and z arrays that are given as inputs, and performs the `first_stage` regression on each array (remember, each array has all observations from one time period only). Since the `first_stage` function you made returns the predicted $\hat{\mathbf{x}}_K$, you need to keep them, and in the end combine them. You can now perform the second-stage regression.

So in our example where we use levels, you should perform three first-stage regressions that look like this:

$$
\begin{align}
\Delta LMP_{i1987} & = \rho_{11}LMP_{i1986} + u_{it} \\
\Delta LMP_{i1988} & = \rho_{21}LMP_{i1986} + \rho_{22}LMP_{i1987} + u_{it} \\
\Delta LMP_{i1989} & = \rho_{31}LMP_{i1986} + \rho_{32}LMP_{i1987} + \rho_{33}LMP_{i1988} + u_{it} \\
\end{align}
$$

Then, for each regression, you also need to predict the LMP. These need to be stored and in the end combined, so that you get a column that looks like this,

$$
\underset{N(T - 1) \times 1}{
\begin{pmatrix}
    \Delta \widehat{LMP}_{1, 1987} \\
    \Delta \widehat{LMP}_{1, 1988} \\
    \Delta \widehat{LMP}_{1, 1989} \\
    \Delta \widehat{LMP}_{2, 1987} \\
    \Delta \widehat{LMP}_{2, 1988} \\
    \Delta \widehat{LMP}_{2, 1989} \\
    \vdots \\
    \Delta \widehat{LMP}_{N, 1987} \\
    \Delta \widehat{LMP}_{N, 1988} \\
    \Delta \widehat{LMP}_{N, 1989} \\
\end{pmatrix}
}
$$

So in the end, you perform a second stage regression that looks like this

$$
    \Delta LMP_{it} = \rho\Delta \widehat{LMP}_{it-1} + \Delta u_{it}
$$

In [None]:
# I have created these helper functions. So if you managed to create the instrument matrix, you can use the iv_eq_by_eq to get columns that give you the correct number of columns for the different time periods.

In [None]:
def IV_eq_by_eq(instrument_matrix, reduced_year):
    z_1987 = instrument_matrix[reduced_year == 1987, 0].reshape(-1, 1)
    z_1988 = instrument_matrix[reduced_year == 1988, 1:3]
    z_1989 = instrument_matrix[reduced_year == 1989, 3:]
    return (z_1987, z_1988, z_1989)
z_arrays = IV_eq_by_eq(instrument_matrix, reduced_year)

In [None]:
s2sls_result = gmm.system_2sls(y_arrays, x_arrays, z_arrays)
lm.print_table(
    ('yfd', ['yfd_l']), s2sls_result, title='System 2SLS results', floatfmt='.4f'
)

System 2SLS results <br>
Dependent variable: yfd

|       |   Beta |     Se |   t-values |
|-------|--------|--------|------------|
| yfd_l | 0.2390 | 0.0166 |    14.3891 |
R² = 0.000 <br>
σ² = 0.158

### Question 3
Repeat question 1 and 2, but now with first differences as instruments. In other words, perform 2SLS with twice lagged first differences of LMP as instruments. 

I have given you some of the code if you managed to finish the `gmm.sequential_instruments` function. To create the y, x, and z arrays you use the provided `xy_eq_by_eq` function.

In [None]:
reduced_year2 = reduced_year[reduced_year != 1987]
instrument_matrix = gmm.sequential_instruments(yfd_l2, T - 1)
def IV_eq_by_eq2(instrument_matrix, reduced_year):
    z_1988 = instrument_matrix[reduced_year == 1988, 0].reshape(-1, 1)
    z_1989 = instrument_matrix[reduced_year == 1989, 1:]
    return (z_1988, z_1989)

z_arrays = IV_eq_by_eq2(instrument_matrix, reduced_year2)
y_arrays = xy_eq_by_eq(yfd0, T-2)
x_arrays = xy_eq_by_eq(yfd_l0, T-2)

In [None]:
s2sls_result = gmm.system_2sls(y_arrays, x_arrays, z_arrays)
lm.print_table(
    ('yfd', ['yfd_l']), s2sls_result, title='System 2SLS results', floatfmt='.4f'
)

Your table should look something like this
System 2SLS results <br>
Dependent variable: yfd

|       |   Beta |     Se |   t-values |
|-------|--------|--------|------------|
| yfd_l | 0.1959 | 0.0197 |     9.9622 |
R² = 0.000 <br>
σ² = 0.152

Part 4: GMM
Coming up

In [None]:
gmm_result = gmm.est_gmm(
    W = la.inv(instrument_matrix.T@instrument_matrix), 
    y = yfd0,
    x = yfd_l0,
    z = instrument_matrix,
    t = T-2,
    step=1
)

In [None]:
lm.print_table(('yfd', ['yfd_l']), gmm_result, title='GMM results', floatfmt='.4f', tablefmt='github')