In [2]:
## Preamble: Package Loading
import numpy as np
import ipywidgets as ipw
from IPython.display import display
import matplotlib.pyplot as plt
from matplotlib import gridspec
import pandas as pd
import json
import kernel as kr
import psc_dbl_sumdisp as psd 

<h1> Panel Selection and Control: Monte Carlo Results

<h2> Summary </h2>

The following notebook contains results of a Monte Carlo Exercise conducted on the estimator detailed in 'psc.ipynb' and 'psc_proposal.pdf' with a data sets generated by 'psc_dgp.ipnyb' (see this notebook for details of the DGP). 

Important features of each of the following trials are presented here

* In all data sets the endogneous variables $Z_1$ have been generated by secondary equations which are panel fixed effects type, corresponding to section 3.3 and 3.4 of 'psc_dgp.ipynb'. 


* All estimates have been generated with the knowledge that the secondary equations are panel type (i.e. the estimation of the secondary equations is properly specified). 


* No subset selection (lasso/SCAD) has been used to generate the following results, this will come later. 


* The number of datasets used from each component of each trial is 'nds = 1000'

<a id='index'><a>

<h2> Index </h2>
<ul>
    <li> <a href='#trial_1'> Trial Set 1: Varying the Number of Time Periods </a> <br>
        <br>
    <ul> 
        <li> <a href='#trial_11'> Trial Set 1.1: Varying the Number of Time Periods, Known Subset </a> <br>
        <br>
        <li> <a href='#trial_12'> Trial Set 1.2: Varying the Number of Time Periods, Lasso </a> <br>
        <br>
    </ul>
    <li> <a href='#trial_2'> Trial Set 2: Varying the number of Cross Sections </a> <br>
       <br>
    <ul> 
    <li> <a href='#trial_21'> Trial Set 2.1: Varying the number of Cross Sections, Known Subset </a> <br>
    <br>
    <li> <a href='#trial_22'> Trial Set 2.2: Varying the number of Cross Sections, Lasso </a> <br>
    <br>
    </ul>
    <li> <a href='#trial_3'> Trial Set 3:  Known Subset vs. Unknown Subset vs. Lasso with $t_{inst} = 5$</a> <br>
              <br>
    <li> <a href='#trial_4'> Trial Set 4: Known Subset vs. Unknown Subset vs. Lasso with $t_{inst} = 10$</a> <br>
        <br>
    <li> <a href='#trial_5'> Trial Set 5: Known Subset vs. Unknown Subset vs. Lasso with $t_{inst} = 20$</a> <br>
        <br>
    <li> <a href='#trial_6'> Trial Set 6: Two Instruments per Cross Section: Unknown Subset vs. Lasso </a> <br><br>
    <ul>
        <li> <a href='#trial_61'>  Trial Set 6.1:  Unknown Subset vs. Lasso, $ncs = 15,\;\; t_{inst} = 30$<br>
            <br>
        <li> <a href='#trial_62'>  Trial Set 6.2: Unknown Subset vs. Lasso, $ncs = 25,\;\; t_{inst} = 50$<br> 
            <br>
        <li> <a href='#trial_63'>   Trial Set 6.3: Unknown Subset vs. Lasso, $ncs = 35,\;\; t_{inst} = 70$<br> <br>
        <li> <a href='#trial_64'>   Trial Set 6.4: Lasso Comparison <br> <br>
    </ul> 
    <li> <a href='#trial_7'> Trial Set 7: Five Instruments per Cross Section: Unknown Subset vs. Lasso </a> <br><br>
    <ul>
        <li> <a href='#trial_71'>  Trial Set 7.1:  Unknown Subset vs. Lasso, $ncs = 10,\;\; t_{inst} = 50$<br>
            <br>
        <li> <a href='#trial_72'>  Trial Set 7.2: Unknown Subset vs. Lasso, $ncs = 20,\;\; t_{inst} = 100$<br> 
            <br>
        <li> <a href='#trial_73'>   Trial Set 7.3: Unknown Subset vs. Lasso, $ncs = 30,\;\; t_{inst} = 150$<br><br>
        <li> <a href='#trial_74'>   Trial Set 7.4: Lasso Comparison <br> <br>
    </ul> 
</ul>

<h3> Variable Description Table </h3>

A number of variables are used below, here are their descriptions. Refer back to 'psc.ipynb' or 'psc_dgp.ipynb' for more details.

Variable Name  |  Description  
--|--
k_H| Kernel number used for H function Estimation  
c_H |  Plug in bandwidth constant for H function Estimation
k_mvd  | Kernel number used for multivariate d>2 density estimation
c_mvd|  Plug in bandwidth constant for multivariate d>2 density estimation
k_uvd  |  Kernel number used for bivariate density  estimation 
c_uvd |  Plug in bandwidth used for bivariate density estimation
dep_nm|  Variable name of the dependent variable
en_nm |  Variable names of each endogenous variabble
ex_nm |  Variable names of each exogenous variable
in_nm |  Variable names of instruments relevant to each cross section
err_vpro|  Vector of covariances used to construct the error cov matrix
ex_vpro|  Vector of covariances used to construct the exog variable cov matrix
inst_vpro | Vector of covariances used to construct the instrument cov matrix
frc |  Indicator for whether the functional form of control function is forced
input_filename|  Filename of dataset used to generate the results. 
kwnsub  | Indicator for ifthe subset of instrument relevant to each crs is known
n_end  |  Number of endogenous variables 
n_exo|  Number of exogenous variables
ncs  |  Number cross sections
nds  |  Number of dgp data sets
ntp |  Number of time periods
orcl |  Indicator for whether residuals $V$ are observed (=1) or not
r_seed|  Random number generator seed used to generate the data set
sec_pan|  Indicator for whether the secondary eqn data is panel or not
c_inst  |  Number of instrument relevant to each cross section   
t_inst|  Total number of instruments
inc | List of instrument relevant to at least one cross section
tin  |  Variable name of the time period index
cin  |  Variable name of the cross section index 
lasso | Indicator for lasso estimation
alph | lasso penalty value
epsil | Threshold for averaging "non zero" coefficients

<a href='#index'> Back to Index </a>
<a id='trial_1'><a>

<h2> Trial Set 1: Varying the number of Time Periods $T \in \{30,50,70\}$

<a href='#index'> Back to Index </a>
<a id='trial_11'><a>

<h2> Trial Set 1.1: Varying the Number of Time Periods, Known Subset </h2> 

Here we examine the sampling distribution of $\hat{\beta}_1, \hat{\alpha}_{1}$ and $\hat{\alpha}_{2}$ as the number of time periods $T$ increases i.e. where $T \in \{30,50,70\}$, while holding the following constant (amongst others shown below).

* Number of Cross Sections: 5


* Number of Endogenous Regressors: 2


* Number of Exogenous Regressors: 2


* Total Number of Instruments: 5


* Number of Instrument Relevant to Each Cross Section: 3


* Set of instruments relevant to each cross section is known

<h3> Trial Set 1.1: Data Loading and Organization </h3> 

The following is extracts and organizes all relvant information from the results data sets whose file names are list here. 

In [4]:
# Results data sets included in trial #1
inpt_filenames0 = ['pscout_8_23_1040.json' ,'pscout_8_23_1693.json' ]
# Legend labels
line_nms0 = ['Unknown', 'Lasso']

res_out0 = [psd.psc_load(inpt_filenames0[i]) for i in range(len(inpt_filenames0))]
estin_dcts0 = [res_out0[i][0] for i in range(len(inpt_filenames0))]
dgp_sum_filenames0 = [ estin_dcts0[i]['input_filename'].replace('pscdata','pscsum')
                      for i in range(len(inpt_filenames0))]
dgp_dicts0 = [psd.pscsum_load(dgp_sum_filenames0[i]) 
             for i in range(len(dgp_sum_filenames0))]
dgpin_dcts0 =  [dgp_dicts0[i][0] for i in range(len(inpt_filenames0))]
merged_dcts0 = [{**estin_dcts0[i],**dgpin_dcts0[i]} for i in range(len(inpt_filenames0))]
true_bcoeffs0 = [dgp_dicts0[i][1] for i in range(len(inpt_filenames0))]
true_acoeffs0 = [dgp_dicts0[i][2] for i in range(len(inpt_filenames0))]
bcoeff0  = [res_out0[i][1] for i in range(len(inpt_filenames0))]
acoeff0  = [res_out0[i][3] for i in range(len(inpt_filenames0))]
btables0 = [res_out0[i][2] for i in range(len(inpt_filenames0))]
atables0 = [res_out0[i][4] for i in range(len(inpt_filenames0))]

ValueError: Length mismatch: Expected axis has 4 elements, new values have 3 elements

<h3> Trial Set 1.1: Merged DGP and Estimator Function Input Dictionary Comparison </h3> 

Here I have merged together the dictionaries used to generate both the underlying dataset and the results (you will see the file name for this data set below) and the dictionary used to produce the estimates based on that data below. 

Below you will see a slider which can be used to summarize this merged dictionary corresponding to the position its file name appears in 'input_filenames0' above. 

In accordance with the trial description, the only differences that should exist are the number of time periods (ntp) and the file name of the data set uded to generate the results. 

In [3]:
psd.indict_dsp(merged_dcts0,1)

<h3> Trial Set 1.1: True Secondary Equation Coefficients Comparison </h3> 

Here I interactively display the coefficent vectors $\alpha_{1jd}$ used to generate the data set (by row indicating cross section and equation) corresponding to the position its file name appears in 'input_filenames0' above. Here they should also be identical across data sets. 

**Note:** 

1.) That since in the above 'sec_pan = 1' the secondary equations are panel type so all non zero coefficients in a columns should be identical. 

2.) A zero coefficient in the following matrix means that the instrument it multiplies is not relevant to that cross section. 

3.) In accordance with the description above they should be identical across results data sets.


In [4]:
psd.indict_dsp(true_acoeffs0,2)

<h3> Trial Set 1.1: Secondary Function Coefficient Estimates </h3>

Here I interactively show the sampling distribution of the elements of $\hat{\alpha}_{dj}$.  

In [5]:
display(psd.cfs_dsp(acoeff0,atables0,2,5,line_nms0))

<h3> Trial Set 1.1: Comments on Secondary Function Coefficient Estimates </h3>
    
* The changes in the properties of the sampling distribution of each coefficient are inline with what we would expect from a consistent estimator, the sample variance and Mean Squared Error decrease as the number of time periods increases $ntp \rightarrow \infty$.  


* Another feature evident from the above is the the variance of each coefficient is inversly proportional to the number of cross section which the instrument it multiplies is relevant to. For example $\hat{\alpha}_{d1,1}$ and $\hat{\alpha}_{d1,2}$ have the small variance since they are relevant to all cross sections, followed by $\hat{\alpha}_{d2,1}$ (relvant to 4 cross sections), followed by $\hat{\alpha}_{d2,4}$ and $\hat{\alpha}_{d2,5}$ (relevant to 3 cross sections), lastly followed by $\hat{\alpha}_{d2,4}$ (relevant to only 2 cross sections),.

<h3> Trial Set 1.1: True Primary Equations Coefficients Comparison </h3>

Here I interactively display the coefficent vector $\beta_1$ used to generate the data set corresponding to the position its file name appears in 'input_filenames0' above. Here they should be identical. 

In [6]:
psd.indict_dsp(true_bcoeffs0,1)

<h3> Trial Set 1.1: Primary Function Coefficient Estimates </h3>

Here I show the sampling distribution of the elements of $\hat{\beta}_1$.  

In [7]:
display(psd.cfs_dsp(bcoeff0,btables0,1,12,line_nms0))

<h3> Trial Set 1.1: Comments on Primary Function Coefficient Estimates </h3>

1.) The sampling distribution behave in the way that we would expect a consistent estimator to behave meaning that the sample variance and mean squared error of all coefficient decrease as the number of time periods increases.  

2.) The sample variance of the coefficients multiplying the endogenous regressors are much larger than those multiplying the exogenous regressors. Given the dgp this makes sense in that $Z_1$ is not correlated with error term $\varepsilon$, thus its identification is accomplished without the need for estimating $V$.

<a href='#index'> Back to Index </a>
<a id='trial_12'><a>

<h2>Trial Set X.X: Description </h2> 

Here we examine the sampling distribution of $\hat{\beta}_1, \hat{\alpha}_{1}$ and $\hat{\alpha}_{2}$ .

* Number of Cross Sections: 


* Number of Endogenous Regressors: 


* Number of Exogenous Regressors: 


* Total Number of Instruments: 


* Number of Instrument Relevant to Each Cross Section: 



<h3> Trial Set X.X: Data Loading and Organization </h3> 

The following is extracts and organizes all relvant information from the results data sets whose file names are list here. 

In [8]:
# Results data sets included in trial #1
inpt_filenames0Y = ['pscout_X_XX_XXXX.json','pscout_X_XX_XXXX.json','pscout_X_XX_XXXX.json']
# Legend labels
line_nms0Y = ['nm 1', 'nm 2' ,'nm 3']

res_out0Y = [psd.psc_load(inpt_filenames0Y[i]) for i in range(len(inpt_filenames0Y))]
estin_dcts0Y = [res_out0Y[i][0] for i in range(len(inpt_filenames0Y))]
dgp_sum_filenames0Y = [ estin_dcts0Y[i]['input_filename'].replace('pscdata','pscsum')
                      for i in range(len(inpt_filenames0Y))]
dgp_dicts0Y = [psd.pscsum_load(dgp_sum_filenames0Y[i]) 
             for i in range(len(dgp_sum_filenames0Y))]
dgpin_dcts0Y =  [dgp_dicts0Y[i][0] for i in range(len(inpt_filenames0Y))]
merged_dcts0Y = [{**estin_dcts0Y[i],**dgpin_dcts0Y[i]} for i in range(len(inpt_filenames0Y))]
true_bcoeffs0Y = [dgp_dicts0Y[i][1] for i in range(len(inpt_filenames0Y))]
true_acoeffs0Y = [dgp_dicts0Y[i][2] for i in range(len(inpt_filenames0Y))]
bcoeff0Y  = [res_out0Y[i][1] for i in range(len(inpt_filenames0Y))]
acoeff0Y  = [res_out0Y[i][3] for i in range(len(inpt_filenames0Y))]
btables0Y = [res_out0Y[i][2] for i in range(len(inpt_filenames0Y))]
atables0Y = [res_out0Y[i][4] for i in range(len(inpt_filenames0Y))]

<h3> Trial Set X.X: Merged DGP and Estimator Function Input Dictionary Comparison </h3> 

Here I have merged together the dictionaries used to generate both the underlying dataset and the results (you will see the file name for this data set below) and the dictionary used to produce the estimates based on that data below. 

Below you will see a slider which can be used to summarize this merged dictionary corresponding to the position its file name appears in 'input_filenames0' above. 

In accordance with the trial description, the only differences that should exist are the number of time periods (ntp) and the file name of the data set uded to generate the results. 

In [9]:
psd.indict_dsp(merged_dcts0Y,1)

<h3> Trial Set X.X: True Secondary Equation Coefficients Comparison </h3> 

Here I interactively display the coefficent vectors $\alpha_{1jd}$ used to generate the data set (by row indicating cross section and equation) corresponding to the position its file name appears in 'input_filenames0' above. Here they should also be identical across data sets. 

**Note:** 

1.) That since in the above 'sec_pan = 1' the secondary equations are panel type so all non zero coefficients in a columns should be identical. 

2.) A zero coefficient in the following matrix means that the instrument it multiplies is not relevant to that cross section. 

3.) In accordance with the description above they should be identical across results data sets.


In [10]:
psd.indict_dsp(true_acoeffs0Y,2)

<h3> Trial Set X.X: Secondary Function Coefficient Estimates </h3>

Here I interactively show the sampling distribution of the elements of $\hat{\alpha}_{dj}$.  

In [11]:
display(psd.cfs_dsp(acoeff0Y,atables0Y,2,5,line_nms0Y))

<h3> Trial Set X.X: Comments on Secondary Function Coefficient Estimates </h3>
<ul>
    <li> Due to the shrinkage inherent in the operation of the lasso estimator the bias of the coefficients is substantial and in nearly half the cases growing. However the variances of each are shrinking as the number of time periods grows. <br>
        <br>
</ul>

<h3> Trial Set X.X: True Primary Equations Coefficients Comparison </h3>

Here I interactively display the coefficent vector $\beta_1$ used to generate the data set corresponding to the position its file name appears in 'input_filenames0' above. Here they should be identical. 

In [12]:
psd.indict_dsp(true_bcoeffs0Y,1)

<h3> Trial Set X.X: Primary Function Coefficient Estimates </h3>

Here I show the sampling distribution of the elements of $\hat{\beta}_1$.  

In [13]:
display(psd.cfs_dsp(bcoeff0Y,btables0Y,1,12,line_nms0Y))

<h3> Trial Set X.X: Comments on Primary Function Coefficient Estimates </h3>

<ul>
    <li> The behavior here is the same as the the known subset estimation in trials set 1.1
        <br><br> 
</ul>

<a href='#index'> Back to Index </a>
<a id='trial_2'><a>