Merge grmpy-semipar in master branch

* Adding grmpy-semipar. new file: develop_test.ipynb modified: environment.yml modified: grmpy/check/check.py modified: grmpy/estimate/estimate.py new file: grmpy/estimate/estimate_par.py new file: grmpy/estimate/estimate_semipar.py new file: replication_grmpy_semipar.yml modified: requirements.txt modified: setup.py git commit -m "Adding grmpy-semipar to the estimation section" * Did some reorganizations and included KernReg folder. * Trying let Travis run develop_test.ipynb. * Create auxiliary files separately for par and semipar. Delete estimate_auxiliary. * Updated documentation. Further updates will follow. * Adjusted environment file. * Modified test section. * Working on Travis issues. * Working on test section. * Worked again on the check functions. * Updated Tutorial section in the documentation. * Further changes to the tutorial file. * Fixed table format. * Changed Estimation table. * Updated the chapter Reliability in the documentation. * Updated the chapter Reliability in the documentation. * Included figures for the documentation file. * Clean up. * Further clean up. * Minor changes to the check section. * Undid some changes in the test section. * Included tutorial notebook for the semiparametric estimation. * Included tutorial notebook for the semiparametric estimation process. * Final adjustments to tutorial semipar, updated references in docs. * Improved code quality. * Fixed some Code Quality issues. * Fixed further Code Quality issues. * Ironing out linter issues. * There is an issue in the specification if travis_runner * Removed duplicates in the check section. * Worked on check section again. * Minor changes. * Weeding out code duplicates. * Removed unused import statement. * Removed obsolete parameter from the initialization file. * Minor additional explanations to the tutorial notebook and the documentation * Removing shell=True from travis_runner, as this might pose security issues. * Updated environment.yml in promotion folder. * Updated environment.yml in promotion folder. * Updated requirement file and documenation * Ironed out minor typos in the documentation. * Found further typos. * Updated links in the documentation. * Included test for the semiparametric estimation process. * Removed semipar test for now. Fixes will follow. * Updated resource paths for semipar test. * Added missing files. * Trying to fix path. Don't see why file has not been found. * Don't see why path is not found. Remove semipar test for now.
OpenSourceEconomics · Feb 6, 2020 · a76be2e · a76be2e
1 parent 397fefd
commit a76be2e
Show file tree

Hide file tree

Showing 37 changed files with 3,117 additions and 106 deletions.
diff --git a/.gitignore b/.gitignore
@@ -2,6 +2,8 @@
 __pxcache__/
 *.pyc
 .idea/
+.ipynb_checkpoints/
+*.html
+.DS_Store
 docs/build
 promotion/
-
diff --git a/README.md b/README.md
@@ -1,6 +1,8 @@
 # grmpy
 
-``grmpy``  is an open-source Python package for the simulation and estimation of the generalized Roy model. It serves as a teaching tool to promote the conceptual framework of the generalized Roy model, illustrate a variety of issues in the econometrics of policy evaluation, and showcase basic software engineering practices.
+``grmpy``  is an open-source Python package for the simulation and estimation of the generalized Roy model. It serves as a teaching tool to promote the conceptual framework of the generalized Roy model, illustrate a variety of issues in the econometrics of policy evaluation, and showcases basic software engineering practices. <br>
+Marginal Treatment Effects (MTE) can be estimated based on a parametric normal model or,
+alternatively, via the semiparametric method of Local Instrumental Variables (LIV).
 
 Please visit our [online documentation](http://grmpy.readthedocs.io/) for details.
 

diff --git a/development/tests/property/run.py b/development/tests/property/run.py
@@ -10,13 +10,13 @@
 import numpy as np
 
 from grmpy.test.auxiliary import cleanup
-from property_auxiliary import distribute_command_line_arguments
-from property_auxiliary import process_command_line_arguments
-from property_auxiliary import get_random_string
-from property_auxiliary import run_property_test
-from property_auxiliary import print_rslt_ext
-from property_auxiliary import collect_tests
-from property_auxiliary import finish
+from development.tests.property.property_auxiliary import distribute_command_line_arguments
+from development.tests.property.property_auxiliary import process_command_line_arguments
+from development.tests.property.property_auxiliary import get_random_string
+from development.tests.property.property_auxiliary import run_property_test
+from development.tests.property.property_auxiliary import print_rslt_ext
+from development.tests.property.property_auxiliary import collect_tests
+from development.tests.property.property_auxiliary import finish
 
 
 def choose_module(inp_dict):

diff --git a/development/tests/regression/draft.py b/development/tests/regression/draft.py
@@ -12,7 +12,7 @@
 
 import numpy as np
 
-from grmpy.estimate.estimate_auxiliary import (
+from grmpy.estimate.estimate_par import (
     calculate_criteria,
     process_data,
     start_values,

diff --git a/development/tests/regression/run.py b/development/tests/regression/run.py
@@ -9,7 +9,7 @@
 import numpy as np
 
 import grmpy
-from grmpy.estimate.estimate_auxiliary import (
+from grmpy.estimate.estimate_par import (
     calculate_criteria,
     process_data,
     start_values,

diff --git a/docs/source/figures/beta_distribution.png b/docs/source/figures/beta_distribution.png
diff --git a/docs/source/figures/normal_distribution.png b/docs/source/figures/normal_distribution.png
diff --git a/docs/source/figures/replication_carneiroB.png b/docs/source/figures/replication_carneiroB.png
diff --git a/docs/source/refs.bib b/docs/source/refs.bib
@@ -1,5 +1,26 @@
 % Encoding: UTF-8
 
+@article{Fan1994,
+author = {Fan, Jianqing and Marron, James S.},
+title = {Fast Implementations of Nonparametric Curve Estimators},
+journal = {Journal of Computational and Graphical Statistics},
+volume = {3},
+number = {1},
+pages = {35-56},
+year  = {1994}
+}
+
+
+@article{Brave2014,
+	author  = {Brave, S. and Walstrum, T.},
+	year    = {2014},
+	title   = {Estimating marginal treatment effects using parametric and semiparametric methods},
+	journal = {Stata Journal},
+	volume  = {14},
+	number  = {1},
+	pages   = {191--217},
+}
+
 @Article{RoRu1983,
   author  = {Rosenbaum, Paul R. and Donald B. Rubin},
   title   = {The Central Role of the Propensity Score in Observational Studies for Causal Effects},

diff --git a/docs/source/reliability.rst b/docs/source/reliability.rst
@@ -31,10 +31,72 @@ As can be seen from the figure, the OLS estimator underestimates the effect sign
 
 The second figure shows the estimated :math:`B^{ATE}` from the ``grmpy`` estimation process. Conversely to the OLS results the estimate of the average effect is close to the true value even if the unobservables are almost perfectly correlated.
 
-Replication
+
+Sensitivity to Different Distributions of the Unobservables
+-----------------------------------------------------------
+The parametric specification makes the strong assumption that the unobservables follow a joint normal distribution.
+The semiparametric method of local instrumental variables is more flexible, as it does not invoke conditions on the functional form.
+We test how sensitive the two methods to different distributions of the unobservables.
+To that end, we use a toy model of the returns to college (based on :cite:`Brave2014`), where we know the true shape of the :math:`B^{MTE}`.
+
+Normal Distribution
+^^^^^^^^^^^^^^^^^^^
+
+.. figure:: ../source/figures/normal_distribution.png
+    :align: center
+
+Both specifications come very close to the original curve. The parametric model even gets a perfect fit.
+
+*beta* Distribution
+^^^^^^^^^^^^^^^^^^^
+
+The shape of the *beta* distribution can be flexibly adjusted by the tuning parameters :math:`\alpha` and :math:`\beta`,
+which we set to 4 and 8, respectively.
+
+
+.. figure:: ../source/figures/beta_distribution.png
+    :align: center
+
+The parametric model underestimates the returns to college, whereas the semiparametric :math:`B^{MTE}` still fits the original
+curve pretty well. The latter makes no assumption on the functional form of the unobservables and, thus, is more flexible
+in estimating the parameter of interest when the assumption of joint normality is violated.
+Which model is superior depends on the context. In empirical applications, we recommend to examine both.
+
+
+Reliability
 -----------
 
-The second check of reliability compares the results of our estimation process with already existing results from the literature. For this purpose we replicate the results for the marginal treatment effect from Carneiro 2011 (:cite:`Carneiro2011`). Additionally we provide a `jupyter notebook <https://github.com/OpenSourceEconomics/grmpy/blob/master/promotion/04_grmpy_tutorial_notebook/04_grmpy_tutorial_notebook.ipynb>`_ that runs an estimation based on an `initialization file <https://github.com/OpenSourceEconomics/grmpy/blob/master/promotion/04_grmpy_tutorial_notebook/files/replication.grmpy.yml>`__ for easy reconstruction of our test setup. The init file corresponds to the specifications of the authors. As shown in the figure below the results are really close to the original results. The deviation seems to be negligible because of the usage of a mock dataset.
+In another check of reliability, we compare the results of our estimation process with already existing results from the literature.
+For this purpose we replicate the results for both the parametric and semiparametric MTE from Carneiro 2011 (:cite:`Carneiro2011`).
+Note that we make use of a mock data set, as the original data cannot be fully recreated from the
+`replication material <https://www.aeaweb.org/articles?id=10.1257/aer.101.6.2754>`_.
+
+We provide two jupyter notebooks for easy reconstruction of the
+`parametric <https://github.com/OpenSourceEconomics/grmpy/blob/master/promotion/grmpy_tutorial_notebook/grmpy_tutorial_notebook.ipynb>`_
+as well as the
+`semiparametric <https://github.com/OpenSourceEconomics/grmpy/blob/master/promotion/grmpy_tutorial_notebook/tutorial_semipar_notebook.ipynb>`_
+setup.
+The corresponding initialization files can be found
+`here <https://github.com/OpenSourceEconomics/grmpy/blob/master/promotion/grmpy_tutorial_notebook/files/replication.grmpy.yml>`_ and
+`here <https://github.com/OpenSourceEconomics/grmpy/blob/master/promotion/grmpy_tutorial_notebook/files/tutorial_semipar.yml>`__.
+
+Parametric Replication
+^^^^^^^^^^^^^^^^^^^^^^
+
+As shown in the figure below, the parametric :math:`B^{MTE}` is really close to the original results.
+The deviation seems to be negligible because of the use of a mock dataset.
 
 .. figure:: ../source/figures/fig-marginal-benefit-parametric-replication.png
     :align: center
+
+
+Semiparametric Replication
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. figure:: ../source/figures/replication_carneiroB.png
+    :align: center
+
+The semiparametric :math:`B^{MTE}`  also gets very close to the original curve. However, the 90 percent confidence bands
+(250 bootstrap replications) are wider. As opposed to the parametric model, where the standard error bands are computed
+analytically, confidence bands in the semiparametric setup are obtained via the bootstrap method,
+which is sensitive to the discrepancies in the mock data set.
diff --git a/docs/source/tutorial.rst b/docs/source/tutorial.rst
@@ -1,10 +1,13 @@
 Tutorial
-========
+=======================
 
-We now illustrate the basic capabilities of the ``grmpy`` package. We start with the assumptions about functional form and the distribution of unobservables and then turn to some simple use cases.
+We now illustrate the basic capabilities of the ``grmpy`` package.
+We start by outlining some basic functional form assumptions before introducing to alternative models that can be used to
+estimate the marginal treatment effect (MTE).
+We then turn to some simple use cases.
 
 Assumptions
-------------
+-----------
 
 The ``grmpy`` package implements the normal linear-in-parameters version of the generalized Roy model. Both potential outcomes and the choice :math:`(Y_1, Y_0, D)` are a linear function of the individual's observables :math:`(X, Z)` and random components :math:`(U_1, U_0, V)`.
 
@@ -15,7 +18,29 @@ The ``grmpy`` package implements the normal linear-in-parameters version of the
     D &= I[D^{*} > 0] \\
     D^{*}    &= Z \gamma -V
 
-We collect all unobservables determining treatment choice in :math:`V = U_C - (U_1 - U_0)`. The unobservables follow a normal distribution :math:`(U_1, U_0, V) \sim \mathcal{N}(0, \Sigma)` with mean zero and covariance matrix :math:`\Sigma`.  Individuals decide to select into latent indicator variable :math:`D^{*}` is positive. Depending on their decision, we either observe :math:`Y_1` or :math:`Y_0`.
+We collect all unobservables determining treatment choice in :math:`V = U_C - (U_1 - U_0)`.
+Individuals decide to select into latent indicator variable :math:`D^{*}` is positive. Depending on their decision, we either observe :math:`Y_1` or :math:`Y_0`.
+
+
+Parametric Normal Model
+^^^^^^^^^^^^^^^^^^^^^^^
+
+The parametric model imposes the assumption of joint normality of the unobservables :math:`(U_1, U_0, V) \sim \mathcal{N}(0, \Sigma)` with mean zero and covariance matrix :math:`\Sigma`.
+
+Semiparametric Model
+^^^^^^^^^^^^^^^^^^^^
+The semiparametric approach invokes no assumption on the distribution of the unobservables. It requires a weaker condition
+:math:`(X,Z) \indep {U_1, U_0, V}`
+
+Under this assumption, the MTE is:
+
+* additively separable in :math:`X` and :math:`U_D`, which means that the shape of the MTE is independent of :math:`X`, and
+
+* identified over the common support of :math:`P(Z)`, unconditional on :math:`X`.
+
+
+The assumption of common support is crucial for the application of LIV and needs to be carefully evaluated every time.
+It is defined as the region where the support of :math:`P(Z)` given :math:`D=1` and the support of :math:`P(Z)` given :math:`D=0 overlap.
 
 Model Specification
 -------------------
@@ -36,11 +61,14 @@ source      str         specified name for the simulation output files
 
 **ESTIMATION**
 
-The *ESTIMATION* block determines the basic information for the estimation process.
+Depending on the model, different input parameters are required.
+
+**PARAMETRIC MODEL**
 
 ===========     ======      ===============================================
 Key             Value       Interpretation
 ===========     ======      ===============================================
+semipar         False       choose the parametric normal model
 agents          int         number of individuals (for the comparison file)
 file            str         name of the estimation specific init file
 optimizer       str         optimizer used for the estimation process
@@ -52,16 +80,59 @@ output_file     str         name for the estimation output file
 comparison	int         flag for enabling the comparison file creation
 ===========     ======      ===============================================
 
+**SEMIPARAMETRIC MODEL**
+
+=============     ======      =========================================================================================
+Key               Value       Interpretation
+=============     ======      =========================================================================================
+semipar           True        choose the semiparametric model
+show_output       bool        If *True*, intermediate outputs of the LIV estimation are displayed
+dependent         str         indicates the dependent variable
+indicator         str         label of the treatment indicator variable
+file              str         name of the estimation specific init file
+logit             bool        If false: probit. Probability model for the decision equation
+nbins             int         Number of histogram bins used to determine common support (default is 25)
+trim_support	  bool        Trim the data outside the common support (default is *True*)
+bandwidth         float       Bandwidth for the locally quadratic regression
+gridsize          int         Number of evaluation points for the locally quadratic regression (default is 400)
+reestimate_p      bool        Re-estimate :math:`P(Z)` after trimming (default is *False*), not recommended
+rbandwidth        int         Bandwidth for the double residual regression (default is 0.05)
+derivative        int         Derivative of the locally quadratic regression (default is 1)
+degree            int         Degree of the local polynomial (default is 2)
+ps_range          list        Start and end point of the range of :math:`p = u_D` over which the MTE shall be plotted
+=============     ======      =========================================================================================
+
+In most empirical applications, bandwidth choices between 0.2 and 0.4 are appropriate for the locally quadratic regression.
+:cite:`Fan1994` find that a gridsize of 400 is a good default for graphical analysis.
+For data sets with less than 400 observations, we recommend a gridsize equivalent to the maximum number of observations that
+remain after trimming the common support.
+If the data set of size N is large enough, a gridsize of 400 should be considered as the minimal number of evaluation points.
+Since *grmpy*'s algorithm is fast enough, gridsize can be easily increased to N evaluation points.
+
+The "rbandwidth", which is 0.05 by default, specifies the bandwidth for the LOWESS (Locally Weighted Scatterplot Smoothing) regression of
+:math:`X`, :math:`X \ \times \ p`, and :math:`Y` on :math:`\widehat{P}(Z)`. If the sample size is small (N < 400),
+the user may need to increase "rbandwidth" to 0.1. Otherwise *grmpy* will throw an error.
+
+Note that the MTE identified by LIV consists of wo components: :math:`\overline{x}(\beta_1 - \beta_0)` (which does not depend on :math:`P(Z) = p`) and :math:`k(p)`
+(which does depend on :math:`p`). The latter is estimated nonparametrically. The key "p_range" in the initialization file specifies the interval
+over which :math:`k(p)` is estimated. After the data outside the overlapping support are trimmed, the locally quadratic kernel estimator
+uses the remaining data to predict :math:`k(p)` over the entire "p_range" specified by the user. If "p_range" is larger than the common support, *grmpy*
+extrapolates the values for the MTE outside this region. Technically speaking, interpretations of the MTE are only valid within the common support.
+In our empirical applications, we set "p_range" to :math:`[0.005,0.995]`.
+
+The other parameters in this section are set by default and, normally, do not need to be changed.
 
 
 **TREATED**
 
-The *TREATED* block specifies the number and order of the covariates determining the potential outcome in the treated state and the values for the coefficients :math:`\beta_1`. Note that the length of the list which determines the paramters has to be equal to the number of variables that are included in the order list.
+The *TREATED* block specifies the number and order of the covariates determining the potential outcome in the treated state
+and the values for the coefficients :math:`\beta_1`. Note that the length of the list which determines the parameters has to be equal
+to the number of variables that are included in the order list.
 
 =======   =========  ======     ===================================
 Key       Container  Values     Interpretation
 =======   =========  ======     ===================================
-params    list       float      Paramters
+params    list       float      Parameters
 order     list       str        Variable labels
 =======   =========  ======     ===================================
 
@@ -73,7 +144,7 @@ The *UNTREATED* block specifies the covariates that a the potential outcome in t
 =======   =========  ======     ===================================
 Key       Container  Values     Interpretation
 =======   =========  ======     ===================================
-params    list       float      Paramters
+params    list       float      Parameters
 order     list       str        Variable labels
 =======   =========  ======     ===================================
 
@@ -84,10 +155,14 @@ The *CHOICE* block specifies the number and order of the covariates determining
 =======   =========  ======     ===================================
 Key       Container  Values     Interpretation
 =======   =========  ======     ===================================
-params    list       float      Paramters
+params    list       float      Parameters
 order     list       str        Variable labels
 =======   =========  ======     ===================================
 
+
+Further Specifications for the Parametric Model
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
 **DIST**
 
 The *DIST* block specifies the distribution of the unobservables.
@@ -137,6 +212,9 @@ ftol       float      relative error in fun(*xopt*) that is acceptable for conve
 Examples
 --------
 
+Parametric Normal Model
+^^^^^^^^^^^^^^^^^^^^^^^
+
 In the following chapter we explore the basic features of the ``grmpy`` package. The resources for the tutorial are also available `online <https://github.com/OpenSourceEconomics/grmpy/tree/master/docs/tutorial>`_.
 So far the package provides the features to simulate a sample from the generalized Roy model and to estimate some parameters of interest for a provided sample as specified in your initialization file.
 
@@ -159,13 +237,27 @@ This creates a number of output files that contain information about the resulti
 
 **Estimation**
 
-The other feature of the package is the estimation of the parameters of interest. The specification regarding start values and and the optimizer options are determined in the *ESTIMATION* section of the initialization file.
+The other feature of the package is the estimation of the parameters of interest.
+By default, the parametric model is chosen, in which case the parameter *semipar* in the *ESTIMATION* section of the initialization file is set to *False*.
+The start values and optimizer options need to be specified in the *ESTIMATION* section.
 
 ::
 
-    grmpy.fit('tutorial.grmpy.yml')
+    grmpy.fit('tutorial.grmpy.yml', semipar=False)
 
-As in the simulation process this creates a number of output file that contains information about the estimation results.
+As in the simulation process this creates a number of output files that contain information about the estimation results.
 
 * **est.grmpy.info**, basic information of the estimation process
 * **comparison.grmpy.txt**, distributional characteristics of the input sample and the samples simulated from the start and result values of the estimation process
+
+
+Local Instrumental Variables
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If the user wishes to estimate the parameters of interest using the semiparametric LIV approach, *semipar* must be changed to *True*.
+
+::
+
+    grmpy.fit('tutorial.semipar.yml', semipar=True)
+
+If *show_output* is *True*, ``grmpy`` plots the common support of the propensity score and shows some intermediate outputs of the estimation process.
diff --git a/environment.yml b/environment.yml
@@ -8,20 +8,22 @@ dependencies:
   - pytest
   - pytest-xdist
   - scipy
+  - numba
   - matplotlib
   - flake8
   - seaborn
   - pandoc
+  - pip
   - pip:
     - sphinxcontrib.bibtex
     - sphinx_rtd_theme
     - linearmodels
     - oyaml
-    - grmpy
     - isort
+    - grmpy
     - nbsphinx
     - sphinx_rtd_theme
     - pytest-cov
     - jupyterlab
-
-
+    - scikit-misc
+    - sklearn
diff --git a/grmpy/KernReg/__init__.py b/grmpy/KernReg/__init__.py
@@ -0,0 +1 @@
+