Skip to content

Commit

Permalink
Merge grmpy-semipar in master branch
Browse files Browse the repository at this point in the history
* Adding grmpy-semipar.

	new file:   develop_test.ipynb
	modified:   environment.yml
	modified:   grmpy/check/check.py
	modified:   grmpy/estimate/estimate.py
	new file:   grmpy/estimate/estimate_par.py
	new file:   grmpy/estimate/estimate_semipar.py
	new file:   replication_grmpy_semipar.yml
	modified:   requirements.txt
	modified:   setup.py

git commit -m "Adding grmpy-semipar to the estimation section"

* Did some reorganizations and included KernReg folder.

* Trying let Travis run develop_test.ipynb.

* Create auxiliary files separately for par and semipar. Delete estimate_auxiliary.

* Updated documentation. Further updates will follow.

* Adjusted environment file.

* Modified test section.

* Working on Travis issues.

* Working on test section.

* Worked again on the check functions.

* Updated Tutorial section in the documentation.

* Further changes to the tutorial file.

* Fixed table format.

* Changed Estimation table.

* Updated the chapter Reliability in the documentation.

* Updated the chapter Reliability in the documentation.

* Included figures for the documentation file.

* Clean up.

* Further clean up.

* Minor changes to the check section.

* Undid some changes in the test section.

* Included tutorial notebook for the semiparametric estimation.

* Included tutorial notebook for the semiparametric estimation process.

* Final adjustments to tutorial semipar, updated references in docs.

* Improved code quality.

* Fixed some Code Quality issues.

* Fixed further Code Quality issues.

* Ironing out linter issues.

* There is an issue in the specification if travis_runner

* Removed duplicates in the check section.

* Worked on check section again.

* Minor changes.

* Weeding out code duplicates.

* Removed unused import statement.

* Removed obsolete parameter from the initialization file.

* Minor additional explanations to the tutorial notebook and the documentation

* Removing shell=True from travis_runner, as this might pose security issues.

* Updated environment.yml in promotion folder.

* Updated environment.yml in promotion folder.

* Updated requirement file and documenation

* Ironed out minor typos in the documentation.

* Found further typos.

* Updated links in the documentation.

* Included test for the semiparametric estimation process.

* Removed semipar test for now. Fixes will follow.

* Updated resource paths for semipar test.

* Added missing files.

* Trying to fix path. Don't see why file has not been found.

* Don't see why path is not found. Remove semipar test for now.
  • Loading branch information
segsell committed Feb 6, 2020
1 parent 397fefd commit a76be2e
Show file tree
Hide file tree
Showing 37 changed files with 3,117 additions and 106 deletions.
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
__pxcache__/
*.pyc
.idea/
.ipynb_checkpoints/
*.html
.DS_Store
docs/build
promotion/

4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# grmpy

``grmpy`` is an open-source Python package for the simulation and estimation of the generalized Roy model. It serves as a teaching tool to promote the conceptual framework of the generalized Roy model, illustrate a variety of issues in the econometrics of policy evaluation, and showcase basic software engineering practices.
``grmpy`` is an open-source Python package for the simulation and estimation of the generalized Roy model. It serves as a teaching tool to promote the conceptual framework of the generalized Roy model, illustrate a variety of issues in the econometrics of policy evaluation, and showcases basic software engineering practices. <br>
Marginal Treatment Effects (MTE) can be estimated based on a parametric normal model or,
alternatively, via the semiparametric method of Local Instrumental Variables (LIV).

Please visit our [online documentation](http://grmpy.readthedocs.io/) for details.

Expand Down
14 changes: 7 additions & 7 deletions development/tests/property/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,13 @@
import numpy as np

from grmpy.test.auxiliary import cleanup
from property_auxiliary import distribute_command_line_arguments
from property_auxiliary import process_command_line_arguments
from property_auxiliary import get_random_string
from property_auxiliary import run_property_test
from property_auxiliary import print_rslt_ext
from property_auxiliary import collect_tests
from property_auxiliary import finish
from development.tests.property.property_auxiliary import distribute_command_line_arguments
from development.tests.property.property_auxiliary import process_command_line_arguments
from development.tests.property.property_auxiliary import get_random_string
from development.tests.property.property_auxiliary import run_property_test
from development.tests.property.property_auxiliary import print_rslt_ext
from development.tests.property.property_auxiliary import collect_tests
from development.tests.property.property_auxiliary import finish


def choose_module(inp_dict):
Expand Down
2 changes: 1 addition & 1 deletion development/tests/regression/draft.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@

import numpy as np

from grmpy.estimate.estimate_auxiliary import (
from grmpy.estimate.estimate_par import (
calculate_criteria,
process_data,
start_values,
Expand Down
2 changes: 1 addition & 1 deletion development/tests/regression/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
import numpy as np

import grmpy
from grmpy.estimate.estimate_auxiliary import (
from grmpy.estimate.estimate_par import (
calculate_criteria,
process_data,
start_values,
Expand Down
Binary file added docs/source/figures/beta_distribution.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/figures/normal_distribution.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/figures/replication_carneiroB.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
21 changes: 21 additions & 0 deletions docs/source/refs.bib
Original file line number Diff line number Diff line change
@@ -1,5 +1,26 @@
% Encoding: UTF-8
@article{Fan1994,
author = {Fan, Jianqing and Marron, James S.},
title = {Fast Implementations of Nonparametric Curve Estimators},
journal = {Journal of Computational and Graphical Statistics},
volume = {3},
number = {1},
pages = {35-56},
year = {1994}
}


@article{Brave2014,
author = {Brave, S. and Walstrum, T.},
year = {2014},
title = {Estimating marginal treatment effects using parametric and semiparametric methods},
journal = {Stata Journal},
volume = {14},
number = {1},
pages = {191--217},
}

@Article{RoRu1983,
author = {Rosenbaum, Paul R. and Donald B. Rubin},
title = {The Central Role of the Propensity Score in Observational Studies for Causal Effects},
Expand Down
66 changes: 64 additions & 2 deletions docs/source/reliability.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,10 +31,72 @@ As can be seen from the figure, the OLS estimator underestimates the effect sign

The second figure shows the estimated :math:`B^{ATE}` from the ``grmpy`` estimation process. Conversely to the OLS results the estimate of the average effect is close to the true value even if the unobservables are almost perfectly correlated.

Replication

Sensitivity to Different Distributions of the Unobservables
-----------------------------------------------------------
The parametric specification makes the strong assumption that the unobservables follow a joint normal distribution.
The semiparametric method of local instrumental variables is more flexible, as it does not invoke conditions on the functional form.
We test how sensitive the two methods to different distributions of the unobservables.
To that end, we use a toy model of the returns to college (based on :cite:`Brave2014`), where we know the true shape of the :math:`B^{MTE}`.

Normal Distribution
^^^^^^^^^^^^^^^^^^^

.. figure:: ../source/figures/normal_distribution.png
:align: center

Both specifications come very close to the original curve. The parametric model even gets a perfect fit.

*beta* Distribution
^^^^^^^^^^^^^^^^^^^

The shape of the *beta* distribution can be flexibly adjusted by the tuning parameters :math:`\alpha` and :math:`\beta`,
which we set to 4 and 8, respectively.


.. figure:: ../source/figures/beta_distribution.png
:align: center

The parametric model underestimates the returns to college, whereas the semiparametric :math:`B^{MTE}` still fits the original
curve pretty well. The latter makes no assumption on the functional form of the unobservables and, thus, is more flexible
in estimating the parameter of interest when the assumption of joint normality is violated.
Which model is superior depends on the context. In empirical applications, we recommend to examine both.


Reliability
-----------

The second check of reliability compares the results of our estimation process with already existing results from the literature. For this purpose we replicate the results for the marginal treatment effect from Carneiro 2011 (:cite:`Carneiro2011`). Additionally we provide a `jupyter notebook <https://github.com/OpenSourceEconomics/grmpy/blob/master/promotion/04_grmpy_tutorial_notebook/04_grmpy_tutorial_notebook.ipynb>`_ that runs an estimation based on an `initialization file <https://github.com/OpenSourceEconomics/grmpy/blob/master/promotion/04_grmpy_tutorial_notebook/files/replication.grmpy.yml>`__ for easy reconstruction of our test setup. The init file corresponds to the specifications of the authors. As shown in the figure below the results are really close to the original results. The deviation seems to be negligible because of the usage of a mock dataset.
In another check of reliability, we compare the results of our estimation process with already existing results from the literature.
For this purpose we replicate the results for both the parametric and semiparametric MTE from Carneiro 2011 (:cite:`Carneiro2011`).
Note that we make use of a mock data set, as the original data cannot be fully recreated from the
`replication material <https://www.aeaweb.org/articles?id=10.1257/aer.101.6.2754>`_.

We provide two jupyter notebooks for easy reconstruction of the
`parametric <https://github.com/OpenSourceEconomics/grmpy/blob/master/promotion/grmpy_tutorial_notebook/grmpy_tutorial_notebook.ipynb>`_
as well as the
`semiparametric <https://github.com/OpenSourceEconomics/grmpy/blob/master/promotion/grmpy_tutorial_notebook/tutorial_semipar_notebook.ipynb>`_
setup.
The corresponding initialization files can be found
`here <https://github.com/OpenSourceEconomics/grmpy/blob/master/promotion/grmpy_tutorial_notebook/files/replication.grmpy.yml>`_ and
`here <https://github.com/OpenSourceEconomics/grmpy/blob/master/promotion/grmpy_tutorial_notebook/files/tutorial_semipar.yml>`__.

Parametric Replication
^^^^^^^^^^^^^^^^^^^^^^

As shown in the figure below, the parametric :math:`B^{MTE}` is really close to the original results.
The deviation seems to be negligible because of the use of a mock dataset.

.. figure:: ../source/figures/fig-marginal-benefit-parametric-replication.png
:align: center


Semiparametric Replication
^^^^^^^^^^^^^^^^^^^^^^^^^^

.. figure:: ../source/figures/replication_carneiroB.png
:align: center

The semiparametric :math:`B^{MTE}` also gets very close to the original curve. However, the 90 percent confidence bands
(250 bootstrap replications) are wider. As opposed to the parametric model, where the standard error bands are computed
analytically, confidence bands in the semiparametric setup are obtained via the bootstrap method,
which is sensitive to the discrepancies in the mock data set.
116 changes: 104 additions & 12 deletions docs/source/tutorial.rst
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
Tutorial
========
=======================

We now illustrate the basic capabilities of the ``grmpy`` package. We start with the assumptions about functional form and the distribution of unobservables and then turn to some simple use cases.
We now illustrate the basic capabilities of the ``grmpy`` package.
We start by outlining some basic functional form assumptions before introducing to alternative models that can be used to
estimate the marginal treatment effect (MTE).
We then turn to some simple use cases.

Assumptions
------------
-----------

The ``grmpy`` package implements the normal linear-in-parameters version of the generalized Roy model. Both potential outcomes and the choice :math:`(Y_1, Y_0, D)` are a linear function of the individual's observables :math:`(X, Z)` and random components :math:`(U_1, U_0, V)`.

Expand All @@ -15,7 +18,29 @@ The ``grmpy`` package implements the normal linear-in-parameters version of the
D &= I[D^{*} > 0] \\
D^{*} &= Z \gamma -V
We collect all unobservables determining treatment choice in :math:`V = U_C - (U_1 - U_0)`. The unobservables follow a normal distribution :math:`(U_1, U_0, V) \sim \mathcal{N}(0, \Sigma)` with mean zero and covariance matrix :math:`\Sigma`. Individuals decide to select into latent indicator variable :math:`D^{*}` is positive. Depending on their decision, we either observe :math:`Y_1` or :math:`Y_0`.
We collect all unobservables determining treatment choice in :math:`V = U_C - (U_1 - U_0)`.
Individuals decide to select into latent indicator variable :math:`D^{*}` is positive. Depending on their decision, we either observe :math:`Y_1` or :math:`Y_0`.


Parametric Normal Model
^^^^^^^^^^^^^^^^^^^^^^^

The parametric model imposes the assumption of joint normality of the unobservables :math:`(U_1, U_0, V) \sim \mathcal{N}(0, \Sigma)` with mean zero and covariance matrix :math:`\Sigma`.

Semiparametric Model
^^^^^^^^^^^^^^^^^^^^
The semiparametric approach invokes no assumption on the distribution of the unobservables. It requires a weaker condition
:math:`(X,Z) \indep {U_1, U_0, V}`

Under this assumption, the MTE is:

* additively separable in :math:`X` and :math:`U_D`, which means that the shape of the MTE is independent of :math:`X`, and

* identified over the common support of :math:`P(Z)`, unconditional on :math:`X`.


The assumption of common support is crucial for the application of LIV and needs to be carefully evaluated every time.
It is defined as the region where the support of :math:`P(Z)` given :math:`D=1` and the support of :math:`P(Z)` given :math:`D=0 overlap.

Model Specification
-------------------
Expand All @@ -36,11 +61,14 @@ source str specified name for the simulation output files

**ESTIMATION**

The *ESTIMATION* block determines the basic information for the estimation process.
Depending on the model, different input parameters are required.

**PARAMETRIC MODEL**

=========== ====== ===============================================
Key Value Interpretation
=========== ====== ===============================================
semipar False choose the parametric normal model
agents int number of individuals (for the comparison file)
file str name of the estimation specific init file
optimizer str optimizer used for the estimation process
Expand All @@ -52,16 +80,59 @@ output_file str name for the estimation output file
comparison int flag for enabling the comparison file creation
=========== ====== ===============================================

**SEMIPARAMETRIC MODEL**

============= ====== =========================================================================================
Key Value Interpretation
============= ====== =========================================================================================
semipar True choose the semiparametric model
show_output bool If *True*, intermediate outputs of the LIV estimation are displayed
dependent str indicates the dependent variable
indicator str label of the treatment indicator variable
file str name of the estimation specific init file
logit bool If false: probit. Probability model for the decision equation
nbins int Number of histogram bins used to determine common support (default is 25)
trim_support bool Trim the data outside the common support (default is *True*)
bandwidth float Bandwidth for the locally quadratic regression
gridsize int Number of evaluation points for the locally quadratic regression (default is 400)
reestimate_p bool Re-estimate :math:`P(Z)` after trimming (default is *False*), not recommended
rbandwidth int Bandwidth for the double residual regression (default is 0.05)
derivative int Derivative of the locally quadratic regression (default is 1)
degree int Degree of the local polynomial (default is 2)
ps_range list Start and end point of the range of :math:`p = u_D` over which the MTE shall be plotted
============= ====== =========================================================================================

In most empirical applications, bandwidth choices between 0.2 and 0.4 are appropriate for the locally quadratic regression.
:cite:`Fan1994` find that a gridsize of 400 is a good default for graphical analysis.
For data sets with less than 400 observations, we recommend a gridsize equivalent to the maximum number of observations that
remain after trimming the common support.
If the data set of size N is large enough, a gridsize of 400 should be considered as the minimal number of evaluation points.
Since *grmpy*'s algorithm is fast enough, gridsize can be easily increased to N evaluation points.

The "rbandwidth", which is 0.05 by default, specifies the bandwidth for the LOWESS (Locally Weighted Scatterplot Smoothing) regression of
:math:`X`, :math:`X \ \times \ p`, and :math:`Y` on :math:`\widehat{P}(Z)`. If the sample size is small (N < 400),
the user may need to increase "rbandwidth" to 0.1. Otherwise *grmpy* will throw an error.

Note that the MTE identified by LIV consists of wo components: :math:`\overline{x}(\beta_1 - \beta_0)` (which does not depend on :math:`P(Z) = p`) and :math:`k(p)`
(which does depend on :math:`p`). The latter is estimated nonparametrically. The key "p_range" in the initialization file specifies the interval
over which :math:`k(p)` is estimated. After the data outside the overlapping support are trimmed, the locally quadratic kernel estimator
uses the remaining data to predict :math:`k(p)` over the entire "p_range" specified by the user. If "p_range" is larger than the common support, *grmpy*
extrapolates the values for the MTE outside this region. Technically speaking, interpretations of the MTE are only valid within the common support.
In our empirical applications, we set "p_range" to :math:`[0.005,0.995]`.

The other parameters in this section are set by default and, normally, do not need to be changed.


**TREATED**

The *TREATED* block specifies the number and order of the covariates determining the potential outcome in the treated state and the values for the coefficients :math:`\beta_1`. Note that the length of the list which determines the paramters has to be equal to the number of variables that are included in the order list.
The *TREATED* block specifies the number and order of the covariates determining the potential outcome in the treated state
and the values for the coefficients :math:`\beta_1`. Note that the length of the list which determines the parameters has to be equal
to the number of variables that are included in the order list.

======= ========= ====== ===================================
Key Container Values Interpretation
======= ========= ====== ===================================
params list float Paramters
params list float Parameters
order list str Variable labels
======= ========= ====== ===================================

Expand All @@ -73,7 +144,7 @@ The *UNTREATED* block specifies the covariates that a the potential outcome in t
======= ========= ====== ===================================
Key Container Values Interpretation
======= ========= ====== ===================================
params list float Paramters
params list float Parameters
order list str Variable labels
======= ========= ====== ===================================

Expand All @@ -84,10 +155,14 @@ The *CHOICE* block specifies the number and order of the covariates determining
======= ========= ====== ===================================
Key Container Values Interpretation
======= ========= ====== ===================================
params list float Paramters
params list float Parameters
order list str Variable labels
======= ========= ====== ===================================


Further Specifications for the Parametric Model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**DIST**

The *DIST* block specifies the distribution of the unobservables.
Expand Down Expand Up @@ -137,6 +212,9 @@ ftol float relative error in fun(*xopt*) that is acceptable for conve
Examples
--------

Parametric Normal Model
^^^^^^^^^^^^^^^^^^^^^^^

In the following chapter we explore the basic features of the ``grmpy`` package. The resources for the tutorial are also available `online <https://github.com/OpenSourceEconomics/grmpy/tree/master/docs/tutorial>`_.
So far the package provides the features to simulate a sample from the generalized Roy model and to estimate some parameters of interest for a provided sample as specified in your initialization file.

Expand All @@ -159,13 +237,27 @@ This creates a number of output files that contain information about the resulti

**Estimation**

The other feature of the package is the estimation of the parameters of interest. The specification regarding start values and and the optimizer options are determined in the *ESTIMATION* section of the initialization file.
The other feature of the package is the estimation of the parameters of interest.
By default, the parametric model is chosen, in which case the parameter *semipar* in the *ESTIMATION* section of the initialization file is set to *False*.
The start values and optimizer options need to be specified in the *ESTIMATION* section.

::

grmpy.fit('tutorial.grmpy.yml')
grmpy.fit('tutorial.grmpy.yml', semipar=False)

As in the simulation process this creates a number of output file that contains information about the estimation results.
As in the simulation process this creates a number of output files that contain information about the estimation results.

* **est.grmpy.info**, basic information of the estimation process
* **comparison.grmpy.txt**, distributional characteristics of the input sample and the samples simulated from the start and result values of the estimation process


Local Instrumental Variables
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If the user wishes to estimate the parameters of interest using the semiparametric LIV approach, *semipar* must be changed to *True*.

::

grmpy.fit('tutorial.semipar.yml', semipar=True)

If *show_output* is *True*, ``grmpy`` plots the common support of the propensity score and shows some intermediate outputs of the estimation process.
8 changes: 5 additions & 3 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,20 +8,22 @@ dependencies:
- pytest
- pytest-xdist
- scipy
- numba
- matplotlib
- flake8
- seaborn
- pandoc
- pip
- pip:
- sphinxcontrib.bibtex
- sphinx_rtd_theme
- linearmodels
- oyaml
- grmpy
- isort
- grmpy
- nbsphinx
- sphinx_rtd_theme
- pytest-cov
- jupyterlab


- scikit-misc
- sklearn
1 change: 1 addition & 0 deletions grmpy/KernReg/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

0 comments on commit a76be2e

Please sign in to comment.