Skip to content

Commit

Permalink
Updated requirement file and documenation
Browse files Browse the repository at this point in the history
  • Loading branch information
segsell committed Jan 29, 2020
1 parent bd24ea8 commit 968b643
Show file tree
Hide file tree
Showing 6 changed files with 886 additions and 8 deletions.
18 changes: 11 additions & 7 deletions docs/source/reliability.rst
Original file line number Diff line number Diff line change
Expand Up @@ -67,20 +67,24 @@ Reliability
-----------

In another check of reliability, we compare the results of our estimation process with already existing results from the literature.
For this purpose we replicate the results for both the parametric and semiparametric marginal treatment effect from Carneiro 2011 (:cite:`Carneiro2011`).
For this purpose we replicate the results for both the parametric and semiparametric MTE from Carneiro 2011 (:cite:`Carneiro2011`).
Note that we make use of a mock data set, as the original data cannot be fully recreated from the
`replication material <https://www.aeaweb.org/articles?id=10.1257/aer.101.6.2754>`_.
Additionally we provide a
`jupyter notebook <https://github.com/OpenSourceEconomics/grmpy/blob/develop_segsell/tutorial.semipar.ipynb>`_
that runs an estimation based on an
`initialization file <https://github.com/OpenSourceEconomics/grmpy/blob/develop_segsell/replication_semipar.yml>`__
for easy reconstruction of our test setup. The init file corresponds to the specifications of the authors.

We provide two jupyter notebooks for easy reconstruction of the
`parametric <https://github.com/OpenSourceEconomics/grmpy/master/promotion/grmpy_tutorial_notebook/grmpy_tutorial_notebook.ipynb>`_
as well as the
`semiparametric <https://github.com/OpenSourceEconomics/grmpy/blob/master/promotion/grmpy_tutorial_notebook/tutorial_semipar_notebook.ipynb>`_
setup.
The corresponding initialization files can be found
`here <https://github.com/OpenSourceEconomics/grmpy/blob/master/promotion/grmpy_tutorial_notebook/files/replication.grmpy.yml>`_ and
`here <https://github.com/OpenSourceEconomics/grmpy/blob/master/promotion/grmpy_tutorial_notebook/files/tutorial_semipar.yml>`__.

Parametric Replication
^^^^^^^^^^^^^^^^^^^^^^

As shown in the figure below, the parametric :math:`B^{MTE}` is really close to the original results.
The deviation seems to be negligible because of the usage of a mock dataset.
The deviation seems to be negligible because of the use of a mock dataset.

.. figure:: ../source/figures/fig-marginal-benefit-parametric-replication.png
:align: center
Expand Down
21 changes: 20 additions & 1 deletion docs/source/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -104,11 +104,30 @@ ps_range list Start and end point of the range of :math:`p = u_D

In most empirical applications, bandwidth choices between 0.2 and 0.4 are appropriate for the locally quadratic regression.
:cite:`Fan1994` find that a gridsize of 400 is a good default for graphical analysis.
For data sets with less than 400 observations, we recommend a gridsize equivalent to the maximum number of observations that
remain after trimming the common support.
If the data set of size *N* is large enough, a gridsize of 400 should be considered as the minimal number of evaluation points.
Since *grmpy*'s algorithm is fast enough, gridsize can be easily increased to *N* evaluation points.

The "rbandwidth", which is 0.05 by default, specifies the bandwidth for the LOWESS (Locally Weighted Scatterplot Smoothing) regression of
:math:`X`, :math:`X \ \times \ p`, and :math:`Y` on :math:`\widehat{P}(Z)`. If the sample size is small (N < 400),
the user may need to increase "rbandwidth" to 0.1. Otherwise *grmpy* will throw an error.

Note that the MTE identified by LIV consists of wo components: $\overline{x}(\beta_1 - \beta_0)$ (which does not depend on :math:`P(Z) = p) and :math:`k(p)`
(which does depend on :math:`p`). The latter is estimated nonparametrically. The section "p_range" in the initialization file specifies the interval
over which :math:`k(p)` is estimated. After the data outside the overlapping support are trimmed, the locally quadratic kernel estimator
uses the remaining data to predict $k(p)$ over the entire "p_range" specified by the user. If "p_range" is larger than the common support, *grmpy*
extrapolates the values for the MTE outside this region. Technically speaking, interpretations of the MTE are only valid within the common support.
In our empirical applications, we set "p_range" to :math:`[0.005,0.995]`.

The other parameters in this section are set by default and, normally, do not need to be changed.


**TREATED**

The *TREATED* block specifies the number and order of the covariates determining the potential outcome in the treated state and the values for the coefficients :math:`\beta_1`. Note that the length of the list which determines the paramters has to be equal to the number of variables that are included in the order list.
The *TREATED* block specifies the number and order of the covariates determining the potential outcome in the treated state
and the values for the coefficients :math:`\beta_1`. Note that the length of the list which determines the parameters has to be equal
to the number of variables that are included in the order list.

======= ========= ====== ===================================
Key Container Values Interpretation
Expand Down
112 changes: 112 additions & 0 deletions promotion/grmpy_tutorial_notebook/files/tutorial_semipar.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
---
ESTIMATION:
file: data/aer-replication-mock.pkl
dependent: wage
indicator: state
semipar: True
show_output: True
logit: True
nbins: 30
bandwidth: 0.322
gridsize: 500
trim_support: True
reestimate_p: False
rbandwidth: 0.05
derivative: 1
degree: 2
ps_range: [0.005, 0.995]
TREATED:
order:
- exp
- expsq
- lwage5
- lurate
- cafqt
- cafqtsq
- mhgc
- mhgcsq
- numsibs
- numsibssq
- urban14
- lavlocwage17
- lavlocwage17sq
- avurate
- avuratesq
- d57
- d58
- d59
- d60
- d61
- d62
- d63
UNTREATED:
order:
- exp
- expsq
- lwage5
- lurate
- cafqt
- cafqtsq
- mhgc
- mhgcsq
- numsibs
- numsibssq
- urban14
- lavlocwage17
- lavlocwage17sq
- avurate
- avuratesq
- d57
- d58
- d59
- d60
- d61
- d62
- d63
CHOICE:
params:
- 1.0
order:
- const
- cafqt
- cafqtsq
- mhgc
- mhgcsq
- numsibs
- numsibssq
- urban14
- lavlocwage17
- lavlocwage17sq
- avurate
- avuratesq
- d57
- d58
- d59
- d60
- d61
- d62
- d63
- lwage5_17numsibs
- lwage5_17mhgc
- lwage5_17cafqt
- lwage5_17
- lurate_17
- lurate_17numsibs
- lurate_17mhgc
- lurate_17cafqt
- tuit4c
- tuit4cnumsibs
- tuit4cmhgc
- tuit4ccafqt
- pub4
- pub4numsibs
- pub4mhgc
- pub4cafqt
DIST:
params:
- 0.1
- 0.0
- 0.0
- 0.1
- 0.0
- 1.0
151 changes: 151 additions & 0 deletions promotion/grmpy_tutorial_notebook/tutorial_semipar_auxiliary.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
"""This module provides auxiliary functions for the semiparametric tutorial"""
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

from sklearn.utils import resample
from scipy.stats import norm

from grmpy.estimate.estimate_semipar import estimate_treatment_propensity
from grmpy.estimate.estimate_semipar import inputs_decision_equation
from grmpy.estimate.estimate_semipar import process_default_input
from grmpy.estimate.estimate_semipar import process_user_input
from grmpy.estimate.estimate_semipar import mte_components
from grmpy.estimate.estimate_semipar import trim_support

from grmpy.check.check import check_presence_init
from grmpy.check.auxiliary import read_data
from grmpy.read.read import read


def plot_semipar_mte(rslt, init_file, nbootstraps=250):
"""This function plots the original and the replicated MTE from Carneiro et al. (2011)"""
# mte per year of university education
mte = rslt["mte"] / 4
quantiles = rslt["quantiles"]

# bootstrap 90 percent confidence bands
mte_boot = bootstrap(init_file, nbootstraps)

# mte per year of university education
mte_boot = mte_boot / 4

# Get standard error of MTE at each gridpoint u_D
mte_boot_std = np.std(mte_boot, axis=1)

# Compute 90 percent confidence intervals
con_u = mte + norm.ppf(0.95) * mte_boot_std
con_d = mte - norm.ppf(0.95) * mte_boot_std

# Load original data
mte_ = pd.read_csv(
"data/mte_semipar_original.csv"
)

# Plot
ax = plt.figure(figsize=(17.5, 10)).add_subplot(111)

ax.set_ylabel(r"$MTE$", fontsize=20)
ax.set_xlabel("$u_D$", fontsize=20)
ax.tick_params(
axis="both", direction="in", length=5, width=1, grid_alpha=0.25, labelsize=14
)
ax.xaxis.set_ticks_position("both")
ax.yaxis.set_ticks_position("both")
ax.xaxis.set_ticks(np.arange(0, 1.1, step=0.1))
ax.yaxis.set_ticks(np.arange(-1.8, 0.9, step=0.2))

ax.set_ylim([-0.77, 0.86])
ax.set_xlim([0, 1])
ax.margins(x=0.003)
ax.margins(y=0.03)

# Plot replicated curves
ax.plot(quantiles, mte, label="replicated $MTE$", color="orange", linewidth=4)
ax.plot(quantiles, con_u, color="orange", linestyle=":", linewidth=3)
ax.plot(quantiles, con_d, color="orange", linestyle=":", linewidth=3)

# Plot original curve
ax.plot(
mte_["quantiles"],
mte_["mte"],
label="$original MTE$",
color="blue",
linewidth=4,
)
ax.plot(mte_["quantiles"], mte_["con_u"], color="blue", linestyle=":", linewidth=3)
ax.plot(mte_["quantiles"], mte_["con_d"], color="blue", linestyle=":", linewidth=3)

blue_patch = mpatches.Patch(color="orange", label="replicated $MTE$")
orange_patch = mpatches.Patch(color="blue", label="original $MTE$")
plt.legend(handles=[blue_patch, orange_patch], prop={"size": 16})

plt.show()

return mte, quantiles


def bootstrap(init_file, nbootstraps):
"""
This function generates bootsrapped standard errors
given an init_file and the number of bootsraps to be drawn.
"""
check_presence_init(init_file)
dict_ = read(init_file)

# Process the information specified in the initialization file
nbins, logit, bandwidth, gridsize, a, b = process_user_input(dict_)
trim, rbandwidth, reestimate_p = process_default_input(dict_)

# Suppress output
show_output = False

# Prepare empty array to store output values
mte_boot = np.zeros([gridsize, nbootstraps])

# Load the baseline data
data = read_data(dict_["ESTIMATION"]["file"])

counter = 0
while counter < nbootstraps:
boot_data = resample(data, replace=True, n_samples=len(data), random_state=None)

# Process the inputs for the decision equation
indicator, D, Z = inputs_decision_equation(dict_, boot_data)

# Estimate propensity score P(z)
ps = estimate_treatment_propensity(D, Z, logit, show_output)

if isinstance(ps, np.ndarray): # & (np.min(ps) <= 0.3) & (np.max(ps) >= 0.7):
# Define common support and trim the data, if trim=True
boot_data, ps = trim_support(
dict_,
boot_data,
logit,
ps,
indicator,
nbins,
trim,
reestimate_p,
show_output,
)

# Estimate the observed and unobserved component of the MTE
X, b1_b0, mte_u = mte_components(
dict_, boot_data, ps, rbandwidth, bandwidth, gridsize, a, b, show_output
)

# Calculate the MTE component that depends on X
mte_x = np.dot(X, b1_b0).mean(axis=0)

# Put the MTE together
mte = mte_x + mte_u
mte_boot[:, counter] = mte

counter += 1

else:
continue

return mte_boot

0 comments on commit 968b643

Please sign in to comment.