Updated requirement file and documenation

OpenSourceEconomics · Jan 29, 2020 · 968b643 · 968b643
1 parent bd24ea8
commit 968b643
Show file tree

Hide file tree

Showing 6 changed files with 886 additions and 8 deletions.
diff --git a/docs/source/reliability.rst b/docs/source/reliability.rst
@@ -67,20 +67,24 @@ Reliability
 -----------
 
 In another check of reliability, we compare the results of our estimation process with already existing results from the literature.
-For this purpose we replicate the results for both the parametric and semiparametric marginal treatment effect from Carneiro 2011 (:cite:`Carneiro2011`).
+For this purpose we replicate the results for both the parametric and semiparametric MTE from Carneiro 2011 (:cite:`Carneiro2011`).
 Note that we make use of a mock data set, as the original data cannot be fully recreated from the
 `replication material <https://www.aeaweb.org/articles?id=10.1257/aer.101.6.2754>`_.
-Additionally we provide a
-`jupyter notebook <https://github.com/OpenSourceEconomics/grmpy/blob/develop_segsell/tutorial.semipar.ipynb>`_
-that runs an estimation based on an
-`initialization file <https://github.com/OpenSourceEconomics/grmpy/blob/develop_segsell/replication_semipar.yml>`__
-for easy reconstruction of our test setup. The init file corresponds to the specifications of the authors.
+
+We provide two jupyter notebooks for easy reconstruction of the
+`parametric <https://github.com/OpenSourceEconomics/grmpy/master/promotion/grmpy_tutorial_notebook/grmpy_tutorial_notebook.ipynb>`_
+as well as the
+`semiparametric <https://github.com/OpenSourceEconomics/grmpy/blob/master/promotion/grmpy_tutorial_notebook/tutorial_semipar_notebook.ipynb>`_
+setup.
+The corresponding initialization files can be found
+`here <https://github.com/OpenSourceEconomics/grmpy/blob/master/promotion/grmpy_tutorial_notebook/files/replication.grmpy.yml>`_ and
+`here <https://github.com/OpenSourceEconomics/grmpy/blob/master/promotion/grmpy_tutorial_notebook/files/tutorial_semipar.yml>`__.
 
 Parametric Replication
 ^^^^^^^^^^^^^^^^^^^^^^
 
 As shown in the figure below, the parametric :math:`B^{MTE}` is really close to the original results.
-The deviation seems to be negligible because of the usage of a mock dataset.
+The deviation seems to be negligible because of the use of a mock dataset.
 
 .. figure:: ../source/figures/fig-marginal-benefit-parametric-replication.png
     :align: center

diff --git a/docs/source/tutorial.rst b/docs/source/tutorial.rst
@@ -104,11 +104,30 @@ ps_range          list        Start and end point of the range of :math:`p = u_D
 
 In most empirical applications, bandwidth choices between 0.2 and 0.4 are appropriate for the locally quadratic regression.
 :cite:`Fan1994` find that a gridsize of 400 is a good default for graphical analysis.
+For data sets with less than 400 observations, we recommend a gridsize equivalent to the maximum number of observations that
+remain after trimming the common support.
+If the data set of size *N* is large enough, a gridsize of 400 should be considered as the minimal number of evaluation points.
+Since *grmpy*'s algorithm is fast enough, gridsize can be easily increased to *N* evaluation points.
+
+The "rbandwidth", which is 0.05 by default, specifies the bandwidth for the LOWESS (Locally Weighted Scatterplot Smoothing) regression of
+:math:`X`, :math:`X \ \times \ p`, and :math:`Y` on :math:`\widehat{P}(Z)`. If the sample size is small (N < 400),
+the user may need to increase "rbandwidth" to 0.1. Otherwise *grmpy* will throw an error.
+
+Note that the MTE identified by LIV consists of wo components: $\overline{x}(\beta_1 - \beta_0)$ (which does not depend on :math:`P(Z) = p) and :math:`k(p)`
+(which does depend on :math:`p`). The latter is estimated nonparametrically. The section "p_range" in the initialization file specifies the interval
+over which :math:`k(p)` is estimated. After the data outside the overlapping support are trimmed, the locally quadratic kernel estimator
+uses the remaining data to predict $k(p)$ over the entire "p_range" specified by the user. If "p_range" is larger than the common support, *grmpy*
+extrapolates the values for the MTE outside this region. Technically speaking, interpretations of the MTE are only valid within the common support.
+In our empirical applications, we set "p_range" to :math:`[0.005,0.995]`.
+
+The other parameters in this section are set by default and, normally, do not need to be changed.
 
 
 **TREATED**
 
-The *TREATED* block specifies the number and order of the covariates determining the potential outcome in the treated state and the values for the coefficients :math:`\beta_1`. Note that the length of the list which determines the paramters has to be equal to the number of variables that are included in the order list.
+The *TREATED* block specifies the number and order of the covariates determining the potential outcome in the treated state
+and the values for the coefficients :math:`\beta_1`. Note that the length of the list which determines the parameters has to be equal
+to the number of variables that are included in the order list.
 
 =======   =========  ======     ===================================
 Key       Container  Values     Interpretation

diff --git a/promotion/grmpy_tutorial_notebook/files/tutorial_semipar.yml b/promotion/grmpy_tutorial_notebook/files/tutorial_semipar.yml
@@ -0,0 +1,112 @@
+---
+ESTIMATION:
+    file: data/aer-replication-mock.pkl
+    dependent: wage
+    indicator: state
+    semipar: True
+    show_output: True
+    logit: True
+    nbins: 30
+    bandwidth: 0.322
+    gridsize: 500
+    trim_support: True
+    reestimate_p: False
+    rbandwidth: 0.05
+    derivative: 1
+    degree: 2
+    ps_range: [0.005, 0.995]
+TREATED:
+    order:
+    - exp
+    - expsq
+    - lwage5
+    - lurate
+    - cafqt
+    - cafqtsq
+    - mhgc
+    - mhgcsq
+    - numsibs
+    - numsibssq
+    - urban14
+    - lavlocwage17
+    - lavlocwage17sq
+    - avurate
+    - avuratesq
+    - d57
+    - d58
+    - d59
+    - d60
+    - d61
+    - d62
+    - d63
+UNTREATED:
+    order:
+    - exp
+    - expsq
+    - lwage5
+    - lurate
+    - cafqt
+    - cafqtsq
+    - mhgc
+    - mhgcsq
+    - numsibs
+    - numsibssq
+    - urban14
+    - lavlocwage17
+    - lavlocwage17sq
+    - avurate
+    - avuratesq
+    - d57
+    - d58
+    - d59
+    - d60
+    - d61
+    - d62
+    - d63
+CHOICE:
+    params:
+    - 1.0
+    order:
+    - const
+    - cafqt
+    - cafqtsq
+    - mhgc
+    - mhgcsq
+    - numsibs
+    - numsibssq
+    - urban14
+    - lavlocwage17
+    - lavlocwage17sq
+    - avurate
+    - avuratesq
+    - d57
+    - d58
+    - d59
+    - d60
+    - d61
+    - d62
+    - d63
+    - lwage5_17numsibs
+    - lwage5_17mhgc
+    - lwage5_17cafqt
+    - lwage5_17
+    - lurate_17
+    - lurate_17numsibs
+    - lurate_17mhgc
+    - lurate_17cafqt
+    - tuit4c
+    - tuit4cnumsibs
+    - tuit4cmhgc
+    - tuit4ccafqt
+    - pub4
+    - pub4numsibs
+    - pub4mhgc
+    - pub4cafqt
+DIST:
+    params:
+    - 0.1
+    - 0.0
+    - 0.0
+    - 0.1
+    - 0.0
+    - 1.0
diff --git a/promotion/grmpy_tutorial_notebook/tutorial_semipar_auxiliary.py b/promotion/grmpy_tutorial_notebook/tutorial_semipar_auxiliary.py
@@ -0,0 +1,151 @@
+"""This module provides auxiliary functions for the semiparametric tutorial"""
+import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+import matplotlib.patches as mpatches
+
+from sklearn.utils import resample
+from scipy.stats import norm
+
+from grmpy.estimate.estimate_semipar import estimate_treatment_propensity
+from grmpy.estimate.estimate_semipar import inputs_decision_equation
+from grmpy.estimate.estimate_semipar import process_default_input
+from grmpy.estimate.estimate_semipar import process_user_input
+from grmpy.estimate.estimate_semipar import mte_components
+from grmpy.estimate.estimate_semipar import trim_support
+
+from grmpy.check.check import check_presence_init
+from grmpy.check.auxiliary import read_data
+from grmpy.read.read import read
+
+
+def plot_semipar_mte(rslt, init_file, nbootstraps=250):
+    """This function plots the original and the replicated MTE from Carneiro et al. (2011)"""
+    # mte per year of university education
+    mte = rslt["mte"] / 4
+    quantiles = rslt["quantiles"]
+
+    # bootstrap 90 percent confidence bands
+    mte_boot = bootstrap(init_file, nbootstraps)
+
+    # mte per year of university education
+    mte_boot = mte_boot / 4
+
+    # Get standard error of MTE at each gridpoint u_D
+    mte_boot_std = np.std(mte_boot, axis=1)
+
+    # Compute 90 percent confidence intervals
+    con_u = mte + norm.ppf(0.95) * mte_boot_std
+    con_d = mte - norm.ppf(0.95) * mte_boot_std
+
+    # Load original data
+    mte_ = pd.read_csv(
+        "data/mte_semipar_original.csv"
+    )
+
+    # Plot
+    ax = plt.figure(figsize=(17.5, 10)).add_subplot(111)
+
+    ax.set_ylabel(r"$MTE$", fontsize=20)
+    ax.set_xlabel("$u_D$", fontsize=20)
+    ax.tick_params(
+        axis="both", direction="in", length=5, width=1, grid_alpha=0.25, labelsize=14
+    )
+    ax.xaxis.set_ticks_position("both")
+    ax.yaxis.set_ticks_position("both")
+    ax.xaxis.set_ticks(np.arange(0, 1.1, step=0.1))
+    ax.yaxis.set_ticks(np.arange(-1.8, 0.9, step=0.2))
+
+    ax.set_ylim([-0.77, 0.86])
+    ax.set_xlim([0, 1])
+    ax.margins(x=0.003)
+    ax.margins(y=0.03)
+
+    # Plot replicated curves
+    ax.plot(quantiles, mte, label="replicated $MTE$", color="orange", linewidth=4)
+    ax.plot(quantiles, con_u, color="orange", linestyle=":", linewidth=3)
+    ax.plot(quantiles, con_d, color="orange", linestyle=":", linewidth=3)
+
+    # Plot original curve
+    ax.plot(
+        mte_["quantiles"],
+        mte_["mte"],
+        label="$original MTE$",
+        color="blue",
+        linewidth=4,
+    )
+    ax.plot(mte_["quantiles"], mte_["con_u"], color="blue", linestyle=":", linewidth=3)
+    ax.plot(mte_["quantiles"], mte_["con_d"], color="blue", linestyle=":", linewidth=3)
+
+    blue_patch = mpatches.Patch(color="orange", label="replicated $MTE$")
+    orange_patch = mpatches.Patch(color="blue", label="original $MTE$")
+    plt.legend(handles=[blue_patch, orange_patch], prop={"size": 16})
+
+    plt.show()
+
+    return mte, quantiles
+
+
+def bootstrap(init_file, nbootstraps):
+    """
+    This function generates bootsrapped standard errors
+    given an init_file and the number of bootsraps to be drawn.
+    """
+    check_presence_init(init_file)
+    dict_ = read(init_file)
+
+    # Process the information specified in the initialization file
+    nbins, logit, bandwidth, gridsize, a, b = process_user_input(dict_)
+    trim, rbandwidth, reestimate_p = process_default_input(dict_)
+
+    # Suppress output
+    show_output = False
+
+    # Prepare empty array to store output values
+    mte_boot = np.zeros([gridsize, nbootstraps])
+
+    # Load the baseline data
+    data = read_data(dict_["ESTIMATION"]["file"])
+
+    counter = 0
+    while counter < nbootstraps:
+        boot_data = resample(data, replace=True, n_samples=len(data), random_state=None)
+
+        # Process the inputs for the decision equation
+        indicator, D, Z = inputs_decision_equation(dict_, boot_data)
+
+        # Estimate propensity score P(z)
+        ps = estimate_treatment_propensity(D, Z, logit, show_output)
+
+        if isinstance(ps, np.ndarray):  # & (np.min(ps) <= 0.3) & (np.max(ps) >= 0.7):
+            # Define common support and trim the data, if trim=True
+            boot_data, ps = trim_support(
+                dict_,
+                boot_data,
+                logit,
+                ps,
+                indicator,
+                nbins,
+                trim,
+                reestimate_p,
+                show_output,
+            )
+
+            # Estimate the observed and unobserved component of the MTE
+            X, b1_b0, mte_u = mte_components(
+                dict_, boot_data, ps, rbandwidth, bandwidth, gridsize, a, b, show_output
+            )
+
+            # Calculate the MTE component that depends on X
+            mte_x = np.dot(X, b1_b0).mean(axis=0)
+
+            # Put the MTE together
+            mte = mte_x + mte_u
+            mte_boot[:, counter] = mte
+
+            counter += 1
+
+        else:
+            continue
+
+    return mte_boot