# Augenblick and Rabin, 2019, "An Experiment on Time Preference and Misprediction in Unpleasant Tasks", Table 1

#### Authors:  

- Massimiliano Pozzi (Bocconi University, pozzi.massimiliano@studbocconi.it)
- Salvatore Nunnari (Bocconi University, salvatore.nunnari@unibocconi.it)

#### Description:

The code in this Jupyter notebook performs the aggregate estimates to replicate column 1 of Table 1

This notebook was tested with the following packages versions:
- Pozzi:   (Anaconda 4.10.3 on Windows 10 Pro) : python 3.8.3, numpy 1.18.5, pandas 1.0.5, scipy 1.5.0, autograd 1.3
- Nunnari: (Anaconda 4.10.1 on macOS 10.15.7): python 3.8.10, numpy 1.20.2, pandas 1.2.4, scipy 1.6.2, autograd 1.3

In [14]:
# Import the necessary libraries

from autograd.scipy.stats import norm    
import autograd.numpy as np
import pandas as pd
import scipy.optimize as opt
from autograd import grad, hessian

## 1. Data Cleaning and Data Preparation

We import the dataset containing the choices of all 100 individuals who participated to the experiment. To guarantee consistency with the authors' results, we then construct the primary sample used for the aggregate estimates. This sample consists of 72 individuals whose individual parameter estimates converged in less than 200 iterations when using the authors' Stata algorithm. In particular, we run the "03MergeIndMLEAndConstructMainSample.do" file provided by the authors. This script creates a file named "ind_to_keep.csv" which contains the identifiers of the individuals to keep.

In [15]:
# Import the two datasets and drop subjects whose individual estimates do not converge

dt = pd.read_stata('../input/decisions_data.dta')    # full sample
ind_keep = pd.read_csv('../input/ind_to_keep.csv')   # import csv with ID of subjects to keep 

# drop subjects whose IDs are not listed in the ind_keep dataframe (28 individuals)

dt = dt[dt.wid.isin(ind_keep.wid_col1)] # this is the primary sample for the aggregate estimates (72 individuals)

We remove observations when a bonus was offered and create the following dummy variables that will be useful for estimation: pb is equal to one if the subject completed 10 mandatory tasks on subject-day (this is used to estimate the projection bias parameter &alpha;); ind_effort10 and ind_effort110 are equal to one if, respectively, the subject completed 10 or 110 tasks (and they are used for the Tobit correction when computing the likelihood).

In [16]:
# Remove observations when a bonus was offered and create dummy variables. 

dt = dt[dt.bonusoffered !=1]   # remove observations when a bonus was offered
dt['pb']= dt['workdone1']/10   # pb dummy variable. workdone1 can either be 10 or 0, so dividing the variable by 10 creates our dummy
dt['ind_effort10']  = (dt['effort']==10).astype(int)   # ind_effort10 dummy
dt['ind_effort110'] = (dt['effort']==110).astype(int)  # ind_effort110 dummy
dt.index = np.arange(len(dt.wid))                      # correct the index. The index should go from 0 to 8048

## 2. Define the Model and the Likelihood (Section 3 in Paper)

The agent needs to choose the optimal effort e to solve a simple tradeoff problem between disutility of effort and consumption utility derived from the consequent payment. More specifically, the agent takes a decision at a time k to complete a certain number of tasks at time t and to get paid a wage w per task at time T. Assuming the agent discounts utility using quasi-hyperbolic discounting and has a convex cost function C(e) the problem can be conveniently written as:

$$ \max_{{e}} \; \delta^{T-k}⋅(e⋅w)- \frac{1}{\beta^{I(k=t)}}⋅\frac{1}{\beta_h^{I(p=1)}}⋅\delta^{t-k}⋅ \frac{e^\gamma}{\phi⋅\gamma} $$

Where the last term is a two parameter power cost function, I(k=t) is an indicator function equal to one if the decision occurs in the same period as the effort and I(p=1) is an indicator that the decision is a prediction, &beta;<sub>h</sub> is the perceived present bias parameter (that is, the agent's degree of awareness of his present bias), and &delta; is the standard time discounting parameter. Taking the derivative of the maximization problem above with respect to effort yields the following first order condition:

$$  e^*= \left(\frac{\delta^{T-k}⋅\phi⋅w}{\frac{1}{\beta^{I(k=t)}}⋅\frac{1}{\beta_h^{I(p=1)}}⋅\delta^{t-k}} \right)^{\frac{1}{\gamma-1}} $$

This is the optimal effort level, or what we will call in the code the predicted choice. To model heterogeneity, the authors assume that the observed effort is distributed as the predicted effort plus an implementation error which is Gaussian with mean zero and standard deviation sigma, so that the likelihood of observing an effort decision e<sub>j</sub> in the data is equal to:

$$ L(e_j)= \phi \left(\frac{e^*_j-e_j}{\sigma}\right)$$

where &phi; is the pdf of a standard normal. 

To deal with corner solutions we apply a Tobit correction, so that the likelihood to maximize is:

$$ L^{tobit}(e_j)=(1-I(e=10)-I(e=110))⋅\phi \left(\frac{e^*_j-e_j}{\sigma}\right) + I(e=10)⋅\left(1- \Phi \left(\frac{e_j^*-10}{\sigma}\right)\right)+I(e=110)⋅ \Phi \left(\frac{e_j^*-110}{\sigma}\right) $$

where &Phi;(⋅) is the cdf of a standard normal, while I(e=10) and I(e=110) are the indicators ind_effort10 and ind_effort110 explained above. Note that, to keep the code simple, in this notebook we call effort the number of tasks performed by the agent, that is, the number of tasks chosen by the agent (ranging between 0 and 100) plus the compulsory 10 tasks. In the paper, the authors call effort just the number of tasks chosen by the agent (and, thus, they add 10 tasks to get total effort). This explains the differences between the equations in this notebook and equation (7), (8) and (10) in Section 3 of the paper. 

Our goal is to minimize the negative of the sum of the logarithms of L<sup>tobit</sup>.

In [17]:
# the function negloglike computes the negative of the log likelihood of observing our data given the parameters of the model.

# parameters:

# beta is the present bias parameter
# betahat is the perceived present bias parameter
# delta is the usual time-discounting parameter
# gamma and phi are the two parameters controlling the cost of effort function
# alpha is the projection bias parameter
# sigma is the standard deviation of the normal error term ϵ

# args:

# netdistance is (T-k)-(t-k) = T-t, the difference between the payment date T and the work time t
# wage is the amount paid per task in a certain session
# today is a dummy variable equal to one if the decision involves the choice of work today
# prediction is a dummy variable equal to one if the decision involves the choice of work in the future
# pb is a dummy equal to one if the subject completed 10 mandatory tasks on subject-day 
# effort is the number of tasks completed by a subject in a session. It can range from a minimum of 10 to a maximum of 110
# ind_effort10 is a dummy equal to one if the subject's effort was equal to 10
# ind_effort110 is a dummy equal to one if the subject's effort was equal to 110


def negloglike(params, *args):
    
    beta, betahat, delta, gamma, phi, alpha, sigma = params
    netdistance, wage, today, prediction, pb, effort, ind_effort10, ind_effort110 = args
    
    # We use np.array to allow for element-wise operations
    
    netdistance = np.array(netdistance)
    wage = np.array(wage)
    today = np.array(today)
    prediction = np.array(prediction)
    pb = np.array(pb)
    effort = np.array(effort)
    ind_effort10 = np.array(ind_effort10)
    ind_effort110 = np.array(ind_effort110)
    
    # predchoice is the predicted choice coming from the optimality condition of the subject
    
    predchoice=((phi*(delta**netdistance)*(beta**today)*(betahat**prediction)*wage)**(1/(gamma-1)))-pb*alpha
    
    # prob is a 1x8049 vector containing the probability of observing the effort of an individual. If effort is 10 or 110 we apply a Tobit correction
    
    prob = (1-ind_effort10-ind_effort110)*norm.pdf(effort, predchoice, sigma)+ind_effort10*(1 - norm.cdf((predchoice-effort)/sigma))+ind_effort110*norm.cdf((predchoice-effort)/sigma)
            
    # we now look at the vector prob and add a small value close to zero if prob=0 or subtract a small value close to zero if prob=1. This is necessary to avoid problems when taking logs
        
    index_p0 = [i for i in range(0,len(prob)) if prob[i]==0] # vector containing the indexes when prob=0
    index_p1 = [i for i in range(0,len(prob)) if prob[i]==1] # vector containing the indexes when prob=1
    
    # use a for loop to change the values
    
    for i in index_p0:
        prob[i] = 1E-4
        
    for i in index_p1:
        prob[i] = 1 - 1E-4
    
    negll = - np.sum(np.log(prob)) # negative log likelihood
    
    return negll

## 3. Estimation

### Point Estimates

We now estimate the model. First, we need to initialize a vector with the starting parameters for the minimization algorithm. We then minimize the negative log-likelihood function using the scipy.optimize package and the Nelder-Mead algorithm.

In [18]:
# Define the initial guesses (same as the ones used by the authors in their do.file) and the arguments for the function to minimize

# starting parameters for the algorithm

beta_init, betahat_init, delta_init, gamma_init, phi_init, alpha_init, sigma_init = 0.8, 1, 1, 2, 500, 7, 40
par_init = [beta_init, betahat_init, delta_init, gamma_init, phi_init, alpha_init, sigma_init]

# args necessary for the function to minimize

mle_args = (dt['netdistance'],dt['wage'],dt['today'],dt['prediction'],dt['pb'],dt['effort'],dt['ind_effort10'],dt['ind_effort110'])

# we now find the estimates using the scipy.optimize package

sol = opt.minimize(negloglike, par_init, args=(mle_args), method='Nelder-Mead', options={'maxiter': 1500})
res = sol.x

### Standard Errors

We now estimate individual cluster robust standard errors. 

These are computed by taking the square root of the diagonal elements of the following matrix: 

$$ Adj⋅(H^{-1} @ G @ H^{-1}) $$ 

Where Adj is an adjustment for the degree of freedoms and the number of clusters:

$$ Adj = \frac{Nr.observations-1}{Nr.observations-Nr.parameters}⋅\frac{Nr.clusters}{Nr.clusters-1} $$ 

H<sup>-1</sup> is the inverse of the hessian of the negative log-likelihood evaluated in the minimum (our estimates), @ stands for matrix multiplication, and G is a 5x5 matrix of gradient contributions. 

We denote the gradient of the log likelihood function for a generic individual i as follows:

$$  g_i(y|\theta) = [log f_i(y|\theta)]' = \frac{\partial}{\partial \theta} log f_i(y|\theta) $$

where &theta; is the parameters vector and f<sub>i</sub>(y|&theta;) the likelihood function. Then G is defined as follows:

$$ G = \sum_j \left[\sum_{i \in c_j}g_i(y|\hat{\theta})\right]^T\left[\sum_{i \in c_j}g_i(y|\hat{\theta})\right] $$

where J is the number of clusters (in our case the number of unique individuals = 72) and c<sub>j</sub> is a generic cluster j, that includes all observations for a specific individual (in our case 130). For more information on how to compute standard errors when using maximum likelihood, we refer the reader to David A. Freedman, 2006, ["On The So-Called 'Huber Sandwich Estimator' and 'Robust Standard Errors'"](https://snunnari.github.io/freedman.pdf), *The American Statistician*, 60:4, 299-302).

In [19]:
# Define the function that computes the matrix of individual gradient contribution G. 
# We use the autograd package that performs automatic differentiation. Automatic differentiation yields more precise results than finite differences

def gradcontr(dt, parameters):
    
    G = np.zeros((len(parameters), len(parameters))) # A 7x7 matrix 
    vsingle_grad = []  # This will be a 1x8049 vector whose elements are 1x7 vectors. Each 1x7 vector is the gradient of negloglike for a single observation
    
    for j in range(0, len(dt.wid)):  # loop over all 8049 observations
        
        # args needed to compute the individual observation likelihood
        args_ind = ([dt['netdistance'][j]],[dt['wage'][j]],[dt['today'][j]],[dt['prediction'][j]],[dt['pb'][j]],
                     [dt['effort'][j]],[dt['ind_effort10'][j]],[dt['ind_effort110'][j]]) 
    
        single_grad = np.array(gradfun(parameters, *args_ind))  # 1x7 vector. gradient of the negative log likelihood using only one observation in the dataset
        vsingle_grad.append(single_grad)

    # we create a two columns dataframe. The first one is the wid, the second one is vsingle_grad. This will simplify summing the gradients
    # over a specific individual. Each element in the column gradient is a 1x7 vector.
    
    dg = pd.DataFrame({'wid': dt.wid, 'gradient': vsingle_grad})
    
    for wid in np.unique(dt.wid): # loop over the individuals IDs
        
        ind_grad = [sum(i) for i in zip(*dg.loc[dg['wid'] == wid].gradient)] # we are summing the single observation gradients element-wise.
        G += np.outer(ind_grad,ind_grad)                                     # we take the outer product and sum them
        
    return G

In [20]:
# Compute the individual cluster robust standard errors

# Compute the hessian
Hfun = hessian(negloglike)
hessian = Hfun(res, *mle_args)      # hessian
hess_inv = np.linalg.inv(hessian)   # inverse of the hessian

# Compute the matrix of gradient contribution
gradfun = grad(negloglike)
grad_contribution = gradcontr(dt, res)

# Compute the adjustment for degree of freedoms and number of clusters
adj = (len(dt.wid)-1)/(len(dt.wid)-len(res)) * len(np.unique(dt.wid))/(len(np.unique(dt.wid))-1)

varcov_estimates = adj *(hess_inv @ grad_contribution @ hess_inv) # var-cov matrix of our estimates
se_cluster = np.sqrt(np.diag((varcov_estimates)))                 # individual cluster robust standard errors

### Hypothesis Testing

We now do some hypothesis testing on the parameters we obtained. We compute the z-test statistics and the corresponding p-values to check if beta, betahat or delta are statistically different from one. We then compute the p-value of a z-test to check if the parameter for projection bias is statistically different from zero.

In [21]:
# Compute the z-test statistics and the corresponding p-values to check if beta, betahat, delta are statistically different from one

zvalues_1 = (np.array(res[0:3])-1)/np.array(se_cluster[0:3]) # the first three elements are for beta (position 0), betahat (position 1) and delta (position 2)
pvalues_1 = 2*(1-norm.cdf(np.abs(zvalues_1),0,1))

# Now compute the z-test statistics and the corresponding p-value for H0: alpha different from 0

zvalue_a = (np.array(res[5]))/np.array(se_cluster[5]) 
pvalue_a = 2*(1-norm.cdf(np.abs(zvalue_a),0,1))

## 4. Print and Save Estimation Results

We create a table with point estimates and individual cluster robust standard errors. We then save the results as a csv file in the output folder and print the results. This replicates Column 1 of Table 1 in the paper.

In [22]:
# Create a new DataFrame with the results and save it as a csv file in output. We round the results up to the 3rd decimal.

parameters_name = ["Present Bias β",
                   "Naive Pres. Bias β_h",
                   "Discount Factor δ",
                   "Cost Curvature γ",
                   "Cost Slope ϕ",
                   "Proj Task Reduction α",
                   "Sd of error term σ"]

Table_1 = pd.DataFrame({'parameters':parameters_name,'estimates':np.round(res,3),'standarderr':np.round(se_cluster,3)})

Table_1.to_csv('../output/table1_python.csv')

In [23]:
# Print the results

from IPython.display import display

print("Table 1: Primary aggregate structural estimation")
display(Table_1)
print("Number of observations:","{:,}".format(len(dt.wid)))
print("Number of participants:","{:,}".format(len(np.unique(dt.wid))))
print("Log Likelihood:","{:,.0f}".format(-sol.fun))
print("H_0(β=1)","{:,.2f}" .format(np.round(pvalues_1[0],3)))
print("H_0(β_h=1):","{:,.2f}".format(np.round(pvalues_1[1],2)))
print("H_0(α=0):","{:,.3f}".format(np.round(pvalue_a,3)))
print("H_0(δ=1):","{:,.2f}".format(np.round(pvalues_1[2],2)))

Table 1: Primary aggregate structural estimation


Unnamed: 0,parameters,estimates,standarderr
0,Present Bias β,0.835,0.038
1,Naive Pres. Bias β_h,0.999,0.011
2,Discount Factor δ,1.003,0.003
3,Cost Curvature γ,2.145,0.07
4,Cost Slope ϕ,723.974,251.855
5,Proj Task Reduction α,7.307,2.598
6,Sd of error term σ,42.625,3.306


Number of observations: 8,049
Number of participants: 72
Log Likelihood: -28,412
H_0(β=1) 0.00
H_0(β_h=1): 0.92
H_0(α=0): 0.005
H_0(δ=1): 0.37
