## Problem Set 3: Household Expenditures and The Supplemental Nutrition Assistance Program<a name="cite_ref-1"></a>[<sup>[1]</sup>](#cite_note-1)

**Harvard University**<br/>
**Spring 2023**<br/>
**Instructor**: Gregory Bruich, Ph.D.

- Posted on: 02/07/2023
- Due at: 11:59pm on 02/14/2023 <3


<a name="cite_note-1"></a>[1.](#cite_ref-1) Background information is from Bruich (2014)

<hr style="height:2.4pt">

### Suggested Imports

In [2]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import statsmodels.formula.api as sm
from stargazer.stargazer import Stargazer

### Background

The Food Stamp Program (now called the Supplemental Nutrition Assistance Program or SNAP) provides income to low-income households each month with the stated goal of helping them buy food. In April 2014, the average SNAP household consisted of just over two people and received $256 in benefits per month. SNAP benefits are restricted in that they can only be used to pay for certain food items purchased at retailers that have applied for and received authorization to
participate in the program from the U.S. Department of Agriculture. Excluded items include alcohol, hot foods, and toiletries. 

There are two ways for a household to become eligible for SNAP. The first way is if all members of the household receive benefits through either the Supplemental Security Income program,<a name="cite_ref-2"></a>[<sup>[2]</sup>](#cite_note-2) the Temporary Assistance for Needy Families program, or a county general assistance program. The second way is if household income and assets are below certain thresholds. Income and assets are measured the month prior to applying for benefits and are re-assessed at periodic intervals (typically 6 months). In addition, there are minimum work requirements (20 hours per week) for non-disabled adults between 18 and 50 years old without children. SNAP benefits may only be received by adults who do not meet this work requirement for three months out of the previous three years. Beneficiaries can often substitute work training or volunteering for the work
requirement.

Table 1 describes the variables included in the stata dataset `snap.dta`. The dataset is an extract
from the National Household Food Acquisition and Purchase Survey (FoodAPS) from the U.S. Department of Agriculture (USDA). The sample is restricted to households receiving SNAP benefits. Survey respondents kept track of all food items purchased over a 24 hour period.

The survey responses were merged with administrative data from the SNAP program, allowing us to measure the exact number of days since the household received its SNAP benefits. Spending and days since receipt of SNAP will be the main variables we utilize in this problem set, although we will also control for various other household level variables (e.g., the number of people in the household, whether the household owns a vehicle, whether the primary respondent has a high school education).

<a name="cite_note-2"></a>[2.](#cite_ref-2) An exception is that in California, Supplemental Security Income payments have included an additional cash amount for food stamp benefits since 1974. Therefore, individuals in California who receive Supplemental Security Income cannot also receive SNAP benefits separately.


<hr style="height:2.4pt">


### Data Description
**File**: `snap.dta`

The data consist of $n=1215$ households from National Household Food Acquisition and Purchase Survey (FoodAPS) from the U.S. Department of Agriculture (USDA). All households are SNAP recipients.

**Table 1: Definitions of Selected Variables in `snap.dta`**

| Variable          | Description                                                              | N     | Mean   |
| ----------------- | ------------------------------------------------------------------------ | ----- | ------ |
| (1)               | (2)                                                                      | (3)   | (4)    |
|                   |                                                                          |       |        |
| `hhnum`           | 6-digit unique identifier for each household                             | 1,215 | n/a    |
| `spending`        | Total amount spent ($) on both food away from home and food at home      | 1,215 | $21.80 |
| `days`            | days since snap last received                                            | 1,215 | 14.01  |
| `week1`           | 1st week of SNAP benefit month                                           | 1,215 | 0.321  |
| `week2`           | 2nd week of SNAP benefit month                                           | 1,215 | 0.233  |
| `week3`           | 3rd week of SNAP benefit month                                           | 1,215 | 0.230  |
| `week4`           | 4th week of SNAP benefit month                                           | 1,215 | 0.216  |
| `anyvehicle`      | whether anybody in household owns or leases a vehicle (y/n)              | 1,215 | 0.724  |
| `hhsize`          | number of people at residence, excluding guests                          | 1,215 | 3.420  |
| `primstoretime_d` | driving time, in minutes, between residence and primary food store       | 1,215 | 8.512  |
| `white`           | Primary respondent is White                                              | 1,215 | 0.645  |
| `black`           | Primary respondent is Black                                              | 1,215 | 0.207  |
| `asian`           | Primary respondent is Asian or Native Hawaiian or Other Pacific Islander | 1,215 | 0.0156 |
| `hispanic`        | Primary respondent is Hispanic                                           | 1,215 | 0.246  |
| `highschool`      | Primary respondent has high school education                             | 1,215 | 0.720  |

*Notes:* Table defines the variables from the FoodAPS data.


<hr style="height:2.4pt">

### Data Load

In [3]:
# Read dataset into a pandas dataframe
snap = pd.read_stata("snap.dta")

# Display first 5 rows of data
snap.head()

Unnamed: 0,hhnum,spending,snap_amount,days,week1,week2,week3,week4,anyvehicle,hhsize,primstoretime_d,white,black,asian,hispanic,highschool
0,100012,5.52,125.0,27,0,0,0,1,1,5,2.37,1,0,0,0,1
1,100028,23.620001,725.0,17,0,0,1,0,1,7,4.23,1,0,0,0,0
2,100040,22.01,225.0,6,1,0,0,0,1,2,5.72,0,0,0,1,1
3,100069,59.75,175.0,17,0,0,1,0,1,4,5.15,0,0,0,0,1
4,100076,30.65,325.0,27,0,0,0,1,1,4,8.75,1,0,0,0,1


<hr style="height:2.4pt">

### Instructions

Please submit your Problem Set on Canvas. Your submission should include two files:
1. This notebook as a `.ipynb` file with your code and answers to questions
2. A `.pdf` version of this notebook. TODO: Provide general instructions on converting `.ipynb` to `pdf`

<hr style="height:2.4pt">

### Questions

*Note: Short answers should be very succinct. Show your work and intuition clearly: credit is given for explanations and not just having the correct answer*

### 1

Estimate the following regressions and generate a table of the results using `stargazer`. Report appropriate standard errors for each regression, and
explain how you decided which standard errors to use.

You will have to generate new variables: the square of `days`, natural log of `spending` and
household size (`hhsize`), as well as the interaction terms `week2` × `anyvehicle`, `week3` ×
`anyvehicle`, and `week4` × `anyvehicle`. See Table 2a and Table 2b for more guidance.

- a) Column 1: Regress `spending` in dollars on `days`, the square of `days`, the natural log of `hhsize`, the variable `highschool`, and the variable `anyvehicle`. In Greek:
$$Spending_i = \alpha_0 + \alpha_1days_i + \alpha_2log(hhsize_i) + \alpha_4highschool_i + \alpha_5Vehicle_i + u_i$$

- b) Column 2: Regress the natural log of `spending` on `days`, the square of `days`, the natural log of `hhsize`, the variable `highschool`, and the variable `anyvehicle`. In Greek:
$$log(Spending_i) = \alpha_0 + \alpha_1days_i + \alpha_2log(hhsize_i) + \alpha_4highschool_i + \alpha_5Vehicle_i + u_i$$

- c) Column 3: Regress the natural log of `spending` on the dummy variables `week2`, `week3`, and `week4`. Estimate this regression over the subset of observations **for whom `anyvehicle` equals 1**. In Greek:
$$log(Spending_i) = \alpha_0 + \alpha_1Week2_i + \alpha_2Week3_i + \alpha_3Week4_i + u_i$$

d. Column 4: Regress the natural log of `spending` on the dummy variables `week2`, `week3`, and `week4`. Estimate this regression over the subset of observations **for whom `anyvehicle` equals 0**. In Greek:
$$log(Spending_i) = \alpha_0 + \alpha_1Week2_i + \alpha_2Week3_i + \alpha_3Week4_i + u_i$$

e. Column 5: Regress the natural log of `spending` on the dummy variables `week2`, `week3`, `week4`, the variable `anyvehicle`, and new variables that equal `week2` × `anyvehicle`, `week3` × `anyvehicle`, and `week4` × `anyvehicle`. Estimate this regression over the full sample. In Greek:

$$\begin{aligned}
f(x)= & \alpha_0 + \alpha_1Week2_i + \alpha_2Week3_i + \alpha_3Week4_i + \alpha_4 Vehicle_i + \alpha_5 Week2_i \times Vehicle_i\\
& \alpha_6 Week3_i \times Vehicle_i + \alpha_7 Week4_i \times Vehicle_i + u_i
\end{aligned}$$

In [None]:
# Your Code Here
reg = sm.ols()

<hr style="height:2.4pt">

### 2

The number of observations in **Column 1** is greater than in **Column 2**. Why?

*[Your Answer Here]*

<hr style="height:2.4pt">

### 3
Interpret the coefficient on log household size 
- (i) in column 1 and 
- (ii) in column 2

in words.

*[Your Answer Here]*

<hr style="height:2.4pt">

### 4
Use the regression in column 2 of your table to do the following:
- a) Calculate the predicted effect of increasing days from $days = 1$ to $days = 2$
- b) Calculate the standard error of the predicted effect by hand. Does your regression output table have all the information needed to calculate this standard error? Explain why or why not. (You can check your answer using `lincom`).
- c) Calculate a 95% confidence interval for your predicted effect by hand. (You can check your answer using lincom).

Note: you may use Stata/R (or a calculator) to help with your calculations in (b) and (c), but you should write out the formula you use with the appropriate values plugged in.

In [None]:
# Your Code Here

*[Your Answer Here]*

<hr style="height:2.4pt">

### 5

This question has to do with the regressions in columns 3, 4, and 5 of your table.
- a) Using the regression in column 4, interpret the coefficients on `week2`, `week3`, `week4` in words.
- b) Using the regression in column 5, interpret the coefficients on `week2`, `week3`, `week4` in words. How do these coefficients compare with the coefficients you reported in column 4?
- c) The “fully interacted regression” in column 5 provides exactly the same information as the regressions reported in column 4 and column 3. To see this, show that the sum of the coefficient on `week2` and `week2` × `anyvehicle` in column 5 exactly equals the coefficient on `week2` in column 3.
- d) What do you think is an advantage of estimating the regressions separately as in columns 3 and 4? What do you think is an advantage to estimating the fully interacted regression as in column 5?

*[Your Answer Here]*

<hr style="height:2.4pt">

### 6

Using the regression in column 5, can you reject the null hypothesis the coefficients on `week2` × `anyvehicle`, `week3` × `anyvehicle`, and `week4` × `anyvehicle` are jointly zero?
- a) Report your *F* statistic, its *p*-value, the 5% critical value for the test, and number of restrictions for this test.
- b) Does the conclusion you draw from the 𝐹 test match the conclusion you would draw from *t*-tests of the significance of the coefficients individually? Explain.
- c) Give a qualitative interpretation of what the null hypothesis from the previous question means in words. This is a thinking question.

In [None]:
# Your Code Here

*[Your Answer Here]*

<hr style="height:2.4pt">

### 7
Suppose a policy maker is considering whether to fund a program that would make vehicles (zip cars) available to families receiving SNAP benefits. The policy maker would like to know if this program would increase spending on food. As we have discussed, policy questions like this are fundamentally “if-then” causal questions. Do you think that the coefficient on `anyvehicle` in column 1 measures the causal effect that the policy maker would want to know? If the coefficient does not measure the causal effect of interest, is the estimated coefficient too big or too small?

Hint: provide an example of an omitted variable, then use the omitted variable bias formula and use your knowledge of the world to infer the signs of the inputs to the formula.

*[Your Answer Here]*

<hr style="height:2.4pt">

### Sample Code

Adds another column to a dataframe storing the squares of values in the original
```python
# Note: not using simpler method np.square() due to
# integer overflow
snap[“x_squared”] = snap.x.astype(int) ** 2
```

Adds another column to a dataframe storing the natural log of values in the original, replacing infinite values with NaNs.
```python
snap[“log_x”] = np.log(snap.x).replace(-float(“inf”),
np.nan)
```

Shows how to estimate an ordinary least squares regression with heteroskedasticity robust standard errors. Notice we specify that we wish to drop rows with missing data.
```python
mod = sm.ols(
    “yvar ~ xvar1 + xvar2 + xvar3”,
    data=snap,
    missing=”drop”
)
res = mod.fit(cov_type=”HC2”)
```

Create a new dataframe that is a subset of the rows of the original where the xvar is equal to 5.
```python
five_df = snap.loc[snap.xvar == 5]
```

These lines show how to make a regression table with three columns, corresponding to
- A regression of some variable yvar1 on xvar1, xvar2, and xvar3
- A second regression of yvar1 on just xvar1 and xvar2
- A third regression with a different dependent variable regressing yvar2 on xvar2 and xvar3
Note we have to label the columns manually here. Also note we add an extra row to the bottom of the table and manually add entries
```python
# Estimate Regressions:
mod1 = sm.ols(
    “yvar1 ~ xvar1 + xvar2 + xvar3”,
    data=health
)
res1 = mod.fit(cov_type=”HC2”)

mod2 = sm.ols(
    “yvar1 ~ xvar1 + xvar2”,
    data=health
)
res2 = mod.fit(cov_type=”HC2”)

mod3 = sm.ols(
    “yvar2 ~ xvar2 + xvar3”,
    data=health
)
res3 = mod.fit(cov_type=”HC2”)

# Create Table
table = Stargazer(models)

# Label columns
# This list of 1s should be the same length as the
# number of columns
table.custom_columns([“yvar1”, “yvar1”, “yvar2”],
seperators=[1, 1, 1])

# Add custom row at bottom of table:
table.add_line(“Sample:”, [“hello”, “hi”, “hey”])

# Display table
table
```



Shows how to access the coefficients and covariance matrix of regression result
```python
# Access coefficient on xvar1
alpha_1 = res.params.xvar1
# Access covariance matrix of result
cov_matrix = res.cov_params()
# Access covariance between xvar1 and xvar2:
cov_x1_x2 = cov_matrix.xvar1.xvar2
```


Shows how to run an F test that variables xvar1, xvar2, and xvar3 are not all 0 on a regression result
```python
# Run Test
test_results = res.wald_test([
    “xvar1 = 0”,
    “xvar2 = 0”,
    “xvar3 = 0”,
], scalar=True, use_f=True)

# collect f statistic
f_stat = test_results.fvalue

#collect degrees of freedom
df = test_results.df_num
Calculate true p-value
true_p = 1 – stats.chi2.cdf(fstat * df, df)

#Calculate critical value:
# note we use 10^10 rather than infinity due to function
constraints
INF = 10 ** 10
crit_val = stats.f.ppf(.95, df, INF)
```