# Data Cleaning Process Household Pulse Survey: Food Sufficiency

### Keeping relevant variables: Food Sufficiency


The use of food stamps at restaurants (April 28, 2020)

Source: https://www.washingtonpost.com/news/voraciously/wp/2020/04/28/democrats-want-to-let-millions-more-americans-use-their-food-stamps-at-restaurants/

A bill introduced in Congress on Tuesday would expand an underused part of the food stamps program to help feed millions of out-of-work Americans while assisting restaurants across the country that are struggling to survive a pandemic that has dramatically reduced their revenues or shut them down altogether.

Sen. Chris Murphy (D-Conn.) and Rep. Jimmy Panetta (D-Calif.) introduced legislation in their respective chambers to expand the Restaurant Meals Program, a little-known initiative that allows seniors, disabled and homeless people to use their Supplemental Nutrition Assistance Program (SNAP) benefits to purchase discounted restaurant meals because these folks often can’t cook for themselves or don’t have access to a kitchen. The RMP is voluntary for states, and to date, only about three use the program, including California, Arizona and Rhode Island. The low participation rate, in part, reflects the public health concern that fast-food chains would mostly sign up for the program, affecting the well-being of the people who receive benefits, Murphy said.

But as the unemployment rate swells, as grocery store shelves grow more depleted and as public transportation continues to pose a threat for low-income Americans who need it to visit supermarkets, states have been looking to SNAP for solutions. Governors in Louisiana and Texas have both said they would like to see the RMP expanded to feed their citizens and to bolster the hobbled hospitality industry. This month, several groups, including the National Council of Chain Restaurants and the National Restaurant Association, wrote to the U.S. Agriculture Department to recommend that Secretary Sonny Perdue open the RMP to more SNAP recipients and more restaurants.

### Food Variables


1. FOODSUFRSN1-FOODSUFRSN5: Why did you not have enough to eat?
    - Couldn't afford to buy more food
    - Couldn’t get out to buy food (for example, didn’t have transportation, or had mobility or health problems that prevented you from getting out)  
    - Afraid to go or didn’t want to go out to buy food
    - Couldn’t get groceries or meals delivered to me
    - The stores didn’t have the food I wanted
    
2. FREEFOOD


3. WHEREFREE1-WHEREFREE7: Where did you get free groceries or free? No mentions to the Restaurant Meals Program


4. SNAP_YN: SNAP Receipt


5. PRIFOODSUF, CURFOODSUF, CHILDFOOD

Ideas for EDA: explore the use of SNAP in the cities and the food sufficiency to answer the following questions:
- How much people receive SNAP benefits?
- How much people don't get enough food or the food they want and why? People that can't get out to buy or they are afraid are targets for delivery. How much delivery services the city needs?
- Distribution of people for type of free groceries. How much people receive SNAP related to the entire universe of people getting free meals from family, communities, churches, temples, food pantries, school programs?

In [1]:
import pandas as pd
import constants
import os
import numpy as np
import data_cleaning_methods

FOOD VARIABLES are in all the datasets. **To explore and apply cleaning steps in the variables related to food affordability in the census data, we are going to select 1000 rows from weekly surveys.**

In [2]:
rel_var1 = constants.ID_VAR + constants.WEEK_VAR + constants.DEMOGRAPHICS_VARS + constants.FOOD_VARS

In [3]:
base_name='pulse2020_puf_'
df_lists=[]

for f in os.listdir('../../data/raw/census/raw'):
    if base_name in f:
        data=pd.read_csv(os.path.join('../../data/raw/census/raw',f), index_col=False, nrows=1000)
        df_lists.append(data)

In [4]:
df = pd.DataFrame()

for d in df_lists:
    df = pd.concat([df, d], sort=False)

In [5]:
df.groupby('WEEK').count()

Unnamed: 0_level_0,SCRAM,EST_ST,EST_MSA,PWEIGHT,TBIRTH_YEAR,ABIRTH_YEAR,EGENDER,AGENDER,RHISPANIC,AHISPANIC,...,PSCHNG7,PSWHYCHG1,PSWHYCHG2,PSWHYCHG3,PSWHYCHG4,PSWHYCHG5,PSWHYCHG6,PSWHYCHG7,PSWHYCHG8,PSWHYCHG9
WEEK,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1000,1000,225,1000,1000,1000,1000,1000,1000,1000,...,0,0,0,0,0,0,0,0,0,0
2,1000,1000,304,1000,1000,1000,1000,1000,1000,1000,...,0,0,0,0,0,0,0,0,0,0
3,1000,1000,298,1000,1000,1000,1000,1000,1000,1000,...,0,0,0,0,0,0,0,0,0,0
4,1000,1000,301,1000,1000,1000,1000,1000,1000,1000,...,0,0,0,0,0,0,0,0,0,0
5,1000,1000,318,1000,1000,1000,1000,1000,1000,1000,...,0,0,0,0,0,0,0,0,0,0
6,1000,1000,290,1000,1000,1000,1000,1000,1000,1000,...,0,0,0,0,0,0,0,0,0,0
7,1000,1000,306,1000,1000,1000,1000,1000,1000,1000,...,0,0,0,0,0,0,0,0,0,0
8,1000,1000,299,1000,1000,1000,1000,1000,1000,1000,...,0,0,0,0,0,0,0,0,0,0
9,1000,1000,308,1000,1000,1000,1000,1000,1000,1000,...,0,0,0,0,0,0,0,0,0,0
10,1000,1000,317,1000,1000,1000,1000,1000,1000,1000,...,0,0,0,0,0,0,0,0,0,0


In [6]:
df_food = df.loc[:, rel_var1]

In [9]:
data_cleaning_methods.percent_missing(df_food)

SCRAM             0.000000
WEEK              0.000000
TBIRTH_YEAR       0.000000
EGENDER           0.000000
MS                0.000000
EST_ST            0.000000
REGION           70.588235
EST_MSA          69.888235
RHISPANIC         0.000000
RRACE             0.000000
EEDUC             0.000000
THHLD_NUMPER      0.000000
THHLD_NUMKID      0.000000
THHLD_NUMADLT     0.000000
INCOME            0.000000
FOODCONF          0.000000
TSPNDPRPD         0.000000
TSPNDFOOD         0.000000
SNAPMNTH1        70.588235
SNAPMNTH2        70.588235
SNAPMNTH3        70.588235
SNAPMNTH4        70.588235
SNAPMNTH5        70.588235
SNAPMNTH6        70.588235
SNAPMNTH7        70.588235
SNAPMNTH8        70.588235
SNAPMNTH9        70.588235
SNAPMNTH10       70.588235
SNAPMNTH11       70.588235
SNAPMNTH12       70.588235
SNAP_YN          70.588235
WHEREFREE1        0.000000
WHEREFREE2        0.000000
WHEREFREE3        0.000000
WHEREFREE4        0.000000
WHEREFREE5        0.000000
WHEREFREE6        0.000000
W

The variables `SNAPMNTHX` have around 70% of missing data. Those variables represent the month of the year when people got SNAP Receipts and they are present in the Phase 2 of the survey (from week 13) and that explains the high number of missing values. We are going to skip these variables because we can extract the SNAP information from another variable (`SPNDRC8`, which has 70% of missing data, but just one variable is handler).

In [11]:
df_food.drop(['SNAPMNTH1', 'SNAPMNTH2', 
              'SNAPMNTH3', 'SNAPMNTH4',
              'SNAPMNTH5', 'SNAPMNTH6',
              'SNAPMNTH7', 'SNAPMNTH8',
              'SNAPMNTH9', 'SNAPMNTH10',
              'SNAPMNTH11', 'SNAPMNTH12'], axis=1, inplace=True)

In [13]:
df_food.groupby('REGION').count()

Unnamed: 0_level_0,SCRAM,WEEK,TBIRTH_YEAR,EGENDER,MS,EST_ST,EST_MSA,RHISPANIC,RRACE,EEDUC,...,FREEFOOD,FOODSUFRSN1,FOODSUFRSN2,FOODSUFRSN3,FOODSUFRSN4,FOODSUFRSN5,CHILDFOOD,PRIFOODSUF,CURFOODSUF,SPNDSRC8
REGION,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1.0,737,737,737,737,737,737,337,737,737,737,...,737,737,737,737,737,737,737,737,737,737
2.0,1598,1598,1598,1598,1598,1598,546,1598,1598,1598,...,1598,1598,1598,1598,1598,1598,1598,1598,1598,1598
3.0,972,972,972,972,972,972,149,972,972,972,...,972,972,972,972,972,972,972,972,972,972
4.0,1693,1693,1693,1693,1693,1693,507,1693,1693,1693,...,1693,1693,1693,1693,1693,1693,1693,1693,1693,1693


- The `REGION` variable is present in Phase 2 of the Survey onward as well. During EDA, we are going to explore how crucial is to keep the data from week 1, since a lot of variables don't exist there.

- Finally, `EST_MSA` is a missing value when the surveyed person doesn't live in a metropolitan area. We keep this variable in the analysis to split between rural and urban contexts.

Saving the sample data for Exploratory Data Analysis related to the food context,

In [15]:
df_food.to_csv('../../data/interim/census/food_affordability.csv', index = False)