# 1. Data Cleaning Process Household Pulse Survey

In [1]:
import pandas as pd
import constants
import os
import numpy as np

In [2]:
base_name='pulse2020_puf_'
df_lists=[]

for f in os.listdir('../../data/household/raw'):
    if base_name in f:
        data=pd.read_csv(os.path.join('../../data/household/raw',f), index_col=False)
        df_lists.append(data)

In [3]:
df_lists[0].head()

Unnamed: 0,SCRAM,WEEK,EST_ST,EST_MSA,PWEIGHT,TBIRTH_YEAR,ABIRTH_YEAR,EGENDER,AGENDER,RHISPANIC,...,COMP1,COMP2,COMP3,INTRNTAVAIL,INTRNT1,INTRNT2,INTRNT3,TSCHLHRS,TTCH_HRS,INCOME
0,V030000001S52011391390122,5,1,,1239.935394,1973,2,1,2,1,...,-88,-88,-88,-88,-88,-88,-88,-88.0,-88.0,3
1,V030000002S02020543300112,5,2,,196.842234,1973,2,2,2,1,...,-99,1,-99,1,-99,1,-99,0.0,0.0,8
2,V030000002S02020880630122,5,2,,295.425365,1951,2,1,2,2,...,-88,-88,-88,-88,-88,-88,-88,-88.0,-88.0,-88
3,V030000002S02020999610122,5,2,,1088.296594,1983,2,1,2,1,...,-99,1,-99,1,-99,1,-99,0.0,0.0,6
4,V030000005S58050092940112,5,5,,20476.738688,1960,2,1,2,1,...,-88,-88,-88,-88,-88,-88,-88,-88.0,-88.0,2


### 1. Studying the dataset

Public Use Data File (PUF) includes a replicate weight data file, and a data dictionary for every new release of the survey. The shape of the dataset has changed over time, depending on the number of surveyed people and the addition of new variables. 

#### Features
- 17 weeks, from April 23 to October 26
- Surveyed people between 50k-100k per survey
- Variables between 82 to 188
- Demographic variables
- Index variables: SCRAM (ID) and WEEK
- Spending variables
- Food variables
- Shopping variables
- Telework variables
- Trips variables
- Health variables
- Work variables
- Missing data designed as -88 and -99
- Mostly categorical data
- Require use of data dictionary to interpretate the name of columns and categories

Some interesting variables related to the spending of the stimulus payment can be found during weeks 7 to 12 from Phase 1 of the Survey and difficulty with expenditures and changes in shopping patterns are part of the new questions incorporated in Phase 2, from week 13 onwards. Although spending and shopping variables are not asked simultaneously to the population, they offer valious insights for our study as the survey tries to be representative along demographic variables. and they are pre-processed separately.

In [4]:
for df in df_lists:
    print(df.shape)

(105066, 82)
(91605, 105)
(90767, 105)
(101215, 82)
(86792, 105)
(83302, 84)
(73472, 105)
(109051, 188)
(88716, 188)
(132961, 82)
(41996, 82)
(95604, 188)
(110019, 188)
(99302, 188)
(74413, 82)
(98663, 105)
(108062, 105)


In [5]:
df_lists[0].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105066 entries, 0 to 105065
Data columns (total 82 columns):
SCRAM            105066 non-null object
WEEK             105066 non-null int64
EST_ST           105066 non-null int64
EST_MSA          31955 non-null float64
PWEIGHT          105066 non-null float64
TBIRTH_YEAR      105066 non-null int64
ABIRTH_YEAR      105066 non-null int64
EGENDER          105066 non-null int64
AGENDER          105066 non-null int64
RHISPANIC        105066 non-null int64
AHISPANIC        105066 non-null int64
RRACE            105066 non-null int64
ARACE            105066 non-null int64
EEDUC            105066 non-null int64
AEDUC            105066 non-null int64
MS               105066 non-null int64
THHLD_NUMPER     105066 non-null int64
AHHLD_NUMPER     105066 non-null int64
THHLD_NUMKID     105066 non-null int64
AHHLD_NUMKID     105066 non-null int64
THHLD_NUMADLT    105066 non-null int64
WRKLOSS          105066 non-null int64
EXPCTLOSS        105066 non

### 2. Keeping relevant variables: Spending and Shopping

In [6]:
rel_var = constants.ID_VAR + constants.WEEK_VAR + constants.DEMOGRAPHICS_VARS + constants.EIP_VARS + constants.SHOPPING_VARS + constants.TRIPS_VAR

In [7]:
len(rel_var)

58

In [8]:
df_lists1 = []

for df in df_lists:
    df_copy = df.copy()
    for col in df_copy.columns:
        if col not in rel_var:
            df_copy.drop(columns=col, inplace=True)
    df_lists1.append(df_copy)

### Spending variables

1. EIP: Use of Economic Impact Payment (Stimulus)
    - Pay for expenses
    - Pay off debt
    - Add to savings
    - NA
2. EIPSPND: Spending use of Economic Impact Payment (Stimulus)
    - Food (groceries, eating out, take out)
    - Clothing, household supplies, household items, recreational, rent, mortgage, vehicle, saving or investments, charitable, credit card, loans, others.
    - Spending categories are not mutually excluyent.
    
For EDA, percent change of
- EIPSPND split by food and others (weekly and grouping by demographics)
- EIP split by food and others (weekly and grouping by demographics)

In [9]:
df_spending = pd.DataFrame()
for df in df_lists1:
    if 'EIP' in df.columns:
        df_spending = pd.concat([df_spending, df])

In [10]:
df_spending.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 549361 entries, 0 to 108061
Data columns (total 28 columns):
SCRAM            549361 non-null object
WEEK             549361 non-null int64
EST_ST           549361 non-null int64
EST_MSA          167813 non-null float64
TBIRTH_YEAR      549361 non-null int64
EGENDER          549361 non-null int64
RHISPANIC        549361 non-null int64
RRACE            549361 non-null int64
EEDUC            549361 non-null int64
MS               549361 non-null int64
THHLD_NUMPER     549361 non-null int64
THHLD_NUMKID     549361 non-null int64
THHLD_NUMADLT    549361 non-null int64
EIP              549361 non-null int64
EIPSPND1         549361 non-null int64
EIPSPND2         549361 non-null int64
EIPSPND3         549361 non-null int64
EIPSPND4         549361 non-null int64
EIPSPND5         549361 non-null int64
EIPSPND6         549361 non-null int64
EIPSPND7         549361 non-null int64
EIPSPND8         549361 non-null int64
EIPSPND9         549361 non-

Save in interim folder:

In [11]:
#df_spending.to_csv('../../data/household/interim/household_spending.csv', index = False)

### Shopping variables: Expenditures and changes in shopping behaviors

1. EXPNS_DIF: Difficulty with expenses
    - Not at all difficult
    - A little difficult
    - Somewhat difficult
    - Very difficult
    
    
2. CHNGHOW1-CHNGHOW12: Spending and shopping change
    - Purchases modality
        - More purchases online (as opposed to in store)
        - More purchases by curbside pick-up (as opposed to in store)
        - More purchases in-store (as opposed to purchases online or curbside pick-up)
    - Cash/credit card
        - Increased use of credit cards or smartphone apps for purchases instead of cash
        - Increased use of cash     
    - Restaurants
        - Avoided eating at restaurants
        - Resumed eating at restaurants
        
        
3. WHYCHNGD1-WHYCHNGD13: Why spending change?


4. SPNDSRC8: Use of SNAP (Nutrition Assistance Program) as source of income for spending needs


5. FEWRTRIPS, FEWRTRANS: Fewer trips to stores and trips transit

In [12]:
num=0

for df in df_lists1:
    if 'WHYCHNGD1' in df.columns:
        num+=1
        df.to_csv('../../data/household/interim/household_shopping'+str(num)+'.csv', index = False)

### 3. Keeping relevant variables: Food Sufficiency


The use of food stamps at restaurants (April 28, 2020)

Source: https://www.washingtonpost.com/news/voraciously/wp/2020/04/28/democrats-want-to-let-millions-more-americans-use-their-food-stamps-at-restaurants/

A bill introduced in Congress on Tuesday would expand an underused part of the food stamps program to help feed millions of out-of-work Americans while assisting restaurants across the country that are struggling to survive a pandemic that has dramatically reduced their revenues or shut them down altogether.

Sen. Chris Murphy (D-Conn.) and Rep. Jimmy Panetta (D-Calif.) introduced legislation in their respective chambers to expand the Restaurant Meals Program, a little-known initiative that allows seniors, disabled and homeless people to use their Supplemental Nutrition Assistance Program (SNAP) benefits to purchase discounted restaurant meals because these folks often can’t cook for themselves or don’t have access to a kitchen. The RMP is voluntary for states, and to date, only about three use the program, including California, Arizona and Rhode Island. The low participation rate, in part, reflects the public health concern that fast-food chains would mostly sign up for the program, affecting the well-being of the people who receive benefits, Murphy said.

But as the unemployment rate swells, as grocery store shelves grow more depleted and as public transportation continues to pose a threat for low-income Americans who need it to visit supermarkets, states have been looking to SNAP for solutions. Governors in Louisiana and Texas have both said they would like to see the RMP expanded to feed their citizens and to bolster the hobbled hospitality industry. This month, several groups, including the National Council of Chain Restaurants and the National Restaurant Association, wrote to the U.S. Agriculture Department to recommend that Secretary Sonny Perdue open the RMP to more SNAP recipients and more restaurants.

### Food Variables


1. FOODSUFRSN1-FOODSUFRSN5: Why did you not have enough to eat?
    - Couldn't afford to buy more food
    - Couldn’t get out to buy food (for example, didn’t have transportation, or had mobility or health problems that prevented you from getting out)  
    - Afraid to go or didn’t want to go out to buy food
    - Couldn’t get groceries or meals delivered to me
    - The stores didn’t have the food I wanted
    
2. FREEFOOD


3. WHEREFREE1-WHEREFREE7: Where did you get free groceries or free? No mentions to the Restaurant Meals Program


4. SNAP_YN: SNAP Receipt


5. PRIFOODSUF, CURFOODSUF, CHILDFOOD

Ideas for EDA: explore the use of SNAP in the cities and the food sufficiency to answer the following questions:
- How much people receive SNAP benefits?
- How much people don't get enough food or the food they want and why? People that can't get out to buy or they are afraid are targets for delivery. How much delivery services the city needs?
- Distribution of people for type of free groceries. How much people receive SNAP related to the entire universe of people getting free meals from family, communities, churches, temples, food pantries, school programs?

In [15]:
rel_var1 = constants.ID_VAR + constants.WEEK_VAR + constants.DEMOGRAPHICS_VARS + constants.FOOD_VARS

FOOD VARIABLES are in all the datasets and the combination and merge of them is going to be done for EDA purposes as needed. 