<h1><center> Basic Monthly CPS Data </center></h1>

data dictionary [here](https://www2.census.gov/programs-surveys/cps/datasets/2022/basic/2022_Basic_CPS_Public_Use_Record_Layout_plus_IO_Code_list.txt)

data instruction [here](https://www2.census.gov/programs-surveys/cps/methodology/PublicUseDocumentation_final.pdf)

**Part 1: Select Observations and Variables**

The main purpose of the following code is to select the samples and variables that might be used later. The data cleaning process is described in the following steps:

1. Import all datasets in the raw data folder and concatenate them into one dataframe
2. Drop observations/rows based on degree of survey completeness and definition of civil noninstitutional population
3. Select variables of interest
    - link variables: keep track of individual in different months
    - geographic and demographic variables
    - labor force variables: check edited universe
    - weight variables: used for estimation and inference later
4. Rename variables, provide dictionary of variable labels and values

**Part 2: Clean Variables** 

After selecting the sample and variables, we can continue to investigate the values of each variable. This file will deal with missing values, catergorical variables, and numeric variables. The data cleaning process is described in the following steps:

1. Geographic and demographic variables
    - Is there NaN?
    - How to change values of categorical variables?
    - Convert numerical values to categorical values? (eg. age --> age group)
2. Labor Force variables
    - Is there NaN?
    - How to work with negative values? (check edited universe in data instruction, also in variable label in clean1)
3. We have already checked Linking Varibles, and we should not modify weights.
4. Export the whole year dataframe to csv file (replace the one in clean1_variables.ipynb), export each month dataframe to csv file (replace the one in clean1_variables.ipynb)

This file can be used to automate the data cleaning for all 12 months in a single year.

In [168]:
import os
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# current working directory should be `./code`, check it:
print(os.getcwd())

GLOBAL_PATH = '../'
RAW_DATA_PATH = 'raw_data/'
PROCESSED_DATA_PATH = 'processed_data/'

# specify the year of CPS data
CPS_YEAR = 'cps2020'

/Users/niksun/Desktop/data-repository-partial-code/code


## **Part 1: Select Observations and Variables**

### **1.1 Read All Files into A Single Dataframe**

In [169]:
path, dirs, raw_files = next(os.walk(GLOBAL_PATH + RAW_DATA_PATH + CPS_YEAR))
raw_files.remove('.DS_Store') if '.DS_Store' in raw_files else None
raw_file_count = len(raw_files)
# dataframe_list is a list of dataframes, not list of file names
dataframe_list = list(map(lambda file: pd.read_csv(path + '/' + file), raw_files))

for i in range(raw_file_count):
    print(raw_files[i], ':', dataframe_list[i].shape)

feb20pub.csv : (139248, 388)
jun20pub.csv : (123364, 388)
oct20pub.csv : (135242, 388)
aug20pub.csv : (126448, 388)
dec20pub.csv : (132036, 388)
jan20pub.csv : (138697, 388)
nov20pub.csv : (134122, 388)
may20pub.csv : (126557, 388)
sep20pub.csv : (133448, 388)
apr20pub.csv : (129382, 388)
mar20pub.csv : (131578, 388)
jul20pub.csv : (124102, 388)


In [170]:
# concatenate 12 months of data into one dataframe
raw_df = pd.concat(dataframe_list, axis=0, ignore_index=True)
print('Shape of concatenated dataset:', raw_df.shape)
raw_df.head(10)

Shape of concatenated dataset: (1574224, 388)


Unnamed: 0,hrhhid2,HUFINAL,OCCURNUM,HUINTTYP,HURESPLI,HUPRSCNT,HUTYPEA,HUTYPB,HUTYPC,HUBUS,HUBUSL1,HUBUSL2,HUBUSL3,HUBUSL4,HRMIS,HRMONTH,HRYEAR4,HRLONGLK,qstnum,gereg,gestfips,gediv,hehousut,hxhousut,hephoneo,hxphoneo,hetelavl,hxtelavl,hetelhhd,hxtelhhd,hrhtype,hrintsta,hrnumhou,hefaminc,hxfaminc,hwhhwgt,hwhhwtln,PULINENO,PUCHINHH,PUWK,PUBUS1,PUDIS,PULAY,PUHROFF1,PUHROFF2,PUHROT1,PUHROT2,PUABSOT,PUBUSCK1,PUBUSCK2,PUBUSCK3,PUBUSCK4,PURETOT,PUHRCK1,PUHRCK2,PUHRCK3,PUHRCK4,PUHRCK5,PUHRCK6,PUHRCK7,PUHRCK12,PULAYDT,PULAY6M,PULAYAVR,PULK,PULKAVR,PULAYCK1,PULAYCK2,PULAYCK3,PUDWCK1,PUDWCK2,PUDWCK3,PUDWCK4,PUDWCK5,PUJHCK1,PUJHCK2,PUJHDP1O,PUJHCK3,PUJHCK4,PUJHCK5,PULKM2,PULKM3,PULKM4,PULKM5,PULKM6,PULKDK1,PULKDK2,PULKDK3,PULKDK4,PULKDK5,PULKDK6,PULKPS1,PULKPS2,PULKPS3,PULKPS4,PULKPS5,PULKPS6,PUIOCK1,PUIOCK2,PUIOCK3,PUIODP1,PUIODP2,PUIODP3,PUIO1MFG,PUIO2MFG,ptern2,pternh1c,PUNLFCK1,PUNLFCK2,PUSLFPRX,PUDIS1,PUDIS2,PUBUS2OT,perrp,pxrrp,pxage,peafnow,pxafnow,pesex,pxsex,pemaritl,pxmaritl,pxrace1,pehspnon,pxhspnon,peeduca,pxeduca,peafever,pxafever,peafwhn1,pxafwhn1,peafwhn2,peafwhn3,peafwhn4,pespouse,pxspouse,penatvty,pxnatvty,pemntvty,pxmntvty,pefntvty,pxfntvty,pxinusyr,pedipged,pxdipged,pehgcomp,pxhgcomp,pecyc,pxcyc,pepar1,pxpar1,pepar2,pxpar2,pepar1typ,pxpar1typ,pepar2typ,pxpar2typ,prdasian,prmarsta,ptdtrace,prdthsp,prpertyp,prfamnum,prfamtyp,prfamrel,prnmchld,prchld,prcitflg,prcitshp,prinuyer,prtage,prtfage,pecohab,pxcohab,peabspdo,peabsrsn,pedwavl,pedwavr,pedwlko,pedwlkwk,pedwrsn,pedwwk,pedwwnto,pedw4wk,pehractt,pehract1,pehract2,pehravl,pehrftpt,pehrrsn1,pehrrsn2,pehrrsn3,pehruslt,pehrusl1,pehrusl2,pehrwant,pejhrsn,pejhwant,pejhwko,pelayavl,pelaydur,pelayfto,pelaylk,pelkavl,pelkdur,pelkfto,pelkll1o,pelkll2o,pelklwo,pelkm1,pemjnum,pemjot,pemlr,penlfact,penlfjh,penlfret,peret1,pxabspdo,pxabsrsn,pxdwavl,pxdwavr,pxdwlko,pxdwlkwk,pxdwrsn,pxdwwk,pxdwwnto,pxdw4wk,pxhractt,pxhract1,pxhract2,pxhravl,pxhrftpt,pxhrrsn1,pxhrrsn2,pxhrrsn3,pxhruslt,pxhrusl1,pxhrusl2,pxhrwant,pxjhrsn,pxjhwant,pxjhwko,pxlayavl,pxlaydur,pxlayfto,pxlaylk,pxlkavl,pxlkdur,pxlkfto,pxlkll1o,pxlkll2o,pxlklwo,pxlkm1,pxmjnum,pxmjot,pxmlr,pxnlfact,pxnlfjh,pxnlfret,pxret1,prabsrea,prcivlf,prdisc,premphrs,prempnot,prexplf,prftlf,prhrusl,prjobsea,prpthrs,prptrea,prunedur,pruntype,prwksch,prwkstat,prwntjob,peio1icd,peio2icd,ptio1ocd,ptio2ocd,peio1cow,peio2cow,pepdemp1,pepdemp2,pxio1icd,pxio2icd,pxio1ocd,pxio2ocd,pxio1cow,pxio2cow,pxpdemp1,pxpdemp2,pxnmemp1,pxnmemp2,prioelg,premp,prcow1,prcow2,prnagws,prnagpws,prdtcow1,prdtcow2,prmjind1,prmjind2,primind1,primind2,prmjocc1,prmjocc2,prdtind1,prdtind2,prdtocc1,prdtocc2,pragna,prsjmj,prcowpg,prmjocgr,peernper,peernhry,peernuot,peernwkp,peernrt,pternh2,pternh1o,peernhro,ptern,peernlab,peerncov,pxernper,pxernhry,pxernuot,pxernwkp,pxernrt,pxernh2,pxernh1o,pxernhro,pxern,pxernlab,pxerncov,prerelg,pternwa,pternhly,prwernal,prhernal,peschlvl,peschenr,peschft,pxschlvl,pxschenr,pxschft,prnlfsch,pedisear,pediseye,pedisrem,pedisphy,pedisdrs,pedisout,pxdisear,pxdiseye,pxdisrem,pxdisphy,pxdisdrs,pxdisout,prdisflg,pecert1,pecert2,pecert3,pxcert1,pxcert2,pxcert3,pwsswgt,pwlgwgt,pwvetwgt,pworwgt,pwfmwgt,pwcmpwgt,pthr,ptwk,ptot,ptnmemp1,ptnmemp2,hrhhid,gtcbsa,gtco,gtcbsast,gtcbsasz,gtcsa,gtmetsta,gtindvpc
0,11011,201,1,2,1,0,-1,-1,-1,2,-1,-1,-1,-1,3,2,2020,2,1,3,1,6,1,0,1,0,-1,1,1,0,6,1,1,13,0,18525008.0,1.0,1.0,9.0,1.0,-1.0,-1.0,-1.0,2.0,-1.0,2.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,2.0,5.0,5.0,-1.0,2.0,3.0,5.0,2.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,4.0,3.0,3.0,1.0,2.0,1.0,4.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,-1.0,-1.0,41.0,0.0,0.0,2.0,0.0,1.0,0.0,6.0,0.0,0.0,2.0,0.0,43.0,0.0,2.0,0.0,-1.0,1.0,-1.0,-1.0,-1.0,-1.0,1.0,57.0,-1.0,57.0,-1.0,57.0,-1.0,-1.0,-1.0,1.0,-1.0,1.0,-1.0,1.0,-1,1.0,-1,1.0,-1,1.0,-1,1.0,-1.0,7.0,1.0,-1.0,2.0,0.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,40.0,0,-1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,40.0,40.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,40.0,40.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,2.0,1.0,-1.0,-1.0,-1.0,-1.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,18.0,1.0,1.0,1.0,4.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,2.0,-1.0,6180.0,-1.0,1360.0,-1.0,2.0,-1.0,-1.0,-1.0,0.0,-1.0,0.0,-1.0,0.0,-1.0,1.0,-1.0,1.0,-1.0,1.0,1.0,2.0,-1.0,1.0,-1.0,8.0,-1.0,6.0,-1.0,8.0,-1.0,2.0,-1.0,23.0,-1.0,4.0,-1.0,2.0,1.0,2.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,2.0,-1.0,1.0,0.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,1.0,2.0,20.0,20.0,20.0,18525008.0,26774269.0,18566761.0,0.0,18525008.0,18786224.0,0,0,0,-1.0,-1.0,110024067491,33860,0,2,3,0,1,0
1,9012,201,1,2,1,0,-1,-1,-1,2,-1,-1,-1,-1,8,2,2020,2,2,3,1,6,5,0,1,0,-1,1,1,0,6,1,1,10,0,27647640.0,1.0,1.0,9.0,1.0,-1.0,-1.0,-1.0,2.0,-1.0,2.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,2.0,5.0,5.0,-1.0,2.0,3.0,5.0,2.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,4.0,3.0,3.0,1.0,2.0,1.0,4.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,-1.0,-1.0,41.0,0.0,0.0,2.0,0.0,1.0,0.0,2.0,0.0,0.0,1.0,0.0,39.0,0.0,2.0,0.0,-1.0,1.0,-1.0,-1.0,-1.0,-1.0,1.0,57.0,-1.0,303.0,-1.0,303.0,42.0,-1.0,1.0,0.0,-1.0,1.0,-1.0,1.0,-1,1.0,-1,1.0,-1,1.0,-1,1.0,-1.0,3.0,1.0,1.0,2.0,0.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,21.0,0,-1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,40.0,40.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,40.0,40.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,2.0,1.0,-1.0,-1.0,-1.0,-1.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,18.0,1.0,1.0,1.0,4.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,2.0,-1.0,8680.0,-1.0,4020.0,-1.0,4.0,-1.0,-1.0,-1.0,0.0,-1.0,0.0,-1.0,0.0,-1.0,1.0,-1.0,1.0,-1.0,1.0,1.0,4.0,-1.0,1.0,1.0,6.0,-1.0,11.0,-1.0,18.0,-1.0,3.0,-1.0,46.0,-1.0,13.0,-1.0,2.0,1.0,1.0,2.0,1.0,1.0,2.0,-1.0,-1.0,-1.0,1100.0,40.0,-1.0,2.0,2.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,44000.0,1100.0,0.0,0.0,-1.0,2.0,-1.0,1.0,0.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,-1.0,-1.0,20.0,0.0,0.0,27647640.0,39959244.0,28425098.0,111229844.0,27647640.0,28518067.0,0,0,0,-1.0,-1.0,761077501690006,19300,3,4,2,380,1,0
2,9011,201,1,2,1,0,-1,-1,-1,2,-1,-1,-1,-1,8,2,2020,2,3,3,1,6,5,0,0,0,-1,1,1,0,1,1,2,14,0,17330305.0,1.0,1.0,9.0,2.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,2.0,-1.0,-1.0,2.0,-1.0,2.0,5.0,1.0,-1.0,-1.0,-1.0,-1.0,2.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,4.0,3.0,3.0,1.0,2.0,1.0,4.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,-1.0,-1.0,40.0,0.0,0.0,2.0,0.0,2.0,0.0,1.0,0.0,0.0,2.0,0.0,43.0,0.0,2.0,0.0,-1.0,1.0,-1.0,-1.0,-1.0,2.0,0.0,57.0,-1.0,57.0,-1.0,57.0,-1.0,-1.0,-1.0,1.0,-1.0,1.0,-1.0,1.0,-1,1.0,-1,1.0,-1,1.0,-1,1.0,-1.0,1.0,1.0,-1.0,2.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,51.0,0,-1.0,1.0,2.0,4.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,20.0,20.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,2.0,2.0,-1.0,-1.0,-1.0,-1.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,31.0,1.0,-1.0,2.0,1.0,1.0,2.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,2.0,10.0,-1.0,6991.0,-1.0,540.0,-1.0,7.0,-1.0,2.0,-1.0,0.0,-1.0,0.0,-1.0,0.0,-1.0,0.0,-1.0,1.0,-1.0,1.0,1.0,5.0,-1.0,-1.0,-1.0,10.0,-1.0,8.0,-1.0,11.0,-1.0,1.0,-1.0,33.0,-1.0,2.0,-1.0,2.0,1.0,-1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,2.0,-1.0,1.0,0.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,1.0,1.0,20.0,20.0,20.0,17330305.0,25077749.0,17477571.0,69894376.0,17330305.0,17479668.0,0,0,0,-1.0,-1.0,177750666901000,19300,3,4,2,380,1,0
3,9011,201,2,2,1,0,-1,-1,-1,2,-1,-1,-1,-1,8,2,2020,2,3,3,1,6,5,0,0,0,-1,1,1,0,1,1,2,14,0,17330305.0,1.0,2.0,9.0,1.0,-1.0,-1.0,-1.0,2.0,-1.0,2.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,2.0,5.0,5.0,-1.0,2.0,3.0,5.0,2.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,4.0,3.0,3.0,1.0,2.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,2.0,-1.0,-1.0,-1.0,42.0,0.0,0.0,2.0,0.0,1.0,0.0,1.0,0.0,0.0,2.0,0.0,39.0,0.0,2.0,0.0,-1.0,1.0,-1.0,-1.0,-1.0,1.0,0.0,57.0,-1.0,57.0,-1.0,57.0,-1.0,-1.0,1.0,0.0,-1.0,1.0,-1.0,1.0,-1,1.0,-1,1.0,-1,1.0,-1,1.0,-1.0,1.0,1.0,-1.0,2.0,1.0,1.0,2.0,0.0,0.0,0.0,1.0,0.0,34.0,0,-1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,35.0,35.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,35.0,35.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,2.0,1.0,-1.0,-1.0,-1.0,-1.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,17.0,1.0,1.0,1.0,3.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,2.0,-1.0,770.0,-1.0,6230.0,-1.0,4.0,-1.0,-1.0,-1.0,0.0,-1.0,0.0,-1.0,0.0,-1.0,1.0,-1.0,1.0,-1.0,1.0,1.0,4.0,-1.0,1.0,-1.0,6.0,-1.0,3.0,-1.0,3.0,-1.0,7.0,-1.0,4.0,-1.0,19.0,-1.0,2.0,1.0,1.0,5.0,2.0,1.0,2.0,-1.0,1.0,1250.0,-1.0,-1.0,-1.0,2.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,45000.0,1250.0,0.0,0.0,-1.0,2.0,-1.0,1.0,0.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,-1.0,-1.0,20.0,0.0,0.0,23835366.0,34449350.0,23908329.0,96708345.0,17330305.0,24269730.0,0,0,0,-1.0,-1.0,177750666901000,19300,3,4,2,380,1,0
4,9011,226,1,-1,-1,0,-1,1,-1,-1,-1,-1,-1,-1,5,2,2020,3,4,3,1,6,1,0,0,0,-1,1,-1,1,0,3,0,-1,1,,-1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1,-1.0,-1,-1.0,-1,-1.0,-1,-1.0,,,,,,,,,,,,,,,0,-1.0,-1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,,,,,,,0,0,0,-1.0,-1.0,110865752009,19300,3,4,2,380,1,0
5,9011,226,1,-1,-1,0,-1,1,-1,-1,-1,-1,-1,-1,5,2,2020,3,5,3,1,6,1,0,0,0,-1,1,-1,1,0,3,0,-1,1,,-1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1,-1.0,-1,-1.0,-1,-1.0,-1,-1.0,,,,,,,,,,,,,,,0,-1.0,-1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,,,,,,,0,0,0,-1.0,-1.0,609005258110007,19300,3,4,2,380,1,0
6,9011,226,1,-1,-1,0,-1,1,-1,-1,-1,-1,-1,-1,5,2,2020,3,6,3,1,6,1,0,0,0,-1,1,-1,1,0,3,0,-1,1,,-1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1,-1.0,-1,-1.0,-1,-1.0,-1,-1.0,,,,,,,,,,,,,,,0,-1.0,-1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,,,,,,,0,0,0,-1.0,-1.0,210570160800905,19300,3,4,2,380,1,0
7,9011,226,1,2,-1,0,-1,1,-1,-1,-1,-1,-1,-1,8,2,2020,2,7,3,1,6,1,0,0,0,-1,1,-1,1,0,3,0,-1,1,,-1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1,-1.0,-1,-1.0,-1,-1.0,-1,-1.0,,,,,,,,,,,,,,,0,-1.0,-1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,,,,,,,0,0,0,-1.0,-1.0,4890213002155,19300,3,4,2,380,1,0
8,9011,225,1,1,-1,1,-1,-1,-1,-1,-1,-1,-1,-1,8,2,2020,2,8,3,1,6,1,0,0,0,-1,1,-1,1,0,3,0,-1,1,,-1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1,-1.0,-1,-1.0,-1,-1.0,-1,-1.0,,,,,,,,,,,,,,,0,-1.0,-1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,,,,,,,0,0,0,-1.0,-1.0,310950120805204,19300,3,4,2,380,1,0
9,9011,201,1,1,1,1,-1,-1,-1,1,-1,-1,-1,-1,8,2,2020,2,9,3,1,6,1,0,1,0,-1,1,1,0,7,1,1,8,23,13988189.0,1.0,1.0,9.0,1.0,-1.0,-1.0,-1.0,2.0,-1.0,2.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,2.0,5.0,4.0,3.0,2.0,3.0,5.0,2.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,4.0,3.0,3.0,1.0,2.0,1.0,4.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,-1.0,-1.0,41.0,0.0,0.0,2.0,0.0,2.0,0.0,4.0,0.0,0.0,2.0,0.0,44.0,0.0,1.0,0.0,4.0,0.0,3.0,-1.0,-1.0,-1.0,1.0,57.0,-1.0,57.0,-1.0,57.0,-1.0,-1.0,-1.0,1.0,-1.0,1.0,-1.0,1.0,-1,1.0,-1,1.0,-1,1.0,-1,1.0,-1.0,5.0,1.0,-1.0,2.0,0.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,67.0,0,-1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,30.0,30.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,30.0,30.0,-1.0,3.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,2.0,1.0,-1.0,-1.0,-1.0,-1.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,16.0,1.0,1.0,2.0,2.0,-1.0,12.0,22.0,-1.0,-1.0,1.0,7.0,-1.0,7780.0,-1.0,5940.0,-1.0,7.0,-1.0,2.0,-1.0,0.0,-1.0,0.0,-1.0,0.0,-1.0,0.0,-1.0,1.0,-1.0,1.0,1.0,5.0,-1.0,-1.0,-1.0,10.0,-1.0,9.0,-1.0,14.0,-1.0,5.0,-1.0,38.0,-1.0,17.0,-1.0,2.0,1.0,-1.0,3.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,2.0,2.0,2.0,2.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,1.0,1.0,20.0,20.0,20.0,13988189.0,20241553.0,13089234.0,56139150.0,13988189.0,14039666.0,0,0,0,-1.0,-1.0,210005930580241,19300,3,4,2,380,1,0


### **1.2 Remove Rows/obs**

#### **1.2.1 Remove Non-Civilian Noninstitutional Population**

We care about **Civilian Noninstitutional Population** when studying employment. By [definition](https://webapps.dol.gov/dolfaq/go-dol-faq.asp?faqid=111&faqsub=Employment+%2F+Unemployment&faqtop=Statistics&topicid=6), **Civilian Noninstitutional Population** refers to persons who are 16 years of age and older residing in the 50 states and the District of Columbia, and who are not inmates of institutions (e.g., penal and mental facilities, homes for the aged), and who are not on active duty in the Armed Forces.

Two variables should be used here: `prtage` and `prpertyp`, which represents age and person type respectively. We will remove observations with `prtage` less than 16 and `prpertyp` equal to 1 (child household member) and 3 (in arm force). In these cases, `pemlr` labor force is recorded as -1.

In [171]:
print('============================\nCheck before removing rows:')
# NOTE: (pemlr is -1) may NOT match exactly (prpertyp is 1 or 3)
a = raw_df[raw_df['pemlr']==-1]['prpertyp'].value_counts()
b = raw_df[raw_df['prtage'] < 16].shape[0]
c = raw_df[raw_df['pemlr']==-1].shape[0]
print('Value counts for person type when pemlr is -1: \n', a)
print('Counts of person who is less than 16 years old:', b)
print('Counts of pemlr is -1:', c)


# remove: people in arm force, people who are less than 16 years old
raw_df.drop(raw_df[(raw_df['prtage'] < 16) | (raw_df['prpertyp'].isin([1,3]))].index, inplace = True)


print('============================\nCheck after removing rows:')
a = raw_df[raw_df['pemlr']==-1]['prpertyp'].value_counts()
b = raw_df[raw_df['prtage'] < 16].shape[0]
c = raw_df[raw_df['pemlr']==-1].shape[0]
print('Value counts for person type when pemlr is -1: ', a) # should be empty series
print('Counts of person who is less than 16 years old:', b) # should be 0
print('Counts of pemlr is -1:', c) # should be 0
print('Shape of dataset:', raw_df.shape)

Check before removing rows:
Value counts for person type when pemlr is -1: 
 1.0    223289
3.0      4296
Name: prpertyp, dtype: int64
Counts of person who is less than 16 years old: 239334
Counts of pemlr is -1: 227585
Check after removing rows:
Value counts for person type when pemlr is -1:  Series([], Name: prpertyp, dtype: int64)
Counts of person who is less than 16 years old: 0
Counts of pemlr is -1: 0
Shape of dataset: (1330594, 388)


#### **1.2.2 Remove Rows with Missing Values**

There is a variable called `HUFINAL` which indicates the final outcome of the survey (different values represent whether the survey is completed or not and the reasons not finishing the survey). We can write a function `missing_perc_row` to check the average percentage of missing values (NaN) for each value of `HUFINAL`. Then write another function `remove_missing_row` to remove rows that have more than a given threshold percentage of missing values (currently set to be 1%).


It sometimes takes more than one minute to run this, and depends on your computer's performance.

In [172]:
def missing_perc_row(df, col):
    """
    Input: dataframe, column name = 'HUFINAL'
    Output: dictionary of average percentage of missing in a row for a list of distinct column values
    check perc of missing in a row for a list of distinct column values
    """
    category = df[col].unique()
    dct = {}
    for cat in category:
        avg_missing = df[df[col] == cat].isnull().sum().sum() / df[df[col] == cat].shape[0]
        perc = avg_missing/ df.shape[1] * 100
        dct[cat] = perc
    return dct

def remove_missing_row(df, col, threshold):
    """
    Input: dataframe, column name = 'HUFINAL', threshold
    Output: dataframe with rows removed
    remove rows with missing percentage > threshold
    """
    for key, value in missing_row_dict.items():
        if value > threshold:
            # df[df[col] != key] # this will only remove rows for the last key with value > thres
            # use this:
            df.drop(df[df[col] == key].index, inplace = True)   
    return df


print('============================')
missing_row_dict = missing_perc_row(raw_df, 'HUFINAL')
print('The raw dataset has the following percentage of missing for each value of HUFINAL:\n', missing_row_dict)

print('============================')
raw_df = remove_missing_row(raw_df, 'HUFINAL', 10)
print('After dropping rows, the dataset has the following percentage of missing for each value of HUFINAL:\n', missing_perc_row(raw_df, 'HUFINAL'))
print('Shape of dataset:', raw_df.shape)

The raw dataset has the following percentage of missing for each value of HUFINAL:
 {201: 0.0006854764158332043, 226: 79.64022206866123, 225: 79.73039768615865, 228: 79.89690721649485, 231: 79.89690721649485, 230: 79.89690721649485, 227: 79.63917525773195, 229: 79.89690721649485, 223: 79.89690721649485, 232: 79.89690721649485, 218: 79.6443601957218, 203: 0.0, 1: 0.0, 2: 0.0, 259: 79.89690721649485, 243: 79.89690721649485, 240: 79.89690721649485, 213: 79.648331524688, 219: 79.64499107041462, 217: 79.6438237299341, 233: 79.89690721649485, 241: 79.89690721649485, 224: 79.89690721649485, 244: 79.89690721649485, 248: 79.89690721649485, 245: 79.89690721649485, 216: 79.64951824656458, 214: 79.63917525773195, 5: 0.0, 4: 0.0, 247: 79.89690721649485, 204: 5.7802627203192545, 258: 79.89690721649485, 20: 79.89690721649485, 242: 79.89690721649485, 205: 0.0, 6: 0.0}
After dropping rows, the dataset has the following percentage of missing for each value of HUFINAL:
 {201: 0.0006854764158332043, 203: 

### **1.3 Select Variables of Interest**

a. Link Datasets Variables
- In order to link the same individuals across months, three variables must be used: HRHHID, HRHHID2, and PULINENO.

b. Geographic and Demographic Variables

c. Labor Force Variables
- Currently selected variables may be changed in the future.

d. Weights

In [173]:
# Original columns
original_cols = raw_df.columns.values
original_cols

array(['hrhhid2', 'HUFINAL', 'OCCURNUM', 'HUINTTYP', 'HURESPLI',
       'HUPRSCNT', 'HUTYPEA', 'HUTYPB', 'HUTYPC', 'HUBUS', 'HUBUSL1',
       'HUBUSL2', 'HUBUSL3', 'HUBUSL4', 'HRMIS', 'HRMONTH', 'HRYEAR4',
       'HRLONGLK', 'qstnum', 'gereg', 'gestfips', 'gediv', 'hehousut',
       'hxhousut', 'hephoneo', 'hxphoneo', 'hetelavl', 'hxtelavl',
       'hetelhhd', 'hxtelhhd', 'hrhtype', 'hrintsta', 'hrnumhou',
       'hefaminc', 'hxfaminc', 'hwhhwgt', 'hwhhwtln', 'PULINENO',
       'PUCHINHH', 'PUWK', 'PUBUS1', 'PUDIS', 'PULAY', 'PUHROFF1',
       'PUHROFF2', 'PUHROT1', 'PUHROT2', 'PUABSOT', 'PUBUSCK1',
       'PUBUSCK2', 'PUBUSCK3', 'PUBUSCK4', 'PURETOT', 'PUHRCK1',
       'PUHRCK2', 'PUHRCK3', 'PUHRCK4', 'PUHRCK5', 'PUHRCK6', 'PUHRCK7',
       'PUHRCK12', 'PULAYDT', 'PULAY6M', 'PULAYAVR', 'PULK', 'PULKAVR',
       'PULAYCK1', 'PULAYCK2', 'PULAYCK3', 'PUDWCK1', 'PUDWCK2',
       'PUDWCK3', 'PUDWCK4', 'PUDWCK5', 'PUJHCK1', 'PUJHCK2', 'PUJHDP1O',
       'PUJHCK3', 'PUJHCK4', 'PUJHCK5', 'PUL

In [174]:
Link_VARS_Label = {
    'hrhhid': 'Household ID',
    'hrhhid2': 'Household ID2',
    'PULINENO': 'Line number of person'
}
Link_VARS = list(Link_VARS_Label.keys())


GeoDemographic_VARS_Label = {
    'HRYEAR4': 'year of interview', 
    'HRMONTH': 'month of interview', 
    'HRMIS': 'month of previous interview, month in sample, can be a calculation for link datasets',
    'gestfips': 'state fips code',
    'gtco': 'county fips code',
    'gtmetsta': 'metropolitan status', 
    'prtage': 'age, 0-79 in years, 80 for 80-84, 85 for 85+',
    'pemaritl': 'marital status',
    'pesex': 'sex',
    'peeduca': 'highest level of school completed',
    'ptdtrace': 'race',
    'pehspnon': 'hispanic or non-hispanic',
    'prcitshp': 'citizenship status',
    'hefaminc': 'family income'
}
GeoDemographic_VARS = list(GeoDemographic_VARS_Label.keys())


Labor_VARS_Label = {
    'pemlr': 'monthly labor force status recode. EDITED UNIVERSE: PRPERTYP = 2 (already satisfied)',
    'prwkstat': 'full-time or part-time work status. EDITED UNIVERSE: PEMLR = 1-7',
    'pternwa': 'weekly earnings recode. Calculated for all workers. Collected for one-quarter of the sample. EDITED UNIVERSE: PRERELG = 1',
    'prerelg': 'flag, if earnings eligible for editing',
    'pternhly': 'hourly earnings recode. Only calculated for hourly-paid workers. Collected for one-quarter of the sample. EDITED UNIVERSE:	PEERNPER = 1 OR PEERNRT = 1',
    'peernper': 'earning perodicity',
    'peernrt': 'confirm hourly paid',
    'pemjot': 'do you have more than one job. EDITED UNIVERSE: PEMLR = 1, 2',
    'pemjnum': 'number of jobs. EDITED UNIVERSE: PEMJOT = 1',
    'pehruslt': 'usual weekly hours worked. -4 means varies. EDITED UNIVERSE: PEMLR = 1 OR 2',
    'pehractt': 'actual weekly hours worked during survey week. EDITED UNIVERSE: PEMLR = 1',
    'peernlab': 'member of union. EDITED UNIVERSE: (PEIO1COW = 1-5 AND PEMLR = 1-2 AND HRMIS = 4, 8)',
    'prmjocgr': 'major occupational category. EDITED UNIVERSE:	PRMJOCC = 1-11',
    'peio1cow': 'class of worker on main job. EDITED UNIVERSE:	(PEMLR = 1-3) OR (PEMLR = 4 AND PELKLWO = 1-2) \
                                OR (PEMLR = 5 AND (PENLFJH = 1 OR PEJHWKO = 1)) OR (PEMLR = 6 AND PENLFJH = 1) \
                                OR (PEMLR = 7 AND (PENLFJH = 1 OR PEJHWKO = 1))',
}
Labor_VARS = list(Labor_VARS_Label.keys())


Weights_VAR_Label = {
    'hwhhwgt': 'household weight',
    'pwlgwgt': 'longitudinal weight',
    'pworwgt': 'outgoing rotation weight',
    'pwvetwgt': 'veteran weight',
    'pwsswgt': 'second stage weight',
    'pwcmpwgt': 'composite final weight'
}
Weights_VARS = list(Weights_VAR_Label.keys())

In [175]:
useful_vars = Link_VARS + GeoDemographic_VARS + Labor_VARS + Weights_VARS
raw_df = raw_df[useful_vars]
raw_df.shape

(1029862, 37)

In [176]:
# As we might need to link datasets in the future, make sure there is no missing value in the link variables
raw_df[Link_VARS].isnull().sum()
raw_df.dropna(subset=Link_VARS, inplace=True)

### **1.4 Rename Variables and Add Dictionary for Categorical Variables**

In [177]:
rename_dict = {
    'hrhhid': 'hrhhid', 'hrhhid2': 'hrhhid2', 'PULINENO': 'PULINENO',
    'HRYEAR4': 'Year', 'HRMONTH': 'Month', 'HRMIS': 'MonthInSample', 'gestfips': 'State', 'gtco': 'County', 'gtmetsta': 'Metropolitan', 'prtage': 'Age', 
    'pemaritl': 'Marital', 'pesex': 'Gender', 'peeduca': 'Educ','ptdtrace': 'Race', 'pehspnon': 'Hispanic', 'prcitshp': 'Citizen','hefaminc': 'FamIncome',
    'pemlr': 'LaborForce', 'prwkstat': 'FullPartTime', 'pternwa': 'WeeklyEarning', 'prerelg': 'flagWeekEarn', 'pternhly': 'HourlyEarning', 
    'peernper': 'flagPeriodicityEarn', 'peernrt': 'flagConfirmHourly', 'pemjot': 'SingleJob', 'pemjnum': 'NumJobs', 'pehruslt': 'UsualHours', 'pehractt': 'ActualHours', 
    'peernlab': 'Union', 'prmjocgr': 'Occupation', 'peio1cow': 'WorkerClass',
}

raw_df.rename(columns=rename_dict, inplace=True)

In [178]:
raw_df.columns

Index(['hrhhid', 'hrhhid2', 'PULINENO', 'Year', 'Month', 'MonthInSample',
       'State', 'County', 'Metropolitan', 'Age', 'Marital', 'Gender', 'Educ',
       'Race', 'Hispanic', 'Citizen', 'FamIncome', 'LaborForce',
       'FullPartTime', 'WeeklyEarning', 'flagWeekEarn', 'HourlyEarning',
       'flagPeriodicityEarn', 'flagConfirmHourly', 'SingleJob', 'NumJobs',
       'UsualHours', 'ActualHours', 'Union', 'Occupation', 'WorkerClass',
       'hwhhwgt', 'pwlgwgt', 'pworwgt', 'pwvetwgt', 'pwsswgt', 'pwcmpwgt'],
      dtype='object')

Original Values of Categorical Variables

In [179]:
GeoDemographic_Categorical_Val = {
    'State': {1: 'AL', 2: 'AK', 4: 'AZ', 5: 'AR', 6: 'CA', 8: 'CO', 9: 'CT', 10: 'DE', 
                11: 'DC', 12: 'FL', 13: 'GA', 15: 'HI', 16: 'ID', 17: 'IL', 18: 'IN', 19: 'IA', 20: 'KS', 
                21: 'KY', 22: 'LA', 23: 'ME', 24: 'MD', 25: 'MA', 26: 'MI', 27: 'MN', 28: 'MS', 29: 'MO', 30: 'MT', 
                31: 'NE', 32: 'NV', 33: 'NH', 34: 'NJ', 35: 'NM', 36: 'NY', 37: 'NC', 38: 'ND', 39: 'OH', 40: 'OK', 
                41: 'OR', 42: 'PA', 44: 'RI', 45: 'SC', 46: 'SD', 47: 'TN', 48: 'TX', 49: 'UT', 50: 'VT', 
                51: 'VA', 53: 'WA', 54: 'WV', 55: 'WI', 56: 'WY', },
    'Metropolitan': {1: 'metropolitan', 2: 'nonmetropolitan', 3: 'not identified'},
    'Marital': {1: 'married - spouse present', 2: 'married - spouse absent', 3: 'widowed', 4: 'divorced', 5: 'separated', 6: 'never married'},
    'Gender': {1: 'male', 2: 'femlae'},
    'Educ': {31: 'less than 1st grade', 32: '1st, 2nd, 3rd or 4th grade', 33: '5th or 6th grade', 
                34: '7th or 8th grade', 35: '9th grade', 36: '10th grade', 37: '11th grade', 38: '12th grade no diploma',
                39: 'high school grad-diploma or equiv (ged)', 40: 'some college', 
                41: 'associate degree-occupational/vocational', 42: 'associate degree-academic program', 43: 'bachelor\'s degree',
                44: 'master\'s degree', 45: 'professional school degree', 46: 'doctorate degree'},
    'Race': {1: 'white only', 2: 'black only', 3: 'american indian, alaskan native only', 4: 'asian only',
                5: 'hawaiian/pacific islander only', 6: 'white-black', 7: 'white-ai', 8: 'white-asian', 9: 'white-hp',
                10: 'black-ai', 11: 'black-asian', 12: 'black-hp', 13: 'ai-asian', 14: 'ai-hp', 15: 'asian-hp',
                16: 'w-b-ai', 17: 'w-b-a', 18: 'w-b-hp', 19: 'w-ai-a', 20: 'w-ai-hp', 21: 'w-a-hp', 22: 'b-ai-a',
                23: 'w-b-ai-a', 24: 'w-ai-a-hp', 25: 'other 3 race combinations', 26: 'other 4 nad 5 race combinations'},
    'Hispanic': {1: 'hispanic', 2: 'non-hispanic'},
    'Citizen': {1: 'native, born in in US', 2: 'native, born in Puerto Rico or other US island areas',
                3: 'native, born abroad', 4: 'foreign born, US citizen by naturalization', 5: 'foreign born, not a citizen'},
    'FamIncome': {1: 'less than 5,000', 2: '5,000 to 7,499', 3: '7,500 to 9,999', 4: '10,000 to 12,499', 5: '12,500 to 14,999',
                6: '15,000 to 19,999', 7: '20,000 to 24,999', 8: '25,000 to 29,999', 9: '30,000 to 34,999', 10: '35,000 to 39,999',
                11: '40,000 to 49,999', 12: '50,000 to 59,999', 13: '60,000 to 74,999', 14: '75,000 to 99,999', 
                15: '100,000 to 149,999', 16: '150,000 or more'}
}

GeoDemographic_Categorical_VARS = list(GeoDemographic_Categorical_Val.keys())

Employment_Categorical_Val = {
    'LaborForce': {1: 'employed - at work', 2: 'employed - absent', 3: 'unemployed - on layoff', 4: 'unemployed - looking',
                5: 'not in labor force - retired', 6: 'not in labor force - disabled', 7: 'not in labor force - other'},
    'FullPartTime': {1: 'not in labor force', 2: 'FT, usually FT', 3: 'PT for economic reasons, usually FT', 4: 'PT for non-econ reason, usual FT',
                    5: 'not at work, usual FT', 6: 'PT HRS, USUALLY PT FOR ECONOMIC REASONS', 7: 'PT HRS, USUALLY PT FOR NON-ECONOMIC REASONS',
                    8: 'FT HOURS, USUALLY PT FOR ECONOMIC REASONS', 9: 'FT HOURS, USUALLY PT FOR NON-ECONOMIC', 10: 'NOT AT WORK, USUALLY PART-TIME',
                    11: 'UNEMPLOYED FT', 12: 'UNEMPLOYED PT'},
    'flagWeekEarn': {0: 'not eligible for edit', 2: 'eligible for edit'},
    'flagPeriodicityEarn': {1: 'hourly', 2: 'weekly', 3: 'bi-weekly', 4: 'twice monthly', 5: 'monthly', 6: 'annually', 7: 'otehr'},
    'flagConfirmHourly': {1: 'yes', 2: 'no'},
    'SingleJob': {1: 'yes', 2: 'no'},
    'NumJobs': {2: '2 jobs', 3: '3 jobs', 4: '4 or more jobs'},
    'Union': {1: 'yes', 2: 'no'},
    'Occupation': {1: 'management, business, science, and arts occupations', 2: 'service occupations', 3: 'sales and office occupations',
                4: 'farming, fishing, and forestry occupations', 5: 'construction, extraction, and maintenance occupations',
                6: 'production, transportation, and material moving occupations', 7: 'armed forces'},
    'WorkerClass': {1: 'gov - federal', 2: 'gov - state', 3: 'gov - local', 4: 'private - profit', 5: 'private - nonprofit', 6: 'self-employ, incorporated',
                7: 'self-employ, incorporated', 8: 'without pay'},
}

Employment_Categorical_VARS = list(Employment_Categorical_Val.keys())
Categorical_Vars = GeoDemographic_Categorical_VARS + Employment_Categorical_VARS

## **Part 2: Clean Variables**

### **2.1 Check overall information**

In [180]:
raw_df.head()

Unnamed: 0,hrhhid,hrhhid2,PULINENO,Year,Month,MonthInSample,State,County,Metropolitan,Age,Marital,Gender,Educ,Race,Hispanic,Citizen,FamIncome,LaborForce,FullPartTime,WeeklyEarning,flagWeekEarn,HourlyEarning,flagPeriodicityEarn,flagConfirmHourly,SingleJob,NumJobs,UsualHours,ActualHours,Union,Occupation,WorkerClass,hwhhwgt,pwlgwgt,pworwgt,pwvetwgt,pwsswgt,pwcmpwgt
0,110024067491,11011,1.0,2020,2,3,1,0,1,40.0,6.0,1.0,43.0,1.0,2.0,1.0,13,1.0,2.0,-1.0,0.0,-1.0,-1.0,-1.0,2.0,-1.0,40.0,40.0,-1.0,1.0,2.0,18525008.0,26774269.0,0.0,18566761.0,18525008.0,18786224.0
1,761077501690006,9012,1.0,2020,2,8,1,3,1,21.0,2.0,1.0,39.0,1.0,1.0,1.0,10,1.0,2.0,44000.0,1.0,1100.0,1.0,-1.0,2.0,-1.0,40.0,40.0,2.0,2.0,4.0,27647640.0,39959244.0,111229844.0,28425098.0,27647640.0,28518067.0
2,177750666901000,9011,1.0,2020,2,8,1,3,1,51.0,1.0,2.0,43.0,1.0,2.0,1.0,14,2.0,10.0,-1.0,0.0,-1.0,-1.0,-1.0,2.0,-1.0,20.0,-1.0,-1.0,1.0,7.0,17330305.0,25077749.0,69894376.0,17477571.0,17330305.0,17479668.0
3,177750666901000,9011,2.0,2020,2,8,1,3,1,34.0,1.0,1.0,39.0,1.0,2.0,1.0,14,1.0,2.0,45000.0,1.0,1250.0,2.0,1.0,2.0,-1.0,35.0,35.0,2.0,5.0,4.0,17330305.0,34449350.0,96708345.0,23908329.0,23835366.0,24269730.0
9,210005930580241,9011,1.0,2020,2,8,1,3,1,67.0,4.0,2.0,44.0,1.0,2.0,1.0,8,1.0,7.0,-1.0,0.0,-1.0,-1.0,-1.0,2.0,-1.0,30.0,30.0,-1.0,3.0,7.0,13988189.0,20241553.0,56139150.0,13089234.0,13988189.0,14039666.0


In [181]:
columns = raw_df.columns.values
columns

array(['hrhhid', 'hrhhid2', 'PULINENO', 'Year', 'Month', 'MonthInSample',
       'State', 'County', 'Metropolitan', 'Age', 'Marital', 'Gender',
       'Educ', 'Race', 'Hispanic', 'Citizen', 'FamIncome', 'LaborForce',
       'FullPartTime', 'WeeklyEarning', 'flagWeekEarn', 'HourlyEarning',
       'flagPeriodicityEarn', 'flagConfirmHourly', 'SingleJob', 'NumJobs',
       'UsualHours', 'ActualHours', 'Union', 'Occupation', 'WorkerClass',
       'hwhhwgt', 'pwlgwgt', 'pworwgt', 'pwvetwgt', 'pwsswgt', 'pwcmpwgt'],
      dtype=object)

In [182]:
# Check missing
# If a column contains missing (NaN), take care of it later.
# Theoraically, we should not have any missing values.

for col in columns:
    if raw_df[col].isnull().sum() > 0:
        print(col, raw_df[col].isnull().sum())

In [183]:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1029845 entries, 0 to 1574207
Data columns (total 37 columns):
 #   Column               Non-Null Count    Dtype  
---  ------               --------------    -----  
 0   hrhhid               1029845 non-null  int64  
 1   hrhhid2              1029845 non-null  int64  
 2   PULINENO             1029845 non-null  float64
 3   Year                 1029845 non-null  int64  
 4   Month                1029845 non-null  int64  
 5   MonthInSample        1029845 non-null  int64  
 6   State                1029845 non-null  int64  
 7   County               1029845 non-null  int64  
 8   Metropolitan         1029845 non-null  int64  
 9   Age                  1029845 non-null  float64
 10  Marital              1029845 non-null  float64
 11  Gender               1029845 non-null  float64
 12  Educ                 1029845 non-null  float64
 13  Race                 1029845 non-null  float64
 14  Hispanic             1029845 non-null  float64
 15

In [184]:
Link_VARS_Label = {
    'hrhhid': 'Household ID',
    'hrhhid2': 'Household ID2',
    'PULINENO': 'Line number of person'
}

Link_VARS = ['hrhhid', 'hrhhid2', 'PULINENO']

GeoDemographic_VARS_Label = {
    'Year': 'year of interview', 
    'Month': 'month of interview', 
    'MonthInSample': 'month of previous interview, month in sample, can be a calculation for link datasets',
    'State': 'state fips code',
    'County': 'county fips code',
    'Metropolitan': 'metropolitan status', 
    'Age': 'age, 0-79 in years, 80 for 80-84, 85 for 85+',
    'Marital': 'marital status',
    'Gender': 'sex',
    'Educ': 'highest level of school completed',
    'Race': 'race',
    'Hispanic': 'hispanic or non-hispanic',
    'Citizen': 'citizenship status',
    'FamIncome': 'family income'
}

GeoDemographic_VARS = list(GeoDemographic_VARS_Label.keys())

Labor_VARS_Label = {
    'LaborForce': 'monthly labor force status recode. EDITED UNIVERSE: PRPERTYP = 2 (already satisfied)',
    'FullPartTime': 'full-time or part-time work status. EDITED UNIVERSE: PEMLR = 1-7',
    'WeeklyEarning': 'weekly earnings recode. Calculated for all workers. Collected for one-quarter of the sample. \
                        EDITED UNIVERSE: flagWeekEarn/PRERELG = 1',
    'flagWeekEarn': 'prerelg, flag, if earnings eligible for editing',
    'HourlyEarning': 'hourly earnings recode. Only calculated for hourly-paid workers. \
                    Collected for one-quarter of the sample. EDITED UNIVERSE: PEERNPER = 1 OR PEERNRT = 1',
    'flagPeriodicityEarn': 'peernper, earning perodicity',
    'flagConfirmHourly': 'peernrt, confirm hourly paid',
    'SingleJob': 'do you have more than one job. EDITED UNIVERSE: PEMLR = 1, 2',
    'NumJobs': 'number of jobs. EDITED UNIVERSE: PEMJOT = 1',
    'UsualHours': 'usual weekly hours worked. -4 means varies. EDITED UNIVERSE: PEMLR = 1 OR 2',
    'ActualHours': 'actual weekly hours worked during survey week. EDITED UNIVERSE: PEMLR = 1',
    'Union': 'member of union. EDITED UNIVERSE: (PEIO1COW = 1-5 AND PEMLR = 1-2 AND HRMIS = 4, 8)',
    'Occupation': 'major occupational category. EDITED UNIVERSE:	PRMJOCC = 1-11',
    'WorkerClass': 'class of worker on main job. EDITED UNIVERSE:	(PEMLR = 1-3) OR (PEMLR = 4 AND PELKLWO = 1-2) \
                                OR (PEMLR = 5 AND (PENLFJH = 1 OR PEJHWKO = 1)) OR (PEMLR = 6 AND PENLFJH = 1) \
                                OR (PEMLR = 7 AND (PENLFJH = 1 OR PEJHWKO = 1))',
}

Labor_VARS = list(Labor_VARS_Label.keys())

Weights_VAR_Label = {
    'hwhhwgt': 'household weight',
    'pwlgwgt': 'longitudinal weight',
    'pworwgt': 'outgoing rotation weight',
    'pwvetwgt': 'veteran weight',
    'pwsswgt': 'second stage weight',
    'pwcmpwgt': 'composite final weight'
}

Weights_VARS = list(Weights_VAR_Label.keys())

### **2.2 Geographic and Demographic Variables**

In [185]:
# 1. Metropolitian
print('============================\nBefore cleaning, Metropolitan variable has values:')
print(raw_df['Metropolitan'].value_counts())

print('============================\nAfter cleaning, Metropolitan variable has values:')
# replace 2 (nonmetropolitan) with 0; replace 3 (not identified) with NaN, then impute
raw_df['Metropolitan'] = raw_df['Metropolitan'].replace([2, 3], [0, np.nan])
raw_df['Metropolitan'].fillna(method='ffill', inplace=True)
print(raw_df['Metropolitan'].value_counts(dropna=False))

Before cleaning, Metropolitan variable has values:
1    824706
2    194778
3     10361
Name: Metropolitan, dtype: int64
After cleaning, Metropolitan variable has values:
1.0    832635
0.0    197210
Name: Metropolitan, dtype: int64


In [186]:
# 2. Age
print('============================\nAge variable:')
print(raw_df['Age'].describe().apply(lambda x: format(x, 'f')))
# NOTE: Age variable is numeric for 16-79, but categorical for 80:80-84, 85:85+

print('============================\nAgeGroup variable:')
# create a new column AgeGroup with 4 categories: young prime old retired
# young age workers: 15-24, prime age workers: 25-54, old age workers: 55-64, retired: 65+
raw_df['AgeGroup'] = pd.cut(raw_df['Age'], bins=[15, 24, 54, 64, 100], labels=['young', 'prime', 'old', 'retired'])
print(raw_df['AgeGroup'].value_counts(dropna=False))

Age variable:
count    1029845.000000
mean          48.896109
std           18.954524
min           16.000000
25%           33.000000
50%           49.000000
75%           64.000000
max           85.000000
Name: Age, dtype: object
AgeGroup variable:
prime      469743
retired    252515
old        178894
young      128693
Name: AgeGroup, dtype: int64


In [187]:
# 3. Marital
print('============================\nBefore cleaning, Marital variable has values:')
print(raw_df['Marital'].value_counts())

print('============================\nAfter cleaning, Marital variable has values:')
# replace 6 (never married) with 0; replace all other values (any condition after marriage) with 1
raw_df['Marital'] = raw_df['Marital'].replace([1, 2, 3, 4, 5, 6], [1, 1, 1, 1, 1, 0])
print(raw_df['Marital'].value_counts(dropna=False))

Before cleaning, Marital variable has values:
1.0    533299
6.0    290885
4.0    109789
3.0     66614
5.0     16007
2.0     13251
Name: Marital, dtype: int64
After cleaning, Marital variable has values:
1.0    738960
0.0    290885
Name: Marital, dtype: int64


In [188]:
# 4. Gender
print('============================\nBefore cleaning, Gender variable has values:')
print(raw_df['Gender'].value_counts())

print('============================\nAfter cleaning, Gender variable has values:')
# replace 2 (female) with 0
raw_df['Gender'] = raw_df['Gender'].replace(2, 0)
print(raw_df['Gender'].value_counts())

Before cleaning, Gender variable has values:
2.0    536729
1.0    493116
Name: Gender, dtype: int64
After cleaning, Gender variable has values:
0.0    536729
1.0    493116
Name: Gender, dtype: int64


In [189]:
# 5. Educ
print('============================\nBefore cleaning, Educ variable has values:')
print(raw_df['Educ'].value_counts())

print('============================\nAfter cleaning, Educ variable has values:')
# Simplify Educ somehow, but leave space for future simplification
# replace 31-38 with 0 (less than high school); replace 39 with 1 (high school); 
raw_df['Educ'] = raw_df['Educ'].replace([31, 32, 33, 34, 35, 36, 37, 38], 0)
# replace 40 with 2 (some college); replace 41-42 with 3 (associate);
# replace 43 with 4 (bachelor\'s); replace 44-45 with 5 (master\'s or professional school); replace 46 with 6 (doctorate)
raw_df['Educ'] = raw_df['Educ'].replace([39, 40, 41, 42, 43, 44, 45, 46], [1, 2, 3, 3, 4, 5, 5, 6])
print(raw_df['Educ'].value_counts())

Before cleaning, Educ variable has values:
39.0    290410
43.0    216950
40.0    173341
44.0     94424
42.0     58377
41.0     44838
37.0     31150
36.0     25828
46.0     19461
35.0     16055
38.0     14962
45.0     14537
34.0     12997
33.0      9296
32.0      4636
31.0      2583
Name: Educ, dtype: int64
After cleaning, Educ variable has values:
1.0    290410
4.0    216950
2.0    173341
0.0    117507
5.0    108961
3.0    103215
6.0     19461
Name: Educ, dtype: int64


In [190]:
# 6. Race
print('============================\nBefore cleaning, Race variable has values:')
print(raw_df['Race'].value_counts())

print('============================\nAfter cleaning, Race variable has values:')
# raplace 1 to be 0 (white only), replace 2 to be 1 (black only), replace 4 to be 2 (asian only)
# replace all other only or mix to be 3
raw_df['Race'] = raw_df['Race'].replace([1, 2, 4], [0, 1, 2])
raw_df['Race'] = raw_df['Race'].replace([3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26], 3)
print(raw_df['Race'].value_counts(dropna=False))

Before cleaning, Race variable has values:
1.0     835863
2.0     103478
4.0      57381
3.0      12057
7.0       5826
6.0       4294
5.0       4023
8.0       3186
10.0       823
9.0        748
15.0       536
21.0       459
16.0       425
11.0       230
17.0       125
19.0        77
20.0        66
13.0        62
12.0        47
26.0        42
25.0        25
18.0        20
14.0        19
23.0        15
22.0        10
24.0         8
Name: Race, dtype: int64
After cleaning, Race variable has values:
0.0    835863
1.0    103478
2.0     57381
3.0     33123
Name: Race, dtype: int64


In [191]:
# 7. Hispanic
print('============================\nBefore cleaning, Hispanic variable has values:')
print(raw_df['Hispanic'].value_counts())

print('============================\nAfter cleaning, Hispanic variable has values:')
# replace 2 (non-hispanic) with 0
raw_df['Hispanic'] = raw_df['Hispanic'].replace(2, 0)
print(raw_df['Hispanic'].value_counts())

Before cleaning, Hispanic variable has values:
2.0    899336
1.0    130509
Name: Hispanic, dtype: int64
After cleaning, Hispanic variable has values:
0.0    899336
1.0    130509
Name: Hispanic, dtype: int64


In [192]:
# 8. Citizen
print('============================\nBefore cleaning, Citizen variable has values:')
print(raw_df['Citizen'].value_counts())

print('============================\nAfter cleaning, Citizen variable has values:')
# replace 1 2 3 4 with 1 as citizen
# replace 5 with 0 as non-citizen
raw_df['Citizen'] = raw_df['Citizen'].replace([1, 2, 3, 4], 1)
raw_df['Citizen'] = raw_df['Citizen'].replace(5, 0)
print(raw_df['Citizen'].value_counts(dropna=False))

Before cleaning, Citizen variable has values:
1.0    881562
4.0     71774
5.0     62421
3.0      9111
2.0      4977
Name: Citizen, dtype: int64
After cleaning, Citizen variable has values:
1.0    967424
0.0     62421
Name: Citizen, dtype: int64


In [193]:
# 9. FamIncome
print('============================\nBefore cleaning, FamIncome variable has values:')
print(raw_df['FamIncome'].value_counts().sort_index())

print('============================\nAfter cleaning, FamIncome variable has values:')
print('No clean action FamIncome variable')

Before cleaning, FamIncome variable has values:
1      16384
2       7831
3      13415
4      18929
5      19153
6      32666
7      42356
8      42366
9      50652
10     50860
11     80316
12     84391
13    110968
14    140481
15    159682
16    159395
Name: FamIncome, dtype: int64
After cleaning, FamIncome variable has values:
No clean action FamIncome variable


### **2.3. Labor Force Variables**

`LaborForce` is the ultimate outcome of a bunch of questions in the survey. The value ranges from employed (at work/absent), unemployed (lay-off/looking), to not in labor force (retired/disabled/other). There should not be missing (NaN or -1) for this variable after previous cleaning.

`FullPartTime` is another representation of labor force, indicating full/part time status. We keep this variable because it is the edited universe of `Union`. When we calculate the percentage of members in union, we will refer to this variable.

In [194]:
# 10. LaborForce
print('============================\nLaborForce variable has values:')
print(raw_df['LaborForce'].value_counts(dropna=False).sort_index())

# 11. FullPartTime
print('============================\nBefore cleaning, FullPartTime variable has values:')
print(raw_df['FullPartTime'].value_counts(dropna=False).sort_index())

LaborForce variable has values:
1.0    550971
2.0     25826
3.0     21793
4.0     22886
5.0    224449
6.0     51367
7.0    132553
Name: LaborForce, dtype: int64
Before cleaning, FullPartTime variable has values:
1.0     408369
2.0     410237
3.0      13240
4.0      36975
5.0      18488
6.0      12525
7.0      74876
8.0        587
9.0       2531
10.0      7338
11.0     35077
12.0      9602
Name: FullPartTime, dtype: int64


`WeeklyEarning` is collected for one-quarter of the sample. Its edited universe is `flagWeekEarn/PRERELG=1` (eligible for editing).

`HourlyEarning` is collected for one-quarter of the sample. Only calculated for hourly-paid workers. EDITED UNIVERSE: `flagPeriodicityEarn/PEERNPER = 1` **OR** `flagConfirmHourly/PEERNRT = 1`

In [195]:
# 12. WeeklyEarning and its flag
print('============================\nBefore cleaning, flagWeekEarn variable has values:')
print(raw_df['flagWeekEarn'].value_counts(dropna=False))

print('============================\nWhen flagWeekEarn = 0, WeeklyEarning variable has values:')
print(raw_df.loc[raw_df['flagWeekEarn'] == 0, 'WeeklyEarning'].value_counts(dropna=False).sort_index())

print('============================\nWhen flagWeekEarn = 1, WeeklyEarning variable has values:')
print(raw_df.loc[raw_df['flagWeekEarn'] == 1, 'WeeklyEarning'].describe().apply(lambda x: format(x, 'f')))
print('with {} missing values in WeeklyEarning variable'.format(raw_df.loc[raw_df['flagWeekEarn'] == 1, 'WeeklyEarning'].isnull().sum()))

print('============================\nAfter cleaning, WeeklyEarning variable has values:')
print('No clean action WeeklyEarning variable')

Before cleaning, flagWeekEarn variable has values:
0.0    896174
1.0    133671
Name: flagWeekEarn, dtype: int64
When flagWeekEarn = 0, WeeklyEarning variable has values:
-1.0    896174
Name: WeeklyEarning, dtype: int64
When flagWeekEarn = 1, WeeklyEarning variable has values:
count    133671.000000
mean     106899.663854
std       72878.905592
min           0.000000
25%       54000.000000
50%       86750.000000
75%      142307.000000
max      288461.000000
Name: WeeklyEarning, dtype: object
with 0 missing values in WeeklyEarning variable
After cleaning, WeeklyEarning variable has values:
No clean action WeeklyEarning variable


In [196]:
# 13. HourlyEarning and its flag
print('============================\nBefore cleaning, HourlyEarning variable has values:')
print('HourlyEarning variable has {} missing values'.format(raw_df['HourlyEarning'].isnull().sum()))
print('HourlyEarning variable has {} negative values'.format(raw_df.loc[raw_df['HourlyEarning'] < 0, 'HourlyEarning'].shape[0]))

print('============================\nWhen flagPeriodicityEarn = 1:')
print('HourlyEarning variable has {} missing values'.format(raw_df.loc[raw_df['flagPeriodicityEarn'] == 1, 'HourlyEarning'].isnull().sum()))
print('HourlyEarning variable has {} negative values'.format(raw_df.loc[(raw_df['flagPeriodicityEarn'] == 1) & (raw_df['HourlyEarning'] < 0), 
                                                                        'HourlyEarning'].shape[0]))

print('============================\nWhen flagConfirmHourly = 1:')
print('HourlyEarning variable has {} missing values'.format(raw_df.loc[raw_df['flagConfirmHourly'] == 1, 'HourlyEarning'].isnull().sum()))
print('HourlyEarning variable has {} negative values'.format(raw_df.loc[(raw_df['flagConfirmHourly'] == 1) & (raw_df['HourlyEarning'] < 0), 
                                                                        'HourlyEarning'].shape[0]))

print('============================\nWhen or flagPeriodicityEarn = 1 or flagConfirmHourly = 1:')
print('HourlyEarning variable has {} missing values'.format(raw_df.loc[(raw_df['flagPeriodicityEarn'] == 1) | (raw_df['flagConfirmHourly'] == 1), 
                                                                       'HourlyEarning'].isnull().sum()))
print('HourlyEarning variable has {} negative values'.format(raw_df.loc[((raw_df['flagPeriodicityEarn'] == 1) | 
                                                                         (raw_df['flagConfirmHourly'] == 1)) & 
                                                                         (raw_df['HourlyEarning'] < 0), 
                                                                        'HourlyEarning'].shape[0]))

# negate edited universe
print('============================\nWhen negating edited universe:')
print('HourlyEarning variable has {} missing values'.format(raw_df.loc[(raw_df['flagPeriodicityEarn'] != 1) & (raw_df['flagConfirmHourly'] != 1), 
                                                                       'HourlyEarning'].isnull().sum()))

print('============================\nAfter cleaning, HourlyEarning variable has values:')
print('No clean action HourlyEarning variable')
# hourly earning may be used to calculate weekly earning if weekly earning is missing or negative

Before cleaning, HourlyEarning variable has values:
HourlyEarning variable has 0 missing values
HourlyEarning variable has 955744 negative values
When flagPeriodicityEarn = 1:
HourlyEarning variable has 0 missing values
HourlyEarning variable has 9 negative values
When flagConfirmHourly = 1:
HourlyEarning variable has 0 missing values
HourlyEarning variable has 2 negative values
When or flagPeriodicityEarn = 1 or flagConfirmHourly = 1:
HourlyEarning variable has 0 missing values
HourlyEarning variable has 11 negative values
When negating edited universe:
HourlyEarning variable has 0 missing values
After cleaning, HourlyEarning variable has values:
No clean action HourlyEarning variable


`SingleJob` is binary with edited universe LaborForce/PEMLR = 1, 2

`NumJobs`

In [197]:
# 14. SingleJob
print('============================\nBefore cleaning, SingleJob variable has values:')
print(raw_df['SingleJob'].value_counts(dropna=False).sort_index())

print('============================\nWhen LaborForce = 1,2, SingleJob variable has values:')
print(raw_df.loc[raw_df['LaborForce'].isin([1, 2]), 'SingleJob'].value_counts(dropna=False).sort_index())
print('otherwise, SingleJob variable has values:', raw_df[raw_df['LaborForce'].isin([3, 4, 5, 6, 7])]['SingleJob'].value_counts())

print('============================\nAfter cleaning, SingleJob variable has values:')
# replace 2 (no) with 0
raw_df['SingleJob'] = raw_df['SingleJob'].replace(2, 0)
print(raw_df['SingleJob'].value_counts(dropna=False).sort_index())

Before cleaning, SingleJob variable has values:
-1.0    453048
 1.0     29072
 2.0    547725
Name: SingleJob, dtype: int64
When LaborForce = 1,2, SingleJob variable has values:
1.0     29072
2.0    547725
Name: SingleJob, dtype: int64
otherwise, SingleJob variable has values: -1.0    453048
Name: SingleJob, dtype: int64
After cleaning, SingleJob variable has values:
-1.0    453048
 0.0    547725
 1.0     29072
Name: SingleJob, dtype: int64


In [198]:
# 15. NumJobs
print('============================\nBefore cleaning, NumJobs variable has values:')
print(raw_df['NumJobs'].value_counts(dropna=False).sort_index())
print('NumJobs variable has {} positive values'.format(raw_df.loc[raw_df['NumJobs'] >= 0, 'NumJobs'].shape[0]))

print('============================\nWhen SingleJob = 1, NumJobs variable has values:')
print(raw_df.loc[raw_df['SingleJob'] == 1, 'NumJobs'].value_counts(dropna=False).sort_index())

print('============================\nWhen SingleJob = 0, NumJobs variable has values:')
print(raw_df.loc[raw_df['SingleJob'] == 0, 'NumJobs'].value_counts(dropna=False).sort_index())

print('============================\nAfter cleaning, NumJobs variable has values:')
# replace -1 (blank) with 1 if SingleJob = 1
raw_df.loc[(raw_df['SingleJob'] == 1) & (raw_df['NumJobs'] == -1), 'NumJobs'] = 1
# replace other -1 (blank) with 0
raw_df.loc[raw_df['NumJobs'] == -1, 'NumJobs'] = 0
print(raw_df['NumJobs'].value_counts(dropna=False).sort_index())


Before cleaning, NumJobs variable has values:
-1.0    1000773
 2.0      26555
 3.0       2117
 4.0        400
Name: NumJobs, dtype: int64
NumJobs variable has 29072 positive values
When SingleJob = 1, NumJobs variable has values:
2.0    26555
3.0     2117
4.0      400
Name: NumJobs, dtype: int64
When SingleJob = 0, NumJobs variable has values:
-1.0    547725
Name: NumJobs, dtype: int64
After cleaning, NumJobs variable has values:
0.0    1000773
2.0      26555
3.0       2117
4.0        400
Name: NumJobs, dtype: int64


`UsualHours`: -4 means the usual working hours per week varies. EDITED UNIVERSE: PEMLR = 1 OR 2

`ActualHours`: EDITED UNIVERSE: PEMLR = 1

In [199]:
# 16. UsualHours
print('============================\nBefore cleaning, UsualHours variable has values:')
print(raw_df['UsualHours'].describe().apply(lambda x: format(x, 'f')))
print('with {} missing values'.format(raw_df['UsualHours'].isnull().sum()))
print('also with {} -1 values and {} -4 values'.format(raw_df.loc[raw_df['UsualHours'] == -1, 'UsualHours'].shape[0],
                                                                      raw_df.loc[raw_df['UsualHours'] == -4, 'UsualHours'].shape[0]))

print('============================\nWhen LaborForce = 1,2, UsualHours variable has values:')
print(raw_df.loc[raw_df['LaborForce'].isin([1, 2]), 'UsualHours'].describe().apply(lambda x: format(x, 'f')))
# TODO: solve -4

print('============================\nAfter cleaning, UsualHours variable has values:')
print('No clean action UsualHours variable')

Before cleaning, UsualHours variable has values:
count    1029845.000000
mean          20.025610
std           21.956629
min           -4.000000
25%           -1.000000
50%           15.000000
75%           40.000000
max          179.000000
Name: UsualHours, dtype: object
with 0 missing values
also with 453048 -1 values and 39576 -4 values
When LaborForce = 1,2, UsualHours variable has values:
count    576797.000000
mean         36.540277
std          15.517432
min          -4.000000
25%          36.000000
50%          40.000000
75%          40.000000
max         179.000000
Name: UsualHours, dtype: object
After cleaning, UsualHours variable has values:
No clean action UsualHours variable


In [200]:
# 17. ActualHours
print('============================\nBefore cleaning, ActualHours variable has values:')
print(raw_df['ActualHours'].describe().apply(lambda x: format(x, 'f')))
print('with {} missing values'.format(raw_df['ActualHours'].isnull().sum()))
print('also with {} -1 values'.format(raw_df.loc[raw_df['ActualHours'] == -1, 'ActualHours'].shape[0]))

print('============================\nWhen LaborForce != 1:')
print('ActualHours variable has {} missing values'.format(raw_df.loc[raw_df['LaborForce'] != 1, 'ActualHours'].isnull().sum()))
print('ActualHours variable has {} -1 values'.format(raw_df.loc[(raw_df['LaborForce'] != 1) & (raw_df['ActualHours'] == -1), 
                                                                'ActualHours'].shape[0]))

print('============================\nWhen LaborForce = 1, ActualHours variable has values:')
print(raw_df.loc[raw_df['LaborForce'] == 1, 'ActualHours'].describe().apply(lambda x: format(x, 'f')))

print('============================\nAfter cleaning, ActualHours variable has values:')
print('No clean action ActualHours variable')


Before cleaning, ActualHours variable has values:
count    1029845.000000
mean          20.022292
std           21.822384
min           -1.000000
25%           -1.000000
50%           16.000000
75%           40.000000
max          192.000000
Name: ActualHours, dtype: object
with 0 missing values
also with 478874 -1 values
When LaborForce != 1:
ActualHours variable has 0 missing values
ActualHours variable has 478874 -1 values
When LaborForce = 1, ActualHours variable has values:
count    550971.000000
mean         38.293723
std          13.121156
min           1.000000
25%          35.000000
50%          40.000000
75%          40.000000
max         192.000000
Name: ActualHours, dtype: object
After cleaning, ActualHours variable has values:
No clean action ActualHours variable


`Union`: EDITED UNIVERSE: (PEIO1COW = 1-5 AND PEMLR = 1-2 AND HRMIS = 4, 8).

In [201]:
# 18. Union
print('============================\nBefore cleaning, Union variable has values:')
print(raw_df['Union'].value_counts(dropna=False).sort_index())

print('============================\nWhen edited universe, Union variable has values:')
print(raw_df[raw_df['LaborForce'].isin([1, 2]) & 
             raw_df['WorkerClass'].isin([1,2,3,4,5]) & 
             raw_df['MonthInSample'].isin([4,8])]['Union'].value_counts())
# edit universe correct, but why we need to limit hrmis to be 4 or 8?

print('============================\nAfter cleaning, Union variable has values:')
# replace 2 (no) with 0
raw_df['Union'] = raw_df['Union'].replace(2, 0)
print(raw_df['Union'].value_counts(dropna=False).sort_index())

Before cleaning, Union variable has values:
-1.0    896174
 1.0     14122
 2.0    119549
Name: Union, dtype: int64
When edited universe, Union variable has values:
2.0    119549
1.0     14122
Name: Union, dtype: int64
After cleaning, Union variable has values:
-1.0    896174
 0.0    119549
 1.0     14122
Name: Union, dtype: int64


`Occupation`:EDITED UNIVERSE: PRMJOCC = 1-11

`WorkerClass`: serves mainly as an edited universe for `Union`. We do not need to care about `WorkerClass` because we have `Occupation` already.

In [202]:
# 19. Occupation
print('============================\nBefore cleaning, Occupation variable has values:')
print(raw_df['Occupation'].value_counts(dropna=False).sort_index())

# One issue left: there is no PRMJOCC in data dictionary, but there are PRMJOCC1 and PRMJOCC2
# The edited universe for different job categories are linked to each other.

Before cleaning, Occupation variable has values:
-1.0    401092
 1.0    265170
 2.0    100996
 3.0    126864
 4.0      5188
 5.0     52772
 6.0     77683
 7.0        80
Name: Occupation, dtype: int64


### **2.4. Export Yearly Data**

In [203]:
# save file to csv in PROCESSED_DATA_PATH
raw_df.to_csv(GLOBAL_PATH + PROCESSED_DATA_PATH + str(CPS_YEAR) + '.csv', index=False)