<h1><center> Basic Monthly CPS Data </center></h1>

data dictionary [here](https://www2.census.gov/programs-surveys/cps/datasets/2022/basic/2022_Basic_CPS_Public_Use_Record_Layout_plus_IO_Code_list.txt)

data instruction [here](https://www2.census.gov/programs-surveys/cps/methodology/PublicUseDocumentation_final.pdf)

**Part 1: Select Observations and Variables**

The main purpose of the following code is to select the samples and variables that might be used later. The data cleaning process is described in the following steps:

1. Import all datasets in the raw data folder and concatenate them into one dataframe
2. Drop observations/rows based on degree of survey completeness and definition of civil noninstitutional population
3. Select variables of interest
    - link variables: keep track of individual in different months
    - geographic and demographic variables
    - labor force variables: check edited universe
    - weight variables: used for estimation and inference later
4. Rename variables, provide dictionary of variable labels and values
5. Split the dataframe into different dataframes based on the month of the survey
6. Export the dataframes (the concat dataframe + dataframe for each month) into the data folder

**Part 2: Clean Variables** 

After selecting the sample and variables, we can continue to investigate the values of each variable. This file will deal with missing values, catergorical variables, and numeric variables. The data cleaning process is described in the following steps:

1. Geographic and demographic variables
    - Is there NaN?
    - How to change values of categorical variables?
    - Convert numerical values to categorical values? (eg. age --> age group)
2. Labor Force variables
    - Is there NaN?
    - How to work with negative values? (check edited universe in data instruction, also in variable label in clean1)
3. We have already checked Linking Varibles, and we should not modify weights.
4. Export the whole year dataframe to csv file (replace the one in clean1_variables.ipynb), export each month dataframe to csv file (replace the one in clean1_variables.ipynb)

This file can be used to automate the data cleaning for all 12 months in a single year.

In [20]:
import os
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# !! Your current working directory should be ./Fondren/Code
#     which is the location of the code files (current files)

GLOBAL_PATH = '../' # this lead to directory ./Fondren, which is the root directory of the project
DATA_PATH = '../Data/' # this lead to directory ./Fondren/Data
OUTPUT_PATH = '../Output/' # this lead to directory ./Fondren/Output

# later on modify only the year to get the data for different years
CPS_YEAR = 'CPS2021'

## **Part 1: Select Observations and Variables**

### **1. Read All Files into A Single Dataframe**

In [21]:
path, dirs, raw_files = next(os.walk(DATA_PATH+'Raw'+CPS_YEAR))
raw_files.remove('.DS_Store') if '.DS_Store' in raw_files else None
raw_file_count = len(raw_files)
# dataframe_list is a list of dataframes, not list of file names
dataframe_list = list(map(lambda file: pd.read_csv(DATA_PATH + 'Raw' + CPS_YEAR + '/' + file), raw_files))

for i in range(raw_file_count):
    print(raw_files[i], ':', dataframe_list[i].shape)

jul21pub.csv : (129092, 388)
mar21pub.csv : (131357, 388)
sep21pub.csv : (127872, 388)
apr21pub.csv : (133449, 388)
may21pub.csv : (132077, 388)
feb21pub.csv : (132681, 388)
oct21pub.csv : (129103, 388)
jun21pub.csv : (129272, 388)
dec21pub.csv : (127489, 388)
aug21pub.csv : (129408, 388)
nov21pub.csv : (127375, 388)
jan21pub.csv : (133553, 388)


In [22]:
# we can concatenate all dataframes into one dataframe, after cleaning, split them back based on month and year
# What I am trying to do is to avoid using for loop to clean each dataframe
raw_df = pd.concat(dataframe_list, axis=0, ignore_index=True)
# save this file
raw_df.to_csv(DATA_PATH + 'Clean' + CPS_YEAR + '/cleaned1' + CPS_YEAR + '.csv', index=False)

print('Shape of concatenated dataset:', raw_df.shape)
raw_df.head(10)

Shape of concatenated dataset: (1562728, 397)


Unnamed: 0,hrhhid2,HUFINAL,OCCURNUM,HUINTTYP,HURESPLI,HUPRSCNT,HUTYPEA,HUTYPB,HUTYPC,HUBUS,HUBUSL1,HUBUSL2,HUBUSL3,HUBUSL4,HRMIS,HRMONTH,HRYEAR4,HRLONGLK,qstnum,gereg,gestfips,gediv,hehousut,hxhousut,hephoneo,hxphoneo,hetelavl,hxtelavl,hetelhhd,hxtelhhd,hrhtype,hrintsta,hrnumhou,hefaminc,hxfaminc,hwhhwgt,hwhhwtln,PULINENO,PUCHINHH,PUWK,PUBUS1,PUDIS,PULAY,PUHROFF1,PUHROFF2,PUHROT1,PUHROT2,PUABSOT,PUBUSCK1,PUBUSCK2,PUBUSCK3,PUBUSCK4,PURETOT,PUHRCK1,PUHRCK2,PUHRCK3,PUHRCK4,PUHRCK5,PUHRCK6,PUHRCK7,PUHRCK12,PULAYDT,PULAY6M,PULAYAVR,PULK,PULKAVR,PULAYCK1,PULAYCK2,PULAYCK3,PUDWCK1,PUDWCK2,PUDWCK3,PUDWCK4,PUDWCK5,PUJHCK1,PUJHCK2,PUJHDP1O,PUJHCK3,PUJHCK4,PUJHCK5,PULKM2,PULKM3,PULKM4,PULKM5,PULKM6,PULKDK1,PULKDK2,PULKDK3,PULKDK4,PULKDK5,PULKDK6,PULKPS1,PULKPS2,PULKPS3,PULKPS4,PULKPS5,PULKPS6,PUIOCK1,PUIOCK2,PUIOCK3,PUIODP1,PUIODP2,PUIODP3,PUIO1MFG,PUIO2MFG,ptern2,pternh1c,PUNLFCK1,PUNLFCK2,PUSLFPRX,PUDIS1,PUDIS2,PUBUS2OT,perrp,pxrrp,pxage,peafnow,pxafnow,pesex,pxsex,pemaritl,pxmaritl,pxrace1,pehspnon,pxhspnon,peeduca,pxeduca,peafever,pxafever,peafwhn1,pxafwhn1,peafwhn2,peafwhn3,peafwhn4,pespouse,pxspouse,penatvty,pxnatvty,pemntvty,pxmntvty,pefntvty,pxfntvty,pxinusyr,pedipged,pxdipged,pehgcomp,pxhgcomp,pecyc,pxcyc,pepar1,pxpar1,pepar2,pxpar2,pepar1typ,pxpar1typ,pepar2typ,pxpar2typ,prdasian,prmarsta,ptdtrace,prdthsp,prpertyp,prfamnum,prfamtyp,prfamrel,prnmchld,prchld,prcitflg,prcitshp,prinuyer,prtage,prtfage,pecohab,pxcohab,peabspdo,peabsrsn,pedwavl,pedwavr,pedwlko,pedwlkwk,pedwrsn,pedwwk,pedwwnto,pedw4wk,pehractt,pehract1,pehract2,pehravl,pehrftpt,pehrrsn1,pehrrsn2,pehrrsn3,pehruslt,pehrusl1,pehrusl2,pehrwant,pejhrsn,pejhwant,pejhwko,pelayavl,pelaydur,pelayfto,pelaylk,pelkavl,pelkdur,pelkfto,pelkll1o,pelkll2o,pelklwo,pelkm1,pemjnum,pemjot,pemlr,penlfact,penlfjh,penlfret,peret1,pxabspdo,pxabsrsn,pxdwavl,pxdwavr,pxdwlko,pxdwlkwk,pxdwrsn,pxdwwk,pxdwwnto,pxdw4wk,pxhractt,pxhract1,pxhract2,pxhravl,pxhrftpt,pxhrrsn1,pxhrrsn2,pxhrrsn3,pxhruslt,pxhrusl1,pxhrusl2,pxhrwant,pxjhrsn,pxjhwant,pxjhwko,pxlayavl,pxlaydur,pxlayfto,pxlaylk,pxlkavl,pxlkdur,pxlkfto,pxlkll1o,pxlkll2o,pxlklwo,pxlkm1,pxmjnum,pxmjot,pxmlr,pxnlfact,pxnlfjh,pxnlfret,pxret1,prabsrea,prcivlf,prdisc,premphrs,prempnot,prexplf,prftlf,prhrusl,prjobsea,prpthrs,prptrea,prunedur,pruntype,prwksch,prwkstat,prwntjob,peio1icd,peio2icd,ptio1ocd,ptio2ocd,peio1cow,peio2cow,pepdemp1,pepdemp2,pxio1icd,pxio2icd,pxio1ocd,pxio2ocd,pxio1cow,pxio2cow,pxpdemp1,pxpdemp2,pxnmemp1,pxnmemp2,prioelg,premp,prcow1,prcow2,prnagws,prnagpws,prdtcow1,prdtcow2,prmjind1,prmjind2,primind1,primind2,prmjocc1,prmjocc2,prdtind1,prdtind2,prdtocc1,prdtocc2,pragna,prsjmj,prcowpg,prmjocgr,peernper,peernhry,peernuot,peernwkp,peernrt,pternh2,pternh1o,peernhro,ptern,peernlab,peerncov,pxernper,pxernhry,pxernuot,pxernwkp,pxernrt,pxernh2,pxernh1o,pxernhro,pxern,pxernlab,pxerncov,prerelg,pternwa,pternhly,prwernal,prhernal,peschlvl,peschenr,peschft,pxschlvl,pxschenr,pxschft,prnlfsch,pedisear,pediseye,pedisrem,pedisphy,pedisdrs,pedisout,pxdisear,pxdiseye,pxdisrem,pxdisphy,pxdisdrs,pxdisout,prdisflg,pecert1,pecert2,pecert3,pxcert1,pxcert2,pxcert3,pwsswgt,pwlgwgt,pwvetwgt,pworwgt,pwfmwgt,pwcmpwgt,pthr,ptwk,ptot,ptnmemp1,ptnmemp2,hrhhid,gtcbsa,gtco,gtcbsast,gtcbsasz,gtcsa,gtmetsta,gtindvpc,PUERN2,PUERNH1C,peio1ocd,peio2ocd,peernh2,peernh1o,peern,prernwa,prernhly
0,11011,201,2,2,2,0,-1,-1,-1,2,-1,-1,-1,-1,8,7,2021,2,1,3,1,6,1,0,1,0,-1,1,1,0,7,1,1,12,23,19202002.0,2.0,2.0,9.0,3.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,2.0,-1.0,-1.0,-1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,1.0,-1.0,-1.0,-1.0,41.0,0.0,0.0,2.0,0.0,2.0,0.0,5.0,43.0,0.0,2.0,0.0,43.0,0.0,2.0,0.0,-1.0,1.0,-1.0,-1.0,-1.0,-1.0,1.0,57.0,-1.0,57.0,-1.0,57.0,-1.0,-1.0,-1.0,1.0,-1.0,1.0,-1.0,1.0,-1,1.0,-1,1.0,-1,1.0,-1,1.0,-1.0,6.0,1.0,-1.0,2.0,0.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,61.0,0,-1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,5.0,-1.0,2.0,-1.0,2.0,-1.0,-1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,1.0,0.0,1.0,0.0,-1.0,2.0,-1.0,0.0,4.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,1.0,2.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,2.0,2.0,2.0,2.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,-1.0,-1.0,20.0,0.0,0.0,19202002.0,28699695.0,18945420.0,76121915.0,19202002.0,18976311.0,0,0,0,-1.0,-1.0,505019880110916,19300,3,4,2,380,1,0,,,,,,,,,
1,11011,201,1,2,1,0,-1,-1,-1,2,-1,-1,-1,-1,6,7,2021,2,2,3,1,6,1,0,1,0,1,0,1,0,7,1,1,9,23,20812987.0,1.0,1.0,9.0,1.0,-1.0,-1.0,-1.0,2.0,-1.0,1.0,10.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,2.0,5.0,5.0,-1.0,2.0,3.0,5.0,2.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,4.0,3.0,3.0,1.0,2.0,1.0,4.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,-1.0,-1.0,41.0,0.0,0.0,2.0,0.0,2.0,0.0,4.0,0.0,0.0,2.0,0.0,43.0,0.0,2.0,0.0,-1.0,1.0,-1.0,-1.0,-1.0,-1.0,1.0,57.0,-1.0,57.0,-1.0,57.0,-1.0,-1.0,-1.0,1.0,-1.0,1.0,-1.0,1.0,-1,1.0,-1,1.0,-1,1.0,-1,1.0,-1.0,5.0,1.0,-1.0,2.0,0.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,53.0,0,-1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,60.0,60.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,50.0,50.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,2.0,1.0,-1.0,-1.0,-1.0,-1.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,22.0,1.0,1.0,1.0,6.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,2.0,-1.0,8191.0,-1.0,350.0,-1.0,5.0,-1.0,-1.0,-1.0,0.0,-1.0,0.0,-1.0,0.0,-1.0,1.0,-1.0,1.0,-1.0,1.0,1.0,4.0,-1.0,1.0,1.0,6.0,-1.0,10.0,-1.0,16.0,-1.0,1.0,-1.0,41.0,-1.0,1.0,-1.0,2.0,1.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,2.0,-1.0,1.0,0.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,1.0,1.0,20.0,20.0,20.0,20812987.0,31107507.0,21153559.0,0.0,20812987.0,21204165.0,0,0,0,-1.0,-1.0,610100690316751,19300,3,4,2,380,1,0,,,,,,,,,
2,13011,226,1,2,-1,0,-1,1,-1,-1,-1,-1,-1,-1,4,7,2021,2,3,3,1,6,1,0,1,0,-1,1,-1,1,0,3,0,-1,1,,-1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1,-1.0,-1,-1.0,-1,-1.0,-1,-1.0,,,,,,,,,,,,,,,0,-1.0,-1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,,,,,,,0,0,0,-1.0,-1.0,71212006900967,19300,3,4,2,380,1,0,,,,,,,,,
3,13011,226,1,-1,-1,0,-1,1,-1,-1,-1,-1,-1,-1,4,7,2021,2,4,3,1,6,5,0,0,0,-1,1,-1,1,0,3,0,-1,1,,-1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1,-1.0,-1,-1.0,-1,-1.0,-1,-1.0,,,,,,,,,,,,,,,0,-1.0,-1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,,,,,,,0,0,0,-1.0,-1.0,201967201670009,19300,3,4,2,380,1,0,,,,,,,,,
4,13011,201,1,2,1,0,-1,-1,-1,2,-1,-1,-1,-1,2,7,2021,2,5,3,1,6,1,0,1,0,-1,1,1,0,7,1,1,8,0,18287603.0,1.0,1.0,9.0,3.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,2.0,-1.0,-1.0,-1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,2.0,1.0,-1.0,-1.0,-1.0,41.0,0.0,0.0,2.0,0.0,2.0,0.0,2.0,0.0,0.0,2.0,0.0,39.0,0.0,2.0,0.0,-1.0,1.0,-1.0,-1.0,-1.0,-1.0,1.0,57.0,-1.0,57.0,-1.0,57.0,-1.0,-1.0,1.0,0.0,-1.0,1.0,-1.0,1.0,-1,1.0,-1,1.0,-1,1.0,-1,1.0,-1.0,3.0,1.0,-1.0,2.0,0.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,77.0,0,-1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,5.0,-1.0,-1.0,-1.0,2.0,-1.0,-1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,1.0,1.0,1.0,0.0,-1.0,2.0,-1.0,0.0,4.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,1.0,2.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,0.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,2.0,2.0,2.0,2.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,-1.0,-1.0,20.0,0.0,0.0,18287603.0,27333017.0,18351717.0,0.0,18287603.0,18396932.0,0,0,0,-1.0,-1.0,180314039113,13820,0,2,5,0,1,0,,,,,,,,,
5,11011,226,1,-1,-1,0,-1,1,-1,-1,-1,-1,-1,-1,8,7,2021,2,6,3,1,6,1,0,0,0,-1,1,-1,1,0,3,0,-1,1,,-1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1,-1.0,-1,-1.0,-1,-1.0,-1,-1.0,,,,,,,,,,,,,,,0,-1.0,-1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,,,,,,,0,0,0,-1.0,-1.0,901005347110188,0,0,3,0,0,2,0,,,,,,,,,
6,11011,201,1,2,1,0,-1,-1,-1,2,-1,-1,-1,-1,8,7,2021,2,7,3,1,6,1,0,1,0,-1,1,1,0,7,1,1,6,23,18424472.0,1.0,1.0,9.0,3.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,2.0,-1.0,-1.0,-1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,1.0,-1.0,-1.0,-1.0,41.0,0.0,0.0,2.0,0.0,2.0,0.0,3.0,0.0,0.0,2.0,0.0,36.0,0.0,2.0,0.0,-1.0,1.0,-1.0,-1.0,-1.0,-1.0,1.0,57.0,-1.0,57.0,-1.0,57.0,-1.0,-1.0,-1.0,1.0,-1.0,1.0,-1.0,1.0,-1,1.0,-1,1.0,-1,1.0,-1,1.0,-1.0,4.0,1.0,-1.0,2.0,0.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,85.0,1,-1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,5.0,-1.0,2.0,-1.0,2.0,-1.0,-1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,1.0,0.0,1.0,0.0,-1.0,2.0,-1.0,0.0,4.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,1.0,2.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,2.0,2.0,2.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,-1.0,-1.0,20.0,0.0,0.0,18424472.0,27537584.0,18489066.0,74350105.0,18424472.0,18534619.0,0,0,0,-1.0,-1.0,310588190701104,0,0,3,0,0,2,0,,,,,,,,,
7,11011,225,1,-1,-1,0,-1,2,-1,-1,-1,-1,-1,-1,7,7,2021,2,8,3,1,6,1,0,0,0,-1,1,-1,1,0,3,0,-1,1,,-1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1,-1.0,-1,-1.0,-1,-1.0,-1,-1.0,,,,,,,,,,,,,,,0,-1.0,-1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,,,,,,,0,0,0,-1.0,-1.0,102009027110311,0,0,3,0,0,2,0,,,,,,,,,
8,11011,201,1,2,1,0,-1,-1,-1,2,-1,-1,-1,-1,7,7,2021,2,9,3,1,6,1,0,1,0,-1,1,1,0,6,1,1,10,23,17561071.0,1.0,1.0,9.0,1.0,-1.0,-1.0,-1.0,2.0,-1.0,2.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,2.0,5.0,5.0,-1.0,2.0,3.0,5.0,2.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,4.0,3.0,3.0,1.0,2.0,1.0,4.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,-1.0,-1.0,41.0,0.0,0.0,2.0,0.0,1.0,0.0,4.0,0.0,0.0,2.0,0.0,39.0,0.0,2.0,0.0,-1.0,1.0,-1.0,-1.0,-1.0,-1.0,1.0,57.0,-1.0,57.0,-1.0,57.0,-1.0,-1.0,1.0,0.0,-1.0,1.0,-1.0,1.0,-1,1.0,-1,1.0,-1,1.0,-1,1.0,-1.0,5.0,2.0,-1.0,2.0,0.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,73.0,0,-1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,40.0,40.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,40.0,40.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,2.0,1.0,-1.0,-1.0,-1.0,-1.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,18.0,1.0,1.0,1.0,4.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,2.0,-1.0,6290.0,-1.0,3945.0,-1.0,4.0,-1.0,-1.0,-1.0,0.0,-1.0,0.0,-1.0,0.0,-1.0,1.0,-1.0,1.0,-1.0,1.0,1.0,4.0,-1.0,1.0,1.0,6.0,-1.0,6.0,-1.0,8.0,-1.0,3.0,-1.0,23.0,-1.0,12.0,-1.0,2.0,1.0,1.0,2.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,2.0,2.0,2.0,2.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,-1.0,-1.0,20.0,0.0,0.0,17561071.0,26267955.0,17099389.0,0.0,17561071.0,17124931.0,0,0,0,-1.0,-1.0,2790110012113,0,0,3,0,0,2,0,,,,,,,,,
9,13011,201,1,2,1,0,-1,-1,-1,2,-1,-1,-1,-1,4,7,2021,2,10,3,1,6,1,0,1,0,-3,3,1,0,1,1,4,12,23,21582300.0,2.0,1.0,9.0,1.0,-1.0,-1.0,-1.0,2.0,-1.0,2.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,2.0,5.0,5.0,-1.0,2.0,3.0,5.0,2.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,4.0,3.0,3.0,1.0,2.0,1.0,2.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,-1.0,-1.0,40.0,0.0,0.0,2.0,0.0,1.0,0.0,1.0,0.0,0.0,2.0,0.0,39.0,0.0,2.0,0.0,-1.0,1.0,-1.0,-1.0,-1.0,2.0,0.0,57.0,-1.0,57.0,-1.0,57.0,-1.0,-1.0,1.0,0.0,-1.0,1.0,-1.0,1.0,-1,1.0,-1,1.0,-1,1.0,-1,1.0,-1.0,1.0,1.0,-1.0,2.0,1.0,1.0,1.0,1.0,4.0,0.0,1.0,0.0,40.0,0,-1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,40.0,40.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,40.0,40.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,2.0,1.0,-1.0,-1.0,-1.0,-1.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,18.0,1.0,1.0,1.0,4.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,2.0,-1.0,4470.0,-1.0,4850.0,-1.0,4.0,-1.0,-1.0,-1.0,0.0,-1.0,0.0,-1.0,0.0,-1.0,1.0,-1.0,1.0,-1.0,1.0,1.0,4.0,-1.0,1.0,1.0,6.0,-1.0,5.0,-1.0,6.0,-1.0,4.0,-1.0,21.0,-1.0,16.0,-1.0,2.0,1.0,1.0,3.0,2.0,2.0,1.0,-1.0,2.0,-1.0,-1.0,-1.0,-1.0,2.0,2.0,13.0,42.0,41.0,1.0,11.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,144200.0,-1.0,1.0,-1.0,-1.0,2.0,-1.0,1.0,0.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,-1.0,-1.0,20.0,0.0,0.0,29304632.0,43834044.0,28807565.0,121191507.0,21582300.0,29215219.0,0,0,0,-1.0,-1.0,166280010011857,0,0,3,0,0,2,0,,,,,,,,,


### **2. Remove Rows/obs**

#### **2.1 Remove Non-Civilian Noninstitutional Population**

Definition of Civilian Noninstitutional Population: Persons 16 years of age and older residing in the 50 states and the District of Columbia, who are not inmates of institutions (e.g., penal and mental facilities, homes for the aged), and who are not on active duty in the Armed Forces [reference](https://webapps.dol.gov/dolfaq/go-dol-faq.asp?faqid=111&faqsub=Employment+%2F+Unemployment&faqtop=Statistics&topicid=6).

Two variables should be used here: `prtage` and `prpertyp`. We will remove observations with `prtage` less than 16 and `prpertyp` equal to 1 (child household member) and 3 (in arm force). In these cases, `pemlr` labor force is recorded as -1.

In [23]:
# We can see that pemlr is -1 and tells nothing about the employment status when prpertyp is 1 and 3
a = raw_df[raw_df['pemlr']==-1]['prpertyp'].value_counts()
b = raw_df[raw_df['prtage'] < 16].shape[0]
c = raw_df[raw_df['pemlr']==-1].shape[0]
print('Value counts for person type when pemlr is -1: \n', a) # this makes sense, since pemlr could not be -1 when prpertyp is 2
print('Counts of person who is less than 16 years old:', b)
print('Counts of pemlr is -1:', c)
# We found in some cases, pemlr is not -1 but age is less than 16 (b > c). So both prtage and prpertyp should be used for removing rows

Value counts for person type when pemlr is -1: 
 1.0    225076
3.0      4408
Name: prpertyp, dtype: int64
Counts of person who is less than 16 years old: 241276
Counts of pemlr is -1: 229484


In [24]:
raw_df.drop(raw_df[(raw_df['prtage'] < 16) | (raw_df['prpertyp'].isin([1,3]))].index, inplace = True)

In [25]:
# check outcome after removing rows, all should be 0
a = raw_df[raw_df['pemlr']==-1]['prpertyp'].value_counts()
b = raw_df[raw_df['prtage'] < 16].shape[0]
c = raw_df[raw_df['pemlr']==-1].shape[0]
print('Value counts for person type when pemlr is -1: ', a)
print('Counts of person who is less than 16 years old:', b)
print('Counts of pemlr is -1:', c)
print('Shape of dataset:', raw_df.shape)

Value counts for person type when pemlr is -1:  Series([], Name: prpertyp, dtype: int64)
Counts of person who is less than 16 years old: 0
Counts of pemlr is -1: 0
Shape of dataset: (1317044, 397)


#### **2.2 Remove Rows with Missing Values**

There is a variable called `HUFINAL` which indicates the final outcome of the survey (different values represent whether the survey is completed or not and the reasons not finishing the survey). We can write a function `missing_perc_row` to check the average percentage of missing values (NaN) for each value of `HUFINAL`. Then write another function `remove_missing_row` to remove rows that have more than a given threshold percentage of missing values (currently set to be 1%).


It sometimes takes more than one minute to run this, and depends on your computer's performance.

In [26]:
def missing_perc_row(df, col):
    """
    Input: dataframe, column name = 'HUFINAL'
    Output: dictionary of average percentage of missing in a row for a list of distinct column values
    check perc of missing in a row for a list of distinct column values
    """
    category = df[col].unique()
    dct = {}
    for cat in category:
        avg_missing = df[df[col] == cat].isnull().sum().sum() / df[df[col] == cat].shape[0]
        perc = avg_missing/ df.shape[1] * 100
        dct[cat] = perc
    return dct

missing_row_dict = missing_perc_row(raw_df, 'HUFINAL')
print('The raw dataset has the following percentage of missing for each value of HUFINAL:\n', missing_row_dict)


The raw dataset has the following percentage of missing for each value of HUFINAL:
 {201: 2.2676601194457335, 226: 80.10169330249852, 225: 80.19090871964943, 230: 80.35973660640411, 231: 80.35264483627203, 228: 80.35264483627203, 259: 80.35264483627203, 227: 80.10271844024992, 229: 80.35264483627203, 223: 80.35264483627203, 232: 80.35264483627203, 203: 2.279279136107557, 219: 80.10893569062748, 218: 80.1047626912124, 241: 80.35264483627203, 240: 80.35264483627203, 1: 2.2670025188916876, 2: 2.337568024919443, 213: 80.11385131748705, 216: 80.10428525327859, 247: 80.35264483627203, 233: 80.35643581038391, 248: 80.35264483627203, 243: 80.35264483627203, 217: 80.10379590333852, 214: 80.10075566750629, 244: 80.35264483627203, 245: 80.35264483627203, 242: 80.35264483627203, 4: 2.2670025188916876, 224: 80.35264483627203, 204: 4.680452228926249, 5: 2.2670025188916876, 205: 2.2670025188916876, 258: 80.35264483627203, 6: 2.2670025188916876, 202: 80.10075566750629, 20: 80.35264483627203}


In [27]:
# remove the rows given a specified threshold of missing percentage
def remove_missing_row(df, col, threshold):
    for key, value in missing_row_dict.items():
        if value > threshold:
            # df[df[col] != key] # this will only remove rows for the last key with value > thres
            # try this:
            df.drop(df[df[col] == key].index, inplace = True)   
    return df

# remove rows with missing percentage > threshold
raw_df = remove_missing_row(raw_df, 'HUFINAL', 50)
print('After dropping rows, the dataset has the following percentage of missing for each value of HUFINAL:\n', missing_perc_row(raw_df, 'HUFINAL'))
print('---------------------------------------------')
print('Shape of dataset:', raw_df.shape)

---------------------------------------------
After dropping rows, the dataset has the following percentage of missing for each value of HUFINAL:
 {201: 2.2676601194457335, 203: 2.279279136107557, 1: 2.2670025188916876, 2: 2.337568024919443, 4: 2.2670025188916876, 204: 4.680452228926249, 5: 2.2670025188916876, 205: 2.2670025188916876, 6: 2.2670025188916876}
---------------------------------------------
Shape of dataset: (1034408, 397)


### **3. Select Variables of Interest**

a. Link Datasets Variables
- In order to link the same individuals across months, three variables must be used: HRHHID, HRHHID2, and PULINENO.

b. Geographic and Demographic Variables

c. Labor Force Variables
- Currently selected variables may be changed in the future.

d. Weights

In [28]:
# Original column
original_cols = raw_df.columns.values
original_cols

array(['hrhhid2', 'HUFINAL', 'OCCURNUM', 'HUINTTYP', 'HURESPLI',
       'HUPRSCNT', 'HUTYPEA', 'HUTYPB', 'HUTYPC', 'HUBUS', 'HUBUSL1',
       'HUBUSL2', 'HUBUSL3', 'HUBUSL4', 'HRMIS', 'HRMONTH', 'HRYEAR4',
       'HRLONGLK', 'qstnum', 'gereg', 'gestfips', 'gediv', 'hehousut',
       'hxhousut', 'hephoneo', 'hxphoneo', 'hetelavl', 'hxtelavl',
       'hetelhhd', 'hxtelhhd', 'hrhtype', 'hrintsta', 'hrnumhou',
       'hefaminc', 'hxfaminc', 'hwhhwgt', 'hwhhwtln', 'PULINENO',
       'PUCHINHH', 'PUWK', 'PUBUS1', 'PUDIS', 'PULAY', 'PUHROFF1',
       'PUHROFF2', 'PUHROT1', 'PUHROT2', 'PUABSOT', 'PUBUSCK1',
       'PUBUSCK2', 'PUBUSCK3', 'PUBUSCK4', 'PURETOT', 'PUHRCK1',
       'PUHRCK2', 'PUHRCK3', 'PUHRCK4', 'PUHRCK5', 'PUHRCK6', 'PUHRCK7',
       'PUHRCK12', 'PULAYDT', 'PULAY6M', 'PULAYAVR', 'PULK', 'PULKAVR',
       'PULAYCK1', 'PULAYCK2', 'PULAYCK3', 'PUDWCK1', 'PUDWCK2',
       'PUDWCK3', 'PUDWCK4', 'PUDWCK5', 'PUJHCK1', 'PUJHCK2', 'PUJHDP1O',
       'PUJHCK3', 'PUJHCK4', 'PUJHCK5', 'PUL

In [29]:
Link_VARS_Label = {
    'hrhhid': 'Household ID',
    'hrhhid2': 'Household ID2',
    'PULINENO': 'Line number of person'
}

Link_VARS = ['hrhhid', 'hrhhid2', 'PULINENO']

In [30]:
GeoDemographic_VARS_Label = {
    'HRYEAR4': 'year of interview', 
    'HRMONTH': 'month of interview', 
    'HRMIS': 'month of previous interview, month in sample, can be a calculation for link datasets',
    'gestfips': 'state fips code',
    'gtco': 'county fips code',
    'gtmetsta': 'metropolitan status', 
    'prtage': 'age, 0-79 in years, 80 for 80-84, 85 for 85+',
    'pemaritl': 'marital status',
    'pesex': 'sex',
    'peeduca': 'highest level of school completed',
    'ptdtrace': 'race',
    'pehspnon': 'hispanic or non-hispanic',
    'prcitshp': 'citizenship status',
    'hefaminc': 'family income'
}

GeoDemographic_VARS = list(GeoDemographic_VARS_Label.keys())

In [31]:
Labor_VARS_Label = {
    'pemlr': 'monthly labor force status recode. EDITED UNIVERSE: PRPERTYP = 2 (already satisfied)',
    'prwkstat': 'full-time or part-time work status. EDITED UNIVERSE: PEMLR = 1-7',
    'pternwa': 'weekly earnings recode. Calculated for all workers. Collected for one-quarter of the sample. EDITED UNIVERSE: PRERELG = 1',
    'prerelg': 'flag, if earnings eligible for editing',
    'pternhly': 'hourly earnings recode. Only calculated for hourly-paid workers. Collected for one-quarter of the sample. EDITED UNIVERSE:	PEERNPER = 1 OR PEERNRT = 1',
    'peernper': 'earning perodicity',
    'peernrt': 'confirm hourly paid',
    'pemjot': 'do you have more than one job. EDITED UNIVERSE: PEMLR = 1, 2',
    'pemjnum': 'number of jobs. EDITED UNIVERSE: PEMJOT = 1',
    'pehruslt': 'usual weekly hours worked. -4 means varies. EDITED UNIVERSE: PEMLR = 1 OR 2',
    'pehractt': 'actual weekly hours worked during survey week. EDITED UNIVERSE: PEMLR = 1',
    'peernlab': 'member of union. EDITED UNIVERSE: (PEIO1COW = 1-5 AND PEMLR = 1-2 AND HRMIS = 4, 8)',
    'prmjocgr': 'major occupational category. EDITED UNIVERSE:	PRMJOCC = 1-11',
    'peio1cow': 'class of worker on main job. EDITED UNIVERSE:	(PEMLR = 1-3) OR (PEMLR = 4 AND PELKLWO = 1-2) \
                                OR (PEMLR = 5 AND (PENLFJH = 1 OR PEJHWKO = 1)) OR (PEMLR = 6 AND PENLFJH = 1) \
                                OR (PEMLR = 7 AND (PENLFJH = 1 OR PEJHWKO = 1))',
}

Labor_VARS = list(Labor_VARS_Label.keys())

In [32]:
print(Labor_VARS)

['pemlr', 'prwkstat', 'pternwa', 'prerelg', 'pternhly', 'peernper', 'peernrt', 'pemjot', 'pemjnum', 'pehruslt', 'pehractt', 'peernlab', 'prmjocgr', 'peio1cow']


In [33]:
Weights_VAR_Label = {
    'hwhhwgt': 'household weight',
    'pwlgwgt': 'longitudinal weight',
    'pworwgt': 'outgoing rotation weight',
    'pwvetwgt': 'veteran weight',
    'pwsswgt': 'second stage weight',
    'pwcmpwgt': 'composite final weight'
}

Weights_VARS = list(Weights_VAR_Label.keys())

In [34]:
useful_vars = Link_VARS + GeoDemographic_VARS + Labor_VARS + Weights_VARS
raw_df = raw_df[useful_vars]
raw_df.shape

(1034408, 37)

As we need to link datasets in the future, make sure there is no missing value in the link variables

In [35]:
raw_df[Link_VARS].isnull().sum()
# drop rows with missing values in Link_VARS (Most of the time missing occurs in PULINENO)
raw_df.dropna(subset=Link_VARS, inplace=True)

### **4. Rename Variables and Dictionary for Categorical Variables**

In [36]:
rename_dict = {
    'hrhhid': 'hrhhid', 'hrhhid2': 'hrhhid2', 'PULINENO': 'PULINENO',
    'HRYEAR4': 'Year', 'HRMONTH': 'Month', 'HRMIS': 'MonthInSample', 'gestfips': 'State', 'gtco': 'County', 'gtmetsta': 'Metropolitan', 'prtage': 'Age', 
    'pemaritl': 'Marital', 'pesex': 'Gender', 'peeduca': 'Educ','ptdtrace': 'Race', 'pehspnon': 'Hispanic', 'prcitshp': 'Citizen','hefaminc': 'FamIncome',
    'pemlr': 'LaborForce', 'prwkstat': 'FullPartTime', 'pternwa': 'WeeklyEarning', 'prerelg': 'flagWeekEarn', 'pternhly': 'HourlyEarning', 
    'peernper': 'flagPeriodicityEarn', 'peernrt': 'flagConfirmHourly', 'pemjot': 'SingleJob', 'pemjnum': 'NumJobs', 'pehruslt': 'UsualHours', 'pehractt': 'ActualHours', 
    'peernlab': 'Union', 'prmjocgr': 'Occupation', 'peio1cow': 'WorkerClass',
}

raw_df.rename(columns=rename_dict, inplace=True)

In [37]:
raw_df.columns

Index(['hrhhid', 'hrhhid2', 'PULINENO', 'Year', 'Month', 'MonthInSample',
       'State', 'County', 'Metropolitan', 'Age', 'Marital', 'Gender', 'Educ',
       'Race', 'Hispanic', 'Citizen', 'FamIncome', 'LaborForce',
       'FullPartTime', 'WeeklyEarning', 'flagWeekEarn', 'HourlyEarning',
       'flagPeriodicityEarn', 'flagConfirmHourly', 'SingleJob', 'NumJobs',
       'UsualHours', 'ActualHours', 'Union', 'Occupation', 'WorkerClass',
       'hwhhwgt', 'pwlgwgt', 'pworwgt', 'pwvetwgt', 'pwsswgt', 'pwcmpwgt'],
      dtype='object')

Original Values of Categorical Variables

In [38]:
GeoDemographic_Categorical_Val = {
    'State': {1: 'AL', 2: 'AK', 4: 'AZ', 5: 'AR', 6: 'CA', 8: 'CO', 9: 'CT', 10: 'DE', 
                11: 'DC', 12: 'FL', 13: 'GA', 15: 'HI', 16: 'ID', 17: 'IL', 18: 'IN', 19: 'IA', 20: 'KS', 
                21: 'KY', 22: 'LA', 23: 'ME', 24: 'MD', 25: 'MA', 26: 'MI', 27: 'MN', 28: 'MS', 29: 'MO', 30: 'MT', 
                31: 'NE', 32: 'NV', 33: 'NH', 34: 'NJ', 35: 'NM', 36: 'NY', 37: 'NC', 38: 'ND', 39: 'OH', 40: 'OK', 
                41: 'OR', 42: 'PA', 44: 'RI', 45: 'SC', 46: 'SD', 47: 'TN', 48: 'TX', 49: 'UT', 50: 'VT', 
                51: 'VA', 53: 'WA', 54: 'WV', 55: 'WI', 56: 'WY', },
    'Metropolitan': {1: 'metropolitan', 2: 'nonmetropolitan', 3: 'not identified'},
    'Marital': {1: 'married - spouse present', 2: 'married - spouse absent', 3: 'widowed', 4: 'divorced', 5: 'separated', 6: 'never married'},
    'Gender': {1: 'male', 2: 'femlae'},
    'Educ': {31: 'less than 1st grade', 32: '1st, 2nd, 3rd or 4th grade', 33: '5th or 6th grade', 
                34: '7th or 8th grade', 35: '9th grade', 36: '10th grade', 37: '11th grade', 38: '12th grade no diploma',
                39: 'high school grad-diploma or equiv (ged)', 40: 'some college', 
                41: 'associate degree-occupational/vocational', 42: 'associate degree-academic program', 43: 'bachelor\'s degree',
                44: 'master\'s degree', 45: 'professional school degree', 46: 'doctorate degree'},
    'Race': {1: 'white only', 2: 'black only', 3: 'american indian, alaskan native only', 4: 'asian only',
                5: 'hawaiian/pacific islander only', 6: 'white-black', 7: 'white-ai', 8: 'white-asian', 9: 'white-hp',
                10: 'black-ai', 11: 'black-asian', 12: 'black-hp', 13: 'ai-asian', 14: 'ai-hp', 15: 'asian-hp',
                16: 'w-b-ai', 17: 'w-b-a', 18: 'w-b-hp', 19: 'w-ai-a', 20: 'w-ai-hp', 21: 'w-a-hp', 22: 'b-ai-a',
                23: 'w-b-ai-a', 24: 'w-ai-a-hp', 25: 'other 3 race combinations', 26: 'other 4 nad 5 race combinations'},
    'Hispanic': {1: 'hispanic', 2: 'non-hispanic'},
    'Citizen': {1: 'native, born in in US', 2: 'native, born in Puerto Rico or other US island areas',
                3: 'native, born abroad', 4: 'foreign born, US citizen by naturalization', 5: 'foreign born, not a citizen'},
    'FamIncome': {1: 'less than 5,000', 2: '5,000 to 7,499', 3: '7,500 to 9,999', 4: '10,000 to 12,499', 5: '12,500 to 14,999',
                6: '15,000 to 19,999', 7: '20,000 to 24,999', 8: '25,000 to 29,999', 9: '30,000 to 34,999', 10: '35,000 to 39,999',
                11: '40,000 to 49,999', 12: '50,000 to 59,999', 13: '60,000 to 74,999', 14: '75,000 to 99,999', 
                15: '100,000 to 149,999', 16: '150,000 or more'}
}

GeoDemographic_Categorical_VARS = list(GeoDemographic_Categorical_Val.keys())

Employment_Categorical_Val = {
    'LaborForce': {1: 'employed - at work', 2: 'employed - absent', 3: 'unemployed - on layoff', 4: 'unemployed - looking',
                5: 'not in labor force - retired', 6: 'not in labor force - disabled', 7: 'not in labor force - other'},
    'FullPartTime': {1: 'not in labor force', 2: 'FT, usually FT', 3: 'PT for economic reasons, usually FT', 4: 'PT for non-econ reason, usual FT',
                    5: 'not at work, usual FT', 6: 'PT HRS, USUALLY PT FOR ECONOMIC REASONS', 7: 'PT HRS, USUALLY PT FOR NON-ECONOMIC REASONS',
                    8: 'FT HOURS, USUALLY PT FOR ECONOMIC REASONS', 9: 'FT HOURS, USUALLY PT FOR NON-ECONOMIC', 10: 'NOT AT WORK, USUALLY PART-TIME',
                    11: 'UNEMPLOYED FT', 12: 'UNEMPLOYED PT'},
    'flagWeekEarn': {0: 'not eligible for edit', 2: 'eligible for edit'},
    'flagPeriodicityEarn': {1: 'hourly', 2: 'weekly', 3: 'bi-weekly', 4: 'twice monthly', 5: 'monthly', 6: 'annually', 7: 'otehr'},
    'flagConfirmHourly': {1: 'yes', 2: 'no'},
    'SingleJob': {1: 'yes', 2: 'no'},
    'NumJobs': {2: '2 jobs', 3: '3 jobs', 4: '4 or more jobs'},
    'Union': {1: 'yes', 2: 'no'},
    'Occupation': {1: 'management, business, science, and arts occupations', 2: 'service occupations', 3: 'sales and office occupations',
                4: 'farming, fishing, and forestry occupations', 5: 'construction, extraction, and maintenance occupations',
                6: 'production, transportation, and material moving occupations', 7: 'armed forces'},
    'WorkerClass': {1: 'gov - federal', 2: 'gov - state', 3: 'gov - local', 4: 'private - profit', 5: 'private - nonprofit', 6: 'self-employ, incorporated',
                7: 'self-employ, incorporated', 8: 'without pay'},
}

Employment_Categorical_VARS = list(Employment_Categorical_Val.keys())

Categorical_Vars = GeoDemographic_Categorical_VARS + Employment_Categorical_VARS

### **5. Export concatenated dataframe**

In [39]:
raw_df.to_csv(DATA_PATH + 'Clean' + CPS_YEAR + '/cleaned2' + CPS_YEAR + '.csv', index=False)

## **Part 2: Clean Variables**

### **1. Import Yearly Data and Variable Dictionary**

In [40]:
raw_df = pd.read_csv(DATA_PATH + 'Clean' + CPS_YEAR + '/cleaned2' + CPS_YEAR + '.csv')
raw_df.head()

Unnamed: 0,hrhhid,hrhhid2,PULINENO,Year,Month,MonthInSample,State,County,Metropolitan,Age,Marital,Gender,Educ,Race,Hispanic,Citizen,FamIncome,LaborForce,FullPartTime,WeeklyEarning,flagWeekEarn,HourlyEarning,flagPeriodicityEarn,flagConfirmHourly,SingleJob,NumJobs,UsualHours,ActualHours,Union,Occupation,WorkerClass,hwhhwgt,pwlgwgt,pworwgt,pwvetwgt,pwsswgt,pwcmpwgt
0,505019880110916,11011,2.0,2021,7,8,1,3,1,61.0,5.0,2.0,43.0,1.0,2.0,1.0,12,5.0,1.0,-1.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,19202002.0,28699695.0,76121915.0,18945420.0,19202002.0,18976311.0
1,610100690316751,11011,1.0,2021,7,6,1,3,1,53.0,4.0,2.0,43.0,1.0,2.0,1.0,9,1.0,2.0,-1.0,0.0,-1.0,-1.0,-1.0,2.0,-1.0,50.0,60.0,-1.0,1.0,5.0,20812987.0,31107507.0,0.0,21153559.0,20812987.0,21204165.0
2,180314039113,13011,1.0,2021,7,2,1,0,1,77.0,2.0,2.0,39.0,1.0,2.0,1.0,8,5.0,1.0,-1.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,18287603.0,27333017.0,0.0,18351717.0,18287603.0,18396932.0
3,310588190701104,11011,1.0,2021,7,8,1,0,2,85.0,3.0,2.0,36.0,1.0,2.0,1.0,6,5.0,1.0,-1.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,18424472.0,27537584.0,74350105.0,18489066.0,18424472.0,18534619.0
4,2790110012113,11011,1.0,2021,7,7,1,0,2,73.0,4.0,1.0,39.0,2.0,2.0,1.0,10,1.0,2.0,-1.0,0.0,-1.0,-1.0,-1.0,2.0,-1.0,40.0,40.0,-1.0,2.0,4.0,17561071.0,26267955.0,0.0,17099389.0,17561071.0,17124931.0


In [41]:
columns = raw_df.columns.values
columns

array(['hrhhid', 'hrhhid2', 'PULINENO', 'Year', 'Month', 'MonthInSample',
       'State', 'County', 'Metropolitan', 'Age', 'Marital', 'Gender',
       'Educ', 'Race', 'Hispanic', 'Citizen', 'FamIncome', 'LaborForce',
       'FullPartTime', 'WeeklyEarning', 'flagWeekEarn', 'HourlyEarning',
       'flagPeriodicityEarn', 'flagConfirmHourly', 'SingleJob', 'NumJobs',
       'UsualHours', 'ActualHours', 'Union', 'Occupation', 'WorkerClass',
       'hwhhwgt', 'pwlgwgt', 'pworwgt', 'pwvetwgt', 'pwsswgt', 'pwcmpwgt'],
      dtype=object)

In [42]:
# Check missing
# If a column contains missing (NaN), take care of it later.
# Theoraically, we should not have any missing values.

for col in columns:
    if raw_df[col].isnull().sum() > 0:
        print(col, raw_df[col].isnull().sum())

WeeklyEarning 177492
HourlyEarning 177492


In [43]:
Link_VARS_Label = {
    'hrhhid': 'Household ID',
    'hrhhid2': 'Household ID2',
    'PULINENO': 'Line number of person'
}

Link_VARS = ['hrhhid', 'hrhhid2', 'PULINENO']

GeoDemographic_VARS_Label = {
    'Year': 'year of interview', 
    'Month': 'month of interview', 
    'MonthInSample': 'month of previous interview, month in sample, can be a calculation for link datasets',
    'State': 'state fips code',
    'County': 'county fips code',
    'Metropolitan': 'metropolitan status', 
    'Age': 'age, 0-79 in years, 80 for 80-84, 85 for 85+',
    'Marital': 'marital status',
    'Gender': 'sex',
    'Educ': 'highest level of school completed',
    'Race': 'race',
    'Hispanic': 'hispanic or non-hispanic',
    'Citizen': 'citizenship status',
    'FamIncome': 'family income'
}

GeoDemographic_VARS = list(GeoDemographic_VARS_Label.keys())

Labor_VARS_Label = {
    'LaborForce': 'monthly labor force status recode. EDITED UNIVERSE: PRPERTYP = 2 (already satisfied)',
    'FullPartTime': 'full-time or part-time work status. EDITED UNIVERSE: PEMLR = 1-7',
    'WeeklyEarning': 'weekly earnings recode. Calculated for all workers. Collected for one-quarter of the sample. \
                        EDITED UNIVERSE: flagWeekEarn/PRERELG = 1',
    'flagWeekEarn': 'prerelg, flag, if earnings eligible for editing',
    'HourlyEarning': 'hourly earnings recode. Only calculated for hourly-paid workers. \
                    Collected for one-quarter of the sample. EDITED UNIVERSE: PEERNPER = 1 OR PEERNRT = 1',
    'flagPeriodicityEarn': 'peernper, earning perodicity',
    'flagConfirmHourly': 'peernrt, confirm hourly paid',
    'SingleJob': 'do you have more than one job. EDITED UNIVERSE: PEMLR = 1, 2',
    'NumJobs': 'number of jobs. EDITED UNIVERSE: PEMJOT = 1',
    'UsualHours': 'usual weekly hours worked. -4 means varies. EDITED UNIVERSE: PEMLR = 1 OR 2',
    'ActualHours': 'actual weekly hours worked during survey week. EDITED UNIVERSE: PEMLR = 1',
    'Union': 'member of union. EDITED UNIVERSE: (PEIO1COW = 1-5 AND PEMLR = 1-2 AND HRMIS = 4, 8)',
    'Occupation': 'major occupational category. EDITED UNIVERSE:	PRMJOCC = 1-11',
    'WorkerClass': 'class of worker on main job. EDITED UNIVERSE:	(PEMLR = 1-3) OR (PEMLR = 4 AND PELKLWO = 1-2) \
                                OR (PEMLR = 5 AND (PENLFJH = 1 OR PEJHWKO = 1)) OR (PEMLR = 6 AND PENLFJH = 1) \
                                OR (PEMLR = 7 AND (PENLFJH = 1 OR PEJHWKO = 1))',
}

Labor_VARS = list(Labor_VARS_Label.keys())

Weights_VAR_Label = {
    'hwhhwgt': 'household weight',
    'pwlgwgt': 'longitudinal weight',
    'pworwgt': 'outgoing rotation weight',
    'pwvetwgt': 'veteran weight',
    'pwsswgt': 'second stage weight',
    'pwcmpwgt': 'composite final weight'
}

Weights_VARS = list(Weights_VAR_Label.keys())

### **2. Geographic and demographic variables**

In [44]:
raw_df['Metropolitan'].value_counts()

1    831167
2    193104
3     10123
Name: Metropolitan, dtype: int64

In [45]:
# replace 2 (nonmetropolitan) with 0; replace 3 (not identified) with NaN, then impute
# raw_df['Metropolitan'] = raw_df['Metropolitan'].replace(2, 0)
# raw_df['Metropolitan'] = raw_df['Metropolitan'].replace(3, np.nan)
raw_df['Metropolitan'] = raw_df['Metropolitan'].replace([2, 3], [0, np.nan])
raw_df['Metropolitan'].fillna(method='ffill', inplace=True)

In [46]:
raw_df['Metropolitan'].value_counts(dropna=False)

1.0    839076
0.0    195318
Name: Metropolitan, dtype: int64

In [47]:
raw_df['Age'].describe().apply(lambda x: format(x, 'f'))
# change format, o.w. the result was in scientific notation
# Age variable is interesting: 1-79, 80:80-84, 85:85+. But it does not hurt

count    1034394.000000
mean          48.813059
std           19.013689
min           16.000000
25%           33.000000
50%           49.000000
75%           64.000000
max           85.000000
Name: Age, dtype: object

In [48]:
# create a new column AgeGroup based on young prime old retired
# young age workers: 15-24, prime age workers: 25-54, old age workers: 55-64, retired: 65+
raw_df['AgeGroup'] = pd.cut(raw_df['Age'], bins=[15, 24, 54, 64, 100], labels=['young', 'prime', 'old', 'retired'])
raw_df['AgeGroup'].value_counts(dropna=False)

prime      475420
retired    255620
old        173600
young      129754
Name: AgeGroup, dtype: int64

In [49]:
raw_df['Marital'].value_counts()

1.0    524042
6.0    302211
4.0    110711
3.0     67125
5.0     17104
2.0     13201
Name: Marital, dtype: int64

In [50]:
# replace 6 (never married) with 0; replace all other values (any condition after marriage) with 1
raw_df['Marital'] = raw_df['Marital'].replace([1, 2, 3, 4, 5, 6], [1, 1, 1, 1, 1, 0])
raw_df['Marital'].value_counts(dropna=False)

1.0    732183
0.0    302211
Name: Marital, dtype: int64

In [51]:
raw_df['Gender'].value_counts()

2.0    537745
1.0    496649
Name: Gender, dtype: int64

In [52]:
# replace 2 (female) with 0
raw_df['Gender'] = raw_df['Gender'].replace(2, 0)
raw_df['Gender'].value_counts()

0.0    537745
1.0    496649
Name: Gender, dtype: int64

In [53]:
raw_df['Educ'].value_counts()

39.0    295373
43.0    218510
40.0    169727
44.0     95660
42.0     57591
41.0     44665
37.0     30675
36.0     25810
46.0     20008
35.0     16509
38.0     15264
45.0     14463
34.0     13430
33.0      9248
32.0      4750
31.0      2711
Name: Educ, dtype: int64

In [54]:
# Simplify Educ somehow, but leave space for future simplification
# replace 31-38 with 0 (less than high school); replace 39 with 1 (high school); 
raw_df['Educ'] = raw_df['Educ'].replace([31, 32, 33, 34, 35, 36, 37, 38], 0)
# replace 40 with 2 (some college); replace 41-42 with 3 (associate);
# replace 43 with 4 (bachelor\'s); replace 44-45 with 5 (master\'s or professional school); replace 46 with 6 (doctorate)
raw_df['Educ'] = raw_df['Educ'].replace([39, 40, 41, 42, 43, 44, 45, 46], [1, 2, 3, 3, 4, 5, 5, 6])

In [55]:
raw_df['Educ'].value_counts().sort_index()

0.0    118397
1.0    295373
2.0    169727
3.0    102256
4.0    218510
5.0    110123
6.0     20008
Name: Educ, dtype: int64

In [56]:
raw_df['Race'].value_counts()

1.0     832984
2.0     107840
4.0      58969
3.0      12363
7.0       5699
6.0       4685
5.0       4634
8.0       3212
10.0       772
9.0        660
15.0       586
21.0       582
16.0       512
11.0       273
17.0       120
19.0        82
13.0        79
26.0        79
12.0        64
18.0        44
25.0        42
14.0        33
22.0        26
20.0        26
23.0        21
24.0         7
Name: Race, dtype: int64

In [57]:
# raplace 1 to be 0 (white only), replace 2 to be 1 (black only), replace 4 to be 2 (asian only)
# replace all other only or mix to be 3
raw_df['Race'] = raw_df['Race'].replace([1, 2, 4], [0, 1, 2])
raw_df['Race'] = raw_df['Race'].replace([3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26], 3)
raw_df['Race'].value_counts(dropna=False)

0.0    832984
1.0    107840
2.0     58969
3.0     34601
Name: Race, dtype: int64

In [58]:
raw_df['Hispanic'].value_counts()

2.0    896072
1.0    138322
Name: Hispanic, dtype: int64

In [59]:
# replace 2 (non-hispanic) with 0
raw_df['Hispanic'] = raw_df['Hispanic'].replace(2, 0)
raw_df['Hispanic'].value_counts()

0.0    896072
1.0    138322
Name: Hispanic, dtype: int64

In [60]:
raw_df['Citizen'].value_counts()

1.0    880905
4.0     73850
5.0     65767
3.0      8666
2.0      5206
Name: Citizen, dtype: int64

In [61]:
# replace 1 2 3 4 with 1 as citizen
# replace 5 with 0 as non-citizen
raw_df['Citizen'] = raw_df['Citizen'].replace([1, 2, 3, 4], 1)
raw_df['Citizen'] = raw_df['Citizen'].replace(5, 0)
raw_df['Citizen'].value_counts(dropna=False)

1.0    968627
0.0     65767
Name: Citizen, dtype: int64

In [62]:
raw_df['FamIncome'].value_counts().sort_index()

1      18106
2       8812
3      12804
4      20357
5      18338
6      33363
7      43938
8      43939
9      51773
10     51759
11     79200
12     83469
13    109253
14    140261
15    155361
16    163661
Name: FamIncome, dtype: int64

### **3. Labor Force Variables**

`LaborForce` is the ultimate outcome of a bunch of questions in the survey. The value ranges from employed (at work/absent), unemployed (lay-off/looking), to not in labor force (retired/disabled/other). There should not be missing (NaN or -1) for this variable after previous cleaning.

`FullPartTime` is another representation of labor force, indicating full/part time status. We keep this variable because it is the edited universe of `Union`. When we calculate the percentage of members in union, we will refer to this variable.

In [63]:
raw_df['LaborForce'].value_counts()

1.0    568768
5.0    228687
7.0    130866
6.0     52076
4.0     24856
2.0     23151
3.0      5990
Name: LaborForce, dtype: int64

In [64]:
raw_df['FullPartTime'].value_counts()

2.0     436533
1.0     411629
7.0      78849
4.0      32553
11.0     25799
5.0      16740
6.0      11329
10.0      6411
3.0       6278
12.0      5047
9.0       2681
8.0        545
Name: FullPartTime, dtype: int64

`WeeklyEarning` is collected for one-quarter of the sample. Its edited universe is `flagWeekEarn/PRERELG=1` (eligible for editing).

`HourlyEarning` is collected for one-quarter of the sample. Only calculated for hourly-paid workers. EDITED UNIVERSE: flagPeriodicityEarn/PEERNPER = 1 **OR** flagConfirmHourly/PEERNRT = 1

In [65]:
a = raw_df['flagWeekEarn'].value_counts()
b = raw_df[raw_df['WeeklyEarning'] == -1]['flagWeekEarn'].value_counts()
c = raw_df[raw_df['WeeklyEarning'] == -1]['WeeklyEarning'].value_counts()
print(a)
print(b)
print(c)
# We found that when flagWeekEarn is 0 (not eligible for editing), WeeklyEarning is -1 (blank);
# while flagWeekEarn is 1 (eligible for editing), WeeklyEarning >= 0

0.0    900217
1.0    134177
Name: flagWeekEarn, dtype: int64
0.0    745473
Name: flagWeekEarn, dtype: int64
-1.0    745473
Name: WeeklyEarning, dtype: int64


In [66]:
# describe weekly earning without -1
raw_df[raw_df['WeeklyEarning'] != -1]['WeeklyEarning'].describe()

count    111429.000000
mean     109551.908408
std       73440.111281
min           0.000000
25%       57600.000000
50%       90000.000000
75%      144230.000000
max      288461.000000
Name: WeeklyEarning, dtype: float64

In [67]:
a = raw_df[raw_df['HourlyEarning'] == -1]['HourlyEarning'].shape[0]
# negate the edited unvierse for HourlyEarning
b = raw_df[(raw_df['flagPeriodicityEarn'] != 1) & (raw_df['flagConfirmHourly'] != 1)]['HourlyEarning'].shape[0]
# paranthesis for two conditions is necessary, o.w. TypeError: Cannot perform 'rand_' with a dtyped [float64] array and scalar of type [bool]
# https://stackoverflow.com/questions/60654781/typeerror-cannot-perform-rand-with-a-dtyped-float64-array-and-scalar-of-ty
print("Count for HourlyEarning if -1:", a)
print("Count for HourlyEarning if two edited unverse are not 1:", b)
# We found when flagPeriodicityEarn is 1 OR flagConfirmHourly is 1, then HourlyEarning has meaningful values
# o.w. HourlyEarning is -1

Count for HourlyEarning if -1: 794576
Count for HourlyEarning if two edited unverse are not 1: 959368


In [68]:
raw_df[raw_df['HourlyEarning'] != -1]['HourlyEarning'].describe()

count    62326.000000
mean      2029.775712
std       1083.811052
min          0.000000
25%       1400.000000
50%       1700.000000
75%       2400.000000
max       9999.000000
Name: HourlyEarning, dtype: float64

`SingleJob` is binary with edited universe LaborForce/PEMLR = 1, 2

`NumJobs`

In [69]:
raw_df['SingleJob'].value_counts()

 2.0    562358
-1.0    442475
 1.0     29561
Name: SingleJob, dtype: int64

In [70]:
# EDITED UNIVERSE: PEMLR = 1, 2
raw_df[raw_df['LaborForce'].isin([3, 4, 5, 6, 7])]['SingleJob'].value_counts()

-1.0    442475
Name: SingleJob, dtype: int64

In [71]:
raw_df[raw_df['LaborForce'].isin([1, 2])]['SingleJob'].value_counts()

2.0    562358
1.0     29561
Name: SingleJob, dtype: int64

In [72]:
# EDITED UNIVERSE: PEMJOT = 1
raw_df['NumJobs'].value_counts()

-1.0    1004833
 2.0      26729
 3.0       2362
 4.0        470
Name: NumJobs, dtype: int64

`UsualHours`: -4 means the usual working hours per week varies. EDITED UNIVERSE: PEMLR = 1 OR 2

`ActualHours`: EDITED UNIVERSE: PEMLR = 1

In [73]:
a = raw_df[raw_df['UsualHours'] == -1]['UsualHours'].shape[0]
b = raw_df[~raw_df['LaborForce'].isin([1, 2])]['UsualHours'].value_counts()
print("Number of UsualHours if -1:", a)
print("Number of UsualHours if not employed:", b)
# we found that UsualHours is -1 (blank) when this person is not employed (both at work and absent)
c = raw_df[raw_df['UsualHours'] == -4]['UsualHours'].shape[0]
print("Number of UsualHours if -4:", c)
# Usual working hour of some people varies, can we refer to actual working hours?

Number of UsualHours if -1: 442475
Number of UsualHours if not employed: -1.0    442475
Name: UsualHours, dtype: int64
Number of UsualHours if -4: 42271


In [74]:
a = raw_df[raw_df['ActualHours'] == -1]['ActualHours'].shape[0]
b = raw_df[raw_df['LaborForce'] != 1]['ActualHours'].value_counts()
print("Number of ActualHours if -1:", a)
print("Number of ActualHours if not employed(at work):", b)
# We found that the edited unverse makes sense, ActualHours is -1 (blank) when this person is not employed (at work)

Number of ActualHours if -1: 465626
Number of ActualHours if not employed(at work): -1.0    465626
Name: ActualHours, dtype: int64


In [75]:
raw_df[['UsualHours', 'ActualHours']].describe().apply(lambda s: s.apply(lambda x: format(x, 'g')))

Unnamed: 0,UsualHours,ActualHours
count,1034390.0,1034390.0
mean,20.3898,20.8552
std,21.9589,21.9656
min,-4.0,-1.0
25%,-1.0,-1.0
50%,20.0,20.0
75%,40.0,40.0
max,198.0,198.0


In [76]:
# Error:
# raw_df[[raw_df['UsualHours'] != -1]['UsualHours'], [raw_df['ActualHours'] != -1]['ActualHours']].describe()

# No error:
raw_df[raw_df[['UsualHours', 'ActualHours']] != -1][['UsualHours', 'ActualHours']].describe().apply(lambda s: s.apply(lambda x: format(x, 'g')))

# check:
# raw_df[raw_df['UsualHours'] != -1]['UsualHours'].describe()
# raw_df[raw_df['ActualHours'] != -1]['ActualHours'].describe()

Unnamed: 0,UsualHours,ActualHours
count,591919.0,568768.0
mean,36.3793,38.7471
std,15.6516,12.8967
min,-4.0,1.0
25%,36.0,35.0
50%,40.0,40.0
75%,40.0,40.0
max,198.0,198.0


`Union`: EDITED UNIVERSE: (PEIO1COW = 1-5 AND PEMLR = 1-2 AND HRMIS = 4, 8).

In [77]:
raw_df['Union'].value_counts()

-1.0    900217
 2.0    120875
 1.0     13302
Name: Union, dtype: int64

In [78]:
raw_df[raw_df['LaborForce'].isin([1, 2]) & raw_df['WorkerClass'].isin([1,2,3,4,5]) & raw_df['MonthInSample'].isin([4,8])]['Union'].value_counts()
# edit universe correct, but why we need to limit hrmis to be 4 or 8?

2.0    120875
1.0     13302
Name: Union, dtype: int64

In [79]:
raw_df['Union'] = raw_df['Union'].replace(2, 0)

`Occupation`:EDITED UNIVERSE: PRMJOCC = 1-11

`WorkerClass`: serves mainly as an edited universe for `Union`. We do not need to care about `WorkerClass` because we have `Occupation` already.

In [80]:
raw_df['Occupation'].value_counts()
# One issue left: there is no PRMJOCC in data dictionary, but there are PRMJOCC1 and PRMJOCC2
# The edited universe for different job categories are linked to each other.

-1.0    406251
 1.0    263838
 3.0    124532
 2.0    101211
 6.0     79604
 5.0     53709
 4.0      5160
 7.0        89
Name: Occupation, dtype: int64

### **4. Export Yearly Data**

In [81]:
raw_df.to_csv(DATA_PATH + 'Clean' + CPS_YEAR + '/cleaned3' + CPS_YEAR + '.csv', index=False)

In [125]:
# # export dataframe for each month into sepatate csv files
# months = raw_df['Month'].unique()

# for month in months:
#     print('Month:', month, ',', 'shape:', raw_df[raw_df['Month'] == month].shape)

# for month in months:
#     if month == 1:
#         raw_df[raw_df['Month'] == month].to_csv('./CleanCPS2020/clean_' + 'Jan' + '20cps.csv', index=False)
#     elif month == 2:
#         raw_df[raw_df['Month'] == month].to_csv('./CleanCPS2020/clean_' + 'Feb' + '20cps.csv', index=False)
#     elif month == 3:
#         raw_df[raw_df['Month'] == month].to_csv('./CleanCPS2020/clean_' + 'Mar' + '20cps.csv', index=False)
#     elif month == 4:
#         raw_df[raw_df['Month'] == month].to_csv('./CleanCPS2020/clean_' + 'Apr' + '20cps.csv', index=False)
#     elif month == 5:
#         raw_df[raw_df['Month'] == month].to_csv('./CleanCPS2020/clean_' + 'May' + '20cps.csv', index=False)
#     elif month == 6:
#         raw_df[raw_df['Month'] == month].to_csv('./CleanCPS2020/clean_' + 'Jun' + '20cps.csv', index=False)
#     elif month == 7:
#         raw_df[raw_df['Month'] == month].to_csv('./CleanCPS2020/clean_' + 'Jul' + '20cps.csv', index=False)
#     elif month == 8:
#         raw_df[raw_df['Month'] == month].to_csv('./CleanCPS2020/clean_' + 'Aug' + '20cps.csv', index=False)
#     elif month == 9:
#         raw_df[raw_df['Month'] == month].to_csv('./CleanCPS2020/clean_' + 'Sept' + '20cps.csv', index=False)
#     elif month == 10:
#         raw_df[raw_df['Month'] == month].to_csv('./CleanCPS2020/clean_' + 'Oct' + '20cps.csv', index=False)
#     elif month == 11:
#         raw_df[raw_df['Month'] == month].to_csv('./CleanCPS2020/clean_' + 'Nov' + '20cps.csv', index=False)
#     elif month == 12:
#         raw_df[raw_df['Month'] == month].to_csv('./CleanCPS2020/clean_' + 'Dec' + '20cps.csv', index=False)