# **MCO1 - Labor Force Survey 2016**

#### Group: **K-Means Business**

**Dy**, Harmony

**Hernandez**, Christa

**Sanchez**, Matthew

**Uy**, Justine

## Section 1. Introduction to the problem/task and dataset

The dataset the group have chosen is Labor Force Survey 2016. This dataset mainly revolves around the employment and labor force participation of household members in the Philippines, while also capturing key demographic and socio-economic characteristics of the population. The target task is a **classification task** that aims to predict whether or not the respondent has done any work for at least one hour during the past week.

## Section 2. Description of the dataset

<b>Description of the Dataset</b>

The Labor Force Survey (LFS), April 2016 data set describes the demographic and socio-economic characteristics of the population mainly through the estimation of levels of employment, unemployment, and underemployment in the 17 administrative regions of the Philippines. It aims to provide a quantitative framework for the preparation of plans and formulation of policies affecting the labor market.

<b> Data Collection </b>

The data collection was conducted face-to-face within a total national sample of 42,768 sample households (rounds with Batanes sample) or 42,576 sample households (rounds without Batanes sample) per survey round (quarterly in a year). This specific data set was collected from April 8, 2016 until April 30, 2016. To ensure the data set had proportional representation among the samples, the PSA designed a master sample which ensures a randomly assigned and selected set of geographic areas with non-overlapping and discernable boundaries known as PSUs. 

As the data collection method utilized a randomized sampling approach, the data collected is well representative of the national and regional statistics. With their systematic survey execution, a high-response rate of 95.7% was achieved which minimizes nonresponse bias.

However, due to the survey only including participants from private households and excluding people from the institutional population, the survey might potentially be underestimating labor force statistics and creating less accurate results.

In [3]:
import pandas as pd
import numpy as np

df = pd.read_csv("PHL-PSA-LFS-2016-Q2-PUF/LFS PUF April 2016.CSV")

display(df.head())
display(df.info())

Unnamed: 0,PUFREG,PUFPRV,PUFPRRCD,PUFHHNUM,PUFURB2K10,PUFPWGTFIN,PUFSVYMO,PUFSVYYR,PUFPSU,PUFRPL,...,PUFC33_WEEKS,PUFC34_WYNOT,PUFC35_LTLOOKW,PUFC36_AVAIL,PUFC37_WILLING,PUFC38_PREVJOB,PUFC40_POCC,PUFC41_WQTR,PUFC43_QKB,PUFNEWEMPSTAT
0,1,28,2800,1,2,405.2219,4,2016,217,1,...,,,,,,,,1,1,1
1,1,28,2800,1,2,388.828,4,2016,217,1,...,,,,,,,,1,1,1
2,1,28,2800,1,2,406.1194,4,2016,217,1,...,,,,,,,,1,1,1
3,1,28,2800,2,2,405.2219,4,2016,217,1,...,,,,,,,,1,1,1
4,1,28,2800,2,2,384.3556,4,2016,217,1,...,,,,,,,,1,96,1


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180862 entries, 0 to 180861
Data columns (total 50 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   PUFREG           180862 non-null  int64  
 1   PUFPRV           180862 non-null  int64  
 2   PUFPRRCD         180862 non-null  int64  
 3   PUFHHNUM         180862 non-null  int64  
 4   PUFURB2K10       180862 non-null  int64  
 5   PUFPWGTFIN       180862 non-null  float64
 6   PUFSVYMO         180862 non-null  int64  
 7   PUFSVYYR         180862 non-null  int64  
 8   PUFPSU           180862 non-null  int64  
 9   PUFRPL           180862 non-null  int64  
 10  PUFHHSIZE        180862 non-null  int64  
 11  PUFC01_LNO       180862 non-null  int64  
 12  PUFC03_REL       180862 non-null  int64  
 13  PUFC04_SEX       180862 non-null  int64  
 14  PUFC05_AGE       180862 non-null  int64  
 15  PUFC06_MSTAT     180862 non-null  object 
 16  PUFC07_GRADE     180862 non-null  obje

None

<b>Row and Column Representation </b>

Each row represents an individual respondent which covers all household members of the sample households that meets their criteria to be considered. Each column represents a feature which describes the demographic and socio-economic chaacteristics of a respondent.

<b>Instances</b>

The data set contains a number of 42,768 instances (corresponding to households including the Batanes sample) or 42,576 instances (corresponding to households excluding the Batanes sample).

<b>Features</b>

There are 50 features in the dataset.

<b>Important Files</b>

LFS PUF April 2016.CSV: Data set file containing instances of the  responses from the survey.
lfs_april_2016_metadata(dictionary).xlsx: A PDF file containing a study description. From this, Coverage, Sampling, Data Collection, and Data Processing were mainly used.

<b>Feature Descriptions</b>

<b>PUFREG</b>: An integer value representing the administartive region where the respondent resides.

<b>PUFPRV</b>: An integer value representing the numerical code of the respondent's province, also called Province Code. 

<b>PUFPRRCD</b>: An integer value representing a recoded verson of the province code, also called Province Recode.

<b>PUFHHNUM</b>: An integer representing a unique identifier for each household.

<b>PUFURB2K10</b>: A binary value indicating whether a household is in an urban or rural area based on the FIES 2010 Survey. (0 = Rural and 1 = Urban)

<b>PUFPWGTFIN</b>: A float value corresponding to the final weight factor of each respondent which is used to scale or project survey results to the national level. 

<b>PUFSVYMO</b>: An integer value corresponding to the month the survey was conducted. It is a categorical value which means that 1 corresponds to January, 2 to February and so on.

<b>PUFSVYYR</b>: An integer value corresponding to the year the survey was conducted. 

<b>PUFPSU</b>: An integer value corresponding to a respondent's Primary Sampling Unit (PSU) identifier.

<b>PUFRPL</b>: An integer value corresponding to a replicate of a responder's Primary Sampling Unit (PSU) identifier.

<b>PUFHHSIZE</b>: An integer value corresponding to the total number of members in a household, also called Household Size.

<b>PUFC01_LNO</b>: An integer value corresponding to a respondent's unique identifier within a household.

<b>PUFC03_REL</b>: An integer value representing the relationship of the respondent to the household head. It is a categorical value which means that 1 corresponds to Head, 2 is Spouse, 3 is Son/Daughter and so on.

<b>PUFC04_SEX</b>: A binary value corresponding to the gender of the respondent. (0 = Female and 1 = Male)

<b>PUFC05_AGE</b>: An integer value corresponding to the respondent's age at the last birthday.

<b>PUFC06_MSTAT</b>: An integer value corresponding to the respondent's marital status. It is a categorical value which means that 1 corresponds to Single, 2 is Marries, 3 is Widowed and so on.

<b>PUFC07_GRADE</b>: An integer value corresponding to the highest level of education obtained. It is a categorical value which means that 000 corresponds to No grade completed, 010 is Preschool, 210 is Grade 1 and so on.

<b>PUFC08_CURSCH</b>: A binary value corresponding to whether the respondent is currently attending school. (0 = No and 1 = Yes)

<b>PUFC09_GRADTECH</b>: A binary value corresponding to whether the respondent is a graduate of a technical/vocational course. (0 = No and 1 = Yes)

<b>PUFC10_CONWR</b>: An integer value corresponding to the category of OFW.  It is a categorical value which means that 1 corresponds to OCW, 2 is Workers other than OCW, 3 is Employees in Phil. Embassy, Consulates & other missions and so on.

<b>PUFC11_WORK</b>: A binary value corresponding to whether the respondent do any work for at least one house during the past week. (0 = No and 1 = Yes)

<b>PUFC12_JOB</b>: A binary value corresponding to whether the respondent have a job or business during the past week despite not working last week. (0 = No and 1 = Yes)

<b>PUFC14_PROCC</b>: A integer value representing the respondent's primary occupation during the past week. It is a categorical value. (No legend provided)

<b>PUFC16_PKB</b>: An integer value representing the kind of business or industry the respondent's employed in. It is a categorical value. (No legend provided)

<b>PUFC17_NATEM</b>: An integer value representing the respondent's nature of employment. It is a categorical value which means that 1 corresponds to permanent, 2 is short-term or seasonal, 3 is different employer on day to day or week to week basis.

<b>PUFC18_PNWHRS</b>: An integer value corresponding to the respondent's normal working hours per day. 

<b>PUFC19_PHOURS</b>: An integer value corresponding to the respondent's total number of hours worked during the past week including non-paid hours. 

<b>PUFC20_PWMORE</b>: A binary value representing whether the respondent wants more hours of work during the past week. (0 = No and 1 = Yes)

<b>PUFC21_PLADDW</b>: A binary value representing whether the respondent looked for additional work during the past week. (0 = No and 1 = Yes)

<b>PUFC22_PFWRK</b>: A binary value representing whether it was the first time the respondents worked. (0 = No and 1 = Yes)

<b>PUFC23_PCLASS</b>: An integer value corresponding to the relationship of the respondent to where he works. It is a categorical value which means that 0 corresponds to worked for private household, 1 is worked for private establishment, 2 is worked for government corporation and so on.

<b>PUFC24_PBASIS</b>: An integer value corresponding to the method of payment for the respondent's primary occupation. It is a categorical value which means that 0 corresponds to in kind/imputed, 1 is per piece, 2 is per hour and so on.

<b>PUFC25_PBASIC</b>: An integer value corresponding to the basic pay for normal time of the respondent prior to deductions. 

<b>PUFC26_OJOB</b>: A binary value representing whether or not the respondent has had another job or business during the past week. (0 = No and 1 = Yes)

<b>PUFC27_NJOBS</b>: An integer value corresponding to the total number of jobs held by the respondent during the past week. 

<b>PUFC28_THOURS</b>: An integer value corresponding to the total hours the respondent worked across all jobs during the past week.

<b>PUFC29_WWM48H</b>: An integer value representing the main reason for the respondent working more than 48 hours in the past week. It is a categorical value which means that 1 corresponds to wanted more earnings, 2 is requirements of the job, 3 is exceptional week and so on.  

<b>PUFC30_LOOKW</b>: A binary value indicating whether the respondent look for work or try to establish a business in the past week. (0 = No and 1 = Yes)

<b>PUFC31_FLWRK</b>: A binary value indicating whether the respondent was look for work or try to establish a business for the first time. (0 = No and 1 = Yes)

<b>PUFC32_JOBSM</b>: An integer value representing the respondent's methods to find work. It is a categorical value which means that 1 corresponds to registered in public employment agency, 2 is registered in private employment agency, 3 is approached employer directly and so on.  

<b>PUFC33_WEEKS</b>: An integer value corresponding to the number of weeks the respondent has been looking for work.

<b>PUFC34_WYNOT</b>: An integer value representing the reason the respondent isn't looking for work. It is a categorical value which means that 1 corresponds to Tired/Believed no work available, 2 is Awaiting results of previous job application, 3 is Temporary illness/disability and so on.  

<b>PUFC35_LTLOOKW</b>: An integer value value representing the last time the respondent has looked for work. It is a categorical value which means that 1 corresponds to Within the last month, 2 is One to six months ago, 3 is More than six months ago.  

<b>PUFC36_AVAIL</b>: A binary value indicating whether the respondent would have been available for an opportunity last week or within two weeks. (0 = No and 1 = Yes)

<b>PUFC37_WILLING</b>: A binary value indicating whether the respondent was willing to work in the past week or within 2 weeks. (0 = No and 1 = Yes)

<b>PUFC38_PREVJOB</b>: A binary value indicating whether the respondent has worked before. (0 = No and 1 = Yes)

<b>PUFC40_POCC</b>: An integer value value representing the respondent's last occupation. It is a categorical value. (No legend provided)

<b>PUFC41_WQTR</b>: A binary value indicating whether the respondent has worked at all or business during the past quarter. (0 = No and 1 = Yes)

<b>PUFC43_QKB</b>: An integer value value representing the respondent's kind of business for the past quarter. It is a categorical value. (No legend provided)

<b>PUFNEWEMPSTAT</b>: An integer value value representing the respondent's employment status based on a new classificaion criteria used in the LFS. It is a categorical value. (No legend provided)


## Section 3. List of requirements

## Section 4. Data preprocessing and cleaning

It is good practice to first make a copy of the dataframe in order to preserve the original data for any future comparison and analysis.

First, check for any duplicate instances in the dataset in order to be drop any repetitions.

In [4]:
clean_df = df

duplicate_count = clean_df.duplicated().sum()
print(f"Duplicated instances: {duplicate_count}")

Duplicated instances: 0


The output shows no duplicated entries.

Next, we want to find any null values in our dataset. Although upon using `.isnull()` function, it shows that there are no null values in the dataset which is inconsistent when opening the csv file. We came to a realization that some of the values were actually whitespaces due to the nature of the dataset which came from a survey questionnaire.

In [5]:
print(clean_df.isnull().sum())
print((clean_df == ' ').sum()) 

PUFREG             0
PUFPRV             0
PUFPRRCD           0
PUFHHNUM           0
PUFURB2K10         0
PUFPWGTFIN         0
PUFSVYMO           0
PUFSVYYR           0
PUFPSU             0
PUFRPL             0
PUFHHSIZE          0
PUFC01_LNO         0
PUFC03_REL         0
PUFC04_SEX         0
PUFC05_AGE         0
PUFC06_MSTAT       0
PUFC07_GRADE       0
PUFC08_CURSCH      0
PUFC09_GRADTECH    0
PUFC10_CONWR       0
PUFC11_WORK        0
PUFC12_JOB         0
PUFC14_PROCC       0
PUFC16_PKB         0
PUFC17_NATEM       0
PUFC18_PNWHRS      0
PUFC19_PHOURS      0
PUFC20_PWMORE      0
PUFC21_PLADDW      0
PUFC22_PFWRK       0
PUFC23_PCLASS      0
PUFC24_PBASIS      0
PUFC25_PBASIC      0
PUFC26_OJOB        0
PUFC27_NJOBS       0
PUFC28_THOURS      0
PUFC29_WWM48H      0
PUFC30_LOOKW       0
PUFC31_FLWRK       0
PUFC32_JOBSM       0
PUFC33_WEEKS       0
PUFC34_WYNOT       0
PUFC35_LTLOOKW     0
PUFC36_AVAIL       0
PUFC37_WILLING     0
PUFC38_PREVJOB     0
PUFC40_POCC        0
PUFC41_WQTR  

Before we perform further processing for any possible missing values, it's easier to work with them if they are of the same data type. To ensure empty-like data types (whitespaces, tab spaces, newlines, etc) are unified, we will convert them into null values.

In [6]:
clean_df.replace(r'^\s*$', np.nan, regex=True, inplace=True)

print(clean_df.isnull().sum())

#clean_df.to_csv(r"C:\Users\Matthew Sanchez\Desktop\3rdYr\T2\STINTSY\cleaned_LFS_PUF_April_2016.csv", index=False)

PUFREG                  0
PUFPRV                  0
PUFPRRCD                0
PUFHHNUM                0
PUFURB2K10              0
PUFPWGTFIN              0
PUFSVYMO                0
PUFSVYYR                0
PUFPSU                  0
PUFRPL                  0
PUFHHSIZE               0
PUFC01_LNO              0
PUFC03_REL              0
PUFC04_SEX              0
PUFC05_AGE              0
PUFC06_MSTAT        18339
PUFC07_GRADE        18339
PUFC08_CURSCH      107137
PUFC09_GRADTECH     57782
PUFC10_CONWR        57782
PUFC11_WORK         21894
PUFC12_JOB          93306
PUFC14_PROCC       108360
PUFC16_PKB         108360
PUFC17_NATEM       109507
PUFC18_PNWHRS      109507
PUFC19_PHOURS      109507
PUFC20_PWMORE      109507
PUFC21_PLADDW      109507
PUFC22_PFWRK       109507
PUFC23_PCLASS      109507
PUFC24_PBASIS      138947
PUFC25_PBASIC      144274
PUFC26_OJOB        109507
PUFC27_NJOBS       174924
PUFC28_THOURS      109507
PUFC29_WWM48H      163629
PUFC30_LOOKW       132692
PUFC31_FLWRK

Before doing further processes, it is best to determine the relevant features needed for the classification task.

Since the classification task will determine if the household member has done any work for at least an hour over the past week, the relevant features we have selected to help with the prediction are:

_Demographic Factors_
- PUFC05_AGE
- PUFC04_SEX
- PUFC06_MSTAT

_Education and Training_
- PUFC07_GRADE
- PUFC08_CURSCH
- PUFC09_GRADTECH

_Overseas Work & Employment History_
- PUFC10_CONWR
- PUFC38_PREVJOB
- PUFC17_NATEM

_Job-Seeking Efforts_
- PUFC30_LOOKW
- PUFC31_FLWRK
- PUFC37_WILLING

_Availability & Motivation_
- PUFC36_AVAIL
- PUFC20_PWMORE
- PUFC21_PLADDW

In [7]:
relevant_features = [
    "PUFC11_WORK",                                          # label to predict
    "PUFC05_AGE", "PUFC04_SEX", "PUFC06_MSTAT",             # Demographic Factors
    "PUFC07_GRADE", "PUFC08_CURSCH", "PUFC09_GRADTECH",     # Education & Training
    "PUFC10_CONWR", "PUFC38_PREVJOB", "PUFC17_NATEM",       # Overseas Work & Employment History
    "PUFC30_LOOKW", "PUFC31_FLWRK", "PUFC37_WILLING",       # Job-Seeking Efforts
    "PUFC36_AVAIL", "PUFC20_PWMORE", "PUFC21_PLADDW"        # Job Availability & Motivation
]

# Create a new DataFrame with only the selected features
focused_df = clean_df[relevant_features].copy()

display(focused_df)
print(focused_df.isnull().sum())

Unnamed: 0,PUFC11_WORK,PUFC05_AGE,PUFC04_SEX,PUFC06_MSTAT,PUFC07_GRADE,PUFC08_CURSCH,PUFC09_GRADTECH,PUFC10_CONWR,PUFC38_PREVJOB,PUFC17_NATEM,PUFC30_LOOKW,PUFC31_FLWRK,PUFC37_WILLING,PUFC36_AVAIL,PUFC20_PWMORE,PUFC21_PLADDW
0,1,49,1,2,350,,2,5,,1,,,,,1,1
1,1,61,2,2,350,,2,5,,2,,,,,2,2
2,1,19,1,1,350,2,2,5,,2,,,,,1,1
3,1,48,1,2,320,,2,5,,1,,,,,1,1
4,1,41,2,2,350,,2,5,,1,,,,,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
180857,1,29,1,2,350,,2,5,,1,,,,,2,2
180858,2,29,2,2,830,,2,5,2,,2,,,,,
180859,,4,2,,,,,,,,,,,,,
180860,,2,2,,,,,,,,,,,,,


PUFC11_WORK         21894
PUFC05_AGE              0
PUFC04_SEX              0
PUFC06_MSTAT        18339
PUFC07_GRADE        18339
PUFC08_CURSCH      107137
PUFC09_GRADTECH     57782
PUFC10_CONWR        57782
PUFC38_PREVJOB     132692
PUFC17_NATEM       109507
PUFC30_LOOKW       132692
PUFC31_FLWRK       178569
PUFC37_WILLING     174893
PUFC36_AVAIL       174893
PUFC20_PWMORE      109507
PUFC21_PLADDW      109507
dtype: int64


Empty entries for `PUFC06_MSTAT` (marital status) will be defaulted to a value of 6 since the questionnaire associated 6 with "Unknown".

In [8]:
focused_df['PUFC06_MSTAT'].fillna(6, inplace=True)

print(focused_df.isnull().sum())

PUFC11_WORK         21894
PUFC05_AGE              0
PUFC04_SEX              0
PUFC06_MSTAT            0
PUFC07_GRADE        18339
PUFC08_CURSCH      107137
PUFC09_GRADTECH     57782
PUFC10_CONWR        57782
PUFC38_PREVJOB     132692
PUFC17_NATEM       109507
PUFC30_LOOKW       132692
PUFC31_FLWRK       178569
PUFC37_WILLING     174893
PUFC36_AVAIL       174893
PUFC20_PWMORE      109507
PUFC21_PLADDW      109507
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  focused_df['PUFC06_MSTAT'].fillna(6, inplace=True)


Empty entries for `PUFC07_GRADE` (highest grade completed) will be defaulted to a value of the mode to preserve the most common educational attainment.

In [9]:
mode_value = focused_df["PUFC07_GRADE"].mode()[0] 
focused_df["PUFC07_GRADE"].fillna(mode_value, inplace=True)

print(focused_df.isnull().sum())

PUFC11_WORK         21894
PUFC05_AGE              0
PUFC04_SEX              0
PUFC06_MSTAT            0
PUFC07_GRADE            0
PUFC08_CURSCH      107137
PUFC09_GRADTECH     57782
PUFC10_CONWR        57782
PUFC38_PREVJOB     132692
PUFC17_NATEM       109507
PUFC30_LOOKW       132692
PUFC31_FLWRK       178569
PUFC37_WILLING     174893
PUFC36_AVAIL       174893
PUFC20_PWMORE      109507
PUFC21_PLADDW      109507
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  focused_df["PUFC07_GRADE"].fillna(mode_value, inplace=True)


Empty entries for `PUFC08_CURSCH` (currently attending school) will be defaulted to a value of 2 since the questionnaire associated 2 with NO. We assume skip by omission which implies a NO.

In [10]:
focused_df['PUFC08_CURSCH'].fillna(2, inplace=True)

print(focused_df.isnull().sum())

PUFC11_WORK         21894
PUFC05_AGE              0
PUFC04_SEX              0
PUFC06_MSTAT            0
PUFC07_GRADE            0
PUFC08_CURSCH           0
PUFC09_GRADTECH     57782
PUFC10_CONWR        57782
PUFC38_PREVJOB     132692
PUFC17_NATEM       109507
PUFC30_LOOKW       132692
PUFC31_FLWRK       178569
PUFC37_WILLING     174893
PUFC36_AVAIL       174893
PUFC20_PWMORE      109507
PUFC21_PLADDW      109507
dtype: int64

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  focused_df['PUFC08_CURSCH'].fillna(2, inplace=True)





Empty entries for `PUFC09_GRADTECH` (graduate of technical/vocational course) will be defaulted to a value of 2 since the questionnaire associated 2 with NO. We assume skip by omission which implies a NO.

In [11]:
focused_df['PUFC09_GRADTECH'].fillna(2, inplace=True)

print(focused_df.isnull().sum())

PUFC11_WORK         21894
PUFC05_AGE              0
PUFC04_SEX              0
PUFC06_MSTAT            0
PUFC07_GRADE            0
PUFC08_CURSCH           0
PUFC09_GRADTECH         0
PUFC10_CONWR        57782
PUFC38_PREVJOB     132692
PUFC17_NATEM       109507
PUFC30_LOOKW       132692
PUFC31_FLWRK       178569
PUFC37_WILLING     174893
PUFC36_AVAIL       174893
PUFC20_PWMORE      109507
PUFC21_PLADDW      109507
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  focused_df['PUFC09_GRADTECH'].fillna(2, inplace=True)


Empty entries for `PUFC10_CONWR` (overseas Filipino indicator) is where we will apply imputation since several of the empty entries for this column had inputs for the succeeding columns which indicates that the person must belong to either category 4 (student/tourist) or 5 (others). Since 5 is the mode input for the column, we have defaulted it as such

In the same column, we have decided to drop entries whose value for this column is either 1, 2, or 3 since it indicates that they are overseas Filipino workers which is not the primary focus of the dataset. Moreover it was indicated in the questionnaire that if the household member's answer belong to either of these three, then the interviewer may move on to the next member, essentially leaving the next columns as blank and only retaining minimal information. There were a total of 3,555 data entries that were removed.

In [12]:
focused_df["PUFC10_CONWR"].fillna(5, inplace=True)

focused_df["PUFC10_CONWR"] = focused_df["PUFC10_CONWR"].astype(str).str.strip().astype(int)
focused_df = focused_df[~focused_df["PUFC10_CONWR"].isin([1, 2, 3])]

print(focused_df["PUFC10_CONWR"].value_counts())  # Should only contain 4 and 5
print(f"Remaining dataset size: {focused_df.shape[0]} rows")
print(focused_df.isnull().sum())


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  focused_df["PUFC10_CONWR"].fillna(5, inplace=True)


PUFC10_CONWR
5    177278
4        29
Name: count, dtype: int64
Remaining dataset size: 177307 rows
PUFC11_WORK         18339
PUFC05_AGE              0
PUFC04_SEX              0
PUFC06_MSTAT            0
PUFC07_GRADE            0
PUFC08_CURSCH           0
PUFC09_GRADTECH         0
PUFC10_CONWR            0
PUFC38_PREVJOB     129137
PUFC17_NATEM       105952
PUFC30_LOOKW       129137
PUFC31_FLWRK       175014
PUFC37_WILLING     171338
PUFC36_AVAIL       171338
PUFC20_PWMORE      105952
PUFC21_PLADDW      105952
dtype: int64


For empty entries of `PUFC38_PREVJOB`(previous job indicator) will be defaulted to a value of the mode to preserve the most common educational attainment.

In [13]:
mode_value = focused_df["PUFC38_PREVJOB"].mode()[0] 
focused_df["PUFC38_PREVJOB"].fillna(mode_value, inplace=True)

print(focused_df.isnull().sum())

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  focused_df["PUFC38_PREVJOB"].fillna(mode_value, inplace=True)


PUFC11_WORK         18339
PUFC05_AGE              0
PUFC04_SEX              0
PUFC06_MSTAT            0
PUFC07_GRADE            0
PUFC08_CURSCH           0
PUFC09_GRADTECH         0
PUFC10_CONWR            0
PUFC38_PREVJOB          0
PUFC17_NATEM       105952
PUFC30_LOOKW       129137
PUFC31_FLWRK       175014
PUFC37_WILLING     171338
PUFC36_AVAIL       171338
PUFC20_PWMORE      105952
PUFC21_PLADDW      105952
dtype: int64


Empty entries for `PUFC37_WILLING` (willingness to take up work during the past 2 weeks) will be defaulted to a value of 2 since the questionnaire associated 2 with NO. We assume skip by omission which implies a NO. 

In [14]:
focused_df['PUFC37_WILLING'].fillna(2, inplace=True)

print(focused_df.isnull().sum())

PUFC11_WORK         18339
PUFC05_AGE              0
PUFC04_SEX              0
PUFC06_MSTAT            0
PUFC07_GRADE            0
PUFC08_CURSCH           0
PUFC09_GRADTECH         0
PUFC10_CONWR            0
PUFC38_PREVJOB          0
PUFC17_NATEM       105952
PUFC30_LOOKW       129137
PUFC31_FLWRK       175014
PUFC37_WILLING          0
PUFC36_AVAIL       171338
PUFC20_PWMORE      105952
PUFC21_PLADDW      105952
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  focused_df['PUFC37_WILLING'].fillna(2, inplace=True)


For empty entries of `PUFC36_AVAIL`(availability for work) will be defaulted to a value of the mode to preserve the most common educational attainment.

In [15]:
mode_value = focused_df["PUFC36_AVAIL"].mode()[0] 
focused_df["PUFC36_AVAIL"].fillna(mode_value, inplace=True)

print(focused_df.isnull().sum())

PUFC11_WORK         18339
PUFC05_AGE              0
PUFC04_SEX              0
PUFC06_MSTAT            0
PUFC07_GRADE            0
PUFC08_CURSCH           0
PUFC09_GRADTECH         0
PUFC10_CONWR            0
PUFC38_PREVJOB          0
PUFC17_NATEM       105952
PUFC30_LOOKW       129137
PUFC31_FLWRK       175014
PUFC37_WILLING          0
PUFC36_AVAIL            0
PUFC20_PWMORE      105952
PUFC21_PLADDW      105952
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  focused_df["PUFC36_AVAIL"].fillna(mode_value, inplace=True)


Empty entries for `PUFC31_FLWRK` (looked for work or tried to establish a business) will be defaulted to a value of the mode to preserve the most common educational attainment.

In [16]:
mode_value = focused_df["PUFC31_FLWRK"].mode()[0] 
focused_df["PUFC31_FLWRK"].fillna(mode_value, inplace=True)

print(focused_df.isnull().sum())

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  focused_df["PUFC31_FLWRK"].fillna(mode_value, inplace=True)


PUFC11_WORK         18339
PUFC05_AGE              0
PUFC04_SEX              0
PUFC06_MSTAT            0
PUFC07_GRADE            0
PUFC08_CURSCH           0
PUFC09_GRADTECH         0
PUFC10_CONWR            0
PUFC38_PREVJOB          0
PUFC17_NATEM       105952
PUFC30_LOOKW       129137
PUFC31_FLWRK            0
PUFC37_WILLING          0
PUFC36_AVAIL            0
PUFC20_PWMORE      105952
PUFC21_PLADDW      105952
dtype: int64


Empty entries for `PUFC30_LOOKW` (first time to look for work) will be defaulted to a value of the mode to preserve the most common educational attainment.

In [17]:
mode_value = focused_df["PUFC30_LOOKW"].mode()[0] 
focused_df["PUFC30_LOOKW"].fillna(mode_value, inplace=True)

print(focused_df.isnull().sum())

PUFC11_WORK         18339
PUFC05_AGE              0
PUFC04_SEX              0
PUFC06_MSTAT            0
PUFC07_GRADE            0
PUFC08_CURSCH           0
PUFC09_GRADTECH         0
PUFC10_CONWR            0
PUFC38_PREVJOB          0
PUFC17_NATEM       105952
PUFC30_LOOKW            0
PUFC31_FLWRK            0
PUFC37_WILLING          0
PUFC36_AVAIL            0
PUFC20_PWMORE      105952
PUFC21_PLADDW      105952
dtype: int64

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  focused_df["PUFC30_LOOKW"].fillna(mode_value, inplace=True)





Empty entries for `PUFC20_PWMORE` (want more hours of work) will be defaulted to a value of the mode to preserve the most common educational attainment.

In [18]:
mode_value = focused_df["PUFC20_PWMORE"].mode()[0] 
focused_df["PUFC20_PWMORE"].fillna(mode_value, inplace=True)

print(focused_df.isnull().sum())

PUFC11_WORK         18339
PUFC05_AGE              0
PUFC04_SEX              0
PUFC06_MSTAT            0
PUFC07_GRADE            0
PUFC08_CURSCH           0
PUFC09_GRADTECH         0
PUFC10_CONWR            0
PUFC38_PREVJOB          0
PUFC17_NATEM       105952
PUFC30_LOOKW            0
PUFC31_FLWRK            0
PUFC37_WILLING          0
PUFC36_AVAIL            0
PUFC20_PWMORE           0
PUFC21_PLADDW      105952
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  focused_df["PUFC20_PWMORE"].fillna(mode_value, inplace=True)


Empty entries for `PUFC17_NATEM` (nature of employment) will be defaulted to a value of the mode to preserve the most common educational attainment.

In [19]:
mode_value = focused_df["PUFC17_NATEM"].mode()[0] 
focused_df["PUFC17_NATEM"].fillna(mode_value, inplace=True)

print(focused_df.isnull().sum())

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  focused_df["PUFC17_NATEM"].fillna(mode_value, inplace=True)


PUFC11_WORK         18339
PUFC05_AGE              0
PUFC04_SEX              0
PUFC06_MSTAT            0
PUFC07_GRADE            0
PUFC08_CURSCH           0
PUFC09_GRADTECH         0
PUFC10_CONWR            0
PUFC38_PREVJOB          0
PUFC17_NATEM            0
PUFC30_LOOKW            0
PUFC31_FLWRK            0
PUFC37_WILLING          0
PUFC36_AVAIL            0
PUFC20_PWMORE           0
PUFC21_PLADDW      105952
dtype: int64


Empty entries for `PUFC21_PLADDW` (look for additional work) will be defaulted to a value of the mode to preserve the most common educational attainment.

In [20]:
mode_value = focused_df["PUFC21_PLADDW"].mode()[0] 
focused_df["PUFC21_PLADDW"].fillna(mode_value, inplace=True)

print(focused_df.isnull().sum())

PUFC11_WORK        18339
PUFC05_AGE             0
PUFC04_SEX             0
PUFC06_MSTAT           0
PUFC07_GRADE           0
PUFC08_CURSCH          0
PUFC09_GRADTECH        0
PUFC10_CONWR           0
PUFC38_PREVJOB         0
PUFC17_NATEM           0
PUFC30_LOOKW           0
PUFC31_FLWRK           0
PUFC37_WILLING         0
PUFC36_AVAIL           0
PUFC20_PWMORE          0
PUFC21_PLADDW          0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  focused_df["PUFC21_PLADDW"].fillna(mode_value, inplace=True)


Lastly empty entries for `PUFC11_WORK` (work indicator) will be defaulted to a value of the mode to preserve the most common educational attainment.

In [21]:
mode_value = focused_df["PUFC11_WORK"].mode()[0] 
focused_df["PUFC11_WORK"].fillna(mode_value, inplace=True)

print(focused_df.isnull().sum())

PUFC11_WORK        0
PUFC05_AGE         0
PUFC04_SEX         0
PUFC06_MSTAT       0
PUFC07_GRADE       0
PUFC08_CURSCH      0
PUFC09_GRADTECH    0
PUFC10_CONWR       0
PUFC38_PREVJOB     0
PUFC17_NATEM       0
PUFC30_LOOKW       0
PUFC31_FLWRK       0
PUFC37_WILLING     0
PUFC36_AVAIL       0
PUFC20_PWMORE      0
PUFC21_PLADDW      0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  focused_df["PUFC11_WORK"].fillna(mode_value, inplace=True)


Empty entries for `` ()

In [None]:
#focused_df.to_csv(r"C:\Users\Matthew Sanchez\Desktop\3rdYr\T2\STINTSY\cleaned_LFS_PUF_April_2016.csv", index=False)

## Section 5. Exploratory data analysis

## Section 6. Initial model training

## Section 7. Error analysis

## Section 8. Improving model performance

## Section 9. Model performance summary

## Section 10. Insights and conclusions

## Section 11. References