# The goal

To assess the 2 files for completeness and consistency

# Import Libraries

In [6]:
import pandas as pd
import numpy as np

# Load Datasets

In [7]:
oews = pd.read_excel('oes_research_2021_sec_55-56.xlsx')
cleaned_pums =  pd.read_csv('cleaned_pums_2021.csv')

In [8]:
oews.head(4)

Unnamed: 0,AREA,AREA_TITLE,NAICS,NAICS_TITLE,I_GROUP,OCC_CODE,OCC_TITLE,O_GROUP,TOT_EMP,EMP_PRSE,...,H_MEDIAN,H_PCT75,H_PCT90,A_PCT10,A_PCT25,A_MEDIAN,A_PCT75,A_PCT90,ANNUAL,HOURLY
0,1,Alabama,55,Management of Companies and Enterprises,sector,00-0000,All Occupations,total,21920,0.0,...,35.6,56.94,79.49,35470,47040,74050,118440,165330,,
1,1,Alabama,55,Management of Companies and Enterprises,sector,11-0000,Management Occupations,major,4820,4.1,...,61.13,92.03,#,61600,94020,127140,191420,#,,
2,1,Alabama,55,Management of Companies and Enterprises,sector,11-1021,General and Operations Managers,detailed,1600,7.0,...,60.5,#,#,60010,78520,125850,#,#,,
3,1,Alabama,55,Management of Companies and Enterprises,sector,11-2021,Marketing Managers,detailed,140,13.6,...,61.13,99.23,#,65240,98680,127140,206410,#,,


In [9]:
cleaned_pums.head(4)

Unnamed: 0,WRK,SEX,SOCP
0,1,2,119151
1,2,1,119111
2,1,2,113121
3,1,1,1110XX


# 1. Insepct the Completeness

Looking for any missing or incomplete values

## 1.1 Creating a subset of the dataset

The goal here is to create a subset of the dataset only containing the `AREA_TITLE`, `OCC_CODE`, `OCC_TILE` and `H_MEAN`.

In [16]:
oews_subset = oews[['AREA_TITLE','OCC_CODE','OCC_TITLE','H_MEAN']]

oews_subset

Unnamed: 0,AREA_TITLE,OCC_CODE,OCC_TITLE,H_MEAN
0,Alabama,00-0000,All Occupations,42.88
1,Alabama,11-0000,Management Occupations,70.9
2,Alabama,11-1021,General and Operations Managers,72.76
3,Alabama,11-2021,Marketing Managers,69.97
4,Alabama,11-2022,Sales Managers,62.97
...,...,...,...,...
71503,Puerto Rico,53-0000,Transportation and Material Moving Occupations,11.15
71504,Puerto Rico,53-1047,First-Line Supervisors of Transportation and M...,18.02
71505,Puerto Rico,53-3032,Heavy and Tractor-Trailer Truck Drivers,11.08
71506,Puerto Rico,53-7081,Refuse and Recyclable Material Collectors,10.06


In [19]:
oews_subset.isna().sum()

0

No duplicates

## 1.2 Checking summary statistics

The summary statistics can be checked using the `.describe` and the `.info()` functions. The most important column to look at here is the `H_MEAN` column

In [20]:
oews_subset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71508 entries, 0 to 71507
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   AREA_TITLE  71508 non-null  object
 1   OCC_CODE    71508 non-null  object
 2   OCC_TITLE   71508 non-null  object
 3   H_MEAN      71508 non-null  object
dtypes: object(4)
memory usage: 2.2+ MB


In [22]:
oews_subset['H_MEAN'].describe()

count     71508
unique     6818
top           *
freq        536
Name: H_MEAN, dtype: object

The missing values in the `H_MEAN` column are represented by the '*' character.

## 1.3 Looking into the dtype of the dataset

Due to the usage of '\*' to represent missing values, the engire column was set to the object data type. Instead, the column should be a float (float64 to be exact). the '\*' character should be set to the 'np.NaN' value.

In [23]:
pd.options.mode.chained_assignment = None #to remove warning

Replace every '*' character with np.NaN, otherwise keep the value as it is. During cnnverting ot a float, it was found that there is also a '#' character used, that was also treated as a missing value and replaced with np.NaN.

In [32]:
oews_subset['H_MEAN'] = oews_subset['H_MEAN'].apply(lambda x: np.NaN if x=="*" or x=="#" else x)

In [34]:
pd.to_numeric(oews_subset['H_MEAN'])

0        42.88
1        70.90
2        72.76
3        69.97
4        62.97
         ...  
71503    11.15
71504    18.02
71505    11.08
71506    10.06
71507    30.01
Name: H_MEAN, Length: 71508, dtype: float64

## 1.4 Checking the number of missing values again

In [37]:
oews_subset.isna().sum()

AREA_TITLE      0
OCC_CODE        0
OCC_TITLE       0
H_MEAN        562
dtype: int64

Now there is a lot of missing entries, 562 to be exact.

# 2. Inspecting the consistency

The consistency will be checked for the `AREA_TITLE` and `OCC_CODE`/`SOCP` columns between the OEWS subset and the PUMS data.

## 2.1 Is the area consistent between the two datasets?

In [41]:
oews_subset.head()

Unnamed: 0,AREA_TITLE,OCC_CODE,OCC_TITLE,H_MEAN
0,Alabama,00-0000,All Occupations,42.88
1,Alabama,11-0000,Management Occupations,70.9
2,Alabama,11-1021,General and Operations Managers,72.76
3,Alabama,11-2021,Marketing Managers,69.97
4,Alabama,11-2022,Sales Managers,62.97


In [40]:
cleaned_pums.head()

Unnamed: 0,WRK,SEX,SOCP
0,1,2,119151
1,2,1,119111
2,1,2,113121
3,1,1,1110XX
4,1,1,113051


In [42]:
oews_subset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71508 entries, 0 to 71507
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   AREA_TITLE  71508 non-null  object 
 1   OCC_CODE    71508 non-null  object 
 2   OCC_TITLE   71508 non-null  object 
 3   H_MEAN      70946 non-null  float64
dtypes: float64(1), object(3)
memory usage: 2.2+ MB


In [43]:
oews_subset.describe()

Unnamed: 0,H_MEAN
count,70946.0
mean,30.538687
std,17.569752
min,8.18
25%,18.01
50%,24.77
75%,38.43
max,172.36


In [44]:
cleaned_pums.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22353 entries, 0 to 22352
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   WRK     22353 non-null  int64 
 1   SEX     22353 non-null  int64 
 2   SOCP    22353 non-null  object
dtypes: int64(2), object(1)
memory usage: 524.0+ KB


In [45]:
cleaned_pums.describe()

Unnamed: 0,WRK,SEX
count,22353.0,22353.0
mean,1.169015,1.439538
std,0.374774,0.496342
min,1.0,1.0
25%,1.0,1.0
50%,1.0,1.0
75%,1.0,2.0
max,2.0,2.0


The areas are not consitent between the two datasets

## 2.2 Are the occupation codes consistent?

Checking if the occupation codes between the two datasaets (`OCC_CODE` and `SOCP` columns). using .sample() function to see random sampls in the datrasets

In [47]:
oews_subset['OCC_CODE'].sample(5)

14544    11-2021
43139    47-2031
32086    43-4161
43465    53-3032
24233    53-7051
Name: OCC_CODE, dtype: object

In [48]:
cleaned_pums['SOCP'].sample(5)

14604    1191XX
18073    119051
10044    119141
13744    119041
12678    113131
Name: SOCP, dtype: object

The difference between the two datasets is that the OEWS dataset contains a '-' in its `OCC_CODE` column while the PUMS dataset does not.