# Import Libraries

In [1]:
import pandas as pd
import numpy as np

# Load Dataset

In [3]:
df = pd.read_excel('./oes_research_2021_sec_55-56.xlsx')

In [4]:
df.head()

Unnamed: 0,AREA,AREA_TITLE,NAICS,NAICS_TITLE,I_GROUP,OCC_CODE,OCC_TITLE,O_GROUP,TOT_EMP,EMP_PRSE,...,H_MEDIAN,H_PCT75,H_PCT90,A_PCT10,A_PCT25,A_MEDIAN,A_PCT75,A_PCT90,ANNUAL,HOURLY
0,1,Alabama,55,Management of Companies and Enterprises,sector,00-0000,All Occupations,total,21920,0.0,...,35.6,56.94,79.49,35470,47040,74050,118440,165330,,
1,1,Alabama,55,Management of Companies and Enterprises,sector,11-0000,Management Occupations,major,4820,4.1,...,61.13,92.03,#,61600,94020,127140,191420,#,,
2,1,Alabama,55,Management of Companies and Enterprises,sector,11-1021,General and Operations Managers,detailed,1600,7.0,...,60.5,#,#,60010,78520,125850,#,#,,
3,1,Alabama,55,Management of Companies and Enterprises,sector,11-2021,Marketing Managers,detailed,140,13.6,...,61.13,99.23,#,65240,98680,127140,206410,#,,
4,1,Alabama,55,Management of Companies and Enterprises,sector,11-2022,Sales Managers,detailed,140,14.7,...,49.56,77.94,#,59390,79010,103080,162110,#,,


# Handling Missing Data

## Creating the subset of the dataset to work with

In [6]:
df_sub = df[['AREA_TITLE','OCC_CODE','OCC_TITLE','H_MEAN']]

## 1.1 Checking the Rows

Checking the rows that contain `*` or `#` in the `H_MEAN` column

In [7]:
df_sub.loc[df_sub['H_MEAN'].isin(['*','#'])]

Unnamed: 0,AREA_TITLE,OCC_CODE,OCC_TITLE,H_MEAN
15,Alabama,11-9141,"Property, Real Estate, and Community Associati...",*
31,Alabama,13-2052,Personal Financial Advisors,*
51,Alabama,17-3011,Architectural and Civil Drafters,*
298,Arkansas,11-9141,"Property, Real Estate, and Community Associati...",*
301,Arkansas,13-1031,"Claims Adjusters, Examiners, and Investigators",*
...,...,...,...,...
71019,New Mexico,13-0000,Business and Financial Operations Occupations,*
71103,North Dakota,47-4071,Septic Tank Servicers and Sewer Pipe Cleaners,*
71264,Tennessee,11-9021,Construction Managers,*
71293,Tennessee,53-7062,"Laborers and Freight, Stock, and Material Move...",*


## 1.2 Handling the Missing Data

The missing data can be removed or can be imputed

### Setting missing data to NaN

Removes warnings

In [29]:
pd.options.mode.chained_assignment = None 
pd.set_option("future.no_silent_downcasting", True)

In [12]:
missing_wages = df_sub.copy()

missing_wages['H_MEAN'] = missing_wages['H_MEAN'].replace({'#':np.nan,'*':np.nan})

In [13]:
missing_wages.describe()

Unnamed: 0,AREA_TITLE,OCC_CODE,OCC_TITLE,H_MEAN
count,71508,71508,71508,70946.0
unique,54,596,596,6816.0
top,California,00-0000,All Occupations,20.65
freq,3223,1161,1161,57.0


### 1.2.1 Removing the NaN Rows

Creating a copy

In [17]:
drop_nan = missing_wages.copy()

Dropping any rows that contain NaN values

In [18]:
drop_nan.dropna(inplace=True)

Seeing the summary statistics after dropping NaN values

In [20]:
drop_nan.describe()

Unnamed: 0,AREA_TITLE,OCC_CODE,OCC_TITLE,H_MEAN
count,70946,70946,70946,70946.0
unique,54,589,589,6816.0
top,California,00-0000,All Occupations,20.65
freq,3209,1161,1161,57.0


Seeing the number of duplicate rows in the dataframe after dropping missing values

In [21]:
drop_nan.duplicated().sum()

24390

### 1.2.2 Imputing rows with NA values

#### Creating a copy

In [102]:
impute_nan = missing_wages.copy()

#### Imputing the missing values

Using the normal way of `fillna()` won't work in this circumstance because `fillna()` is being applied on a copy of a dataframe. The two following approaches can resovle the issue.

In [106]:
impute_nan.fillna({'H_MEAN':impute_nan['H_MEAN'].mean()},inplace=True)

In [107]:
impute_nan['H_MEAN'] = impute_nan['H_MEAN'].fillna(impute_nan['H_MEAN'].mean())

In [108]:
impute_nan.isna().sum()

AREA_TITLE    0
OCC_CODE      0
OCC_TITLE     0
H_MEAN        0
dtype: int64

In [112]:
impute_nan.describe()

Unnamed: 0,AREA_TITLE,OCC_CODE,OCC_TITLE,H_MEAN
count,71508,71508,71508,71508.0
unique,54,596,596,6817.0
top,California,00-0000,All Occupations,30.538687
freq,3223,1161,1161,562.0


#### Seeing the duplicated rows after imputing

In [110]:
impute_nan.duplicated().sum()

24670

# 1.3: Answering some questions

**1.3.1** Which strategy results in the least changes from the results of the `describe()` function and why?

Removing missing rows yields the least changes in the results from the `describe()` function as `describe()` by default skips missing values, treating them as if they were removed. While imputing keeps the values of the other columns within that rows, thus skewing the results slightly.

**1.3.2** In the context of uniqueness as a data quality attribute, which strategy results in an increase in the number of duplicated rows?

Removing rows that contain missing data reduces the number of duplicated rows. A reason why this is the case is some rows with missing data are duplicates. Another reason could be that duplicated data could contain more missing data by default.