# Quality Control

Controlling the quality of both inputs and outputs is of great significance in data analyses to achieve valid results.

* <a href=#bookmark1>Missing Values</a>
* <a href=#bookmark2>Outliers</a>

In [1]:
# Import necessary packages
import smv
import sys
from pandas import *
from pyspark.sql import HiveContext, DataFrame
from pyspark.sql.window import Window

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

raw = openCsv("data/input/employment/CB1200CZ11.csv")

In [2]:
DataFrame.smvPdHist = lambda df,col,n: df.toPandas()[col].hist(bins=n, alpha=0.3, color='k')

<a name='bookmark1'/>
## Missing Values    

In real life datasets, some observations often have missing values for given variables. Data analysts should pay special attention to these missing values in terms of the reason and meaning of missing values, which may suggest an appropriate treatment for missing values in later analyses.

### Reasons for Missing Values   
There could be many reasons for missing values in different columns, and before checking the data in detail and understanding the business meaning of a column, it is not suggested to assume the reason for missing values.

#### 1. Systematic    
As what we have discussed in the column cross-check example in input data profile, the missing values of in the data suppression flag "PAYANN_F" are systematic: for all observations with positive "PAYANN" value, "PAYANN_F" is of null value. 

In [3]:
raw.smvHist("PAYANN_F")

Histogram of PAYANN_F: String sort by Key
key                      count      Pct    cumCount   cumPct
null                     31729   81.74%       31729   81.74%
D                         7089   18.26%       38818  100.00%
-------------------------------------------------


In [4]:
raw.where(col("PAYANN") > 0).smvEdd("PAYANN_F")

PAYANN_F             Non-Null Count         0
PAYANN_F             Min Length             null
PAYANN_F             Max Length             null
PAYANN_F             Approx Distinct Count  0


#### 2. Random    
There is a random percent of observations with missing values. Assume we send the employment data to a data vendor to append information about the average age of paid employees of each county. However, after we receive the age data, the coverage of counties is only 50% and there is no systematic pattern of which county has the value and which does not. 

In [5]:
# Note that the data set is dummied only for this tutorial purpose
county_demo_stats = openCsv("../dummy_data/county_demo_stats.csv")

# Append to employment data
raw_with_age = raw.smvJoinByKey(county_demo_stats, ["ZIPCODE"], "leftouter").cache()

# Check missing rate
raw_with_age.smvBinHist(("AVG_EMP_AGE",5))

Histogram of AVG_EMP_AGE: with BIN size 5.0
key                      count      Pct    cumCount   cumPct
null                     19365   49.89%       19365   49.89%
35.0                      7553   19.46%       26918   69.34%
40.0                      7538   19.42%       34456   88.76%
45.0                      4362   11.24%       38818  100.00%
-------------------------------------------------


When we use this data, the "AVG_EMP_AGE" variable and any new variable created using the variable will have at least 50% missing values.

### Treatment for Missing Values    
Usually missing values need to be treated or recoded in the later analytic view before performing further analyses or modeling. For missing values in categorical variables like the example above, one can impute the missing value to a string value (for example, "na"). With respect to numerical variables, the imputation method needs to depend on the meaning of the missing values. It does not always make sense to impute missing into zeros, for example one should not impute a variable like age to 0 if it is missing.

<a name='bookmark2'/>
## Outliers 

Numerical variables may have outliers where those observations are very distant from other observations.

### Check for Outilers

In [6]:
# take a sample since toPandas will load all data to memory, which is risky when data size is large
raw_sample = raw.sample(False, 0.3, 99).cache()
raw_sample.count()

11629

In [7]:
raw_sample.select("ESTAB").toPandas().quantile([0,0.01,0.05,0.5,0.95,0.99,1])

Unnamed: 0,ESTAB
0.0,1.0
0.01,1.0
0.05,2.0
0.5,30.0
0.95,912.8
0.99,1642.72
1.0,5933.0


From the quantiles we can check different percentiles (P1, P5, P95, P99) with min and max to see if "ESTAB" has a big outlier issue. 

One can also leverage the window specification and get the percentile of "ESTAB" with `ntile` to check for outliers.

In [8]:
# define a windowSpec
order_estab = Window.orderBy("ESTAB")

# calculate the percentile
raw_sample_percentile = raw_sample.select(
    "ST",
    "ZIPCODE",
    "ESTAB",
    ntile(100).over(order_estab).alias("ESTAB_percentile")
)

# check percentiles and the corresponding values
raw_sample_percentile.where((col("ESTAB_percentile")>=95)|(col("ESTAB_percentile")<=5)).\
    groupBy("ESTAB_percentile").agg(max("ESTAB")).show()

+----------------+----------+
|ESTAB_percentile|max(ESTAB)|
+----------------+----------+
|               1|         1|
|               2|         1|
|               3|         1|
|               4|         1|
|               5|         2|
|              95|       914|
|              96|      1016|
|              97|      1163|
|              98|      1353|
|              99|      1643|
|             100|      5933|
+----------------+----------+



### Treatment for Outilers   
Usually there are 2 ways to handle the outliers:

#### 1. Discard observation with outliers from data

In [9]:
# discard the observations with # of establishments exceeding 99 percentile
raw_sample_filter = raw_sample_percentile.where(col("ESTAB_percentile")<=99)

# check output
raw_sample_filter.count()

11513

116 records have been filtered out from the data.

#### 2. Recode outliers using P1/P99

In [10]:
# get p99
estab_p99 = raw_sample_percentile.groupBy("ESTAB_percentile").agg(max("ESTAB")).collect()[98][1]
estab_p99

1643

In [11]:
# recode values above p99 with p99
raw_sample_recode = raw_sample.smvSelectPlus(
    when(col("ESTAB") > estab_p99, estab_p99).otherwise(col("ESTAB")).alias("ESTAB_recode")
)

# check output
raw_sample_recode.select("ESTAB_recode").toPandas().quantile([0,0.01,0.05,0.5,0.95,0.99,1])

Unnamed: 0,ESTAB_recode
0.0,1.0
0.01,1.0
0.05,2.0
0.5,30.0
0.95,912.8
0.99,1642.72
1.0,1643.0
