# Data Preprocessing

The following notebook is a modified version version of the talk given by April Chen in 2016 titled Depy 2016 Talk: Pre-Modeling: Data Preprocessing and Feature Exploration in Python. More information can be found from the original repo: https://github.com/aprilypchen/depy2016


### The objective of the tutorial is to show the importance of data manipulation, and how to apply common data pre-processing techniques to improve model performance.

In [2]:
# Bread and butter libraries to deal with dataframes and matrices
import numpy as np
import pandas as pd

##### For the following workshop, we will be using an edited version of the "adult" dataset from the public UCI repository. The dataset consists of information on various individuals, including age, education, marital status, gender, and income.

In [14]:
# Machine learning models cannot deal with null values, we will go over techniques to deal with them
na_values = ['#NAME?']
df = pd.read_csv('./dataset/adult.csv', na_values=na_values)

In [15]:
# peek at first 10
df.head(10)

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39.0,State-gov,77516.0,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50.0,Self-emp-not-inc,83311.0,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38.0,Private,215646.0,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53.0,Private,234721.0,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,,0,0,40,United-States,<=50K
4,28.0,Private,338409.0,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37.0,Private,284582.0,Masters,14.0,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49.0,Private,160187.0,9th,5.0,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52.0,Self-emp-not-inc,209642.0,HS-grad,9.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31.0,Private,45781.0,Masters,14.0,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42.0,Private,159449.0,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


#### Binary Classification Problem: Predict, based on various features from the dataset, if someone's income is greater or less than 50k

In [16]:
# Observe dataset distribution, in practice when unbalanced datasets are not well treated, performance metrics can
# be very misleading
df['income'].value_counts()

<=50K    3779
>50K     1221
Name: income, dtype: int64

In [17]:
# Encode as 0 if income <=50K and as 1 if income >50K
df['income'] = [0 if x == '<=50K' else 1 for x in df['income']]

In [18]:
# Check the results of the encoding
df.income.head(10)

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    1
8    1
9    1
Name: income, dtype: int64

## 1. Benchmark performance with unprocessed data

In [19]:
# Drop na values to not throw any errors
df_unprocessed = df
df_unprocessed = df_unprocessed.dropna(axis=0, how='any')

print(df.shape)
print(df_unprocessed.shape)

(5000, 15)
(4496, 15)


In [20]:
# Remove non-numeric columns so model doesn't throw errors
# Potential loss of information from categorical features is evident

for col_name in df_unprocessed.columns:
    if df_unprocessed[col_name].dtypes not in ['int32','int64','float32','float64']:
        df_unprocessed = df_unprocessed.drop(col_name, 1)

In [21]:
# Split into features and target variable
x_unprocessed=df_unprocessed.drop('income',1)
y_unprocessed=df_unprocessed.income

In [23]:
# Check the X features 
x_unprocessed.head()

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week
0,39.0,77516.0,13.0,2174,0,40
1,50.0,83311.0,13.0,0,0,13
2,38.0,215646.0,9.0,0,0,40
4,28.0,338409.0,13.0,0,0,40
5,37.0,284582.0,14.0,0,0,40


In [24]:
# Check the Y targets 
y_unprocessed.head()

0    0
1    0
2    0
4    0
5    0
Name: income, dtype: int64

### Import algo to measure baseline accuracy

In [26]:
# Import common ML tools from sklearn
from sklearn.linear_model import LogisticRegression #Binary classification problem.
from sklearn.model_selection import train_test_split 
from sklearn.metrics import accuracy_score 

### Split data into train/test split

In [28]:
x_train_unproc, x_test_unproc, y_train, y_test = train_test_split(
    x_unprocessed, y_unprocessed, train_size=0.70, test_size=0.30, random_state=1)

In [29]:
# Function that returns model accuracy
def find_model_perf(X_train, y_train, X_test, y_test):
    model = LogisticRegression()
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    acc = accuracy_score(y_test, pred)
    
    return acc

In [35]:
# Train the model and get the accuracy
acc_processed=find_model_perf(x_train_unproc, y_train, x_test_unproc, y_test)
print(acc_processed)

0.7954040029651593




#### NOTE: when feeding a test features into your prediction model, ensure test set has gone through the same preprocessing as your training set

## 2. Explore feature space to determine how to perform data pre-processing, then feed processed data into model to evaluate performance difference

In [37]:
# Separate features from target var.
y=df.income
X=df.drop(['income'],axis=1)

In [38]:
y.head(10)

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    1
8    1
9    1
Name: income, dtype: int64

In [39]:
X.head(10)

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country
0,39.0,State-gov,77516.0,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States
1,50.0,Self-emp-not-inc,83311.0,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States
2,38.0,Private,215646.0,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States
3,53.0,Private,234721.0,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,,0,0,40,United-States
4,28.0,Private,338409.0,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba
5,37.0,Private,284582.0,Masters,14.0,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States
6,49.0,Private,160187.0,9th,5.0,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica
7,52.0,Self-emp-not-inc,209642.0,HS-grad,9.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States
8,31.0,Private,45781.0,Masters,14.0,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States
9,42.0,Private,159449.0,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States


### Dealing with Categorical Data: One-hot encoding

#### One simple strategy to convert categorical data to numerical data is to create one dummy variable for each categorical value possible, then flagging with a 1 when the value is present

In [40]:
pd.get_dummies(X['education']).head(5)

Unnamed: 0,10th,11th,12th,1st-4th,5th-6th,7th-8th,9th,?,Assoc-acdm,Assoc-voc,Bachelors,Doctorate,HS-grad,Masters,Preschool,Prof-school,Some-college
0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
3,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0


#### Determine how many possibles values there are for each feature

In [41]:
# Decide which categorical variables you want to use in model 
for col_name in X.columns:
    if X[col_name].dtypes == 'object':
        unique_cat = len(X[col_name].unique())
        print("Feature '{col_name}' has {unique_cat} unique categories".format(col_name=col_name, unique_cat=unique_cat))

Feature 'workclass' has 8 unique categories
Feature 'education' has 17 unique categories
Feature 'marital_status' has 7 unique categories
Feature 'occupation' has 15 unique categories
Feature 'relationship' has 6 unique categories
Feature 'race' has 6 unique categories
Feature 'sex' has 3 unique categories
Feature 'native_country' has 40 unique categories


In [42]:
# Investigate why there is a high number of unique values for 'native_country'
X['native_country'].value_counts().sort_values(ascending=False).head(10)

United-States    4465
Mexico            104
?                  97
Canada             28
Philippines        22
Germany            22
Puerto-Rico        16
England            16
El-Salvador        16
China              15
Name: native_country, dtype: int64

In [43]:
# In this case, bucket low frequecy categories as "Other"
X['native_country']=['United-States' if x=='United-States' else 'Other' for x in X['native_country']]

In [44]:
X['native_country'].value_counts().sort_values(ascending=False).head(10)

United-States    4465
Other             535
Name: native_country, dtype: int64

#### Create list of important categorical features to encode

In [46]:
# Create a list of features to dummy# Create 
todummy_list = ['workclass', 'education', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_country']

In [47]:
# Function to dummy all the categorical variables used for modeling
def dummy_df(df, todummy_list):
    for x in todummy_list:
        dummies = pd.get_dummies(df[x], prefix=x, dummy_na=False)
        df = df.drop(x, 1)
        df = pd.concat([df, dummies], axis=1)
    return df

In [49]:
# Convert list to dummy 
X=dummy_df(X,todummy_list)

KeyError: 'workclass'

In [50]:
X.head()

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,workclass_?,workclass_Federal-gov,workclass_Local-gov,workclass_Private,...,relationship_Wife,race_Amer-Indian-Eskimo,race_Asian-Pac-Islander,race_Black,race_Other,race_White,sex_Female,sex_Male,native_country_Other,native_country_United-States
0,39.0,77516.0,13.0,2174,0,40,0,0,0,0,...,0,0,0,0,0,1,0,1,0,1
1,50.0,83311.0,13.0,0,0,13,0,0,0,0,...,0,0,0,0,0,1,0,1,0,1
2,38.0,215646.0,9.0,0,0,40,0,0,0,1,...,0,0,0,0,0,1,0,1,0,1
3,53.0,234721.0,7.0,0,0,40,0,0,0,1,...,0,0,0,1,0,0,0,0,0,1
4,28.0,338409.0,13.0,0,0,40,0,0,0,1,...,1,0,0,1,0,0,1,0,1,0


### Investigate null values

In [53]:
X.isnull().sum().sort_values(ascending=False).head()

fnlwgt                 107
education_num           57
age                     48
education_Doctorate      0
education_7th-8th        0
dtype: int64

In [54]:
# Impute missing values using Imputer in sklearn.preprocessing
from sklearn.preprocessing import Imputer

imp = Imputer(missing_values='NaN', strategy='median', axis=0)
imp.fit(X)
X = pd.DataFrame(data=imp.transform(X) , columns=X.columns)



In [55]:
# Sanity check
X.isnull().sum().sort_values(ascending=False).head()

native_country_United-States    0
education_Bachelors             0
education_5th-6th               0
education_7th-8th               0
education_9th                   0
dtype: int64

### PCA to find most important components

In [56]:
# Use PCA from sklearn.decompostion to find principal components
from sklearn.decomposition import PCA

pca = PCA(n_components=10)
X_pca = pd.DataFrame(pca.fit_transform(X))

In [57]:
X_pca.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,-113214.310906,1139.462583,-93.36956,-0.764738,-0.325478,-2.871539,0.442253,1.098011,0.090037,-0.359008
1,-107419.290228,-1034.479985,-97.56377,6.772028,-28.321001,-3.87387,-1.44881,0.084301,0.304153,-0.315497
2,24915.70985,-1033.217954,-95.300258,-0.025758,-0.16784,0.999422,0.342504,0.463981,-0.896809,-0.429981
3,43990.709694,-1033.034501,-94.951613,15.022832,-2.349191,2.91133,-0.287956,-0.449674,0.0674,0.465092
4,147678.709918,-1032.049805,-93.253485,-8.647558,1.266781,-3.30605,0.256099,-1.342177,0.144298,0.427318
5,93851.709836,-1032.561311,-94.140176,-0.282919,0.078516,-4.179262,0.270684,-1.181192,-0.027927,0.203821
6,-30543.290226,-1033.747924,-96.271763,6.910263,-25.570403,4.218942,0.879974,-0.165318,-0.12882,0.80178
7,18911.709703,-1033.27277,-95.361617,14.50254,2.836137,1.1092,-0.791688,-0.036183,0.271056,-0.850998
8,-144949.424631,13049.13768,-70.621113,-9.332694,9.415334,-3.016165,1.377927,0.113162,-0.293002,0.118906
9,-31281.339662,4144.237526,-86.116203,2.651723,-1.111631,-2.862347,-0.992528,-0.18815,-0.329091,0.261547


### Evaluate the same algorithm, but with preprocessed dataset

In [58]:
# Split into train and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_pca, y, train_size=0.70, test_size=0.30,random_state=1)

In [59]:
# Train the model and get the accuracy
acc_proc = find_model_perf(X_train, y_train, X_test, y_test)
print(acc_proc)

0.856




In [61]:
# Check improvement 
improvement = np.round(((acc_proc - acc_processed)/acc_processed),3)*100

In [62]:
improvement

7.6