<h1 style="font-size:42px; text-align:center; margin-bottom:30px;"><span style="color:SteelBlue">Module 2
:</span> Base Table Construction</h1>


<br><hr id="toc">

### Build data ...

1. [Drop unwanted observations](#drop)
2. [Fix structural errors](#structural)
3. [Handle missing data](#missing-data)
4. [Engineer features](#engineer-features)
5. [Save the ABT](#save-abt)

<br><hr>

In [1]:
# NumPy for numerical computing
import numpy as np
# Pandas for DataFrames
import pandas as pd

# Matplotlib for visualization
import matplotlib.pyplot as plt
# display plots in the notebook
%matplotlib inline
# Seaborn for easier visualization
import seaborn as sns

In [2]:
data = pd.read_csv('HR_comma_sep.csv')

FileNotFoundError: File b'HR_comma_sep.csv' does not exist

In [None]:
data.head()

<span id="drop"></span>
## Data preprocessing

In [None]:
# drop unwanted observations
print(data.shape)
data.drop_duplicates()
print(data.shape)

In [None]:
# Unique classes of 'department'
data.Department.unique()

In [None]:
# Print unique values of 'filed_complaint'
print(data.Work_accident.unique())
# Print unique values of 'recently_promoted'
print(data.promotion_last_5years.unique())
# Print unique values of 'left'
print(data.left.unique())

<span id="missing-data"></span>
## Handle missing data

In [None]:
# check for missing values
data.isnull().sum()

The data features has no missing values

<span id="engineer-features"></span>

## Engineer features

In [None]:
sns.lmplot(y='last_evaluation', x='satisfaction_level', data=data[data['left']==1], fit_reg=False)

These roughly translate to 3 **indicator features** we can engineer:

* <code style="color:steelblue">'underperformer'</code> - last_evaluation < 0.6
* <code style="color:steelblue">'unhappy'</code> - satisfaction_level < 0.2
* <code style="color:steelblue">'overachiever'</code> - last_evaluation > 0.8 and satisfaction > 0.7

In [None]:
# Create indicator features
data['underperformer'] = (data.last_evaluation < 0.6).astype(int)
data['unhappy'] = (data.satisfaction_level < 0.2).astype(int)
data['overachiever'] = ((data.last_evaluation > 0.8) & (data.satisfaction_level > 0.7)).astype(int)

In [None]:
# The proportion of observations belonging to each group
data[['underperformer', 'unhappy', 'overachiever']].mean()

<span id="save-abt"></span>
## Base Table


In [None]:
# The proportion of observations who 'Left'
sum(data.left)/len(data)

In [None]:
# Create new dataframe with dummy features
cate_columns = data.select_dtypes(include=['object']).columns
btable = pd.get_dummies(data, columns=cate_columns)
# Display first 10 rows
pd.set_option('display.max_columns', 50)
btable.head()

## Model Training

In [None]:
# Pickle for saving model files
import pickle

# Import RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

# Function for splitting training and test set
from sklearn.model_selection import train_test_split
# Function for creating model pipelines
from sklearn.pipeline import make_pipeline
# For standardization
from sklearn.preprocessing import StandardScaler
# Helper for cross-validation
from sklearn.model_selection import GridSearchCV
# Classification metrics (added later)
from sklearn.metrics import roc_curve, auc

#### Split the data

In [None]:
# Create separate object for target variable
y = btable.left

# Create separate object for input features
X = btable.drop('left', axis=1)

In [None]:
# Split X and y into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size= 0.2,
                                                    random_state=1234,
                                                    stratify=btable.left)

# Print number of observations in X_train, X_test, y_train, and y_test
print(len(X_train), len(X_test), len(y_train), len(y_test))

In [None]:
print(btable.average_montly_hours.min())
print(btable.average_montly_hours.max())
print(310-96)

In [None]:
# min_max scalar
def scaler(x):
    return round((x-96)/214,2)

<span id="pipelines"></span>
## Build model pipelines

In [None]:
pipeline = make_pipeline(RandomForestClassifier(random_state=123))

In [None]:
# Random Forest hyperparameters
rf_hyperparameters = {
    'randomforestclassifier__n_estimators': [100, 200],
    'randomforestclassifier__max_features': ['auto', 'sqrt', 0.33]
}

<span id="fit-tune"></span>
### Fit and tune models with cross-validation

In [None]:
model = GridSearchCV(pipeline, rf_hyperparameters, cv=10, n_jobs=-1)

model.fit(X_train, y_train)

In [None]:
# model best score
model.best_score_

<span id="evaluate"></span>
## Evaluate metrics

In [None]:
# Classification metrics
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import confusion_matrix

In [None]:
pred = model.predict(X_test)

In [None]:
print(confusion_matrix(y_test, pred))

In [None]:
pred_prob = model.predict_proba(X_test)

pred_prob = [p[1] for p in pred_prob]

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, pred_prob)

# Initialize figure
plt.figure(figsize=(8,8))
plt.title('Receiver Operating Characteristic')

# Plot ROC curve
plt.plot(fpr, tpr, label='l1')
plt.legend(loc='lower right')

# Diagonal 45 degree line
plt.plot([0,1],[0,1],'k--')

# Axes limits and labels
plt.xlim([-0.1,1.1])
plt.ylim([-0.1,1.1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

In [None]:
# Calculate AUROC
print(auc(fpr, tpr))

In [None]:
# Save winning model as final_model.pkl
with open('final_model.pkl', 'wb') as f:
    pickle.dump(model.best_estimator_, f)

#### Reading the pickle file to get the stored model to perform predictions and calculate auc

In [69]:
objects = []
with open('final_model.pkl', 'rb') as f:
    objects.append(pickle.load(f))

In [70]:
pred = objects[0].predict_proba(X_test)
pred = [p[1] for p in pred]

fpr, tpr, threshold = roc_curve(y_test, pred)
print(auc(fpr, tpr)) 

0.991072194407


The score matches the previous AUROC score

In [71]:
X_train.columns

Index(['satisfaction_level', 'last_evaluation', 'number_project',
       'average_montly_hours', 'time_spend_company', 'Work_accident',
       'promotion_last_5years', 'underperformer', 'unhappy', 'overachiever',
       'Department_IT', 'Department_RandD', 'Department_accounting',
       'Department_hr', 'Department_management', 'Department_marketing',
       'Department_product_mng', 'Department_sales', 'Department_support',
       'Department_technical', 'salary_high', 'salary_low', 'salary_medium'],
      dtype='object')

In [72]:
data[data.last_evaluation==0]

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,Department,salary,underperformer,unhappy,overachiever


In [74]:
data[data.satisfaction_level==0]

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,Department,salary,underperformer,unhappy,overachiever
