<h1 style="font-size:42px; text-align:center; margin-bottom:30px;"><span style="color:SteelBlue">Step 5:</span> The Delivery</h1>
<hr>

Now, we'll show you how you can use your model to predict brand new (**raw**) data and package your work together into an executable script.


Essential steps of project delivery:

1. [Confirm your model was saved correctly](#confirm)
2. [Write pre-modeling functions](#pre-model)
3. [Construct a model class](#model-class)
4. [Method 1: Jupyter notebook](#jupyter)
5. [Method 2: Executable script](#exectuable)


In [None]:
#importing ze libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# Pickle for reading model files
import pickle

import shortcut function <code style="color:steelblue">roc_auc_score()</code> for **area under ROC curve** metric
- for verification no need to plot the ROC curve

In [None]:
# Area under ROC curve
from sklearn.metrics import roc_auc_score

load our winning model

In [None]:
# Load final_model.pkl as model
with open('final_model/project2_final_model.pkl', 'rb') as f:
    model = pickle.load(f)

Great, let's begin.

# 1. Confirm our model was saved correctly

quick sanity to check we are proceeding with the correct model.

**Firstly, Confirming a few key details:**
* It should be a model <code style="color:steelblue">Pipeline</code>.
* Features should be scaled 
    * The first step should have a <code style="color:steelblue">StandardScaler</code> preprocessing step.
*  Type of model should also be appropriate
    * in this case a <code style="color:steelblue">RandomForestClassifier</code> model.
    
**Then, verifying our model is indeed correct**....
Methodology:

1. Load the original analytical base table (ABT) that was used to train the model.
2. Split it into the same training and test sets (with the same random seed).
3. See if we get the same AUROC on the test set 


In [None]:
# Display model object
model

1. Feature scaled: check.
2. Random Forest model: check.

**Let's see if the model replicates (matches) in score**

In [None]:
# Load analytical base table used in Module 4
abt = pd.read_csv('project_files/analytical_base_table.csv')

**split it into training and test sets**
* remember to stratify the data <code style="color:steelblue">stratify = (data).status</code> 

In [None]:
from sklearn.model_selection import train_test_split
# Create separate object for target variable
y = abt.status
# Create separate object for input features
X = abt.drop('status', axis=1)
# Split X and y into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, stratify = y)

**Prediction of our model, and printing the <code style="color:steelblue">roc_auc_score</code>.**

In [None]:
# Predict X_test
pred = model.predict_proba(X_test)
# Get just the prediction for the postive class (1)
pred = [p[1] for p in pred]
# Print AUROC
score = roc_auc_score(y_test, pred)
print('AUROC', score)

Indeed the same one, we are good to go. One last thing:

Let's load some brand new, **raw data** that we've never seen before.

In [None]:
raw_data = pd.read_csv('project_files/unseen_raw_data.csv')

print( raw_data.shape )
raw_data.head()

**When we try to apply our model to this raw dataset, we'll get an error.**

How could we make it so our model works for any random raw dataset of same configuration?

In [None]:
pred = model.predict_proba(raw_data)

# 2. Write pre-modeling functions

Writing a function to automatically **convert the raw data to the same format as the analytical base table**.
That is encapsulating our code from Step 2 : ABT construction.

In [None]:
def clean_data(df):
    # Drop duplicates
    df = df.drop_duplicates()
    
    # Drop temporary workers
    df = df[df.department != 'temp']
    
    # Missing filed_complaint values should be 0
    df['filed_complaint'] = df.filed_complaint.fillna(0)

    # Missing recently_promoted values should be 0
    df['recently_promoted'] = df.recently_promoted.fillna(0)
    
    # 'information_technology' should be 'IT'
    df.department.replace('information_technology', 'IT', inplace=True)

    # Fill missing values in department with 'Missing'
    df['department'].fillna('Missing', inplace=True)

    # Indicator variable for missing last_evaluation
    df['last_evaluation_missing'] = df.last_evaluation.isnull().astype(int)
    
    # Fill missing values in last_evaluation with 0
    df.last_evaluation.fillna(0, inplace=True)
    
    # Return cleaned dataframe
    return df

Let's check out how well our function cleans

In [None]:
# make a clean_data 
cleaned_data = clean_data(raw_data)
# Display first 10 observations
cleaned_data.head(10)

**Data is clean still need to alter our features. 
the function <code style="color:steelblue">engineer_features()</code> encapsulates all of the feature engineering steps.**

In [None]:
def engineer_features(df):
    # Create indicator features
    df['underperformer'] = ((df.last_evaluation < 0.6) & 
                            (df.last_evaluation_missing == 0)).astype(int)

    df['unhappy'] = (df.satisfaction < 0.2).astype(int)

    df['overachiever'] = ((df.last_evaluation > 0.8) & (df.satisfaction > 0.7)).astype(int)
        
    # Create new dataframe with dummy features
    df = pd.get_dummies(df, columns=['department', 'salary'])
    
    # Return augmented DataFrame
    return df

Creating a new DataFrame named <code style="color:steelblue">augmented_data</code> that is both clean & engineered.

In [None]:
# Create augmented_data
augmented_data = engineer_features(clean_data)
# Display first 5 rows
augmented_data

Our raw data has now been transformed into clean & engineered data (super f***** cool)


<br id="model-class">
# 3. Construct a class for our model

packaging these functions together into a **model class**. 

Convenient way to keep all of the logic for a given model in one place.
- Include logic for cleaning data, feature engineering, and predicting new observations.




In [None]:
class EmployeeRetentionModel:
    
    def __init__(self, model_location):
        with open(model_location, 'rb') as f:
            self.model = pickle.load(f)
    
    def predict_proba(self, X_new, clean=True, augment=True):
        if clean:
            X_new = self.clean_data(X_new)
        
        if augment:
            X_new = self.engineer_features(X_new)
        
        return self.model.predict_proba(X_new)
    
    
    def clean_data(self, df):
        # Drop duplicates
        df = df.drop_duplicates()

        # Drop temporary workers
        df = df[df.department != 'temp']

        # Missing filed_complaint values should be 0
        df['filed_complaint'] = df.filed_complaint.fillna(0)

        # Missing recently_promoted values should be 0
        df['recently_promoted'] = df.recently_promoted.fillna(0)

        # 'information_technology' should be 'IT'
        df.department.replace('information_technology', 'IT', inplace=True)

        # Fill missing values in department with 'Missing'
        df['department'].fillna('Missing', inplace=True)

        # Indicator variable for missing last_evaluation
        df['last_evaluation_missing'] = df.last_evaluation.isnull().astype(int)

        # Fill missing values in last_evaluation with 0
        df.last_evaluation.fillna(0, inplace=True)

        # Return cleaned dataframe
        return df
        
    def engineer_features(self, df):
        # Create indicator features
        df['mediocre'] = ((df.last_evaluation < 0.6) & 
                                (df.last_evaluation_missing == 0)).astype(int)

        df['frustrated'] = (df.satisfaction < 0.2).astype(int)

        df['ambitious'] = ((df.last_evaluation > 0.8) & (df.satisfaction > 0.7)).astype(int)

        # Create new dataframe with dummy features
        df = pd.get_dummies(df, columns=['department', 'salary'])

        # Return augmented DataFrame
        return df

<span id="jupyter"></span>
# 4. Method 1: Jupyter notebook

2 simplest ways to deploy our model.
1. Keep it in Jupyter Notebook
2. Port it to an executable script

For jupyer notebook:
First, simply initialize an instance of our model:

In [None]:
# Initialize an instance
retention_model = EmployeeRetentionModel('final_model/project2_final_model.pkl')

If implemented correctly, these next three statements should all work.

In [None]:
# Predict raw data
_, pred1 = retention_model.predict_proba(raw_data, clean=True, augment=True)

# Predict cleaned data
_, pred2 = retention_model.predict_proba(cleaned_data, clean=False, augment=True)

# Predict cleaned and augmented data
_, pred3 = retention_model.predict_proba(augmented_data, clean=False, augment=False)

<code style="color:steelblue">_, pred1 =</code> simply means we're throwing away the first object that's returned (which was <code style="color:steelblue">X_new</code>).

Their predictions should all be equivalent.

In [None]:
# Should be true
np.array_equal(pred1, pred2) and np.array_equal(pred2, pred3)

<span id="executable"></span>
# 5. Method 2: Executable script (optional)

An example script in the <code style="color:crimson">project_files/</code> directory of this project called <code style="color:crimson">retention_model.py</code>.

To run the script, call it from the command line:

<pre style="color:crimson; margin-bottom:30px">
EDS:step-5-model_delivery - Project Delivery EDS$ python project_files/retention_model.py project_files/unseen_raw_data.csv predictions.csv final_model.pkl True True
</pre>

This'll save a new file that includes the predictions. It'll looks like this:

In [None]:
# Will only work after running the command above
predictions = pd.read_csv('predictions.csv')

predictions.head()

Essential steps to professionally deliver the model:
* Duplicate and confirm your model was saved correctly.
* Compile data cleaning and feature engineering functions we used in ABT construction
* Package everything together (our logic) in a custom model class.
* Apply your model to raw data in Jupyter Notebook.