# Part 3: Unbiased Evaluation using a New Test Set

In this part, we are given a new test set (`/dsa/data/all_datasets/back_order/Kaggle_Test_Dataset_v2.csv`). We can now take advantage of the entire smart sample that we created in Part I. 

* Retrain a pipeline using the optimal parameters that the pipeline learned. We don't need to repeat GridSearch here. 

## Import modules as needed

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt

import os, sys
import itertools
import numpy as np
import pandas as pd
import joblib

In [2]:
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest

from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, chi2, f_classif, mutual_info_classif

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, recall_score, fbeta_score
from sklearn.svm import SVC, LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from pprint import pprint
from sklearn.neighbors import LocalOutlierFactor

## Load smart sample and the best pipeline from Part II

In [3]:
sampled_X, sampled_y= joblib.load('sampled_data.pkl')

bestmodel = joblib.load('best_model.joblib')

In [4]:
iso_forest= joblib.load('iso_forest.joblib')

In [5]:
bestmodel

Pipeline(steps=[('Lsvc',
                 SelectFromModel(estimator=LinearSVC(dual=False,
                                                     penalty='l1'))),
                ('rf',
                 RandomForestClassifier(max_depth=20, max_features='sqrt',
                                        n_estimators=600))])

In [6]:
iso_forest

IsolationForest(contamination=0.08)


##  Retrain a pipeline using the full sampled training data set

Use the full sampled training data set to train the pipeline.

In [7]:
X=sampled_X
y=sampled_y

In [8]:
# Add code below this comment  (Question #E301)
# ----------------------------------
iso_forest.fit(X, y)

IsolationForest(contamination=0.08)

In [9]:
iso_outliers = iso_forest.predict(X)==-1

print(f"Num of outliers = {np.sum(iso_outliers)}")
X_iso = X[~iso_outliers]
y_iso = y[~iso_outliers]

Num of outliers = 1807


In [10]:
bm=bestmodel.fit(X_iso, y_iso)

In [11]:
bm

Pipeline(steps=[('Lsvc',
                 SelectFromModel(estimator=LinearSVC(dual=False,
                                                     penalty='l1'))),
                ('rf',
                 RandomForestClassifier(max_depth=20, max_features='sqrt',
                                        n_estimators=600))])

In [12]:
y_pred_train = bestmodel.predict(X_iso)

In [13]:
print(classification_report(y_iso, y_pred_train)) 

              precision    recall  f1-score   support

           0       0.99      0.97      0.98     10398
           1       0.97      0.99      0.98     10381

    accuracy                           0.98     20779
   macro avg       0.98      0.98      0.98     20779
weighted avg       0.98      0.98      0.98     20779



In [14]:
print(confusion_matrix(y_iso,y_pred_train))

[[10111   287]
 [  137 10244]]


In [15]:
print("Accuracy Score: ", accuracy_score(y_iso,y_pred_train))
print("Recall Score: ",recall_score(y_iso,y_pred_train))
print("F1 Score: ",f1_score(y_iso,y_pred_train))

Accuracy Score:  0.9795947831945715
Recall Score:  0.9868028128311338
F1 Score:  0.9797245600612089


### Save the trained model with the pickle library.

In [16]:
# Add code below this comment  
# -----------------------------
import pickle
pickle.dump(bm,open('finalmodel.pkl', 'wb'))



## Load the Testing Data and evaluate your model

 * `/dsa/data/all_datasets/back_order/Kaggle_Test_Dataset_v2.csv`
 
* We need to preprocess this test data (**follow** the steps similar to Part I)
* **If you have fitted any normalizer/standardizer in Part 2, then we have to transform this test data using the fitted normalizer/standardizer!**

In [17]:
# Preprocess the given test set  (Question #E302)
# ----------------------------------

# Dataset location
DATASET = '/dsa/data/all_datasets/back_order/Kaggle_Test_Dataset_v2.csv'
assert os.path.exists(DATASET)

# Load and shuffle
dataset = pd.read_csv(DATASET).sample(frac = 1).reset_index(drop=True)

dataset.head().transpose()



  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,0,1,2,3,4
sku,3290188,3453759,3512840,3515426,3520962
national_inv,135.0,38.0,27.0,-4.0,61.0
lead_time,2.0,2.0,8.0,8.0,
in_transit_qty,67.0,0.0,0.0,0.0,0.0
forecast_3_month,144.0,0.0,0.0,288.0,0.0
forecast_6_month,324.0,0.0,0.0,288.0,0.0
forecast_9_month,504.0,0.0,0.0,288.0,0.0
sales_1_month,27.0,0.0,0.0,1.0,1.0
sales_3_month,95.0,0.0,1.0,153.0,4.0
sales_6_month,194.0,0.0,5.0,231.0,7.0


In [18]:
dataset.shape

(242076, 23)

In [19]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 242076 entries, 0 to 242075
Data columns (total 23 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   sku                242076 non-null  object 
 1   national_inv       242075 non-null  float64
 2   lead_time          227351 non-null  float64
 3   in_transit_qty     242075 non-null  float64
 4   forecast_3_month   242075 non-null  float64
 5   forecast_6_month   242075 non-null  float64
 6   forecast_9_month   242075 non-null  float64
 7   sales_1_month      242075 non-null  float64
 8   sales_3_month      242075 non-null  float64
 9   sales_6_month      242075 non-null  float64
 10  sales_9_month      242075 non-null  float64
 11  min_bank           242075 non-null  float64
 12  potential_issue    242075 non-null  object 
 13  pieces_past_due    242075 non-null  float64
 14  perf_6_month_avg   242075 non-null  float64
 15  perf_12_month_avg  242075 non-null  float64
 16  lo

In [20]:
dataset.drop(['sku'], axis=1,inplace=True)   # unique identifier is not required. It is a mix of integer and string values
# Source performance for past 6 months and 12 months seems irrelevant. The values range from -0.99 and 1.0 which is ambigous
dataset.drop(['perf_6_month_avg'], axis=1, inplace=True,)  
dataset.drop(['perf_12_month_avg'], axis=1, inplace=True)  

In [21]:
dataset.columns

Index(['national_inv', 'lead_time', 'in_transit_qty', 'forecast_3_month',
       'forecast_6_month', 'forecast_9_month', 'sales_1_month',
       'sales_3_month', 'sales_6_month', 'sales_9_month', 'min_bank',
       'potential_issue', 'pieces_past_due', 'local_bo_qty', 'deck_risk',
       'oe_constraint', 'ppap_risk', 'stop_auto_buy', 'rev_stop',
       'went_on_backorder'],
      dtype='object')

In [22]:
# All the column names of these yes/no columns
yes_no_columns = list(filter(lambda i: dataset[i].dtype!=np.float64, dataset.columns))
print(yes_no_columns)

# Add code below this comment  (Question #E102)
# ----------------------------------
print('potential_issue',dataset['potential_issue'].unique())
print('deck_risk',dataset['deck_risk'].unique())
print('oe_constraint',dataset['oe_constraint'].unique())
print('ppap_risk',dataset['ppap_risk'].unique())
print('stop_auto_buy',dataset['stop_auto_buy'].unique())
print('rev_stop',dataset['rev_stop'].unique())
print('went_on_backorder',dataset['went_on_backorder'].unique())

['potential_issue', 'deck_risk', 'oe_constraint', 'ppap_risk', 'stop_auto_buy', 'rev_stop', 'went_on_backorder']
potential_issue ['No' 'Yes' nan]
deck_risk ['No' 'Yes' nan]
oe_constraint ['No' 'Yes' nan]
ppap_risk ['No' 'Yes' nan]
stop_auto_buy ['Yes' 'No' nan]
rev_stop ['No' 'Yes' nan]
went_on_backorder ['No' 'Yes' nan]


In [23]:
for column_name in yes_no_columns:
    mode = dataset[column_name].apply(str).mode()[0]
    print('Filling missing values of {} with {}'.format(column_name, mode))
    dataset[column_name].fillna(mode, inplace=True)

Filling missing values of potential_issue with No
Filling missing values of deck_risk with No
Filling missing values of oe_constraint with No
Filling missing values of ppap_risk with No
Filling missing values of stop_auto_buy with Yes
Filling missing values of rev_stop with No
Filling missing values of went_on_backorder with No


We can now predict and evaluate with the preprocessed test set. It would be interesting to see the performance with and without outliers removal from the test set. We can report confusion matrix, precision, recall, f1-score, accuracy, and other measures (if any). 

In [24]:
for colname in yes_no_columns:
    dataset[colname].replace(('Yes','No'),(1,0), inplace=True)

In [25]:
print('potential_issue',dataset['potential_issue'].unique())
print('deck_risk',dataset['deck_risk'].unique())
print('oe_constraint',dataset['oe_constraint'].unique())
print('ppap_risk',dataset['ppap_risk'].unique())
print('stop_auto_buy',dataset['stop_auto_buy'].unique())
print('rev_stop',dataset['rev_stop'].unique())
print('went_on_backorder',dataset['went_on_backorder'].unique())

potential_issue [0 1]
deck_risk [0 1]
oe_constraint [0 1]
ppap_risk [0 1]
stop_auto_buy [1 0]
rev_stop [0 1]
went_on_backorder [0 1]


In [26]:
num_backorder = np.sum(dataset['went_on_backorder']==1)
print('backorder ratio:', num_backorder, '/', len(dataset), '=', num_backorder / len(dataset))

backorder ratio: 2688 / 242076 = 0.01110395082536063


In [27]:
dataset.isna().sum()

national_inv             1
lead_time            14725
in_transit_qty           1
forecast_3_month         1
forecast_6_month         1
forecast_9_month         1
sales_1_month            1
sales_3_month            1
sales_6_month            1
sales_9_month            1
min_bank                 1
potential_issue          0
pieces_past_due          1
local_bo_qty             1
deck_risk                0
oe_constraint            0
ppap_risk                0
stop_auto_buy            0
rev_stop                 0
went_on_backorder        0
dtype: int64

In [28]:
lt_median = dataset['lead_time'].median()
dataset['lead_time'].fillna(lt_median,inplace=True)

In [29]:
#dropping only one row with all NA values
dataset=dataset.dropna(subset=['national_inv','in_transit_qty','forecast_3_month','forecast_6_month','forecast_9_month',
                              'sales_1_month','sales_3_month','sales_6_month','sales_9_month','min_bank','pieces_past_due',
                              'local_bo_qty'])

In [30]:
dataset.shape

(242075, 20)

In [31]:
pickled_model = pickle.load(open('finalmodel.pkl', 'rb'))
#y_predicted= pickled_model.predict(X_test)

In [32]:
X_df = dataset.iloc[:, 0:-1]
y_df = dataset.iloc[:, -1].astype(int)

In [33]:
# Add code below this comment  (Question #E303)
# ----------------------------------


y_predicted = pickled_model.predict(X_df)



In [34]:
print(classification_report(y_df, y_predicted)) 

              precision    recall  f1-score   support

           0       1.00      0.89      0.94    239387
           1       0.08      0.81      0.14      2688

    accuracy                           0.89    242075
   macro avg       0.54      0.85      0.54    242075
weighted avg       0.99      0.89      0.93    242075



In [35]:
print(confusion_matrix(y_df,y_predicted))

[[213205  26182]
 [   505   2183]]


In [37]:
print("Accuracy Score: ", accuracy_score(y_df,y_predicted))
print("Recall Score: ",recall_score(y_df,y_predicted))
print("F1 Score: ",f1_score(y_df,y_predicted))

Accuracy Score:  0.8897573066198492
Recall Score:  0.8121279761904762
F1 Score:  0.14059833188419799


## Conclusion

## Reflect

Imagine you are data scientist that has been tasked with developing a system to save your 
company money by predicting and preventing back orders of parts in the supply chain.

Write a **brief summary** for "management" that details your findings, 
your level of certainty and trust in the models, 
and recommendations for operationalizing these models for the business.

# Save your notebook!
## Then `File > Close and Halt`