# Part 3: Unbiased Evaluation using a New Test Set

In this part, we are given a new test set (`/dsa/data/all_datasets/back_order/Kaggle_Test_Dataset_v2.csv`). We can now take advantage of the entire smart sample that we created in Part I. 

* Retrain a pipeline using the optimal parameters that the pipeline learned. We don't need to repeat GridSearch here. 

## Import modules as needed

In [16]:
%matplotlib inline
import matplotlib.pyplot as plt

import os, sys
import itertools
import numpy as np
import pandas as pd
import joblib

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

## Load smart sample and the best pipeline from Part II

In [17]:
X_train, y_train, pipeline1 = joblib.load('pipeline-1.pkl')


##  Retrain a pipeline using the full sampled training data set

Use the full sampled training data set to train the pipeline.

In [18]:
# Add code below this comment  (Question #E301)
# ----------------------------------


In [19]:
from sklearn.ensemble import IsolationForest

# Create an IsolationForest object
clf = IsolationForest(n_estimators=100, contamination=0.05, random_state=43)

# Fit the model to the data
clf.fit(X_train)

# Predict outliers/anomalies in the data
y_pred_train = clf.predict(X_train)

In [20]:
pipeline1

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  ['national_inv', 'lead_time',
                                                   'in_transit_qty',
                                                   'forecast_3_month',
                                                   'forecast_6_month',
                                                   'forecast_9_month',
                                                   'sales_1_month',
                                                   'sales_3_month',
                                                   'sales_6_month',
                                                   'sales_9_month', 'min_bank',
                                                   'pieces_past_due',
                             

In [21]:
pipeline1.fit(X_train, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  ['national_inv', 'lead_time',
                                                   'in_transit_qty',
                                                   'forecast_3_month',
                                                   'forecast_6_month',
                                                   'forecast_9_month',
                                                   'sales_1_month',
                                                   'sales_3_month',
                                                   'sales_6_month',
                                                   'sales_9_month', 'min_bank',
                                                   'pieces_past_due',
                             

### Save the trained model with the pickle library.

In [22]:
# Add code below this comment  
# -----------------------------

joblib.dump([X_train, y_train, pipeline1], 'pipeline-1-fully-trained.pkl')


['pipeline-1-fully-trained.pkl']


## Load the Testing Data and evaluate your model

 * `/dsa/data/all_datasets/back_order/Kaggle_Test_Dataset_v2.csv`
 
* We need to preprocess this test data (**follow** the steps similar to Part I)
* **If you have fitted any normalizer/standardizer in Part 2, then we have to transform this test data using the fitted normalizer/standardizer!**

In [23]:
# Preprocess the given test set  (Question #E302)
# ----------------------------------

# Dataset location
DATASET = '/dsa/data/all_datasets/back_order/Kaggle_Test_Dataset_v2.csv'
assert os.path.exists(DATASET)

# Load and shuffle
dataset = pd.read_csv(DATASET).sample(frac = 1).reset_index(drop=True)

dataset.head().transpose()


  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,0,1,2,3,4
sku,3394011,3339400,3400245,3473283,3318401
national_inv,30.0,0.0,145.0,33.0,14.0
lead_time,2.0,8.0,8.0,8.0,8.0
in_transit_qty,0.0,0.0,0.0,0.0,1.0
forecast_3_month,0.0,0.0,0.0,0.0,59.0
forecast_6_month,0.0,0.0,102.0,0.0,101.0
forecast_9_month,0.0,0.0,102.0,12.0,152.0
sales_1_month,0.0,0.0,14.0,1.0,14.0
sales_3_month,0.0,0.0,49.0,6.0,49.0
sales_6_month,0.0,0.0,117.0,26.0,97.0


In [24]:
dataset = dataset.drop('sku', axis=1)

In [25]:
# All the column names of these yes/no columns
yes_no_columns = list(filter(lambda i: dataset[i].dtype!=np.float64, dataset.columns))
print(yes_no_columns)

# Replacing NA values with mode
for column_name in yes_no_columns:
    mode = dataset[column_name].apply(str).mode()[0]
    print('Filling missing values of {} with {}'.format(column_name, mode))
    dataset[column_name].fillna(mode, inplace=True)

['potential_issue', 'deck_risk', 'oe_constraint', 'ppap_risk', 'stop_auto_buy', 'rev_stop', 'went_on_backorder']
Filling missing values of potential_issue with No
Filling missing values of deck_risk with No
Filling missing values of oe_constraint with No
Filling missing values of ppap_risk with No
Filling missing values of stop_auto_buy with Yes
Filling missing values of rev_stop with No
Filling missing values of went_on_backorder with No


In [26]:
# Convert yes/no columns into binary (0s and 1s)
dataset = dataset.replace({'Yes': 1, 'No': 0})

dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 242076 entries, 0 to 242075
Data columns (total 22 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   national_inv       242075 non-null  float64
 1   lead_time          227351 non-null  float64
 2   in_transit_qty     242075 non-null  float64
 3   forecast_3_month   242075 non-null  float64
 4   forecast_6_month   242075 non-null  float64
 5   forecast_9_month   242075 non-null  float64
 6   sales_1_month      242075 non-null  float64
 7   sales_3_month      242075 non-null  float64
 8   sales_6_month      242075 non-null  float64
 9   sales_9_month      242075 non-null  float64
 10  min_bank           242075 non-null  float64
 11  potential_issue    242076 non-null  int64  
 12  pieces_past_due    242075 non-null  float64
 13  perf_6_month_avg   242075 non-null  float64
 14  perf_12_month_avg  242075 non-null  float64
 15  local_bo_qty       242075 non-null  float64
 16  de

In [27]:
# Check which columns have missing values
missing_cols = dataset.isna().sum()[dataset.isna().sum() > 0].index.tolist()
print(missing_cols)

# Replace missing values with a specified value, e.g. 0
dataset[missing_cols] = dataset[missing_cols].fillna(dataset.lead_time.mean())

['national_inv', 'lead_time', 'in_transit_qty', 'forecast_3_month', 'forecast_6_month', 'forecast_9_month', 'sales_1_month', 'sales_3_month', 'sales_6_month', 'sales_9_month', 'min_bank', 'pieces_past_due', 'perf_6_month_avg', 'perf_12_month_avg', 'local_bo_qty']


In [28]:
# Count number of NaN values
num_nans = dataset.isna().sum().sum()
print('Number of NaN values:', num_nans)

Number of NaN values: 0


In [30]:
# Creating a smart sample of the dataset

y = dataset.went_on_backorder
X = dataset.drop('went_on_backorder', axis=1)

from imblearn.under_sampling import RandomUnderSampler

# create RandomUnderSampler object
rus = RandomUnderSampler(random_state=43)

# fit and resample data
X_test, y_test = rus.fit_resample(X, y)

We can now predict and evaluate with the preprocessed test set. It would be interesting to see the performance with and without outliers removal from the test set. We can report confusion matrix, precision, recall, f1-score, accuracy, and other measures (if any). 

In [31]:
# Add code below this comment  (Question #E303)
# ----------------------------------

from sklearn.metrics import classification_report, confusion_matrix

# Make predictions on the test set using the trained model
y_pred = pipeline1.predict(X_test)

# Generate a confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

# Generate a classification report
cr = classification_report(y_test, y_pred)
print("Classification Report:\n", cr)

from sklearn.metrics import accuracy_score

# Calculate the overall accuracy of the model
acc = accuracy_score(y_test, y_pred)
print("Accuracy:", acc)


Confusion Matrix:
 [[2414  274]
 [ 580 2108]]
Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.90      0.85      2688
           1       0.88      0.78      0.83      2688

    accuracy                           0.84      5376
   macro avg       0.85      0.84      0.84      5376
weighted avg       0.85      0.84      0.84      5376

Accuracy: 0.8411458333333334


## Conclusion

## Reflect

Imagine you are data scientist that has been tasked with developing a system to save your 
company money by predicting and preventing back orders of parts in the supply chain.

Write a **brief summary** for "management" that details your findings, 
your level of certainty and trust in the models, 
and recommendations for operationalizing these models for the business.

# Save your notebook!
## Then `File > Close and Halt`