# Part 3: Unbiased Evaluation using a New Test Set

In this part, we are given a new test set (`/dsa/data/all_datasets/back_order/Kaggle_Test_Dataset_v2.csv`). We can now take advantage of the entire smart sample that we created in Part I. 

* Retrain a pipeline using the optimal parameters that the pipeline learned. We don't need to repeat GridSearch here. 

## Import modules as needed

In [55]:
import joblib
import os
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.feature_selection import SelectPercentile


## Load smart sample and the best pipeline from Part II

In [56]:
# Load smart sample and preprocessed full training dataset
X_train, y_train, train_undersamp = joblib.load('data/sample-data-v4.pkl')

# Load pipeline
pipe = joblib.load('data/pipeline-v5.pkl')



##  Retrain a pipeline using the full sampled training data set

Use the full sampled training data set to train the pipeline.

In [57]:
# Add code below this comment  (Question #E301)
# ----------------------------------
training_model = pipe.fit(X_train, y_train)
# Display full model score
training_model.score(X_train, y_train)


0.8950677410785443

In [58]:
# Make prediction using full test data
predicted_y = training_model.predict(X_train)

# Display confusion matrix
print('Confusion Matrix:\n',pd.DataFrame(confusion_matrix(y_train, predicted_y)))

# Create classification report
print('\nClassification Report:\n',classification_report(y_train, predicted_y))


Confusion Matrix:
       0      1
0  9895   1398
1   972  10321

Classification Report:
               precision    recall  f1-score   support

           0       0.91      0.88      0.89     11293
           1       0.88      0.91      0.90     11293

    accuracy                           0.90     22586
   macro avg       0.90      0.90      0.90     22586
weighted avg       0.90      0.90      0.90     22586



### Save the trained model with the pickle library.

In [59]:
# Add code below this comment  
# -----------------------------
# Pickle the best model
joblib.dump(training_model, 'data/model-v5.pkl')


['data/model-v5.pkl']


## Load the Testing Data and evaluate your model

 * `/dsa/data/all_datasets/back_order/Kaggle_Test_Dataset_v2.csv`
 
* We need to preprocess this test data (follow the steps similar to Part I)
* If we have fitted any normalizer/standardizer in Part 2, then we have to transform this test data using the fitted normalizer/standardizer

In [20]:
# Preprocess the given test set  (Question #E302)
# ----------------------------------
# Dataset location
DATASET = '/dsa/data/all_datasets/back_order/Kaggle_Test_Dataset_v2.csv'
assert os.path.exists(DATASET)

# Load and shuffle
test_dataset = pd.read_csv(DATASET).sample(frac = 1).reset_index(drop=True)

# Remove sku feature
test_dataset.drop('sku', axis = 1, inplace = True)

# Create correlation matrix
corr_matrix = test_dataset.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k = 1).astype(np.bool))

# Find highly-correlated features to drop
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]

# Drop highly-correlated features 
test_dataset.drop(to_drop, axis = 1, inplace = True)

# Get all the column names of yes/no columns
yes_no_columns = list(filter(lambda i: test_dataset[i].dtype!=np.float64, test_dataset.columns))

# Fill missing values in discrete features with value that occurred most often in each column
for column_name in yes_no_columns:
    mode = test_dataset[column_name].apply(str).mode()[0]
    print('Filling missing values of {} with {}'.format(column_name, mode))
    test_dataset[column_name].fillna(mode, inplace=True)

# Fill missing lead time data with mean 
test_dataset['lead_time'].fillna((test_dataset['lead_time'].mean()), inplace = True)

# Fill missing perf_6_month_avg data (-99) with mode
mode_value = test_dataset['perf_6_month_avg'].mode()
test_dataset['perf_6_month_avg'].mask(test_dataset['perf_6_month_avg'] == -99, mode_value, inplace=True)

# Remove any rows with any remaining NaN values
test_dataset = test_dataset.dropna(how = 'any')
test_dataset = test_dataset.reset_index(drop = True)

print(test_dataset.isnull().sum()) # view nan counts in columms

# Fill in yes, no features with 1, 0 values
for col in ['potential_issue', 'deck_risk', 'oe_constraint', 'ppap_risk',
            'stop_auto_buy', 'rev_stop', 'went_on_backorder']:
    test_dataset[col] = (test_dataset[col] == 'Yes').astype(int)
    
test_dataset.info()


  interactivity=interactivity, compiler=compiler, result=result)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations


Filling missing values of potential_issue with No
Filling missing values of deck_risk with No
Filling missing values of oe_constraint with No
Filling missing values of ppap_risk with No
Filling missing values of stop_auto_buy with Yes
Filling missing values of rev_stop with No
Filling missing values of went_on_backorder with No
national_inv         0
lead_time            0
in_transit_qty       0
forecast_3_month     0
sales_1_month        0
sales_3_month        0
min_bank             0
potential_issue      0
pieces_past_due      0
perf_6_month_avg     0
local_bo_qty         0
deck_risk            0
oe_constraint        0
ppap_risk            0
stop_auto_buy        0
rev_stop             0
went_on_backorder    0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 222974 entries, 0 to 222973
Data columns (total 17 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   national_inv       222974 non-null  float64
 1   lea

In [21]:
# Split into X and y
X = test_dataset.iloc[:, :-1]
y = test_dataset.went_on_backorder


We can now predict and evaluate with the preprocessed test set. It would be interesting to see the performance with and without outliers removal from the test set. We can report confusion matrix, precision, recall, f1-score, accuracy, and other measures (if any). 

In [60]:
# Add code below this comment  (Question #E303)
# ----------------------------------
# Load retrained pipeline
full_model = joblib.load('data/model-v5.pkl')


In [62]:
# Fit training dataset to retrained model
test_model = full_model.fit(X, y)
# Display test model score
test_model.score(X, y)

# Make prediction using test dataset
predicted_y = test_model.predict(X)

# Display confusion matrix
print('\nConfusion Matrix:\n',pd.DataFrame(confusion_matrix(y, predicted_y)))
# Create classification report
print('\nClassification Report:\n',classification_report(y, predicted_y))



Confusion Matrix:
       0     1
0  9502  1361
1   934  9929

Classification Report:
               precision    recall  f1-score   support

           0       0.91      0.87      0.89     10863
           1       0.88      0.91      0.90     10863

    accuracy                           0.89     21726
   macro avg       0.89      0.89      0.89     21726
weighted avg       0.89      0.89      0.89     21726



## Conclusion

## Reflect

Imagine you are data scientist that has been tasked with developing a system to save your 
company money by predicting and preventing back orders of parts in the supply chain.

Write a **brief summary** for "management" that details your findings, 
your level of certainty and trust in the models, 
and recommendations for operationalizing these models for the business.

# Save your notebook!
## Then `File > Close and Halt`