# Part 3: Unbiased Evaluation using a New Test Set

In this part, we are given a new test set (`/dsa/data/all_datasets/back_order/Kaggle_Test_Dataset_v2.csv`). We can now take advantage of the entire smart sample that we created in Part I. 

* Retrain a pipeline using the optimal parameters that the pipeline learned. We don't need to repeat GridSearch here. 

## Import modules as needed

In [1]:
import joblib
import os
import pandas as pd
import numpy as np


## Load smart sample and the best pipeline from Part II

In [2]:
X, y, train_undersamp = joblib.load('data/sample-data-v1.pkl')
pipe, CV_rfc, CV_rfc_model = joblib.load('data/pipeline-v3.pkl')



##  Retrain a pipeline using the full sampled training data set

Use the full sampled training data set to train the pipeline.

In [None]:
# Add code below this comment  (Question #E301)
# ----------------------------------
model = CV_rfc_model.fit(X, y)


### Save the trained model with the pickle library.

In [None]:
# Add code below this comment  
# -----------------------------
# Pickle the best model
joblib.dump(model, 'data/model-v1.pkl')



## Load the Testing Data and evaluate your model

 * `/dsa/data/all_datasets/back_order/Kaggle_Test_Dataset_v2.csv`
 
* We need to preprocess this test data (follow the steps similar to Part I)
* If we have fitted any normalizer/standardizer in Part 2, then we have to transform this test data using the fitted normalizer/standardizer

In [None]:
# Preprocess the given test set  (Question #E302)
# ----------------------------------
# Dataset location
DATASET = '/dsa/data/all_datasets/back_order/Kaggle_Test_Dataset_v2.csv'
assert os.path.exists(DATASET)

# Load and shuffle
dataset_test = pd.read_csv(DATASET).sample(frac = 1).reset_index(drop=True)
dataset_test.head().transpose()

# Remove sku feature
dataset_test.drop('sku', axis = 1, inplace = True)

# Create correlation matrix
corr_matrix = dataset_test.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k = 1).astype(np.bool))

# Find highly-correlated features to drop
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]

# Drop highly-correlated features 
dataset_test.drop(to_drop, axis = 1, inplace = True)

# Get all the column names of yes/no columns
yes_no_columns = list(filter(lambda i: dataset_test[i].dtype!=np.float64, dataset_test.columns))
print(yes_no_columns)

# Fill missing values in discrete features with value that occurred most often in each column
for column_name in yes_no_columns:
    mode = dataset_test[column_name].apply(str).mode()[0]
    print('Filling missing values of {} with {}'.format(column_name, mode))
    dataset_test[column_name].fillna(mode, inplace=True)

# Fill missing lead time data with mean 
dataset_test['lead_time'].fillna((dataset_test['lead_time'].mean()), inplace = True)

# Fill missing perf_6_month_avg data (-99) with mode
mode_value = dataset_test['perf_6_month_avg'].mode()
dataset_test['perf_6_month_avg'].mask(dataset_test['perf_6_month_avg'] == -99, mode_value, inplace=True)

# Remove any rows with any remaining NaN values
dataset = dataset_test.dropna(how = 'any')
dataset = dataset_test.reset_index(drop = True)

print(dataset.isnull().sum()) # view nan counts in columms

# Fill in yes, no features with 1, 0 values
for col in ['potential_issue', 'deck_risk', 'oe_constraint', 'ppap_risk',
            'stop_auto_buy', 'rev_stop', 'went_on_backorder']:
    dataset[col] = (dataset[col] == 'Yes').astype(int)
    
dataset.info()


    

We can now predict and evaluate with the preprocessed test set. It would be interesting to see the performance with and without outliers removal from the test set. We can report confusion matrix, precision, recall, f1-score, accuracy, and other measures (if any). 

In [None]:
# Add code below this comment  (Question #E303)
# ----------------------------------
# Make prediction using test dataset
predicted_y = CV_rfc.predict(X)

# Display confusion matrix
print('\nConfusion Matrix:\n',pd.DataFrame(confusion_matrix(y, predicted_y)))
# Create classification report
print('\nClassification Report:\n',classification_report(y, predicted_y))


## Conclusion

## Reflect

Imagine you are data scientist that has been tasked with developing a system to save your 
company money by predicting and preventing back orders of parts in the supply chain.

Write a **brief summary** for "management" that details your findings, 
your level of certainty and trust in the models, 
and recommendations for operationalizing these models for the business.

# Save your notebook!
## Then `File > Close and Halt`