# Part 3: Unbiased Evaluation using a New Test Set

In this part, we are given a new test set that serves as the "truly unseen data" (`/dsa/data/all_datasets/back_order/Kaggle_Test_Dataset_v2.csv`). We can now take advantage of the entire smart sample that we created in Part I. 

* Load your best pipeline model and anomaly detector from Part 2. 
* Load your balanced (smart) sample deom Part 1. 
* Retrain the model with the entire balanced sample. (do NOT repeat the grid search)
* Save the model. 
* Test it with the "unseen" data. 

## Import modules as needed

In [1]:

%matplotlib inline
import random, time
random.seed(10)
import numpy as np
import pandas as pd
import joblib
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier, IsolationForest
from sklearn.impute import SimpleImputer
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.pipeline import Pipeline

---

## Load the balanced sample and the best pipeline and the anomaly detector

In [2]:
X_bal, y_bal = joblib.load('training-data.pkl')

iso   = joblib.load('best_anomaly_detection.pkl')
model = joblib.load('best_pipeline.pkl') 

print(X_bal.shape)
print(y_bal.value_counts())


print("iso model:", iso)
print("best pipe:  ", model)

(15824, 21)
0    7912
1    7912
Name: went_on_backorder, dtype: int64
iso model: IsolationForest(contamination=0.02, random_state=42)
best pipe:   Pipeline(steps=[('sc', StandardScaler()),
                ('pca',
                 PCA(n_components=15, random_state=42,
                     svd_solver='randomized')),
                ('clf',
                 RandomForestClassifier(max_depth=20, n_estimators=400,
                                        n_jobs=-1))])


---

##  Retrain pipeline using the full balanced sample 

Use the full balanced sample to train the pipeline.

In [3]:
inlier_mask = iso.fit_predict(X_bal) == 1
X_train = X_bal[inlier_mask]
y_train = y_bal[inlier_mask]

In [4]:
# Add code below this comment  (Question #E301)
# ----------------------------------
model.fit(X_train, y_train)
print(model)

Pipeline(steps=[('sc', StandardScaler()),
                ('pca',
                 PCA(n_components=15, random_state=42,
                     svd_solver='randomized')),
                ('clf',
                 RandomForestClassifier(max_depth=20, n_estimators=400,
                                        n_jobs=-1))])


## Pickle and save the trained model and the anomaly detector 

In [5]:
# Add code below this comment  
# -----------------------------

joblib.dump(iso, "final_models/iso_detector.pkl")
joblib.dump(model, "final_models/pipeline_iso_pca_rf.pkl")


['final_models/pipeline_iso_pca_rf.pkl']


---

## Load the test data and evaluate your model

 * `/dsa/data/all_datasets/back_order/Kaggle_Test_Dataset_v2.csv`

Remember:  
* We need to preprocess this test data (**follow** the steps similar to Part I)


* If you have fitted any normalizer/standardizer in Part 2, then you have to transform this test data using the same fitted normalizer/standardizer. Do NOT retrain anything. 

In [6]:
# Preprocess the given test set  (Question #E302)
# ----------------------------------

df_test= pd.read_csv("/dsa/data/all_datasets/back_order/Kaggle_Test_Dataset_v2.csv")


#we need to replicate the preprocessing with this new dataset
df_test = df_test.drop(columns=["sku"])
num_cols = df_test.select_dtypes(include="number").columns
df_test[num_cols] = df_test[num_cols].replace(-99, np.nan)
imp = SimpleImputer(strategy="median")
df_test[num_cols] = imp.fit_transform(df_test[num_cols])
df_test.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,national_inv,lead_time,in_transit_qty,forecast_3_month,forecast_6_month,forecast_9_month,sales_1_month,sales_3_month,sales_6_month,sales_9_month,...,pieces_past_due,perf_6_month_avg,perf_12_month_avg,local_bo_qty,deck_risk,oe_constraint,ppap_risk,stop_auto_buy,rev_stop,went_on_backorder
0,62.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.85,0.83,0.0,Yes,No,No,Yes,No,No
1,9.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.85,0.83,0.0,No,No,Yes,No,No,No
2,17.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.92,0.95,0.0,No,No,No,Yes,No,No
3,9.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.78,0.75,0.0,No,No,Yes,Yes,No,No
4,2.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.54,0.71,0.0,No,No,No,Yes,No,No


In [7]:
# Do the yes no converstion
yes_no_columns = list(filter(lambda i: df_test[i].dtype!=np.float64, df_test.columns))

for column_name in yes_no_columns:
    mode = df_test[column_name].apply(str).mode()[0]
    print('Filling missing values of {} with {}'.format(column_name, mode))
    df_test[column_name].fillna(mode, inplace=True)
    
df_test[yes_no_columns]=df_test[yes_no_columns].apply(lambda s: s.str.strip().str.upper().map({'YES':1,'NO':0}))

Filling missing values of potential_issue with No
Filling missing values of deck_risk with No
Filling missing values of oe_constraint with No
Filling missing values of ppap_risk with No
Filling missing values of stop_auto_buy with Yes
Filling missing values of rev_stop with No
Filling missing values of went_on_backorder with No


In [8]:
#Without Iso Forrest
X_test = df_test.drop("went_on_backorder",axis=1)
y_test = df_test["went_on_backorder"]

In [9]:
y_pred = model.predict(X_test)

We can now predict and evaluate with the preprocessed test set. It would be interesting to see the performance with and without outliers removal from the test set. 

Report confusion matrix, precision, recall, f1-score, accuracy, and other measures (if any). 

In [10]:
# Add code below this comment  (Question #E303)
# ----------------------------------
# WITHOUT ISO FORREST

print("----- WITHOUT ISO FOREST  -----\n")
print(pd.DataFrame(
    confusion_matrix(y_test, y_pred),
    index=["No_Backorder","Yes_Backorder"],
    columns=["No_Backorder","Yes_Backorder"]
))
print(classification_report(
    y_test, y_pred,
    target_names=["No_Backorder","Yes_Backorder"]
))


----- WITHOUT ISO FOREST  -----

               No_Backorder  Yes_Backorder
No_Backorder         197193          42195
Yes_Backorder           692           1996
               precision    recall  f1-score   support

 No_Backorder       1.00      0.82      0.90    239388
Yes_Backorder       0.05      0.74      0.09      2688

     accuracy                           0.82    242076
    macro avg       0.52      0.78      0.49    242076
 weighted avg       0.99      0.82      0.89    242076



In [11]:
#With Iso Forest
mask_inlier = iso.predict(X_test) == 1
X_in, y_in = X_test[mask_inlier], y_test[mask_inlier]
y_in_pred = model.predict(X_in)

In [12]:


print("\n----- WITH ISO FOREST  -----\n")
print(pd.DataFrame(
    confusion_matrix(y_in, y_in_pred),
    index=["No_Backorder","Yes_Backorder"],
    columns=["No_Backorder","Yes_Backorder"]
))
print(classification_report(
    y_in, y_in_pred,
    target_names=["No_Backorder","Yes_Backorder"]
))


----- WITH ISO FOREST  -----

               No_Backorder  Yes_Backorder
No_Backorder         193119          41345
Yes_Backorder           677           1953
               precision    recall  f1-score   support

 No_Backorder       1.00      0.82      0.90    234464
Yes_Backorder       0.05      0.74      0.09      2630

     accuracy                           0.82    237094
    macro avg       0.52      0.78      0.49    237094
 weighted avg       0.99      0.82      0.89    237094



In [13]:
#Base Case
base_rate = y_test.mean()
base_rate*100

1.1103950825360631

---

## Conclusion

Comment on the performance of your model: take a look at the project notes to see what you should report here. 

# Write a summary of your processing and an analysis of the model performance  
# (Question #E304)
# ----------------------------------
# I am also writing this for managment (sorry, I read the projects notes wrong)


**My model has performed!**  
Today, we have **no** system in place to anticipate back‐orders—every stock‐out comes as a surprise. By implementing this pipeline, we can shift from reactive firefighting to proactive planning.

1. **Current State**  
   - No systematic way to flag at-risk SKUs.  
   - Stock-outs hurt customer satisfaction and increase expedited freight costs.

2. **Future State**  
   - Daily “heads-up” on likely back-order items.  
   - Inventory planners can focus on a targeted list instead of scanning all 240 k SKUs.


### What is my model?  
A simple three-step pipeline:
1. **StandardScaler** – scales each feature so none dominates.  
2. **PCA (15 components)** – condenses 20+ features into 15 “super-features,” speeding up training.  
3. **RandomForest** (400 trees, max depth 20) – an ensemble of decision trees votes on back-order risk.  

> We tuned hundreds of hyperparameter combinations and saved the best version.


### How does it perform on unseen data?  

| Condition                               | Rate   |  
|-----------------------------------------|--------|  
| **Caught back-orders** (recall)         | 74 %   |  
| **Missed back-orders** (false negatives)| 26 %   |  
| **Correctly cleared** (specificity)     | 82 %   |  
| **False alarms** (false positives)      | 18 %   |  
| **Overall accuracy**                    | 82 %   |  

> Outlier filtering via IsolationForest (2 % contamination) made **no difference**, so we’ve omitted it to keep things simple.



---

## Reflection

Imagine you are data scientist that has been tasked with developing a system to save your 
company money by predicting and preventing back orders of parts in the supply chain.

Write a **brief summary** for "management" that details your findings, 
your level of certainty and trust in the models, 
and recommendations for operationalizing these models for the business. take a look at the project notes to see what you should report here. 

# Write your answer here:  
# (Question #E305)
# ----------------------------------
**Model Certainty**

- Recall: Catches 74% of true back-orders (1,987 / 2,688) vs. 0 % with no model.
- Precision: Only 4% of alerts are true back-orders—a 4× lift over the 1.1 % raw rate. <-

**Trade-offs & Model Limits**

- Miss rate: 26% of real back-orders still slip through.
- False-alarm rate: 96% of alerts (≈ 42,203) are noise.
- Static view: Doesn’t capture seasonality, promotions, or sudden demand spikes.

**Operational Recommendations**
- Dedicated triage: Assign 1 FTE to vet ~20 k alerts/run and log outcomes.

- Feature enhancement: Add supplier details to dataset.
- Ongoing monitoring: Track recall/precision monthly; retrain quarterly.

**Bottom Line**
- Preventing 74% of 2,688 annual back-orders at 320 dollars each saves $636,518/year.
- Even with 96 percent false alarms, a review process could yield net cost savings and happier customers.

**Emphasis on Model improvement**
- Focuse on reducing false alarms.
- Re-evaulate loss function. (Currently focuses on catching all backorders, but needs to focus on false alarms)
- Estimating the cost of each items backorder. 

# Save your notebook!

## Commit and push. 

