# Part II: Model Development

In this part, we develop three unique pipelines for predicting backorder. We use the smart sample from Part I to fit and evaluate these pipelines. 

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt

import os, sys
import itertools
import numpy as np
import pandas as pd
import joblib

from sklearn.model_selection import train_test_split

In [3]:
numeric_features=['national_inv', 'lead_time', 'in_transit_qty', 'forecast_3_month',
       'forecast_6_month', 'forecast_9_month', 'sales_1_month',
       'sales_3_month', 'sales_6_month', 'sales_9_month', 'min_bank',
       'potential_issue', 'pieces_past_due', 'perf_6_month_avg',
       'perf_12_month_avg', 'local_bo_qty']
binary_features=['deck_risk', 'oe_constraint',
       'ppap_risk', 'stop_auto_buy', 'rev_stop']

## Reload the smart sample here

In [4]:

# Reload your smart sampling from local file 
# ----------------------------------

X_resampled, y_resampled = joblib.load('sample-data.pkl')

## Normalize/standardize the data if required; otherwise ignore. You can perform this step inside the pipeline (if required). 

## Split the data into Train/Test

In [5]:
#y = sampled_df.went_on_backorder
#X = sampled_df.drop('went_on_backorder', axis=1)

# setting up testing and training sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.25, random_state=43)

## Developing Pipeline

In this section, we design an operationalized machine learning pipeline, which includes:

* Anomaly detection
* Dimensionality Reduction
* Train a classification model


We are free to use any of the models that we learned in the past or we can use new models. Here is a pool of methods: 

### Pool of Anomaly Detection Methods (Discussed in M4)
1. IsolationForest
2. EllipticEnvelope
3. LocalOutlierFactor
4. OneClassSVM
5. SGDOneClassSVM

### Pool of Feature Selection Methods (Discussed in M3)

1. VarianceThreshold
1. SelectKBest with any scoring method (e.g, chi, f_classif, mutual_info_classif)
1. SelectKPercentile
3. SelectFpr, SelectFdr, or  SelectFwe
1. GenericUnivariateSelect
2. PCA
3. Factor Analysis
4. Variance Threshold
5. RFE
7. SelectFromModel


### Classification Methods (Discussed in M1-M2
1. Decision Tree
2. Random Forest
3. Logistic Regression
4. Naive Bayes
5. Linear SVC
6. SVC with kernels
7. KNeighborsClassifier
8. GradientBoostingClassifier
9. XGBClassifier
10. LGBM Classifier



It is difficult to fit an anomaly detection method in the sklearn pipeline without writing custom codes. For simplicity, we avoid fitting an anomaly detection method within a pipeline. So we can create the workflow in two steps. 
* Step I: fit an outlier with the training set
* Step II: define a pipeline using a feature selection and a classification method. Then cross-validate this pipeline using the training data without outliers. 
* Note: if your smart sample is somewhat imbalanced, you might want to change the scoring method in GridSearchCV (see the [doc](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)).


Once we fit the pipeline with gridsearch, we identify the best model and give an unbiased evaluation using the test set that we created in Part II. For unbiased evaluation we report confusion matrix, precision, recall, f1-score, accuracy, and other measures if you like. 

**Optional: Those who are interested in writing custom codes for adding an outlier detection method into the sklearn pipeline, please follow this discussion [thread](https://stackoverflow.com/questions/52346725/can-i-add-outlier-detection-and-removal-to-scikit-learn-pipeline).**


**Note:** <span style='background:yellow'>We will be using Grid Search to find the optimal parameters of the pipelines.</span>

You can add more notebook cells or import any Python modules as needed.

In [6]:
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest

from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, chi2, f_classif, mutual_info_classif

from sklearn.pipeline import Pipeline


### Your 1st pipeline 
  * Anomaly detection
  * Dimensionality reduction
  * Model training/validation
  
Add cells as needed. 

Isolation Forest -> Standard Scalar -> Select K Best -> Random Forest

In [7]:
# Anamoly Detection

In [8]:
# Add anomaly detection code  (Question #E201)
# ----------------------------------

from sklearn.ensemble import IsolationForest

# Create an IsolationForest object
clf = IsolationForest(n_estimators=100, contamination=0.05, random_state=43)

# Fit the model to the data
clf.fit(X_train)

# Predict outliers/anomalies in the data
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)


In [9]:
# Pipeline

In [10]:
# Add codes for feature selection and classification pipeline with grid search  (Question #E202)
# ----------------------------------

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier

numeric_features = ['national_inv', 'lead_time', 'in_transit_qty', 'forecast_3_month', 
                    'forecast_6_month', 'forecast_9_month', 'sales_1_month', 'sales_3_month', 
                    'sales_6_month', 'sales_9_month', 'min_bank', 'pieces_past_due', 
                    'perf_6_month_avg', 'perf_12_month_avg', 'local_bo_qty']
binary_features = ['potential_issue', 'deck_risk', 'oe_constraint', 'ppap_risk', 'stop_auto_buy', 
                   'rev_stop']


# define the pipeline
numeric_transformer = Pipeline([
    ('scaler', StandardScaler())
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('bin', 'passthrough', binary_features),
])

feature_selection = Pipeline([
    ('selector', SelectKBest(f_classif, k=5))
])

binary_classification = Pipeline([
    ('classifier', RandomForestClassifier())
])

pipeline1 = Pipeline([
    ('preprocessor', preprocessor),
    ('feature_selection', feature_selection),
    ('binary_classification', binary_classification)
])

pipeline1.fit(X_train, y_train)

  f = msb / msw


Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  ['national_inv', 'lead_time',
                                                   'in_transit_qty',
                                                   'forecast_3_month',
                                                   'forecast_6_month',
                                                   'forecast_9_month',
                                                   'sales_1_month',
                                                   'sales_3_month',
                                                   'sales_6_month',
                                                   'sales_9_month', 'min_bank',
                                                   'pieces_past_due',
                             

In [11]:
# Hyper-parameter tuning

In [12]:
RandomForestClassifier().get_params().keys()

dict_keys(['bootstrap', 'ccp_alpha', 'class_weight', 'criterion', 'max_depth', 'max_features', 'max_leaf_nodes', 'max_samples', 'min_impurity_decrease', 'min_impurity_split', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 'n_estimators', 'n_jobs', 'oob_score', 'random_state', 'verbose', 'warm_start'])

In [13]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid to search over
param_grid = {
    'feature_selection__selector__k': [5, 10, 15],
    'binary_classification__classifier__n_estimators': [50, 100, 200],
    'binary_classification__classifier__max_depth': [10, 20, None],
    'binary_classification__classifier__min_samples_split': [2, 5, 10],
    'binary_classification__classifier__min_samples_leaf': [1, 2, 4],
}

# Create a GridSearchCV object
grid_search = GridSearchCV(pipeline1, param_grid=param_grid, cv=5, n_jobs=-1)

# Fit the GridSearchCV object to the data
grid_search.fit(X_train, y_train)

# Print the best parameters found by GridSearchCV
print('Best parameters found by grid search:', grid_search.best_params_)


  f = msb / msw


Best parameters found by grid search: {'binary_classification__classifier__max_depth': None, 'binary_classification__classifier__min_samples_leaf': 1, 'binary_classification__classifier__min_samples_split': 2, 'binary_classification__classifier__n_estimators': 200, 'feature_selection__selector__k': 15}


In [14]:
# Given an unbiased evaluation  (Question #E203)
# ----------------------------------

best_params = grid_search.best_params_

pipeline1.set_params(**best_params)
pipeline1.fit(X_test, y_test)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  ['national_inv', 'lead_time',
                                                   'in_transit_qty',
                                                   'forecast_3_month',
                                                   'forecast_6_month',
                                                   'forecast_9_month',
                                                   'sales_1_month',
                                                   'sales_3_month',
                                                   'sales_6_month',
                                                   'sales_9_month', 'min_bank',
                                                   'pieces_past_due',
                             

In [15]:
# Model Evaluation Metrics

In [16]:
from sklearn.metrics import classification_report, confusion_matrix

# Make predictions on the test set using the trained model
y_pred = pipeline1.predict(X_test)

# Generate a confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

# Generate a classification report
cr = classification_report(y_test, y_pred)
print("Classification Report:\n", cr)


Confusion Matrix:
 [[2793   15]
 [  10 2829]]
Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.99      1.00      2808
           1       0.99      1.00      1.00      2839

    accuracy                           1.00      5647
   macro avg       1.00      1.00      1.00      5647
weighted avg       1.00      1.00      1.00      5647



In [17]:
from sklearn.metrics import accuracy_score

# Calculate the overall accuracy of the model
acc = accuracy_score(y_test, y_pred)
print("Accuracy:", acc)


Accuracy: 0.995572870550735


#### <center>Record the optimal hyperparameters and performance resulting from this pipeline.</center>

## <span style="background: yellow;">Commit your code!</span> 

### Your 2nd pipeline
  * Anomaly detection
  * Dimensionality reduction
  * Model training/validation

EllipticEnvelope -> Standard Scalar -> PCA -> Logistic Regression

In [18]:
# Add anomaly detection code  (Question #E205)
# ----------------------------------

from sklearn.covariance import EllipticEnvelope
import numpy as np

# Fit the elliptic envelope model
clf = EllipticEnvelope(contamination=0.1)
clf.fit(X_train)

# Predict the labels for the data points
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)

# Print the predicted labels and the number of inliers and outliers
print("Predicted labels:", y_pred_train)
print("Number of inliers:", len(y_pred_train[y_pred_train == 1]))
print("Number of outliers:", len(y_pred_train[y_pred_train == -1]))



Predicted labels: [ 1 -1  1 ...  1  1  1]
Number of inliers: 15245
Number of outliers: 1694


In [19]:
# Add codes for feature selection and classification pipeline with grid search  (Question #E206)
# ----------------------------------

from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA

# define the pipeline
numeric_transformer = Pipeline([
    ('scaler', StandardScaler())
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('bin', 'passthrough', binary_features),
])

feature_selection = Pipeline([
    ('pca', PCA(n_components=5))
])

binary_classification = Pipeline([
    ('classifier', LogisticRegression())
])

pipeline2 = Pipeline([
    ('preprocessor', preprocessor),
    ('feature_selection', feature_selection),
    ('binary_classification', binary_classification)
])

pipeline2.fit(X_train, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  ['national_inv', 'lead_time',
                                                   'in_transit_qty',
                                                   'forecast_3_month',
                                                   'forecast_6_month',
                                                   'forecast_9_month',
                                                   'sales_1_month',
                                                   'sales_3_month',
                                                   'sales_6_month',
                                                   'sales_9_month', 'min_bank',
                                                   'pieces_past_due',
                             

In [20]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid to search over
param_grid = {
    'binary_classification__classifier__C': [0.1, 1, 10],
    'binary_classification__classifier__penalty': ['l1', 'l2', 'elasticnet', 'none'],
    'binary_classification__classifier__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
    'feature_selection__pca__n_components': [2, 5, 10],
    'feature_selection__pca__whiten': [True, False]
}

# Create a GridSearchCV object
grid_search = GridSearchCV(pipeline2, param_grid=param_grid, cv=5, n_jobs=-1)

# Fit the GridSearchCV object to the data
grid_search.fit(X_train, y_train)

# Print the best parameters found by GridSearchCV
print('Best parameters found by grid search:', grid_search.best_params_)


        nan        nan        nan        nan        nan        nan
 0.55080044 0.55133169 0.54932404 0.55534557 0.60434384 0.60487561
        nan        nan        nan        nan        nan        nan
 0.55877013 0.55457859 0.53716244 0.54430573 0.57854663 0.57701196
 0.55398829 0.55286659 0.54483763 0.5496197  0.58704758 0.58964526
 0.55398829 0.55286659 0.54483763 0.54973776 0.58710662 0.58964526
 0.55369311 0.55274853 0.54477858 0.54973778 0.58710662 0.58958623
 0.55505091 0.5532799  0.54105889 0.54507336 0.58297421 0.58025894
 0.55894728 0.55475573 0.5373395  0.54489603 0.57860573 0.57831069
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
 0.55298462 0.55298462 0.5552866  0.5552866  0.62435608 0.6243

Best parameters found by grid search: {'binary_classification__classifier__C': 10, 'binary_classification__classifier__penalty': 'l1', 'binary_classification__classifier__solver': 'liblinear', 'feature_selection__pca__n_components': 10, 'feature_selection__pca__whiten': False}


In [21]:
# Given an unbiased evaluation  (Question #E207)
# ----------------------------------

best_params = grid_search.best_params_

pipeline2.set_params(**best_params)
pipeline2.fit(X_test, y_test)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  ['national_inv', 'lead_time',
                                                   'in_transit_qty',
                                                   'forecast_3_month',
                                                   'forecast_6_month',
                                                   'forecast_9_month',
                                                   'sales_1_month',
                                                   'sales_3_month',
                                                   'sales_6_month',
                                                   'sales_9_month', 'min_bank',
                                                   'pieces_past_due',
                             

In [22]:
from sklearn.metrics import classification_report, confusion_matrix

# Make predictions on the test set using the trained model
y_pred = pipeline2.predict(X_test)

# Generate a confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

# Generate a classification report
cr = classification_report(y_test, y_pred)
print("Classification Report:\n", cr)

from sklearn.metrics import accuracy_score

# Calculate the overall accuracy of the model
acc = accuracy_score(y_test, y_pred)
print("Accuracy:", acc)


Confusion Matrix:
 [[1426 1382]
 [ 419 2420]]
Classification Report:
               precision    recall  f1-score   support

           0       0.77      0.51      0.61      2808
           1       0.64      0.85      0.73      2839

    accuracy                           0.68      5647
   macro avg       0.70      0.68      0.67      5647
weighted avg       0.70      0.68      0.67      5647

Accuracy: 0.6810695944749424


#### <center>Record the optimal hyperparameters and performance resulting from this pipeline.</center>

## <span style="background: yellow;">Commit your code!</span> 

### Your 3rd pipeline
  * Anomaly detection
  * Dimensionality reduction
  * Model training/validation

IsolationForest -> Standard Scalar -> Variance Threshold -> Decision Tree

In [23]:
# Add anomaly detection code  (Question #E209)
# ----------------------------------

from sklearn.ensemble import IsolationForest

# Create an IsolationForest object
clf = IsolationForest(n_estimators=100, contamination=0.05, random_state=43)

# Fit the model to the data
clf.fit(X_train)

# Predict outliers/anomalies in the data
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)


In [24]:
# Add codes for feature selection and classification pipeline with grid search  (Question #E210)
# ----------------------------------

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.tree import DecisionTreeClassifier

numeric_features = ['national_inv', 'lead_time', 'in_transit_qty', 'forecast_3_month', 
                    'forecast_6_month', 'forecast_9_month', 'sales_1_month', 'sales_3_month', 
                    'sales_6_month', 'sales_9_month', 'min_bank', 'pieces_past_due', 
                    'perf_6_month_avg', 'perf_12_month_avg', 'local_bo_qty']
binary_features = ['potential_issue', 'deck_risk', 'oe_constraint', 'ppap_risk', 'stop_auto_buy', 
                   'rev_stop']

# define the pipeline
numeric_transformer = Pipeline([
    ('scaler', StandardScaler())
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('bin', 'passthrough', binary_features),
])

feature_selection = Pipeline([
    ('selector', VarianceThreshold(threshold=0.01))
])

binary_classification = Pipeline([
    ('classifier', DecisionTreeClassifier())
])

pipeline3 = Pipeline([
    ('preprocessor', preprocessor),
    ('feature_selection', feature_selection),
    ('binary_classification', binary_classification)
])

pipeline3.fit(X_train, y_train)


Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  ['national_inv', 'lead_time',
                                                   'in_transit_qty',
                                                   'forecast_3_month',
                                                   'forecast_6_month',
                                                   'forecast_9_month',
                                                   'sales_1_month',
                                                   'sales_3_month',
                                                   'sales_6_month',
                                                   'sales_9_month', 'min_bank',
                                                   'pieces_past_due',
                             

In [26]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid to search over

param_grid = {
    'feature_selection__selector__threshold': [0.0, 0.01, 0.1, 1],
    'binary_classification__classifier__criterion': ['gini', 'entropy'],
    'binary_classification__classifier__max_depth': [None, 5, 10, 20],
    'binary_classification__classifier__min_samples_split': [2, 5, 10],
    'binary_classification__classifier__min_samples_leaf': [1, 2, 4],
}

# Create a GridSearchCV object
grid_search = GridSearchCV(pipeline3, param_grid=param_grid, cv=5, n_jobs=-1)

# Fit the GridSearchCV object to the data
grid_search.fit(X_train, y_train)

# Print the best parameters found by GridSearchCV
print('Best parameters found by grid search:', grid_search.best_params_)


Best parameters found by grid search: {'binary_classification__classifier__criterion': 'gini', 'binary_classification__classifier__max_depth': 10, 'binary_classification__classifier__min_samples_leaf': 1, 'binary_classification__classifier__min_samples_split': 2, 'feature_selection__selector__threshold': 0.1}


In [27]:
# Given an unbiased evaluation  (Question #E211)
# ----------------------------------

best_params = grid_search.best_params_

pipeline3.set_params(**best_params)
pipeline3.fit(X_test, y_test)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  ['national_inv', 'lead_time',
                                                   'in_transit_qty',
                                                   'forecast_3_month',
                                                   'forecast_6_month',
                                                   'forecast_9_month',
                                                   'sales_1_month',
                                                   'sales_3_month',
                                                   'sales_6_month',
                                                   'sales_9_month', 'min_bank',
                                                   'pieces_past_due',
                             

In [28]:
from sklearn.metrics import classification_report, confusion_matrix

# Make predictions on the test set using the trained model
y_pred = pipeline3.predict(X_test)

# Generate a confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

# Generate a classification report
cr = classification_report(y_test, y_pred)
print("Classification Report:\n", cr)

from sklearn.metrics import accuracy_score

# Calculate the overall accuracy of the model
acc = accuracy_score(y_test, y_pred)
print("Accuracy:", acc)


Confusion Matrix:
 [[2588  220]
 [ 131 2708]]
Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.92      0.94      2808
           1       0.92      0.95      0.94      2839

    accuracy                           0.94      5647
   macro avg       0.94      0.94      0.94      5647
weighted avg       0.94      0.94      0.94      5647

Accuracy: 0.9378431025323181


#### <center>Record the optimal hyperparameters and performance resulting from this pipeline.</center>

## Compare these three pipelines and discuss your findings

## <span style="background: yellow;">Commit your code!</span> 

### Pickle the required pipeline/models for Part III.

In [29]:
import joblib

joblib.dump([X_resampled, y_resampled, pipeline1], 'pipeline-1.pkl')


['pipeline-1.pkl']

You should have made a few commits so far of this project.  
**Definitely make a commit of the notebook now!**  
Comment should be: `Final Project, Checkpoint - Pipelines done`


# Save your notebook!
## Then `File > Close and Halt`