# Part II: Model Development

In this part, we develop three unique pipelines for predicting backorder. We use the smart sample from Part I to fit and evaluate these pipelines. 

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt

import os, sys
import itertools
import numpy as np
import pandas as pd

## Reload the smart sample here

In [2]:
# Reload your smart sampling from local file 
# ----------------------------------

import joblib
sampled_X, sampled_y= joblib.load('sampled_data.pkl')

In [3]:
sampled_X.head()

Unnamed: 0,national_inv,lead_time,in_transit_qty,forecast_3_month,forecast_6_month,forecast_9_month,sales_1_month,sales_3_month,sales_6_month,sales_9_month,min_bank,potential_issue,pieces_past_due,local_bo_qty,deck_risk,oe_constraint,ppap_risk,stop_auto_buy,rev_stop
0,57.0,8.0,0.0,37.0,84.0,131.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0,0,0,1,0
1,15.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,3.0,0.0,0,0.0,0.0,0,0,0,1,0
2,13.0,12.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0,0.0,0.0,1,0,0,1,0
3,0.0,4.0,0.0,9.0,10.0,10.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,1,0,0,1,0
4,5.0,8.0,24.0,39.0,67.0,95.0,5.0,25.0,50.0,71.0,0.0,0,0.0,0.0,0,0,0,1,0


In [4]:
sampled_y.head()

0    0
1    0
2    0
3    0
4    0
Name: went_on_backorder, dtype: int64

## Normalize/standardize the data if required; otherwise ignore. You can perform this step inside the pipeline (if required). 

## Split the data into Train/Test

In [5]:
X=sampled_X
y=sampled_y

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, stratify=y)

## Developing Pipeline

In this section, we design an operationalized machine learning pipeline, which includes:

* Anomaly detection
* Dimensionality Reduction
* Train a classification model


We are free to use any of the models that we learned in the past or we can use new models. Here is a pool of methods: 

### Pool of Anomaly Detection Methods (Discussed in M4)
1. IsolationForest
2. EllipticEnvelope
3. LocalOutlierFactor
4. OneClassSVM
5. SGDOneClassSVM

### Pool of Feature Selection Methods (Discussed in M3)

1. VarianceThreshold
1. SelectKBest with any scoring method (e.g, chi, f_classif, mutual_info_classif)
1. SelectKPercentile
3. SelectFpr, SelectFdr, or  SelectFwe
1. GenericUnivariateSelect
2. PCA
3. Factor Analysis
4. Variance Threshold
5. RFE
7. SelectFromModel


### Classification Methods (Discussed in M1-M2
1. Decision Tree
2. Random Forest
3. Logistic Regression
4. Naive Bayes
5. Linear SVC
6. SVC with kernels
7. KNeighborsClassifier
8. GradientBoostingClassifier
9. XGBClassifier
10. LGBM Classifier



It is difficult to fit an anomaly detection method in the sklearn pipeline without writing custom codes. For simplicity, we avoid fitting an anomaly detection method within a pipeline. So we can create the workflow in two steps. 
* Step I: fit an outlier with the training set
* Step II: define a pipeline using a feature selection and a classification method. Then cross-validate this pipeline using the training data without outliers. 
* Note: if your smart sample is somewhat imbalanced, you might want to change the scoring method in GridSearchCV (see the [doc](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)).


Once we fit the pipeline with gridsearch, we identify the best model and give an unbiased evaluation using the test set that we created in Part II. For unbiased evaluation we report confusion matrix, precision, recall, f1-score, accuracy, and other measures if you like. 

**Optional: Those who are interested in writing custom codes for adding an outlier detection method into the sklearn pipeline, please follow this discussion [thread](https://stackoverflow.com/questions/52346725/can-i-add-outlier-detection-and-removal-to-scikit-learn-pipeline).**


**Note:** <span style='background:yellow'>We will be using Grid Search to find the optimal parameters of the pipelines.</span>

You can add more notebook cells or import any Python modules as needed.

In [7]:
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest

from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, chi2, f_classif, mutual_info_classif

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, recall_score, fbeta_score
from sklearn.svm import SVC, LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from pprint import pprint
from sklearn.neighbors import LocalOutlierFactor


### Your 1st pipeline 
  * Anomaly detection (Elliptic Envelope)
  * Dimensionality reduction (SelectKBest)
  * Model training/validation (Gradient Boosting Classifier)
  
Add cells as needed. 

In [8]:
X1_train=X_train
y1_train=y_train

In [9]:
# Add anomaly detection code  (Question #E201)
# ----------------------------------
envelope = EllipticEnvelope(support_fraction=1, contamination=0.1).fit(X1_train)

# Create an boolean indexing array to pick up outliers
outliers = envelope.predict(X1_train)==-1

# Re-slice X,y into a cleaned dataset with outliers excluded
X1_train__clean = X1_train[~outliers]
y1_train_clean = y1_train[~outliers]

print(f"Num of outliers = {np.sum(outliers)}")

Num of outliers = 1514


In [10]:
#import xgboost as xgb
from sklearn.ensemble import GradientBoostingClassifier

In [11]:
gbc = GradientBoostingClassifier(random_state = 42)
print('Parameters currently in use:\n')
pprint(gbc.get_params())

Parameters currently in use:

{'ccp_alpha': 0.0,
 'criterion': 'friedman_mse',
 'init': None,
 'learning_rate': 0.1,
 'loss': 'deviance',
 'max_depth': 3,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_iter_no_change': None,
 'random_state': 42,
 'subsample': 1.0,
 'tol': 0.0001,
 'validation_fraction': 0.1,
 'verbose': 0,
 'warm_start': False}


In [22]:
# Add codes for feature selection and classification pipeline with grid search  (Question #E202)
# ----------------------------------

In [11]:
pipe1xtra = Pipeline(steps=[('select', SelectKBest()),('gbc', GradientBoostingClassifier())])
pipe1xtra.fit(X1_train, y1_train)

param_gridxtra = {
    'select__k': [8,10,15],
    'gbc__max_features': ['auto','sqrt','log2'],
    'gbc__learning_rate': [0.01,0.1,0.2,0.5],
    'gbc__n_estimators': [50,100,500,1000]
}

In [12]:
model_gridxtra = GridSearchCV(pipe1xtra,param_gridxtra,cv=5)
model_gridxtra.fit(X1_train,y1_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('select', SelectKBest()),
                                       ('gbc', GradientBoostingClassifier())]),
             param_grid={'gbc__learning_rate': [0.01, 0.1, 0.2, 0.5],
                         'gbc__max_features': ['auto', 'sqrt', 'log2'],
                         'gbc__n_estimators': [50, 100, 500, 1000],
                         'select__k': [8, 10, 15]})

In [13]:
print(model_gridxtra.best_estimator_)

Pipeline(steps=[('select', SelectKBest(k=15)),
                ('gbc',
                 GradientBoostingClassifier(max_features='auto',
                                            n_estimators=500))])


In [15]:
# Given an unbiased evaluation  (Question #E203)
# ----------------------------------

In [14]:
model_gridxtra.best_score_

0.8846158532231725

In [34]:
train_pred= model_gridxtra.predict(X1_train)

##### Performance on entire training data

In [101]:
train_predw= model_gridxtra.best_estimator_.predict(X1_train)

In [102]:
print(classification_report(y1_train, train_predw)) 

              precision    recall  f1-score   support

           0       0.92      0.89      0.90      7566
           1       0.90      0.92      0.91      7566

    accuracy                           0.91     15132
   macro avg       0.91      0.91      0.91     15132
weighted avg       0.91      0.91      0.91     15132



In [103]:
print(confusion_matrix(y1_train,train_predw))

[[6756  810]
 [ 623 6943]]


In [104]:
print("Accuracy Score: ", accuracy_score(y1_train,train_predw))
print("Recall Score: ",recall_score(y1_train,train_predw))
print("F1 Score: ",f1_score(y1_train,train_predw))

Accuracy Score:  0.9053000264340471
Recall Score:  0.9176579434311393
F1 Score:  0.906456034989229


##### Performance on training data after outlier removal

In [97]:
train_pred= model_gridxtra.best_estimator_.predict(X1_train__clean)

In [98]:
print(classification_report(y1_train_clean, train_pred)) 

              precision    recall  f1-score   support

           0       0.91      0.88      0.89      6728
           1       0.89      0.91      0.90      6890

    accuracy                           0.90     13618
   macro avg       0.90      0.90      0.90     13618
weighted avg       0.90      0.90      0.90     13618



In [99]:
print(confusion_matrix(y1_train_clean,train_pred))

[[5933  795]
 [ 611 6279]]


In [100]:
print("Accuracy Score: ", accuracy_score(y1_train_clean,train_pred))
print("Recall Score: ",recall_score(y1_train_clean,train_pred))
print("F1 Score: ",f1_score(y1_train_clean,train_pred))

Accuracy Score:  0.8967542957849904
Recall Score:  0.9113207547169812
F1 Score:  0.8993125179031797


##### Performance on test data

In [23]:
y_pred=model_gridxtra.best_estimator_.predict(X_test)

In [24]:
print(classification_report(y_test, y_pred)) 
print(confusion_matrix(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.90      0.86      0.88      3727
           1       0.87      0.90      0.88      3727

    accuracy                           0.88      7454
   macro avg       0.88      0.88      0.88      7454
weighted avg       0.88      0.88      0.88      7454

[[3205  522]
 [ 363 3364]]


In [25]:
print("Accuracy Score: ", accuracy_score(y_test,x_pred))
print("Recall Score: ",recall_score(y_test,x_pred))
print("F1 Score: ",f1_score(y_test,x_pred))

Accuracy Score:  0.8812718003756372
Recall Score:  0.9026026294606923
F1 Score:  0.8837514777354525


#### <center>Record the optimal hyperparameters and performance resulting from this pipeline.</center>

## <span style="background: yellow;">Commit your code!</span> 

### Your 2nd pipeline
  * Anomaly detection (Isolation Forest)
  * Dimensionality reduction (SelectFromModel- LinearSVC)
  * Model training/validation (Random Forest)

In [26]:
X2_train=X_train
y2_train=y_train

In [27]:
# Add anomaly detection code  (Question #E205)
# ----------------------------------
# Construct IsolationForest 
iso_forest = IsolationForest(contamination=0.08).fit(X2_train, y2_train)

# Get labels from classifier and cull outliers #P4006
iso_outliers = iso_forest.predict(X2_train)==-1

print(f"Num of outliers = {np.sum(iso_outliers)}")
X2_iso = X2_train[~iso_outliers]
y2_iso = y2_train[~iso_outliers]


Num of outliers = 1211


In [28]:
rf = RandomForestClassifier(random_state = 42)
print('Parameters currently in use:\n')
pprint(rf.get_params())

Parameters currently in use:

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}


In [29]:
# Add codes for feature selection and classification pipeline with grid search  (Question #E206)
# ----------------------------------
pipe2 = Pipeline([
  ('Lsvc', SelectFromModel(LinearSVC(penalty="l1",dual=False))),
  ('rf', RandomForestClassifier())
])
pipe2.fit(X2_train, y_train)

param_grid2 = {
    'rf__n_estimators': [200,600,1000,1400],
    'rf__max_depth': [10,20,30,40,50],
    'rf__max_features' : ['auto','sqrt']
}



In [30]:
rand_model2 = RandomizedSearchCV(pipe2,param_distributions=param_grid2,n_jobs=3,cv=8, n_iter=10) 

In [31]:
rand_model2.fit(X2_iso, y2_iso)

# Check the best choosen params
print(rand_model2.best_estimator_)

Pipeline(steps=[('Lsvc',
                 SelectFromModel(estimator=LinearSVC(dual=False,
                                                     penalty='l1'))),
                ('rf',
                 RandomForestClassifier(max_depth=20, max_features='sqrt',
                                        n_estimators=600))])


In [42]:
rand_model2.best_score_

0.9029513854502961

In [43]:
rand_model2.cv_results_

{'mean_fit_time': array([17.35040626,  5.74955598, 17.75475857, 12.36253399,  7.1678555 ,
         7.29288924, 16.99081963,  7.34370896,  5.43790588, 16.99004331]),
 'std_fit_time': array([0.33920847, 0.15687944, 0.42107386, 0.3626464 , 0.14952638,
        0.12155904, 0.70699019, 0.33995919, 0.15053448, 0.57646339]),
 'mean_score_time': array([0.67183232, 0.20813221, 0.66371647, 0.45752627, 0.27044716,
        0.27280495, 0.63566518, 0.27168438, 0.19322512, 0.63367063]),
 'std_score_time': array([0.01856905, 0.01038165, 0.01526353, 0.01663008, 0.00801413,
        0.00835922, 0.01772929, 0.00624072, 0.00734648, 0.01841053]),
 'param_rf__n_estimators': masked_array(data=[1400, 600, 1400, 1000, 600, 600, 1400, 600, 600, 1400],
              mask=[False, False, False, False, False, False, False, False,
                    False, False],
        fill_value='?',
             dtype=object),
 'param_rf__max_features': masked_array(data=['sqrt', 'sqrt', 'sqrt', 'sqrt', 'sqrt', 'auto', 'sqrt',
 

##### Performance on entire training data

In [105]:
train_pred2 = rand_model2.best_estimator_.predict(X2_train)

In [107]:
print(classification_report(y2_train, train_pred2)) 

              precision    recall  f1-score   support

           0       0.98      0.95      0.96      7566
           1       0.95      0.98      0.96      7566

    accuracy                           0.96     15132
   macro avg       0.96      0.96      0.96     15132
weighted avg       0.96      0.96      0.96     15132



In [108]:
print(confusion_matrix(y2_train,train_pred2))

[[7179  387]
 [ 156 7410]]


In [109]:
print("Accuracy Score: ", accuracy_score(y2_train,train_pred2))
print("Recall Score: ",recall_score(y2_train,train_pred2))
print("F1 Score: ",f1_score(y2_train,train_pred2))

Accuracy Score:  0.9641157811260904
Recall Score:  0.979381443298969
F1 Score:  0.964655340753759


##### Performance on training data after outlier removal 

In [91]:
train_pred2iso = rand_model2.best_estimator_.predict(X2_iso)

In [92]:
print(classification_report(y2_iso, train_pred2iso)) 

              precision    recall  f1-score   support

           0       0.99      0.98      0.98      6980
           1       0.98      0.99      0.98      6941

    accuracy                           0.98     13921
   macro avg       0.98      0.98      0.98     13921
weighted avg       0.98      0.98      0.98     13921



In [95]:
print(confusion_matrix(y2_iso,train_pred2iso))

[[6836  144]
 [  90 6851]]


In [96]:
print("Accuracy Score: ", accuracy_score(y2_iso,train_pred2iso))
print("Recall Score: ",recall_score(y2_iso,train_pred2iso))
print("F1 Score: ",f1_score(y2_iso,train_pred2iso))

Accuracy Score:  0.9831908627253789
Recall Score:  0.9870335686500504
F1 Score:  0.9832089552238806


##### Performance on test data

In [50]:
predicted_y2 = rand_model2.predict(X_test)
print(classification_report(y_test, predicted_y2)) 

              precision    recall  f1-score   support

           0       0.93      0.84      0.88      3727
           1       0.85      0.93      0.89      3727

    accuracy                           0.89      7454
   macro avg       0.89      0.89      0.89      7454
weighted avg       0.89      0.89      0.89      7454



In [51]:
print(confusion_matrix(y_test,predicted_y2))

[[3126  601]
 [ 244 3483]]


In [52]:
print("Accuracy Score: ", accuracy_score(y_test,predicted_y2))
print("Recall Score: ",recall_score(y_test,predicted_y2))
print("F1 Score: ",f1_score(y_test,predicted_y2))

Accuracy Score:  0.8866380466863429
Recall Score:  0.934531795009391
F1 Score:  0.8918192292920242


#### <center>Record the optimal hyperparameters and performance resulting from this pipeline.</center>

## <span style="background: yellow;">Commit your code!</span> 

### Your 3rd pipeline
  * Anomaly detection (Local Outlier Factor)
  * Dimensionality reduction (PCA)
  * Model training/validation (SVC)

In [53]:
X3_train= X_train
y3_train= y_train

In [39]:
# Add anomaly detection code  (Question #E209)
# ----------------------------------

In [54]:
lof_labels = LocalOutlierFactor(n_neighbors=10).fit_predict(X3_train, y3_train)
inliers = lof_labels == 1 # select inliers
X3_clean = X3_train[inliers]
y3_clean = y3_train[inliers]

In [55]:
svc = SVC(random_state = 42)
print('Parameters currently in use:\n')
pprint(svc.get_params())

Parameters currently in use:

{'C': 1.0,
 'break_ties': False,
 'cache_size': 200,
 'class_weight': None,
 'coef0': 0.0,
 'decision_function_shape': 'ovr',
 'degree': 3,
 'gamma': 'scale',
 'kernel': 'rbf',
 'max_iter': -1,
 'probability': False,
 'random_state': 42,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}


In [56]:
clf_pipe = Pipeline([
    ('scale', MinMaxScaler()),
    ('PCA', PCA()), 
    ('SVC', SVC()) 
])

In [57]:
param_grid = {'SVC__C': uniform(1000,100000),
              'SVC__gamma': uniform(0.1,0.001), 
              'PCA__n_components': [10],
              'SVC__kernel': ['rbf']}

In [58]:
rand_model = RandomizedSearchCV(clf_pipe,param_distributions= param_grid, n_jobs=3,cv=5, n_iter=6)

In [59]:
rand_model.fit(X3_clean, y3_clean)

# Check the best choosen params
print(rand_model.best_estimator_)

Pipeline(steps=[('scale', MinMaxScaler()), ('PCA', PCA(n_components=10)),
                ('SVC', SVC(C=98495.71245065526, gamma=0.10055261454104605))])


In [60]:
rand_model.best_score_

0.7844019368636539

In [61]:
rand_model.cv_results_

{'mean_fit_time': array([12.12050505, 12.26125865, 12.18278093,  8.43253322, 12.1598495 ,
        12.36184649]),
 'std_fit_time': array([0.93559875, 0.82446408, 0.84118685, 0.28055503, 0.89801796,
        0.91048539]),
 'mean_score_time': array([1.01179118, 1.02409248, 0.98677077, 1.23256903, 0.97177997,
        0.98433337]),
 'std_score_time': array([0.01617578, 0.0266186 , 0.01985692, 0.01099956, 0.01438121,
        0.02082859]),
 'param_PCA__n_components': masked_array(data=[10, 10, 10, 10, 10, 10],
              mask=[False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_SVC__C': masked_array(data=[68878.47400189278, 63272.143484310836,
                    87725.74118358923, 5209.443108255428,
                    98495.71245065526, 89434.60057516643],
              mask=[False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_SVC__gamma': masked_array(data=[0.10042231518890962, 0.1004744

In [67]:
# Given an unbiased evaluation  (Question #E211)
# ----------------------------------

##### Performance on entire training data

In [62]:
train_pred3= rand_model.best_estimator_.predict(X3_train)

In [64]:
print(classification_report(y3_train, train_pred3)) 

              precision    recall  f1-score   support

           0       0.90      0.60      0.72      7566
           1       0.70      0.94      0.80      7566

    accuracy                           0.77     15132
   macro avg       0.80      0.77      0.76     15132
weighted avg       0.80      0.77      0.76     15132



In [66]:
print(confusion_matrix(y3_train,train_pred3))

[[4516 3050]
 [ 482 7084]]


In [71]:
print("Accuracy Score: ", accuracy_score(y3_train,train_pred3))
print("Recall Score: ",recall_score(y3_train,train_pred3))
print("F1 Score: ",f1_score(y3_train,train_pred3))

Accuracy Score:  0.7665873645255089
Recall Score:  0.936293946603225
F1 Score:  0.8004519774011298


##### Performance on training after outlier removal

In [110]:
train_pred3c= rand_model.best_estimator_.predict(X3_clean)

In [111]:
print(classification_report(y3_clean, train_pred3c)) 

              precision    recall  f1-score   support

           0       0.91      0.61      0.73      6446
           1       0.72      0.94      0.82      7042

    accuracy                           0.78     13488
   macro avg       0.82      0.77      0.77     13488
weighted avg       0.81      0.78      0.78     13488



In [112]:
print(confusion_matrix(y3_clean,train_pred3c))

[[3916 2530]
 [ 405 6637]]


In [113]:
print("Accuracy Score: ", accuracy_score(y3_clean,train_pred3c))
print("Recall Score: ",recall_score(y3_clean,train_pred3c))
print("F1 Score: ",f1_score(y3_clean,train_pred3c))

Accuracy Score:  0.7823991696322657
Recall Score:  0.9424879295654643
F1 Score:  0.8189277561848355


##### Performance on test data

In [68]:
predicted_y3 = rand_model.predict(X_test)
print(classification_report(y_test, predicted_y3)) 

              precision    recall  f1-score   support

           0       0.90      0.58      0.71      3727
           1       0.69      0.93      0.79      3727

    accuracy                           0.76      7454
   macro avg       0.79      0.76      0.75      7454
weighted avg       0.79      0.76      0.75      7454



In [69]:
print(confusion_matrix(y_test,predicted_y3))

[[2164 1563]
 [ 243 3484]]


In [70]:
print("Accuracy Score: ", accuracy_score(y_test,predicted_y3))
print("Recall Score: ",recall_score(y_test,predicted_y3))
print("F1 Score: ",f1_score(y_test,predicted_y3))

Accuracy Score:  0.7577139790716394
Recall Score:  0.9348001073249262
F1 Score:  0.7941645771597902


#### <center>Record the optimal hyperparameters and performance resulting from this pipeline.</center>

## Compare these three pipelines and discuss your findings

### Performance on Test Data

### Pipeline 1

In [72]:
y_pred=model_gridxtra.best_estimator_.predict(X_test)
print(classification_report(y_test, y_pred)) 

              precision    recall  f1-score   support

           0       0.90      0.86      0.88      3727
           1       0.87      0.90      0.88      3727

    accuracy                           0.88      7454
   macro avg       0.88      0.88      0.88      7454
weighted avg       0.88      0.88      0.88      7454



In [73]:
print(confusion_matrix(y_test,y_pred))

[[3205  522]
 [ 363 3364]]


In [74]:
print("Accuracy Score: ", accuracy_score(y_test,x_pred))
print("Recall Score: ",recall_score(y_test,x_pred))
print("F1 Score: ",f1_score(y_test,x_pred))

Accuracy Score:  0.8812718003756372
Recall Score:  0.9026026294606923
F1 Score:  0.8837514777354525


### Pipeline 2

In [78]:
predicted_y2 = rand_model2.predict(X_test)
print(classification_report(y_test, predicted_y2)) 

              precision    recall  f1-score   support

           0       0.93      0.84      0.88      3727
           1       0.85      0.93      0.89      3727

    accuracy                           0.89      7454
   macro avg       0.89      0.89      0.89      7454
weighted avg       0.89      0.89      0.89      7454



In [79]:
print(confusion_matrix(y_test,predicted_y2))

[[3126  601]
 [ 244 3483]]


In [80]:
print("Accuracy Score: ", accuracy_score(y_test,predicted_y2))
print("Recall Score: ",recall_score(y_test,predicted_y2))
print("F1 Score: ",f1_score(y_test,predicted_y2))

Accuracy Score:  0.8866380466863429
Recall Score:  0.934531795009391
F1 Score:  0.8918192292920242


### Pipeline 3

In [81]:
predicted_y3 = rand_model.best_estimator_.predict(X_test)
print(classification_report(y_test, predicted_y3)) 

              precision    recall  f1-score   support

           0       0.90      0.58      0.71      3727
           1       0.69      0.93      0.79      3727

    accuracy                           0.76      7454
   macro avg       0.79      0.76      0.75      7454
weighted avg       0.79      0.76      0.75      7454



In [82]:
print(confusion_matrix(y_test,predicted_y3))

[[2164 1563]
 [ 243 3484]]


In [83]:
print("Accuracy Score: ", accuracy_score(y_test,predicted_y3))
print("Recall Score: ",recall_score(y_test,predicted_y3))
print("F1 Score: ",f1_score(y_test,predicted_y3))

Accuracy Score:  0.7577139790716394
Recall Score:  0.9348001073249262
F1 Score:  0.7941645771597902


## <span style="background: yellow;">Commit your code!</span> 

### Pickle the required pipeline/models for Part III.

In [85]:
iso_forest

IsolationForest(contamination=0.08)

In [84]:
rand_model2.best_estimator_

Pipeline(steps=[('Lsvc',
                 SelectFromModel(estimator=LinearSVC(dual=False,
                                                     penalty='l1'))),
                ('rf',
                 RandomForestClassifier(max_depth=20, max_features='sqrt',
                                        n_estimators=600))])

In [90]:
best_pipe= rand_model2.best_estimator_

joblib.dump(best_pipe,'best_model.joblib')



['best_model.joblib']

In [89]:
joblib.dump(iso_forest,'iso_forest.joblib')

['iso_forest.joblib']

You should have made a few commits so far of this project.  
**Definitely make a commit of the notebook now!**  
Comment should be: `Final Project, Checkpoint - Pipelines done`


# Save your notebook!
## Then `File > Close and Halt`