# Machine Learning - Logistic Regression and XGBoost
# Approach 3

This jupyter notebook provides the Machine Learning process for Logistic Regression and XGBoost Clasiffication models. The models were trained from the train dataset that were exported from Approach 3 (Part2 - NLP). The random forest model is in a seperate jupyter notebook.

In [1]:
# Some basic packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# For splitting our data
from sklearn.model_selection import train_test_split

# For some simple model building
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# This gets rid of those annoying default solver messages when fitting logistic regression
import warnings
warnings.filterwarnings('ignore')

# For cross-validation
from sklearn.model_selection import cross_val_score

# For setting up a temporary directory for caching pipeline results
from tempfile import mkdtemp

# Pipeline
from sklearn.pipeline import Pipeline

# Some scalers we'll try later
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler

# For trying PCA later
from sklearn.decomposition import PCA

# For cross-validated grid search
from sklearn.model_selection import GridSearchCV

### 1. Import Data/Setup Train, Validation, Test set

In [2]:
train = pd.read_csv('data/model_train.csv')
test = pd.read_csv('data/model_test.csv')

In [3]:
# This is train datasett
X = train.drop('points', axis = 1)
y = train['points']

# This is train datasett
X_test = test.drop('points', axis = 1)
y_test = test['points']

In [4]:
# Import train_test_split package
from sklearn.model_selection import train_test_split

# Split data into train and test, where text_size is 30 percent, andsp train set is 70%
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.30, random_state=42, stratify = y)

### 2. Create a Baseline Model

In [6]:
# Baseline logistic regression
baseline_logreg = LogisticRegression(random_state=1).fit(X_train, y_train)

print(f'Accuracy on train set: {baseline_logreg.score(X_train, y_train)}')
print(f'Accuracy on remainder set: {baseline_logreg.score(X_val, y_val)}')
print(f'Accuracy on test set: {baseline_logreg.score(X_test, y_test)}')

Accuracy on train set: 0.8256575331710764
Accuracy on remainder set: 0.8214992306001319
Accuracy on test set: 0.8213479688141158


#### CELEBRATION!

As a basesline model where scaling, dimension reduction, and hyperparameterization were not completed in the above model, the accuracy score looks pretty good as a baseline attempt. One thing to note is that the test and remainder set accuracy scores are near to the train set accuracy scores which is a good indicator that the model is not overfitting.

### 3. Scale, Dimension Reduction, Hyperparameterization

In [5]:
# This packages allows us to save the model so that we can import it and use it later
import joblib
from sklearn.externals import joblib

#### A. Logistic Regression

In [17]:
# Set up a directory to cache the pipeline results
cachedir = mkdtemp()

# Set up a pipeline
# The steps here act as placeholders and will be changed when we pass the pipeline into the grid search later
my_pipeline = Pipeline([('scaler', StandardScaler()), ('dim_reducer', PCA()), ('model', LogisticRegression())], memory=cachedir)

# Let's try the same range of C values from earlier
c_values = [.00001, .0001, .001, .1, 1, 10, 100, 1000, 10000]

# Parameter grid
param_grid = [

    {'scaler': [StandardScaler(), RobustScaler()],
     'dim_reducer': [PCA()],
     'dim_reducer__n_components': [10, 50, 100, 200, 400, 500, 600, 700, 800, 900, 1000],
     'model': [LogisticRegression(solver='lbfgs', random_state=1, n_jobs=-1)],
     'model__C': c_values}
]

# Instantiate the log reg grid search
logreg_gs = GridSearchCV(my_pipeline, param_grid=param_grid, n_jobs = -1, cv=5, verbose=10)

# Fit the log reg grid search
fitted_logreg_gs = logreg_gs.fit(X_train, y_train)

Fitting 5 folds for each of 198 candidates, totalling 990 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  3.1min
[Parallel(n_jobs=-1)]: Done  53 tasks      | elapsed:  3.8min
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:  4.1min
[Parallel(n_jobs=-1)]: Done  77 tasks      | elapsed:  4.9min
[Parallel(n_jobs=-1)]: Done  90 tasks      | elapsed:  5.5min
[Parallel(n_jobs=-1)]: Done 105 tasks      | elapsed:  7.3min
[Parallel(n_jobs=-1)]: Done 120 tasks      | elapsed:  8.0min
[Parallel(n_jobs=-1)]: Done 137 tasks      | elapsed:  8.8min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  9.5min
[Parallel(n_jobs=-1)]: Done 173 tasks      | elapsed: 10

In [21]:
# Save to file in the current working directory
logreg_ml = "logreg_ml.pkl"
joblib.dump(fitted_logreg_gs, logreg_ml)

['logreg_ml.pkl']

In [19]:
fitted_logreg_gs.best_estimator_

Pipeline(memory='/var/folders/5y/d61vfhqx3bv7m_5qfsqw1g8c0000gn/T/tmp16bxk1xl',
         steps=[('scaler',
                 RobustScaler(copy=True, quantile_range=(25.0, 75.0),
                              with_centering=True, with_scaling=True)),
                ('dim_reducer',
                 PCA(copy=True, iterated_power='auto', n_components=1000,
                     random_state=None, svd_solver='auto', tol=0.0,
                     whiten=False)),
                ('model',
                 LogisticRegression(C=1, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='warn', n_jobs=-1, penalty='l2',
                                    random_state=1, solver='lbfgs', tol=0.0001,
                                    verbose=0, warm_start=False))],
         verbose=False)

For logistic regression the best estimator is where n_components = 100 for PCA, Scalar is set to StandardScaler, and C = 0.1

#### Logistic Regression Model Accuracy

In [20]:
print('Logistic Regression Accuracy Score for Train Set:', fitted_logreg_gs.score(X_train, y_train))
print('Logistic Regression Accuracy Score for Validation Set:', fitted_logreg_gs.score(X_val, y_val))
print('Logistic Regression Accuracy Score for Test Set:', fitted_logreg_gs.score(X_test, y_test))

Logistic Regression Accuracy Score for Train Set: 0.8269137159456701
Logistic Regression Accuracy Score for Validation Set: 0.8235143254927823
Logistic Regression Accuracy Score for Test Set: 0.8231175625769389


The accuracy score actually dropped from the baseline score which is surprising but by not too much. As well the Train set is scored a little bit less than the Validation and Test set. Now lets look at the confusion matrix.

In [16]:
## CONFUSION MATRIX
from sklearn.metrics import confusion_matrix

In [22]:
y_pred = fitted_logreg_gs.predict(X_test)
confusion_matrix(y_test, y_pred)

array([[17458,  3198],
       [ 3699, 14637]])

The confusion matrix defines how many True Positives, True Negatives, False Positives, and False Negatives were determined from the dataset. For the test dataset and with the best estimator with a logistic regression:

- Correctly predicted a bad score 17458
- Incorrectly predicted a bad score 3699
- Correctly predicted a good score 14637
- Incorrectly predicted a bad score 3198

As expected (from Part 2 NLP with Wine Reviews), the model would be able to preidct a bad score or scores set to 0 than good scores. Having said that, the model predicted more TN and TP than FP and FN.

In [19]:
# Accuracy Score
from sklearn.metrics import accuracy_score

# Precision Score
from sklearn.metrics import precision_score

# Recall Score
from sklearn.metrics import recall_score

# F1 Score
from sklearn.metrics import f1_score

In [23]:
print('Model Evaluation for Logistic Regression:')
print('Accuracy Score for Test data:', accuracy_score(y_test, y_pred))
print('Precision Score for Test data:', precision_score(y_test, y_pred))
print('Recall Score for Test data:', recall_score(y_test, y_pred))
print('F1 Score for Test data:', f1_score(y_test, y_pred))

Model Evaluation for Logistic Regression:
Accuracy Score for Test data: 0.8231175625769389
Precision Score for Test data: 0.8206896551724138
Recall Score for Test data: 0.7982657068062827
F1 Score for Test data: 0.8093223853363191


Accuracy is the percentage of how correclty the model classified Positive and Negative scores. 

$$Accuracy = \frac{TN+TP}{TN+FP+FN+TP}$$

Precision describes how well the model performs at predicting the positive class in our case positive wine scores. 

$$Precision = \frac{TP}{TP+FP}$$

#### XGBoost

Note I can use hyperparameter optimationz for XGBoost, however I did not have time to add different types of parameters. For example max_depth, booster type, num_features, and gamma.

In [10]:
# Set up a directory to cache the pipeline results
cachedir = mkdtemp()

# Set up a pipeline
# The steps here act as placeholders and will be changed when we pass the pipeline into the grid search later
my_pipeline = Pipeline([('scaler', StandardScaler()), ('dim_reducer', PCA()), ('model', LogisticRegression())], memory = cachedir)

# Parameter grid
param_grid = [
    
    # XGB Boost
    {'scaler': [StandardScaler(), RobustScaler()],
     'dim_reducer': [PCA()],
     'dim_reducer__n_components': [10, 200, 400, 600, 800, 1000],
     'model': [XGBClassifier()]}
    
]

# Instantiate the log reg grid search
xgboost_gs = GridSearchCV(my_pipeline, param_grid= param_grid, cv=5, verbose=10)

# Fit the log reg grid search
fitted_xgboost_gs = xgboost_gs.fit(X_train, y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] dim_reducer=PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=10, model=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1), scaler=StandardScaler(copy=True, with_mean=True, with_std=True) 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  dim_reducer=PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=10, model=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1), scaler=StandardScaler(copy=True, with_mean=True, with_std=True), score=0.718, total=  23.6s
[CV] dim_reducer=PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=10, model=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              col

[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   23.7s remaining:    0.0s


[CV]  dim_reducer=PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=10, model=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1), scaler=StandardScaler(copy=True, with_mean=True, with_std=True), score=0.719, total=  27.4s
[CV] dim_reducer=PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=10, model=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsa

[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   51.1s remaining:    0.0s


[CV]  dim_reducer=PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=10, model=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1), scaler=StandardScaler(copy=True, with_mean=True, with_std=True), score=0.721, total=  31.0s
[CV] dim_reducer=PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=10, model=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsa

[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  1.4min remaining:    0.0s


[CV]  dim_reducer=PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=10, model=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1), scaler=StandardScaler(copy=True, with_mean=True, with_std=True), score=0.722, total=  23.9s
[CV] dim_reducer=PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=10, model=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsa

[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  1.8min remaining:    0.0s


[CV]  dim_reducer=PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=10, model=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1), scaler=StandardScaler(copy=True, with_mean=True, with_std=True), score=0.725, total=  27.7s
[CV] dim_reducer=PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=10, model=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsa

[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  2.2min remaining:    0.0s


[CV]  dim_reducer=PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=10, model=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1), scaler=RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
             with_scaling=True), score=0.740, total=  21.5s
[CV] dim_reducer=PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=10, model=XGBClassifier(base_score=0.5, booster='

[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:  2.6min remaining:    0.0s


[CV]  dim_reducer=PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=10, model=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1), scaler=RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
             with_scaling=True), score=0.728, total=  21.0s
[CV] dim_reducer=PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=10, model=XGBClassifier(base_score=0.5, booster='

[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:  2.9min remaining:    0.0s


[CV]  dim_reducer=PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=10, model=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1), scaler=RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
             with_scaling=True), score=0.732, total=  20.7s
[CV] dim_reducer=PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=10, model=XGBClassifier(base_score=0.5, booster='

[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:  3.3min remaining:    0.0s


[CV]  dim_reducer=PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=10, model=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1), scaler=RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
             with_scaling=True), score=0.736, total=  22.4s
[CV] dim_reducer=PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=10, model=XGBClassifier(base_score=0.5, booster='

[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:  3.7min remaining:    0.0s


[CV]  dim_reducer=PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=10, model=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1), scaler=RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
             with_scaling=True), score=0.735, total=  21.3s
[CV] dim_reducer=PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=200, model=XGBClassifier(base_score=0.5, booster=

[CV]  dim_reducer=PCA(copy=True, iterated_power='auto', n_components=200, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=200, model=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1), scaler=RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
             with_scaling=True), score=0.780, total= 2.8min
[CV] dim_reducer=PCA(copy=True, iterated_power='auto', n_components=200, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=200, model=XGBClassifier(base_score=0.5, boost

[CV]  dim_reducer=PCA(copy=True, iterated_power='auto', n_components=400, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=400, model=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1), scaler=StandardScaler(copy=True, with_mean=True, with_std=True), score=0.779, total= 4.3min
[CV] dim_reducer=PCA(copy=True, iterated_power='auto', n_components=400, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=400, model=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              c

[CV]  dim_reducer=PCA(copy=True, iterated_power='auto', n_components=400, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=400, model=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1), scaler=RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
             with_scaling=True), score=0.784, total= 4.2min
[CV] dim_reducer=PCA(copy=True, iterated_power='auto', n_components=400, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=400, model=XGBClassifier(base_score=0.5, boost

[CV]  dim_reducer=PCA(copy=True, iterated_power='auto', n_components=600, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=600, model=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1), scaler=StandardScaler(copy=True, with_mean=True, with_std=True), score=0.776, total= 5.7min
[CV] dim_reducer=PCA(copy=True, iterated_power='auto', n_components=600, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=600, model=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              c

[CV]  dim_reducer=PCA(copy=True, iterated_power='auto', n_components=600, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=600, model=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1), scaler=RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
             with_scaling=True), score=0.784, total= 6.6min
[CV] dim_reducer=PCA(copy=True, iterated_power='auto', n_components=600, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=800, model=XGBClassifier(base_score=0.5, boost

[CV]  dim_reducer=PCA(copy=True, iterated_power='auto', n_components=800, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=800, model=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1), scaler=RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
             with_scaling=True), score=0.789, total= 8.0min
[CV] dim_reducer=PCA(copy=True, iterated_power='auto', n_components=800, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=800, model=XGBClassifier(base_score=0.5, boost

[CV]  dim_reducer=PCA(copy=True, iterated_power='auto', n_components=1000, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=1000, model=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1), scaler=StandardScaler(copy=True, with_mean=True, with_std=True), score=0.780, total=10.7min
[CV] dim_reducer=PCA(copy=True, iterated_power='auto', n_components=1000, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=1000, model=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
           

[CV]  dim_reducer=PCA(copy=True, iterated_power='auto', n_components=1000, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=1000, model=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1), scaler=RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
             with_scaling=True), score=0.788, total= 9.6min
[CV] dim_reducer=PCA(copy=True, iterated_power='auto', n_components=1000, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False), dim_reducer__n_components=1000, model=XGBClassifier(base_score=0.5, b

[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed: 360.1min finished


In [11]:
# Save to file in the current working directory
xgboost_ml = "xgboost_ml.pkl"
joblib.dump(fitted_xgboost_gs, xgboost_ml)

['xgboost_ml.pkl']

In [12]:
# Best estimattor for xgboost
fitted_xgboost_gs.best_estimator_

Pipeline(memory='/var/folders/5y/d61vfhqx3bv7m_5qfsqw1g8c0000gn/T/tmp6u9jb5dt',
         steps=[('scaler',
                 RobustScaler(copy=True, quantile_range=(25.0, 75.0),
                              with_centering=True, with_scaling=True)),
                ('dim_reducer',
                 PCA(copy=True, iterated_power='auto', n_components=800,
                     random_state=None, svd_solver='auto', tol=0.0,
                     whiten=False)),
                ('model',
                 XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, gamma=0, learning_rate=0.1,
                               max_delta_step=0, max_depth=3,
                               min_child_weight=1, missing=None,
                               n_estimators=100, n_jobs=1, nthread=None,
                               objective='binary:logistic', random_state=0,
                       

In [17]:
# Confustion Matrix
y_pred = fitted_xgboost_gs.predict(X_test)
confusion_matrix(y_test, y_pred)

array([[16790,  3866],
       [ 4238, 14098]])


- Correctly predicted a bad score 16790
- Incorrectly predicted a bad score 4238
- Correctly predicted a good score 14098
- Incorrectly predicted a bad score 3866

In [20]:
print('Model Evaluation for Logistic Regression:')
print('Accuracy Score for Test data:', accuracy_score(y_test, y_pred))
print('Precision Score for Test data:', precision_score(y_test, y_pred))
print('Recall Score for Test data:', recall_score(y_test, y_pred))
print('F1 Score for Test data:', f1_score(y_test, y_pred))

Model Evaluation for Logistic Regression:
Accuracy Score for Test data: 0.7921624948707428
Precision Score for Test data: 0.78479180583389
Recall Score for Test data: 0.768869982547993
F1 Score for Test data: 0.7767493112947659
