# CPSC 330 - Applied Machine Learning 

## Homework 5: Putting it all together 
### Associated lectures: All material till lecture 11

**Feb 26, 11:59pm: See the [Calendar](https://github.com/UBC-CS/cpsc330-2023W2/tree/main?tab=readme-ov-file#deliverable-due-dates-tentative).**

## Submission instructions <a name="si"></a>
<hr>

_points: 4_

You will receive marks for correctly submitting this assignment. To submit this assignment, follow the instructions below:

- **You may work on this assignment in a group (group size <= 4) and submit your assignment as a group.** 
- Below are some instructions on working as a group.  
    - The maximum group size is 4.
    - You can choose your own group members. 
    - Use group work as an opportunity to collaborate and learn new things from each other. 
    - Be respectful to each other and make sure you understand all the concepts in the assignment well. 
    - It's your responsibility to make sure that the assignment is submitted by one of the group members before the deadline. 
- Be sure to follow the [homework submission instructions](https://github.com/UBC-CS/cpsc330-2023W2/blob/main/docs/homework_instructions.md).

## Imports

In [1]:
import os

%matplotlib inline
import string
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import os
import re
import sys
from hashlib import sha1

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# import tests_hw5
from sklearn import datasets
from sklearn.compose import make_column_transformer
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    f1_score,
    make_scorer,
    precision_score,
    recall_score,
)
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_val_score,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
# My imports:
from sklearn.base import BaseEstimator, TransformerMixin
from scipy.stats import uniform
import warnings
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score
from sklearn.metrics import mean_squared_error, r2_score

## Introduction <a name="in"></a>

In this homework you will be working on an open-ended mini-project, where you will put all the different things you have learned so far together to solve an interesting problem.

A few notes and tips when you work on this mini-project: 

#### Tips
1. This mini-project is open-ended, and while working on it, there might be some situations where you'll have to use your own judgment and make your own decisions (as you would be doing when you work as a data scientist). Make sure you explain your decisions whenever necessary. 
2. **Do not include everything you ever tried in your submission** -- it's fine just to have your final code. That said, your code should be reproducible and well-documented. For example, if you chose your hyperparameters based on some hyperparameter optimization experiment, you should leave in the code for that experiment so that someone else could re-run it and obtain the same hyperparameters, rather than mysteriously just setting the hyperparameters to some (carefully chosen) values in your code. 
3. If you realize that you are repeating a lot of code try to organize it in functions. Clear presentation of your code, experiments, and results is the key to be successful in this lab. You may use code from lecture notes or previous lab solutions with appropriate attributions. 

#### Assessment
We plan to grade fairly and leniently. We don't have some secret target score that you need to achieve to get a good grade. **You'll be assessed on demonstration of mastery of course topics, clear presentation, and the quality of your analysis and results.** For example, if you just have a bunch of code and no text or figures, that's not good. If you do a bunch of sane things and get a lower accuracy than your friend, don't sweat it.


#### A final note
Finally, this style of this "project" question is different from other assignments. It'll be up to you to decide when you're "done" -- in fact, this is one of the hardest parts of real projects. But please don't spend WAY too much time on this... perhaps "a few hours". Of course if you're having fun you're welcome to spend as much time as you want! But, if so, try not to do it out of perfectionism or getting the best possible grade. Do it because you're learning and enjoying it. Students from the past cohorts have found such kind of labs useful and fun and I hope you enjoy it as well. 

<br><br>

<!-- BEGIN QUESTION -->

## 1. Pick your problem and explain the prediction problem <a name="1"></a>
<hr>

_points: 3_

In this mini project, you will be working on a classification problem of predicting whether a credit card client will default or not. 
For this problem, you will use [Default of Credit Card Clients Dataset](https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset). In this data set, there are 30,000 examples and 24 features, and the goal is to estimate whether a person will default (fail to pay) their credit card bills; this column is labeled "default.payment.next.month" in the data. The rest of the columns can be used as features. You may take some ideas and compare your results with [the associated research paper](https://www.sciencedirect.com/science/article/pii/S0957417407006719), which is available through [the UBC library](https://www.library.ubc.ca/). 

**Your tasks:**

1. Spend some time understanding the problem and what each feature means. You can find this information in the documentation on [the dataset page on Kaggle](https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset). Write a few sentences on your initial thoughts on the problem and the dataset. 
2. Download the dataset and read it as a pandas dataframe. 

<div class="alert alert-warning">
    
Solution_1
    
</div>

The data was collected from credit card clients in Taiwan from April 2005 to September 2005. I can see that there is a diverse set of features. A lot are numerical such as 'LIMIT_BAL' and 'PAY_0'. There is ordinal and non ordinal categorical features, an example of an ordinal categorical feature would be 'EDUCATION' and a non ordinal categorical feature would be 'MARRIAGE' or 'SEX'.

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 2. Data splitting <a name="2"></a>
<hr>

_points: 2_

**Your tasks:**

1. Split the data into train (70%) and test (30%) portions with `random_state=76`.

> If your computer cannot handle training on 70% training data, make the test split bigger.  

<div class="alert alert-warning">
    
Solution_2
    
</div>

In [2]:
credit_card = pd.read_csv('data/UCI_Credit_Card.csv');
#man(train_test_split)
train_cc, test_cc= train_test_split(credit_card, test_size=0.30, random_state=76)

train_cc

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
8958,8959,340000.0,1,1,2,44,0,0,0,0,...,59324.0,156094.0,110234.0,20000.0,5000.0,2000.0,112000.0,4234.0,4000.0,1
22752,22753,200000.0,2,2,2,34,0,0,0,-1,...,1078.0,1598.0,171700.0,5504.0,1526.0,1078.0,1598.0,173026.0,6000.0,0
25882,25883,80000.0,2,2,1,26,0,0,0,0,...,75443.0,57735.0,58139.0,2800.0,2800.0,2400.0,2100.0,2100.0,2100.0,0
12925,12926,80000.0,2,2,1,45,0,0,0,0,...,79295.0,81142.0,80672.0,3130.0,3107.0,2847.0,3134.0,3072.0,3010.0,0
23598,23599,80000.0,2,2,1,40,-1,-1,-1,-1,...,1729.0,590.0,9628.0,2035.0,32194.0,1729.0,590.0,9628.0,16059.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22810,22811,150000.0,2,3,1,42,-1,-1,-1,-1,...,7306.0,18726.0,13839.0,6151.0,6427.0,7306.0,18738.0,0.0,7154.0,0
6528,6529,90000.0,2,5,2,23,0,0,0,0,...,42726.0,43802.0,42761.0,2000.0,2000.0,1502.0,1737.0,1500.0,5050.0,1
9607,9608,80000.0,2,1,1,35,-1,2,-1,-1,...,303.0,662.0,3295.0,0.0,3723.0,891.0,662.0,3295.0,1088.0,0
12279,12280,80000.0,2,2,1,27,-1,-1,-1,-1,...,2862.0,5539.0,0.0,0.0,680.0,2862.0,5539.0,0.0,5775.0,0


<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 3. EDA <a name="3"></a>
<hr>

_points: 10_

**Your tasks:**

1. Perform exploratory data analysis on the train set.
2. Include at least two summary statistics and two visualizations that you find useful, and accompany each one with a sentence explaining it.
3. Summarize your initial observations about the data. 
4. Pick appropriate metric/metrics for assessment. 

<div class="alert alert-warning">
    
Solution_3
    
</div>

In [3]:
pd.set_option('display.max_columns', None)  # to show all columns
train_cc.describe(include='all')

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
count,21000.0,21000.0,21000.0,21000.0,21000.0,21000.0,21000.0,21000.0,21000.0,21000.0,21000.0,21000.0,21000.0,21000.0,21000.0,21000.0,21000.0,21000.0,21000.0,21000.0,21000.0,21000.0,21000.0,21000.0,21000.0
mean,15011.587619,167563.508571,1.604381,1.843905,1.554667,35.412952,-0.01219,-0.132714,-0.168333,-0.223143,-0.265762,-0.287381,51100.594571,48965.462714,46841.51,43039.813952,40121.88981,38623.497095,5601.265286,6059.441,5204.302571,4889.281333,4782.900857,5162.918714,0.221857
std,8658.232639,129919.112502,0.488995,0.789845,0.52197,9.136302,1.121864,1.196554,1.195375,1.16549,1.13421,1.152388,73651.958111,71005.547417,69398.29,63817.41498,60400.798292,59055.005208,16239.423781,24074.7,16865.645456,16486.840852,15431.523094,17170.608569,0.415505
min,1.0,10000.0,1.0,0.0,0.0,21.0,-2.0,-2.0,-2.0,-2.0,-2.0,-2.0,-165580.0,-69777.0,-157264.0,-170000.0,-81334.0,-209051.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7493.75,50000.0,1.0,1.0,1.0,28.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,3526.75,2946.0,2632.75,2300.0,1800.0,1266.25,1000.0,820.0,390.0,291.0,257.75,150.0,0.0
50%,15041.0,140000.0,2.0,2.0,2.0,34.0,0.0,0.0,0.0,0.0,0.0,0.0,22004.5,20756.0,19994.0,18990.0,18091.0,17127.0,2112.5,2009.0,1801.5,1500.0,1500.0,1500.0,0.0
75%,22505.75,240000.0,2.0,2.0,2.0,41.0,0.0,0.0,0.0,0.0,0.0,0.0,67124.75,63781.25,60286.25,54740.0,50065.25,48950.5,5012.0,5000.0,4531.25,4048.5,4078.0,4001.0,0.0
max,29999.0,800000.0,2.0,6.0,3.0,79.0,8.0,8.0,8.0,8.0,8.0,8.0,746814.0,743970.0,1664089.0,706864.0,823540.0,568638.0,873552.0,1684259.0,889043.0,621000.0,426529.0,528666.0,1.0


In [4]:
# We do not have missing data
train_cc.describe().loc["count"].unique()

array([21000.])

You can tell by visually inspecting the table that it does not have any missing data. The cell above confirms this. 

Something else that is interesting is that PAY_0 to PAY_6 has a min of -2. This is interesting because according to its description it should only hold the values: -1 and 1-9. Going to the discussion tab of the dataset reveals that the column may contain the values -2 and 0, where -2 means there was no balance to pay and 0 means the client made the minimum payment.

Looking at the mean and medial(50%) of 'LIMIT_BAL' we can see that it is right skewed and has a very large standard deviation.

In [5]:
train_cc.info()

<class 'pandas.core.frame.DataFrame'>
Index: 21000 entries, 8958 to 2721
Data columns (total 25 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   ID                          21000 non-null  int64  
 1   LIMIT_BAL                   21000 non-null  float64
 2   SEX                         21000 non-null  int64  
 3   EDUCATION                   21000 non-null  int64  
 4   MARRIAGE                    21000 non-null  int64  
 5   AGE                         21000 non-null  int64  
 6   PAY_0                       21000 non-null  int64  
 7   PAY_2                       21000 non-null  int64  
 8   PAY_3                       21000 non-null  int64  
 9   PAY_4                       21000 non-null  int64  
 10  PAY_5                       21000 non-null  int64  
 11  PAY_6                       21000 non-null  int64  
 12  BILL_AMT1                   21000 non-null  float64
 13  BILL_AMT2                   21000 

Every feature is numerical as far as pandas is concerned however some of the features are actually cadigorical or ordinal. We will need to do some preprocessing to ensure our model handles these features properly.

In summary I observed that this dataset is not missing any values. The 'PAY_i' columns range from -2 to 9 with -2 meaning there was no balance to pay and 0 meaning the minimum balance was paid. I learned that the 'LIMIT_BAL' column is right skewed and has a very large standard deviation. Lastly I learned that despie pandas treating each column as numerical, this dataframe is mostly numerical with some ordinal and non ordinal categorical features.

The metric I will be focusing on is accuracy. I will try to build a model that can predict whether or not someone will default next month based on the rest of the features. The data is not perfectly balanced but it is balanced enough for improvements in classification for be meaningful and visible without using things like a confusion matrix. For the final model I will also calculate a ROC-AUC value to get an idea of how my model is able to differentiate between the two classes.

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 4. Preprocessing and transformations <a name="5"></a>
<hr>

_points: 10_

**Your tasks:**

1. Identify different feature types and the transformations you would apply on each feature type. 
2. Define a column transformer, if necessary. 

<div class="alert alert-warning">
    
Solution_4
    
</div>

In [6]:
# Split the data into X and y train
train_cc.columns

Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0',
       'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6',
       'default.payment.next.month'],
      dtype='object')

In [7]:
X_train = train_cc.drop(columns='default.payment.next.month')
y_train = train_cc['default.payment.next.month']

X_test = test_cc.drop(columns='default.payment.next.month')
y_test = test_cc['default.payment.next.month']

In [8]:
numeric_features = ['LIMIT_BAL', 'AGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']
categorical_features = ['SEX', 'MARRIAGE']
ordinal_features = ['EDUCATION']
drop_features = ['ID']

In [9]:
# Check that I account for each column

categorized_columns_set = set(numeric_features + categorical_features + ordinal_features + drop_features)

if categorized_columns_set == set(X_train.columns):
    print("All X_train columns are accounted for.")
else:
    print("The columns do not match.")

All X_train columns are accounted for.


In [10]:
X_train['EDUCATION'].unique()

array([1, 2, 3, 5, 4, 6, 0])

In [11]:
# CustomMapper for education ordinal variables
class CustomMapper(BaseEstimator, TransformerMixin):
    def __init__(self):
        # Order is: unknown, unknown, other, high-school, university, grad-school
        self.mapping = {0: 0, 5: 1, 6: 2, 4: 3, 3: 4, 2: 5, 1: 6}

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X.apply(lambda col: col.map(self.mapping))

In [12]:
numeric_transformer = make_pipeline(
    StandardScaler()
)

categorical_transformer = make_pipeline(
    OneHotEncoder()
)

ordinal_transformer = make_pipeline(
    CustomMapper()
)

preprocessor = make_column_transformer(
    ('drop', drop_features),
    (numeric_transformer, numeric_features),
    (ordinal_transformer, ordinal_features),
    (categorical_transformer, categorical_features),
)

In [13]:
preprocessor.fit(X_train)
preprocessor

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 5. Baseline model <a name="6"></a>
<hr>

_points: 2_

**Your tasks:**
1. Try `scikit-learn`'s baseline model and report results.

<div class="alert alert-warning">
    
Solution_5
    
</div>

In [14]:
pipeline = make_pipeline(preprocessor, DummyClassifier())
pd.DataFrame(cross_validate(pipeline, X_train, y_train, return_train_score=True)).mean()

fit_time       0.010621
score_time     0.003735
test_score     0.778143
train_score    0.778143
dtype: float64

Our dummy Classifier got 78% accuracy, this score is high but it is not overwhelmingly high. We will be able to see clear improvements without employing advanced forms of accuracy analysis. The cell below confirms that our dummy classifier is working as expected.

In [15]:
train_cc["default.payment.next.month"].value_counts(normalize=True)

default.payment.next.month
0    0.778143
1    0.221857
Name: proportion, dtype: float64

<br><br>

<!-- BEGIN QUESTION -->

## 6. Linear models <a name="7"></a>
<hr>

_points 10_

**Your tasks:**

1. Try a linear model as a first real attempt. 
2. Carry out hyperparameter tuning to explore different values for the complexity hyperparameter. 
3. Report cross-validation scores along with standard deviation. 
4. Summarize your results.

<div class="alert alert-warning">
    
Solution_6
    
</div>

In [16]:
# Logistic Regression with default hyperparameters:
log_reg_pipeline = make_pipeline(preprocessor, LogisticRegression())
pd.DataFrame(cross_validate(log_reg_pipeline, X_train, y_train, return_train_score=True)).mean()

fit_time       0.059306
score_time     0.003329
test_score     0.810524
train_score    0.810333
dtype: float64

In [17]:
# DISCLAIMER: In param_distribution not all combinations of solver & penaltys are valid
#             so I supress errors and run through invalid combinations.
warnings.filterwarnings("ignore")

# Linear Regression with hyperparameter optimization

param_distributions = {
    'logisticregression__solver': ['lbfgs', 'liblinear', 'newton-cholesky'],
    'logisticregression__penalty': ['l1', 'l2', 'none'],
    'logisticregression__C': uniform(loc=0, scale=4),
    'logisticregression__max_iter': [1000]
}


random_search = RandomizedSearchCV(
    # LogisticRegression Specific:
    log_reg_pipeline,
    param_distributions=param_distributions,
    # RandomizedSearch Specific:
    n_iter=100,
    cv=5,
    verbose=1,
    n_jobs=-1,
    random_state=42
)

random_search.fit(X_train, y_train)

# Re enable warnings
warnings.filterwarnings("default")

Fitting 5 folds for each of 100 candidates, totalling 500 fits


In [18]:
print("Best parameters found: ", random_search.best_params_)
print("Best cross-validation score: ", random_search.best_score_)
print("Best model's stddev:", random_search.cv_results_['std_test_score'][random_search.best_index_])

best_model = random_search.best_estimator_
display(best_model)

Best parameters found:  {'logisticregression__C': 3.085081386743783, 'logisticregression__max_iter': 1000, 'logisticregression__penalty': 'l1', 'logisticregression__solver': 'liblinear'}
Best cross-validation score:  0.8106190476190477
Best model's stddev: 0.0039111446092154395


With my best model I got a standard deviation of 0.00391 and a cross validation score of 0.8106. With the default parameters I got a score of 0.810524. This is not a very big improvement but its something. I am mostly happy that my efforts to search through different solvers and penaltys paid off as the search found that solver='liblinear' and penalty='l1' worked better than default.

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 7. Different models <a name="8"></a>
<hr>

_points: 12_

**Your tasks:**
1. Try at least 3 other models aside from a linear model. One of these models should be a tree-based ensemble model. 
2. Summarize your results in terms of overfitting/underfitting and fit and score times. Can you beat a linear model? 

<div class="alert alert-warning">
    
Solution_7
    
</div>

SVC, Gradient boosted trees, CatBoost

In [19]:
# SVC
svc_pipeline = make_pipeline(preprocessor, SVC())
pd.DataFrame(cross_validate(svc_pipeline, X_train, y_train, return_train_score=True)).mean()

fit_time       5.911079
score_time     1.690791
test_score     0.817667
train_score    0.820929
dtype: float64

In [20]:
# Gradient boosed trees
gbt_pipeline = make_pipeline(preprocessor, GradientBoostingClassifier())
pd.DataFrame(cross_validate(gbt_pipeline, X_train, y_train, return_train_score=True)).mean()

fit_time       5.160182
score_time     0.007164
test_score     0.819000
train_score    0.827440
dtype: float64

In [21]:
# Cat Boost
cat_pipeline = make_pipeline(preprocessor, CatBoostClassifier(verbose=False))
pd.DataFrame(cross_validate(cat_pipeline, X_train, y_train, return_train_score=True)).mean()

fit_time       3.488288
score_time     0.021626
test_score     0.817524
train_score    0.858738
dtype: float64

SVC: Slightly higher train score compared to test score meaning we could be slightly overfitting.

Gradient Boosted Trees: Again a slightly higher train score than test score meaning we could be overfitting.

Cat Boost: This one has a worse test score than the Gradient Boosed Trees but a much better train score than all three meaning we are probably overfitting.

From these three models I can tell that I can beat a linear model! All three preformed better than the logical regressor with the car boost model scoring the best by far.

<!-- END QUESTION -->

<br><br>

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 8. Hyperparameter optimization <a name="10"></a>
<hr>

_points: 10_

**Your tasks:**

Make some attempts to optimize hyperparameters for the models you've tried and summarize your results. In at least one case you should be optimizing multiple hyperparameters for a single model. You may use `sklearn`'s methods for hyperparameter optimization or fancier Bayesian optimization methods. 
  - [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)   
  - [RandomizedSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)
  - [scikit-optimize](https://github.com/scikit-optimize/scikit-optimize) 

<div class="alert alert-warning">
    
Solution_8
    
</div>

In [22]:
# Optimize SVC

svc_param_grid = {
    'svc__C': [0.5, 100, 1000],
    'svc__gamma': [0.01, 0.001, 0.0001],
    'svc__kernel': ['rbf', 'poly'],
    'svc__degree': [2, 3]
}


svc_random_search = RandomizedSearchCV(
    # LogisticRegression Specific:
    svc_pipeline,
    param_distributions=svc_param_grid,
    # RandomizedSearch Specific:
    n_iter=36,
    cv=3,
    n_jobs=-1,
    random_state=42
)

svc_random_search.fit(X_train, y_train)

In [23]:
print("Best parameters found: ", svc_random_search.best_params_)
print("Best cross-validation score: ", svc_random_search.best_score_)
print("Best model's stddev:", svc_random_search.cv_results_['std_test_score'][svc_random_search.best_index_])

Best parameters found:  {'svc__kernel': 'poly', 'svc__gamma': 0.01, 'svc__degree': 2, 'svc__C': 100}
Best cross-validation score:  0.8182857142857142
Best model's stddev: 0.0030394235042348586


In [24]:
# Optimize Gradient Boosted Trees

gbt_param_grid = {
    'gradientboostingclassifier__n_estimators': [70, 100, 150],
    'gradientboostingclassifier__learning_rate': [0.01, 0.1, 1]
}


gbt_random_search = RandomizedSearchCV(
    # LogisticRegression Specific:
    gbt_pipeline,
    param_distributions=gbt_param_grid,
    # RandomizedSearch Specific:
    n_iter=9,
    cv=3,
    n_jobs=-1,
    random_state=42
)

gbt_random_search.fit(X_train, y_train)

In [25]:
print("Best parameters found: ", gbt_random_search.best_params_)
print("Best cross-validation score: ", gbt_random_search.best_score_)
print("Best model's stddev:", gbt_random_search.cv_results_['std_test_score'][gbt_random_search.best_index_])

Best parameters found:  {'gradientboostingclassifier__n_estimators': 150, 'gradientboostingclassifier__learning_rate': 0.01}
Best cross-validation score:  0.8189047619047619
Best model's stddev: 0.00268615329069913


In [26]:
# Optimize Cat Boost

catboost_param_grid = {
    'catboostclassifier__iterations': [100, 500, 1000],
    'catboostclassifier__learning_rate': [0.01, 0.05, 0.1],
    'catboostclassifier__depth': [3, 6, 10],
    'catboostclassifier__l2_leaf_reg': [1, 3, 5],
    'catboostclassifier__border_count': [32, 64, 128],
    'catboostclassifier__bootstrap_type': ['Bayesian', 'Bernoulli', 'MVS']
}

catboost_random_search = RandomizedSearchCV(
    # LogisticRegression Specific:
    cat_pipeline,
    param_distributions=catboost_param_grid,
    # RandomizedSearch Specific:
    n_iter=729,
    cv=3,
    n_jobs=-1,
    random_state=42
)

catboost_random_search.fit(X_train, y_train)

In [27]:
print("Best parameters found: ", catboost_random_search.best_params_)
print("Best cross-validation score: ", catboost_random_search.best_score_)
print("Best model's stddev:", catboost_random_search.cv_results_['std_test_score'][catboost_random_search.best_index_])

Best parameters found:  {'catboostclassifier__learning_rate': 0.05, 'catboostclassifier__l2_leaf_reg': 1, 'catboostclassifier__iterations': 100, 'catboostclassifier__depth': 6, 'catboostclassifier__border_count': 64, 'catboostclassifier__bootstrap_type': 'Bayesian'}
Best cross-validation score:  0.8214285714285715
Best model's stddev: 0.003512852979334859


With SVC I had to make some concessions because my computer really struggled to run the parameter search. The result I got was still suprising though, I did not think that the poly kernel with a degree of 2 would be better than default. I was able to improve the default test score of 0.817667 to 0.8189047619047619, which is a very small improvement.

With gradient boosted trees my result was very suprising. To make the runtimes manageable I searched through a small subset of possible parameters and it turned out that no combination of my values was better than the defaults and the preformance decreased from 0.819000 to 0.8189047619047619.

With CatBoost Classifier I was able to learn from my gradient boosted trees and picked better values to search through, with this I was able to improve the test score from 0.817524 to 0.8206666666666665 which is the best classifier yet.

<!-- END QUESTION -->

<br><br>

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 9. Results on the test set <a name="12"></a>
<hr>

_points: 10_

**Your tasks:**

1. Try your best performing model on the test data: report and explain test scores.
2. Do the test scores agree with the validation scores from before? To what extent do you trust your results? Do you think you've had issues with optimization bias?

<div class="alert alert-warning">
    
Solution_9
    
</div>

In [28]:
best_model = catboost_random_search.best_estimator_

y_pred = best_model.predict(X_test)
y_pred_proba = best_model.predict_proba(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(classification_report(y_test, y_pred))

roc_auc = roc_auc_score(y_test, y_pred_proba[:, 1])
print(f"ROC-AUC: {roc_auc}")

Accuracy: 0.8227777777777778
              precision    recall  f1-score   support

           0       0.84      0.95      0.89      7023
           1       0.68      0.37      0.48      1977

    accuracy                           0.82      9000
   macro avg       0.76      0.66      0.69      9000
weighted avg       0.81      0.82      0.80      9000

ROC-AUC: 0.7875973452643604


I got a test score of 82.33% which is better than my validation results! This means that my best model was not overfitting and could be underfitting. I also got a ROC-AUC score of 0.784 which means that our model is pretty good at differentiating between classes and that our accuracy is not just from the fact that our data is arguably imbalanced.

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 10. Summary of results <a name="13"></a>
<hr>

_points 12_

Imagine that you want to present the summary of these results to your boss and co-workers. 

**Your tasks:**

1. Create a table (printed `DataFrame`) summarizing important results. 
2. Write concluding remarks.
3. Discuss other ideas that you did not try but could potentially improve the performance/interpretability . 
3. Report your final test score along with the metric you used at the top of this notebook in the [Submission instructions section](#si).

<div class="alert alert-warning">
    
Solution_10
    
</div>

In [29]:
metrics = ["Validation Score", 'Standard deviation', "Hyperparameter Optimized", 'Test Accuracy']
models = ["Baseline", "Best Linear Model", "Best Tree Ensemble Model", "Best Model"]

results_df = pd.DataFrame(index=models, columns=metrics)

results_df.loc["Baseline"] = [0.778143, 'NA', 'N', 'NA']
results_df.loc["Best Linear Model"] = [0.8106190476190477, 0.0039111446092154395, 'Y', 'NA']
results_df.loc["Best Tree Ensemble Model"] = [0.8189047619047619, 0.00268615329069913, 'Y', 'NA']
results_df.loc["Best Model"] = ['NA', 'NA', 'Y', 0.8233333333333334]
display(1)
display(results_df)

1

Unnamed: 0,Validation Score,Standard deviation,Hyperparameter Optimized,Test Accuracy
Baseline,0.778143,,N,
Best Linear Model,0.810619,0.003911,Y,
Best Tree Ensemble Model,0.818905,0.002686,Y,
Best Model,,,Y,0.823333


2. When I went from my baseline to a simple linear model I was able to jump up several points in accuracy. As I moved from a linear model to more sophisticated ones the improvements were minimal. I am happy that I was able to explore many types of models to find the best. I am underwealmed with the best models preformance though, an improvement of only about 5 percent from the baseline is small.
3. My hyperparameter optimization left a lot to be desired. The time it took to search was limiting since I am using a laptop. With a better computer or more time to work with then I would specify a distribution of values instead of hardcoding in some good guesses. I also would like to have been able to explore searching through some more parameters. In particular I would have liked to try a 'sigmoid' kernel for SVC and more values for the poly degree.
4. Final test score: 0.8233333333333334, the metric I wanted to measure was classification accuracy of default.next.month given the rest of the table's features. I also mentined ROC-AUC score and for that I got a score of 0.784189185169532 which is tells me that the model is able to differentiate between classes and is not just exploting the class imbalance.

<!-- END QUESTION -->

<br><br>

<!-- END QUESTION -->

<br><br>

**PLEASE READ:** When you are ready to submit your assignment do the following:

1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`. 
2. Notebooks with cell execution numbers out of order or not starting from “1” will have marks deducted. Notebooks without the output displayed may not be graded at all (because we need to see the output in order to grade your work).
3. Upload the assignment using [PrairieLearn](https://ca.prairielearn.com/pl/course_instance/6697). Don't forget to rename your file `hw4_sol.ipynb`.

This was a tricky one but you did it!

![](img/eva-well-done.png)

In [30]:
# Extra curricular, save model to joblib file
from joblib import dump, load

dump(best_model, 'cat_boost_predictor.joblib')

['cat_boost_predictor.joblib']