# CPSC 330 - Applied Machine Learning 

## Homework 6: Putting it all together 
### Associated lectures: All material till lecture 13 

**Due date: Monday, November 15, 2021 at 11:59pm**

## Table of contents

- [Submission instructions](#si)
- [Understanding the problem](#1)
- [Data splitting](#2)
- [EDA](#3)
- (Optional) [Feature engineering](#4)
- [Preprocessing and transformations](#5)
- [Baseline model](#6)
- [Linear models](#7)
- [Different classifiers](#8)
- (Optional) [Feature selection](#9)
- [Hyperparameter optimization](#10)
- [Interpretation and feature importances](#11)
- [Results on the test set](#12)
- (Optional) [Explaining predictions](#13)
- [Summary of the results](#14)

## Imports 

In [44]:
import os

%matplotlib inline
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import xgboost as xgb
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    f1_score,
    make_scorer,
    plot_confusion_matrix,
)
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_val_score,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.svm import SVC

<br><br>

## Instructions 
<hr>
rubric={points:2}

Follow the [homework submission instructions](https://github.com/UBC-CS/cpsc330/blob/master/docs/homework_instructions.md). 

**You may work on this homework in a group and submit your assignment as a group.** Below are some instructions on working as a group.  
- The maximum group size is 2. 
- Use group work as an opportunity to collaborate and learn new things from each other. 
- Be respectful to each other and make sure you understand all the concepts in the assignment well. 
- It's your responsibility to make sure that the assignment is submitted by one of the group members before the deadline. 
- You can find the instructions on how to do group submission on Gradescope [here](https://help.gradescope.com/article/m5qz2xsnjy-student-add-group-members).

<br><br>

## Introduction <a name="in"></a>
<hr>

At this point we are at the end of supervised machine learning part of the course. So in this homework, you will be working on an open-ended mini-project, where you will put all the different things you have learned so far together to solve an interesting problem.

A few notes and tips when you work on this mini-project: 

#### Tips

1. This mini-project is open-ended, and while working on it, there might be some situations where you'll have to use your own judgment and make your own decisions (as you would be doing when you work as a data scientist). Make sure you explain your decisions whenever necessary. 
2. **Do not include everything you ever tried in your submission** -- it's fine just to have your final code. That said, your code should be reproducible and well-documented. For example, if you chose your hyperparameters based on some hyperparameter optimization experiment, you should leave in the code for that experiment so that someone else could re-run it and obtain the same hyperparameters, rather than mysteriously just setting the hyperparameters to some (carefully chosen) values in your code. 
3. If you realize that you are repeating a lot of code try to organize it in functions. Clear presentation of your code, experiments, and results is the key to be successful in this lab. You may use code from lecture notes or previous lab solutions with appropriate attributions. 
4. If you are having trouble running models on your laptop because of the size of the dataset, you can create your train/test split in such a way that you have less data in the train split. If you end up doing this, please write a note to the grader in the submission explaining why you are doing it.  

#### Assessment

We plan to grade fairly and leniently. We don't have some secret target score that you need to achieve to get a good grade. **You'll be assessed on demonstration of mastery of course topics, clear presentation, and the quality of your analysis and results.** For example, if you just have a bunch of code and no text or figures, that's not good. If you do a bunch of sane things and get a lower accuracy than your friend, don't sweat it.

#### A final note

Finally, this style of this "project" question is different from other assignments. It'll be up to you to decide when you're "done" -- in fact, this is one of the hardest parts of real projects. But please don't spend WAY too much time on this... perhaps "a few hours" (4-10 hours) is a good guideline for a typical submission. Of course if you're having fun you're welcome to spend as much time as you want! But, if so, try not to do it out of perfectionism or getting the best possible grade. Do it because you're learning and enjoying it. Students from the past cohorts have found such kind of labs useful and fun and I hope you enjoy it as well. 

<br><br>

## 1. Understanding the problem <a name="1"></a>
<hr>
rubric={points:4}

In this mini project, you will be working on a classification problem of predicting whether a credit card client will default or not. 
For this problem, you will use [Default of Credit Card Clients Dataset](https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset). In this data set, there are 30,000 examples and 24 features, and the goal is to estimate whether a person will default (fail to pay) their credit card bills; this column is labeled "default.payment.next.month" in the data. The rest of the columns can be used as features. You may take some ideas and compare your results with [the associated research paper](https://www.sciencedirect.com/science/article/pii/S0957417407006719), which is available through [the UBC library](https://www.library.ubc.ca/). 

**Your tasks:**

1. Spend some time understanding the problem and what each feature means. You can find this information in the documentation on [the dataset page on Kaggle](https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset). Write a few sentences on your initial thoughts on the problem and the dataset. 
2. Download the dataset and read it as a pandas dataframe. 

The problems asks us to predict whether the given client will default or not. From initial look at the problem it appears to be a supervised learning classification problem. From the description of the columns, it seems that the column ID serve as no purpose in our solving our problem. Also for the sake of readability we should redefine the column name PAY_0 to PAY_1.

In [45]:
#reading the dataset into pandas dataframe and change few columns names for ease of usage and readability
raw_data = pd.read_csv("UCI_Credit_Card.csv")
raw_data.rename(columns={'PAY_0':'PAY_1'}, inplace=True)
raw_data.rename(columns={"default.payment.next.month":'defaulter'}, inplace=True)

n_row, n_col = raw_data.shape
print("Number of rows:", n_row)
print("Number of columns:", n_col)


Number of rows: 30000
Number of columns: 25


It looks like the data is clean and there are no unusual values present, so we move on

<br><br>

## 2. Data splitting <a name="2"></a>
<hr>
rubric={points:2}

**Your tasks:**

1. Split the data into train and test portions. 

#### Solution
I will be splitting the dataset into 80-20%

In [46]:
X_raw_data = raw_data.drop(["defaulter"], axis=1)
Y_raw_data = raw_data["defaulter"]
X_train, X_test, Y_train, Y_test = train_test_split(X_raw_data, Y_raw_data, test_size=0.2, random_state=1)
print("Training variables",X_train.shape)
print("Training results",Y_train.shape)
print("Training variables",X_test.shape)
print("Training results",Y_test.shape)


Training variables (24000, 24)
Training results (24000,)
Training variables (6000, 24)
Training results (6000,)


<br><br>

## 3. EDA <a name="3"></a>
<hr>
rubric={points:10}

**Your tasks:**

1. Perform exploratory data analysis on the train set.
2. Include at least two summary statistics and two visualizations that you find useful, and accompany each one with a sentence explaining it.
3. Summarize your initial observations about the data. 
4. Pick appropriate metric/metrics for assessment. 

In [47]:
#Printing the summary of the dataset with describe, info to look for anomalies

X_train.describe()
X_train.info()
X_train.isna().sum().sum()
# from the results of above calls, it is clear that the data does not contain any missing or null values.
# Hence there is no need for imputation



<class 'pandas.core.frame.DataFrame'>
Int64Index: 24000 entries, 28004 to 29733
Data columns (total 24 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   ID         24000 non-null  int64  
 1   LIMIT_BAL  24000 non-null  float64
 2   SEX        24000 non-null  int64  
 3   EDUCATION  24000 non-null  int64  
 4   MARRIAGE   24000 non-null  int64  
 5   AGE        24000 non-null  int64  
 6   PAY_1      24000 non-null  int64  
 7   PAY_2      24000 non-null  int64  
 8   PAY_3      24000 non-null  int64  
 9   PAY_4      24000 non-null  int64  
 10  PAY_5      24000 non-null  int64  
 11  PAY_6      24000 non-null  int64  
 12  BILL_AMT1  24000 non-null  float64
 13  BILL_AMT2  24000 non-null  float64
 14  BILL_AMT3  24000 non-null  float64
 15  BILL_AMT4  24000 non-null  float64
 16  BILL_AMT5  24000 non-null  float64
 17  BILL_AMT6  24000 non-null  float64
 18  PAY_AMT1   24000 non-null  float64
 19  PAY_AMT2   24000 non-null  float64
 20  PA

0

In [48]:
# Let's plot some stuff (i.e. frequency of defaulters and non-defaulters for each variable)
columns = X_train.columns
print(X_train["PAY_1"].value_counts)

# for column in columns:
#     defaulters = X_train[Y_train == 1]
#     non_defaulters = X_train[Y_train == 0]
#     print(len(defaulters))

<bound method IndexOpsMixin.value_counts of 28004    1
8560    -1
15484    1
12531    0
24473   -1
        ..
17289    1
5192    -1
12172    0
235     -1
29733   -2
Name: PAY_1, Length: 24000, dtype: int64>


In [49]:
#From the count above it is clear that we are not missing any data, or have unusual values in any of the columns.


<br><br>

## (Optional) 4. Feature engineering <a name="4"></a>
<hr>
rubric={points:1}

**Your tasks:**

1. Carry out feature engineering. In other words, extract new features relevant for the problem and work with your new feature set in the following exercises. You may have to go back and forth between feature engineering and preprocessing. 

In [50]:
# We are also replacing "EDUCATION" column values 5,6 with 4 (we are doing it here, just for convenience)
class EngineerFeature(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit( self, X, y = None ):
        return self
    def transform(self, X, y=None):
        bins = [20,30,40,50,60,70,80]
        names = ['21-30','31-40','41-50','51-60','61-70','71-80']
        X.loc[:,'AGE_GROUP'] = pd.cut(x=X.AGE, bins=bins, labels=names, right=True)
        X["EDUCATION"][X["EDUCATION"] == 5] = 4
        X["EDUCATION"][X["EDUCATION"] == 6] = 4
        X=X.drop(["AGE"], axis=1)
        return X

In [51]:
# We will create a pipeline for featuer engineering
# Having age as continuous variable does not help us much, however, converting age to age groups might be effective.
feature_eng_pipeline = make_pipeline(EngineerFeature())
eng = feature_eng_pipeline.transform(X_train)
eng.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24000 entries, 28004 to 29733
Data columns (total 24 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   ID         24000 non-null  int64   
 1   LIMIT_BAL  24000 non-null  float64 
 2   SEX        24000 non-null  int64   
 3   EDUCATION  24000 non-null  int64   
 4   MARRIAGE   24000 non-null  int64   
 5   PAY_1      24000 non-null  int64   
 6   PAY_2      24000 non-null  int64   
 7   PAY_3      24000 non-null  int64   
 8   PAY_4      24000 non-null  int64   
 9   PAY_5      24000 non-null  int64   
 10  PAY_6      24000 non-null  int64   
 11  BILL_AMT1  24000 non-null  float64 
 12  BILL_AMT2  24000 non-null  float64 
 13  BILL_AMT3  24000 non-null  float64 
 14  BILL_AMT4  24000 non-null  float64 
 15  BILL_AMT5  24000 non-null  float64 
 16  BILL_AMT6  24000 non-null  float64 
 17  PAY_AMT1   24000 non-null  float64 
 18  PAY_AMT2   24000 non-null  float64 
 19  PAY_AMT3   24000 non-

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = value
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X["EDUCATION"][X["EDUCATION"] == 5] = 4
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._update_inplace(result)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X["EDUCATION"

<br><br>

## 5. Preprocessing and transformations <a name="5"></a>
<hr>
rubric={points:10}

**Your tasks:**

1. Identify different feature types and the transformations you would apply on each feature type. 
2. Define a column transformer, if necessary. 

In [52]:

numerical_features = ["LIMIT_BAL", "BILL_AMT1", "BILL_AMT2", 
"BILL_AMT3", "BILL_AMT4", "BILL_AMT5", "BILL_AMT6", 
"PAY_AMT1", "PAY_AMT2", "PAY_AMT3", "PAY_AMT4", "PAY_AMT5", "PAY_AMT6",
 ]
catetorical_features = ["SEX", "MARRIAGE"]
ordinal_features = ["PAY_1", "PAY_2", "PAY_3", "PAY_4", "PAY_5", "PAY_6", "EDUCATION"]
# Since we are dealing with only amount we will use standard scaling for numerical feature
# Although we will be applying OHE for categorical features
drop_features = ["ID", "AGE_GROUP"]
# repayment_categories = [-1,0,1,2,3,4,5,6,7,8,9]
feature_transform_pipeline = make_column_transformer(
    (StandardScaler(), numerical_features),
    (OneHotEncoder(handle_unknown="ignore"), catetorical_features),
    (OrdinalEncoder(), ordinal_features),
    ("passthrough", []),
    ("drop", drop_features)
)

<br><br>

## 6. Baseline model <a name="6"></a>
<hr>

rubric={points:2}

**Your tasks:**
1. Try `scikit-learn`'s baseline model and report results.

In [53]:
#Since Dummy classifier only looks at the output variable, we need not use pipeline
model = DummyClassifier(strategy="prior")
model.fit(X_train, Y_train)
train_error = model.score(X_train, Y_train)
test_error = model.score(X_test, Y_test)
print("Training Error:", train_error)
print("testing Error:", test_error)



Training Error: 0.7792083333333333
testing Error: 0.7771666666666667


<br><br>

## 7. Linear models <a name="7"></a>
<hr>
rubric={points:12}

**Your tasks:**

1. Try logistic regression as a first real attempt. 
2. Carry out hyperparameter tuning to explore different values for the complexity hyperparameter `C`. 
3. Report validation scores along with standard deviation. 
4. Summarize your results.

In [54]:
# from assignment #3 
def mean_std_cross_val_scores(model, XX, yy, **kwargs):

    scores = cross_validate(model, XX, yy, **kwargs)

    mean_scores = pd.DataFrame(scores).mean()
    std_scores = pd.DataFrame(scores).std()
    out_col = []

    for i in range(len(mean_scores)):
        out_col.append((f"%0.3f (+/- %0.3f)" % (mean_scores[i], std_scores[i])))

    return pd.Series(data=out_col, index=mean_scores.index)

In [55]:

cs = [1,20, 50, 100, 200]
results_dict = {}
preprocessor = make_pipeline(feature_eng_pipeline,feature_transform_pipeline)

for c in cs:
    pipeline = make_pipeline(preprocessor, LogisticRegression(C=c))
    results_dict[c] = mean_std_cross_val_scores(
           pipeline, X_train, Y_train, cv=5, return_train_score=True
         )
pd.DataFrame(results_dict)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X["EDUCATION"][X["EDUCATION"] == 5] = 4
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._update_inplace(result)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-ve

Unnamed: 0,1,20,50,100,200
fit_time,0.301 (+/- 0.089),0.339 (+/- 0.051),0.321 (+/- 0.066),0.273 (+/- 0.012),0.260 (+/- 0.009)
score_time,0.024 (+/- 0.007),0.024 (+/- 0.008),0.020 (+/- 0.005),0.017 (+/- 0.002),0.016 (+/- 0.001)
test_score,0.812 (+/- 0.005),0.812 (+/- 0.004),0.812 (+/- 0.004),0.812 (+/- 0.004),0.812 (+/- 0.004)
train_score,0.810 (+/- 0.002),0.810 (+/- 0.002),0.810 (+/- 0.002),0.810 (+/- 0.002),0.810 (+/- 0.002)


<br><br>

## 8. Different classifiers <a name="8"></a>
<hr>
rubric={points:15}

**Your tasks:**
1. Try at least 3 other models aside from logistic regression. At least one of these models should be a tree-based ensemble model (e.g., lgbm, random forest, xgboost). 
2. Summarize your results. Can you beat logistic regression? 

In [56]:
models = {
    "randomForest": RandomForestClassifier(),
    "decision Tree": DecisionTreeClassifier(),
    "RBF SVM": SVC() 
}
model_results_dict = {}
for model_name, model in models.items():
    pipe = make_pipeline(preprocessor, model)
    model_results_dict[model_name] = mean_std_cross_val_scores(
        pipe, X_train, Y_train, cv=5, return_train_score=True
    )

pd.DataFrame(results_dict).T

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X["EDUCATION"][X["EDUCATION"] == 5] = 4
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._update_inplace(result)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-ve

Unnamed: 0,fit_time,score_time,test_score,train_score
1,0.301 (+/- 0.089),0.024 (+/- 0.007),0.812 (+/- 0.005),0.810 (+/- 0.002)
20,0.339 (+/- 0.051),0.024 (+/- 0.008),0.812 (+/- 0.004),0.810 (+/- 0.002)
50,0.321 (+/- 0.066),0.020 (+/- 0.005),0.812 (+/- 0.004),0.810 (+/- 0.002)
100,0.273 (+/- 0.012),0.017 (+/- 0.002),0.812 (+/- 0.004),0.810 (+/- 0.002)
200,0.260 (+/- 0.009),0.016 (+/- 0.001),0.812 (+/- 0.004),0.810 (+/- 0.002)


<br><br>

## (Optional) 9. Feature selection <a name="9"></a>
<hr>
rubric={points:1}

**Your tasks:**

Make some attempts to select relevant features. You may try `RFECV` or forward selection. Do the results improve with feature selection? Summarize your results. If you see improvements in the results, keep feature selection in your pipeline. If not, you may abandon it in the next exercises. 

<br><br>

## 10. Hyperparameter optimization <a name="10"></a>
<hr>
rubric={points:15}

**Your tasks:**

Make some attempts to optimize hyperparameters for the models you've tried and summarize your results. You may pick one of the best performing models from the previous exercise and tune hyperparameters only for that model. You may use `sklearn`'s methods for hyperparameter optimization or fancier Bayesian optimization methods. 
  - [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)   
  - [RandomizedSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)
  - [scikit-optimize](https://github.com/scikit-optimize/scikit-optimize)

In [57]:
#hyper-parameter optimization
param_grid = {"C": [1,5,50, 100, 20, 900, 400, 75]}
results_dict = {}
for param in param_grid["C"]:
    model_name = "LR"
    pipe = make_pipeline(preprocessor, LogisticRegression(C=param))

    key = model_name + "(C= " + str(param) + ")"
    results_dict[key] = mean_std_cross_val_scores(
        pipe, X_train, Y_train, cv=5, return_train_score=True
    )

results_df = pd.DataFrame(results_dict).T
results_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X["EDUCATION"][X["EDUCATION"] == 5] = 4
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._update_inplace(result)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-ve

Unnamed: 0,fit_time,score_time,test_score,train_score
LR(C= 1),0.290 (+/- 0.036),0.017 (+/- 0.003),0.812 (+/- 0.005),0.810 (+/- 0.002)
LR(C= 5),0.258 (+/- 0.006),0.017 (+/- 0.002),0.812 (+/- 0.004),0.810 (+/- 0.002)
LR(C= 50),0.298 (+/- 0.071),0.021 (+/- 0.006),0.812 (+/- 0.004),0.810 (+/- 0.002)
LR(C= 100),0.282 (+/- 0.033),0.017 (+/- 0.002),0.812 (+/- 0.004),0.810 (+/- 0.002)
LR(C= 20),0.258 (+/- 0.010),0.017 (+/- 0.004),0.812 (+/- 0.004),0.810 (+/- 0.002)
LR(C= 900),0.261 (+/- 0.024),0.019 (+/- 0.002),0.812 (+/- 0.004),0.810 (+/- 0.002)
LR(C= 400),0.262 (+/- 0.010),0.017 (+/- 0.003),0.812 (+/- 0.005),0.810 (+/- 0.002)
LR(C= 75),0.259 (+/- 0.013),0.018 (+/- 0.001),0.812 (+/- 0.004),0.810 (+/- 0.002)


As eveident from the code above the best hypermarameter for logistic regression is 1

<br><br>

## 11. Interpretation and feature importances <a name="1"></a>
<hr>
rubric={points:15}

**Your tasks:**

1. Use the methods we saw in class (e.g., `eli5`, `shap`) (or any other methods of your choice) to explain feature importances of one of the best performing models. Summarize your observations. 

<br><br>

## 12. Results on the test set <a name="12"></a>
<hr>

rubric={points:5}

**Your tasks:**

1. Try your best performing model on the test data and report test scores. 
2. Do the test scores agree with the validation scores from before? To what extent do you trust your results? Do you think you've had issues with optimization bias? 

In [58]:
final_pipeline = make_pipeline(preprocessor, LogisticRegression(C=1))
final_pipeline.fit(X_train, Y_train)
final_pipeline.score(X_test, Y_test)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X["EDUCATION"][X["EDUCATION"] == 5] = 4
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._update_inplace(result)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-ve

ValueError: Found unknown categories [8] in column 1 during transform

The test score obtained is as follows, yes the test score agrees with the validation score that we obtained above. Since the validation and test score agree to a great extend we conclude that our test results are reliable and we have had no issues with optimization bias.

<br><br>

## (Optional) 13. Explaining predictions 
rubric={points:1}

**Your tasks**

1. Take one or two test predictions and explain them with SHAP force plots.  

<br><br>

## 14. Summary of results <a name="13"></a>
<hr>
rubric={points:10}

**Your tasks:**

1. Report your final test score along with the metric you used. 
2. Write concluding remarks.
3. Discuss other ideas that you did not try but could potentially improve the performance/interpretability . 

- The final test score is .799
- The test score that we obtained is pretty low.
- Using other ensemble methods could prove useful here. Also feature transformation using some calculation based of the remaining billl amount and repayment schedule woulod be wise because they directly affect the defaulter.-

<br><br><br><br>

## Submission instructions 

**PLEASE READ:** When you are ready to submit your assignment do the following:

1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`. 
2. Notebooks with cell execution numbers out of order or not starting from “1” will have marks deducted. Notebooks without the output displayed may not be graded at all (because we need to see the output in order to grade your work).
3. Upload the assignment using Gradescope's drag and drop tool. Check out this [Gradescope Student Guide](https://lthub.ubc.ca/guides/gradescope-student-guide/) if you need help with Gradescope submission. 