#  Applied Machine Learning 

## Homework 6: Putting it all together 



## Table of contents

- [Submission instructions](#si)
- [Understanding the problem](#1)
- [Data splitting](#2)
- [EDA](#3)
- (Optional) [Feature engineering](#4)
- [Preprocessing and transformations](#5)
- [Baseline model](#6)
- [Linear models](#7)
- [Different classifiers](#8)
- (Optional) [Feature selection](#9)
- [Hyperparameter optimization](#10)
- [Interpretation and feature importances](#11)
- [Results on the test set](#12)
- (Optional) [Explaining predictions](#13)
- [Summary of the results](#14)

## Imports 

In [1]:
# pip install xgboost

Collecting xgboost
  Using cached xgboost-1.7.4-py3-none-win_amd64.whl (89.1 MB)
Installing collected packages: xgboost
Successfully installed xgboost-1.7.4
Note: you may need to restart the kernel to use updated packages.


In [33]:
# pip install catboost

Collecting catboostNote: you may need to restart the kernel to use updated packages.
  Downloading catboost-1.1.1-cp39-none-win_amd64.whl (74.0 MB)
Collecting graphviz
  Downloading graphviz-0.20.1-py3-none-any.whl (47 kB)

Installing collected packages: graphviz, catboost
Successfully installed catboost-1.1.1 graphviz-0.20.1


In [36]:
# pip install lightgbm

Collecting lightgbm
  Downloading lightgbm-3.3.5-py3-none-win_amd64.whl (1.0 MB)
Installing collected packages: lightgbm
Successfully installed lightgbm-3.3.5
Note: you may need to restart the kernel to use updated packages.


In [47]:
# pip install eli5

Collecting eli5
  Downloading eli5-0.13.0.tar.gz (216 kB)
Collecting jinja2>=3.0.0
  Downloading Jinja2-3.1.2-py3-none-any.whl (133 kB)
Building wheels for collected packages: eli5
  Building wheel for eli5 (setup.py): started
  Building wheel for eli5 (setup.py): finished with status 'done'
  Created wheel for eli5: filename=eli5-0.13.0-py2.py3-none-any.whl size=107748 sha256=c1598be2a458373e3eefc5dc4343f3ca84eb3f3a343c9f3d2a25196ee67b3602
  Stored in directory: c:\users\dhruv\appdata\local\pip\cache\wheels\7b\26\a5\8460416695a992a2966b41caa5338e5e7fcea98c9d032d055c
Successfully built eli5
Installing collected packages: jinja2, eli5
  Attempting uninstall: jinja2
    Found existing installation: Jinja2 2.11.3
    Uninstalling Jinja2-2.11.3:
      Successfully uninstalled Jinja2-2.11.3
Successfully installed eli5-0.13.0 jinja2-3.1.2
Note: you may need to restart the kernel to use updated packages.


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
anaconda-project 0.10.2 requires ruamel-yaml, which is not installed.
jupyter-server 1.13.5 requires pywinpty<2; os_name == "nt", but you have pywinpty 2.0.2 which is incompatible.


In [37]:
import os

%matplotlib inline
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import xgboost as xgb
import catboost as ctb
import lightgbm as ltb
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    f1_score,
    make_scorer,
    plot_confusion_matrix,
)
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_val_score,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.svm import SVC

<br><br>

<br><br>

## Introduction <a name="in"></a>
<hr>

At this point we are at the end of supervised machine learning part of the course. So in this homework, you will be working on an open-ended mini-project, where you will put all the different things you have learned so far together to solve an interesting problem.

A few notes and tips when you work on this mini-project: 

#### Tips

1. This mini-project is open-ended, and while working on it, there might be some situations where you'll have to use your own judgment and make your own decisions (as you would be doing when you work as a data scientist). Make sure you explain your decisions whenever necessary. 
2. **Do not include everything you ever tried in your submission** -- it's fine just to have your final code. That said, your code should be reproducible and well-documented. For example, if you chose your hyperparameters based on some hyperparameter optimization experiment, you should leave in the code for that experiment so that someone else could re-run it and obtain the same hyperparameters, rather than mysteriously just setting the hyperparameters to some (carefully chosen) values in your code. 
3. If you realize that you are repeating a lot of code try to organize it in functions. Clear presentation of your code, experiments, and results is the key to be successful in this lab. You may use code from lecture notes or previous lab solutions with appropriate attributions. 
4. If you are having trouble running models on your laptop because of the size of the dataset, you can create your train/test split in such a way that you have less data in the train split. If you end up doing this, please write a note to the grader in the submission explaining why you are doing it.  

#### Assessment

We plan to grade fairly and leniently. We don't have some secret target score that you need to achieve to get a good grade. **You'll be assessed on demonstration of mastery of course topics, clear presentation, and the quality of your analysis and results.** For example, if you just have a bunch of code and no text or figures, that's not good. If you do a bunch of sane things and get a lower accuracy than your friend, don't sweat it.

#### A final note

Finally, this style of this "project" question is different from other assignments. It'll be up to you to decide when you're "done" -- in fact, this is one of the hardest parts of real projects. But please don't spend WAY too much time on this... perhaps "a few hours" (2-8 hours???) is a good guideline for a typical submission. Of course if you're having fun you're welcome to spend as much time as you want! But, if so, try not to do it out of perfectionism or getting the best possible grade. Do it because you're learning and enjoying it. Students from the past cohorts have found such kind of labs useful and fun and I hope you enjoy it as well. 

<br><br>

## 1. Understanding the problem <a name="1"></a>
<hr>
rubric={points:4}

In this mini project, you will be working on a classification problem of predicting whether a credit card client will default or not. 
For this problem, you will use [Default of Credit Card Clients Dataset](https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset). In this data set, there are 30,000 examples and 24 features, and the goal is to estimate whether a person will default (fail to pay) their credit card bills; this column is labeled "default.payment.next.month" in the data. The rest of the columns can be used as features. You may take some ideas and compare your results with [the associated research paper](https://www.sciencedirect.com/science/article/pii/S0957417407006719), which is available through [the UBC library](https://www.library.ubc.ca/). 

**Your tasks:**

1. Spend some time understanding the problem and what each feature means. You can find this information in the documentation on [the dataset page on Kaggle](https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset). Write a few sentences on your initial thoughts on the problem and the dataset. 
2. Download the dataset and read it as a pandas dataframe. 

In [3]:
df = pd.read_csv('UCI_Credit_Card.csv')

In [4]:
df

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,1,20000.0,2,2,1,24,2,2,-1,-1,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26,-1,2,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,2,2,1,37,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,1,2,1,57,-1,0,-1,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29995,29996,220000.0,1,3,1,39,0,0,0,0,...,88004.0,31237.0,15980.0,8500.0,20000.0,5003.0,3047.0,5000.0,1000.0,0
29996,29997,150000.0,1,3,2,43,-1,-1,-1,-1,...,8979.0,5190.0,0.0,1837.0,3526.0,8998.0,129.0,0.0,0.0,0
29997,29998,30000.0,1,2,2,37,4,3,2,-1,...,20878.0,20582.0,19357.0,0.0,0.0,22000.0,4200.0,2000.0,3100.0,1
29998,29999,80000.0,1,3,1,41,1,-1,0,0,...,52774.0,11855.0,48944.0,85900.0,3409.0,1178.0,1926.0,52964.0,1804.0,1


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 25 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   ID                          30000 non-null  int64  
 1   LIMIT_BAL                   30000 non-null  float64
 2   SEX                         30000 non-null  int64  
 3   EDUCATION                   30000 non-null  int64  
 4   MARRIAGE                    30000 non-null  int64  
 5   AGE                         30000 non-null  int64  
 6   PAY_0                       30000 non-null  int64  
 7   PAY_2                       30000 non-null  int64  
 8   PAY_3                       30000 non-null  int64  
 9   PAY_4                       30000 non-null  int64  
 10  PAY_5                       30000 non-null  int64  
 11  PAY_6                       30000 non-null  int64  
 12  BILL_AMT1                   30000 non-null  float64
 13  BILL_AMT2                   300

In [6]:
df.drop('ID',axis=1,inplace=True)

In [7]:
df

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,20000.0,2,2,1,24,2,2,-1,-1,-2,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,120000.0,2,2,2,26,-1,2,0,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,90000.0,2,2,2,34,0,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,50000.0,2,2,1,37,0,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,50000.0,1,2,1,57,-1,0,-1,0,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29995,220000.0,1,3,1,39,0,0,0,0,0,...,88004.0,31237.0,15980.0,8500.0,20000.0,5003.0,3047.0,5000.0,1000.0,0
29996,150000.0,1,3,2,43,-1,-1,-1,-1,0,...,8979.0,5190.0,0.0,1837.0,3526.0,8998.0,129.0,0.0,0.0,0
29997,30000.0,1,2,2,37,4,3,2,-1,0,...,20878.0,20582.0,19357.0,0.0,0.0,22000.0,4200.0,2000.0,3100.0,1
29998,80000.0,1,3,1,41,1,-1,0,0,0,...,52774.0,11855.0,48944.0,85900.0,3409.0,1178.0,1926.0,52964.0,1804.0,1


<br><br>

## 2. Data splitting <a name="2"></a>
<hr>
rubric={points:2}

**Your tasks:**

1. Split the data into train and test portions. 

In [14]:
x = df.drop(columns = 'default.payment.next.month')
y = df['default.payment.next.month']
X_train, X_test, y_train, y_test = train_test_split(x, y, train_size = 0.80)

In [15]:
X_train.head()

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
26050,160000.0,2,2,1,49,-1,-1,2,0,0,...,20723.0,21873.0,25999.0,26377.0,4980.0,0.0,1500.0,4520.0,939.0,0.0
17816,50000.0,1,2,2,23,0,0,0,0,0,...,49362.0,47907.0,18145.0,17495.0,1781.0,3264.0,1400.0,3000.0,350.0,133.0
11596,200000.0,2,1,2,31,-1,-1,-1,-1,-1,...,1209.0,0.0,1440.0,0.0,2286.0,2712.0,0.0,1440.0,0.0,0.0
11434,220000.0,2,2,3,50,0,0,0,0,0,...,55863.0,57079.0,51202.0,53195.0,5200.0,2000.0,2000.0,2000.0,3200.0,2001.0
13133,260000.0,2,2,2,41,0,0,0,0,0,...,254767.0,190432.0,105466.0,2312.0,9725.0,10617.0,8555.0,4722.0,2146.0,199485.0


<br><br>

## 3. EDA <a name="3"></a>
<hr>
rubric={points:10}

**Your tasks:**

1. Perform exploratory data analysis on the train set.
2. Include at least two summary statistics and two visualizations that you find useful, and accompany each one with a sentence explaining it.
3. Summarize your initial observations about the data. 
4. Pick appropriate metric/metrics for assessment. 

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 24 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   LIMIT_BAL                   30000 non-null  float64
 1   SEX                         30000 non-null  int64  
 2   EDUCATION                   30000 non-null  int64  
 3   MARRIAGE                    30000 non-null  int64  
 4   AGE                         30000 non-null  int64  
 5   PAY_0                       30000 non-null  int64  
 6   PAY_2                       30000 non-null  int64  
 7   PAY_3                       30000 non-null  int64  
 8   PAY_4                       30000 non-null  int64  
 9   PAY_5                       30000 non-null  int64  
 10  PAY_6                       30000 non-null  int64  
 11  BILL_AMT1                   30000 non-null  float64
 12  BILL_AMT2                   30000 non-null  float64
 13  BILL_AMT3                   300

In [16]:
X_train.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
LIMIT_BAL,24000.0,167714.986667,130022.175975,10000.0,50000.0,140000.0,240000.0,1000000.0
SEX,24000.0,1.602917,0.489304,1.0,1.0,2.0,2.0,2.0
EDUCATION,24000.0,1.855292,0.791282,0.0,1.0,2.0,2.0,6.0
MARRIAGE,24000.0,1.54975,0.52205,0.0,1.0,2.0,2.0,3.0
AGE,24000.0,35.494875,9.207045,21.0,28.0,34.0,41.0,79.0
PAY_0,24000.0,-0.020333,1.129221,-2.0,-1.0,0.0,0.0,8.0
PAY_2,24000.0,-0.133,1.20286,-2.0,-1.0,0.0,0.0,7.0
PAY_3,24000.0,-0.168958,1.198494,-2.0,-1.0,0.0,0.0,8.0
PAY_4,24000.0,-0.223708,1.170778,-2.0,-1.0,0.0,0.0,8.0
PAY_5,24000.0,-0.269583,1.132826,-2.0,-1.0,0.0,0.0,8.0


<br><br>

## (Optional) 4. Feature engineering <a name="4"></a>
<hr>
rubric={points:1}

**Your tasks:**

1. Carry out feature engineering. In other words, extract new features relevant for the problem and work with your new feature set in the following exercises. You may have to go back and forth between feature engineering and preprocessing. 

<br><br>

## 5. Preprocessing and transformations <a name="5"></a>
<hr>
rubric={points:10}

**Your tasks:**

1. Identify different feature types and the transformations you would apply on each feature type. 
2. Define a column transformer, if necessary. 

<br><br>

## 6. Baseline model <a name="6"></a>
<hr>

rubric={points:2}

**Your tasks:**
1. Try `scikit-learn`'s baseline model and report results.

In [17]:
dummy = DummyClassifier()
dummy.fit(X_train, y_train)
dummy.score(X_test, y_test)

0.7851666666666667

<br><br>

## 7. Linear models <a name="7"></a>
<hr>
rubric={points:12}

**Your tasks:**

1. Try logistic regression as a first real attempt. 
2. Carry out hyperparameter tuning to explore different values for the complexity hyperparameter `C`. 
3. Report validation scores along with standard deviation. 
4. Summarize your results.

In [18]:
params = {"logisticregression__C": np.arange(0, 1, 0.1)}
pipe = make_pipeline(StandardScaler(), LogisticRegression())
grid = RandomizedSearchCV(estimator = pipe, param_distributions = params, n_jobs = -1)

In [19]:
grid.fit(X_train, y_train)

5 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\Dhruv\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\Dhruv\anaconda3\lib\site-packages\sklearn\pipeline.py", line 394, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "C:\Users\Dhruv\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1589, in fit
    fold_coefs_ = Parallel(
  File "C:\Users\Dhruv\anaconda3\lib\site-packages\joblib\parallel.py", line 1043, in __call__
    if self.dispatch_one_batch(iterator

RandomizedSearchCV(estimator=Pipeline(steps=[('standardscaler',
                                              StandardScaler()),
                                             ('logisticregression',
                                              LogisticRegression())]),
                   n_jobs=-1,
                   param_distributions={'logisticregression__C': array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])})

In [20]:
grid.best_score_

0.8099583333333333

In [21]:
grid.best_params_

{'logisticregression__C': 0.1}

<br><br>

## 8. Different classifiers <a name="8"></a>
<hr>
rubric={points:15}

**Your tasks:**
1. Try at least 3 other models aside from logistic regression. At least one of these models should be a tree-based ensemble model (e.g., lgbm, random forest, xgboost). 
2. Summarize your results. Can you beat logistic regression? 

In [38]:
models = {
    "xgb" : xgb.XGBClassifier(),
    "catBoost": ctb.CatBoostClassifier(),
    "lightGBM": ltb.LGBMClassifier()
}
result = {}

In [39]:
for i in models:
    model = models[i]
    model.fit(X_train, y_train)
    result[i] = model.score(X_test, y_test)

Learning rate set to 0.040021
0:	learn: 0.6701410	total: 180ms	remaining: 2m 59s
1:	learn: 0.6495378	total: 207ms	remaining: 1m 43s
2:	learn: 0.6309252	total: 228ms	remaining: 1m 15s
3:	learn: 0.6136149	total: 247ms	remaining: 1m 1s
4:	learn: 0.5987235	total: 293ms	remaining: 58.2s
5:	learn: 0.5844637	total: 338ms	remaining: 55.9s
6:	learn: 0.5716022	total: 368ms	remaining: 52.2s
7:	learn: 0.5599672	total: 391ms	remaining: 48.5s
8:	learn: 0.5490501	total: 410ms	remaining: 45.2s
9:	learn: 0.5396452	total: 450ms	remaining: 44.5s
10:	learn: 0.5309286	total: 495ms	remaining: 44.5s
11:	learn: 0.5229486	total: 533ms	remaining: 43.9s
12:	learn: 0.5153540	total: 557ms	remaining: 42.3s
13:	learn: 0.5082952	total: 581ms	remaining: 40.9s
14:	learn: 0.5021617	total: 602ms	remaining: 39.5s
15:	learn: 0.4963395	total: 628ms	remaining: 38.6s
16:	learn: 0.4909349	total: 671ms	remaining: 38.8s
17:	learn: 0.4862661	total: 712ms	remaining: 38.8s
18:	learn: 0.4816576	total: 736ms	remaining: 38s
19:	learn:

164:	learn: 0.4135353	total: 4.95s	remaining: 25s
165:	learn: 0.4134837	total: 4.97s	remaining: 25s
166:	learn: 0.4133956	total: 5s	remaining: 24.9s
167:	learn: 0.4132753	total: 5.02s	remaining: 24.9s
168:	learn: 0.4131849	total: 5.04s	remaining: 24.8s
169:	learn: 0.4130509	total: 5.05s	remaining: 24.7s
170:	learn: 0.4129618	total: 5.07s	remaining: 24.6s
171:	learn: 0.4128355	total: 5.09s	remaining: 24.5s
172:	learn: 0.4127885	total: 5.11s	remaining: 24.4s
173:	learn: 0.4127538	total: 5.13s	remaining: 24.3s
174:	learn: 0.4126621	total: 5.14s	remaining: 24.3s
175:	learn: 0.4126100	total: 5.16s	remaining: 24.2s
176:	learn: 0.4125185	total: 5.18s	remaining: 24.1s
177:	learn: 0.4124200	total: 5.2s	remaining: 24s
178:	learn: 0.4123101	total: 5.22s	remaining: 24s
179:	learn: 0.4121763	total: 5.25s	remaining: 23.9s
180:	learn: 0.4120684	total: 5.27s	remaining: 23.8s
181:	learn: 0.4119604	total: 5.29s	remaining: 23.8s
182:	learn: 0.4118279	total: 5.31s	remaining: 23.7s
183:	learn: 0.4116848	to

324:	learn: 0.3986867	total: 9.51s	remaining: 19.7s
325:	learn: 0.3985522	total: 9.53s	remaining: 19.7s
326:	learn: 0.3984457	total: 9.57s	remaining: 19.7s
327:	learn: 0.3983002	total: 9.62s	remaining: 19.7s
328:	learn: 0.3982536	total: 9.64s	remaining: 19.7s
329:	learn: 0.3981985	total: 9.67s	remaining: 19.6s
330:	learn: 0.3980694	total: 9.69s	remaining: 19.6s
331:	learn: 0.3980076	total: 9.72s	remaining: 19.6s
332:	learn: 0.3978826	total: 9.76s	remaining: 19.6s
333:	learn: 0.3978011	total: 9.8s	remaining: 19.6s
334:	learn: 0.3976637	total: 9.84s	remaining: 19.5s
335:	learn: 0.3976038	total: 9.86s	remaining: 19.5s
336:	learn: 0.3974864	total: 9.88s	remaining: 19.4s
337:	learn: 0.3973803	total: 9.91s	remaining: 19.4s
338:	learn: 0.3973090	total: 9.93s	remaining: 19.4s
339:	learn: 0.3971552	total: 9.95s	remaining: 19.3s
340:	learn: 0.3970440	total: 9.97s	remaining: 19.3s
341:	learn: 0.3969543	total: 9.99s	remaining: 19.2s
342:	learn: 0.3968576	total: 10s	remaining: 19.2s
343:	learn: 0.3

483:	learn: 0.3843298	total: 14.3s	remaining: 15.2s
484:	learn: 0.3842563	total: 14.3s	remaining: 15.2s
485:	learn: 0.3842056	total: 14.4s	remaining: 15.2s
486:	learn: 0.3841485	total: 14.4s	remaining: 15.2s
487:	learn: 0.3840832	total: 14.4s	remaining: 15.1s
488:	learn: 0.3840244	total: 14.5s	remaining: 15.1s
489:	learn: 0.3839338	total: 14.5s	remaining: 15.1s
490:	learn: 0.3838885	total: 14.5s	remaining: 15s
491:	learn: 0.3838106	total: 14.6s	remaining: 15s
492:	learn: 0.3837468	total: 14.6s	remaining: 15s
493:	learn: 0.3836766	total: 14.6s	remaining: 15s
494:	learn: 0.3835404	total: 14.6s	remaining: 14.9s
495:	learn: 0.3834209	total: 14.7s	remaining: 14.9s
496:	learn: 0.3833228	total: 14.7s	remaining: 14.9s
497:	learn: 0.3832110	total: 14.8s	remaining: 14.9s
498:	learn: 0.3831322	total: 14.8s	remaining: 14.8s
499:	learn: 0.3830352	total: 14.8s	remaining: 14.8s
500:	learn: 0.3830142	total: 14.8s	remaining: 14.8s
501:	learn: 0.3829515	total: 14.8s	remaining: 14.7s
502:	learn: 0.382896

645:	learn: 0.3717116	total: 19.2s	remaining: 10.5s
646:	learn: 0.3716000	total: 19.3s	remaining: 10.5s
647:	learn: 0.3715848	total: 19.3s	remaining: 10.5s
648:	learn: 0.3715021	total: 19.3s	remaining: 10.5s
649:	learn: 0.3714383	total: 19.4s	remaining: 10.4s
650:	learn: 0.3714022	total: 19.4s	remaining: 10.4s
651:	learn: 0.3713291	total: 19.4s	remaining: 10.4s
652:	learn: 0.3712795	total: 19.4s	remaining: 10.3s
653:	learn: 0.3711999	total: 19.5s	remaining: 10.3s
654:	learn: 0.3710900	total: 19.5s	remaining: 10.3s
655:	learn: 0.3710089	total: 19.6s	remaining: 10.3s
656:	learn: 0.3709351	total: 19.6s	remaining: 10.2s
657:	learn: 0.3708784	total: 19.6s	remaining: 10.2s
658:	learn: 0.3708141	total: 19.6s	remaining: 10.2s
659:	learn: 0.3708052	total: 19.7s	remaining: 10.1s
660:	learn: 0.3707360	total: 19.7s	remaining: 10.1s
661:	learn: 0.3706298	total: 19.7s	remaining: 10.1s
662:	learn: 0.3705805	total: 19.8s	remaining: 10s
663:	learn: 0.3705161	total: 19.8s	remaining: 10s
664:	learn: 0.37

807:	learn: 0.3602892	total: 24.1s	remaining: 5.73s
808:	learn: 0.3602406	total: 24.1s	remaining: 5.7s
809:	learn: 0.3601443	total: 24.2s	remaining: 5.67s
810:	learn: 0.3600666	total: 24.2s	remaining: 5.64s
811:	learn: 0.3599822	total: 24.2s	remaining: 5.61s
812:	learn: 0.3599204	total: 24.3s	remaining: 5.58s
813:	learn: 0.3598686	total: 24.3s	remaining: 5.55s
814:	learn: 0.3598037	total: 24.3s	remaining: 5.52s
815:	learn: 0.3597081	total: 24.3s	remaining: 5.49s
816:	learn: 0.3596173	total: 24.4s	remaining: 5.46s
817:	learn: 0.3595821	total: 24.4s	remaining: 5.44s
818:	learn: 0.3595204	total: 24.5s	remaining: 5.41s
819:	learn: 0.3594108	total: 24.5s	remaining: 5.38s
820:	learn: 0.3593429	total: 24.5s	remaining: 5.34s
821:	learn: 0.3592788	total: 24.5s	remaining: 5.31s
822:	learn: 0.3591878	total: 24.6s	remaining: 5.29s
823:	learn: 0.3591627	total: 24.6s	remaining: 5.26s
824:	learn: 0.3591195	total: 24.6s	remaining: 5.23s
825:	learn: 0.3590807	total: 24.7s	remaining: 5.2s
826:	learn: 0.

968:	learn: 0.3495030	total: 28.8s	remaining: 922ms
969:	learn: 0.3494501	total: 28.9s	remaining: 893ms
970:	learn: 0.3494028	total: 28.9s	remaining: 863ms
971:	learn: 0.3493913	total: 28.9s	remaining: 833ms
972:	learn: 0.3493347	total: 28.9s	remaining: 803ms
973:	learn: 0.3493077	total: 29s	remaining: 773ms
974:	learn: 0.3492640	total: 29s	remaining: 743ms
975:	learn: 0.3492384	total: 29s	remaining: 714ms
976:	learn: 0.3491996	total: 29.1s	remaining: 684ms
977:	learn: 0.3491830	total: 29.1s	remaining: 654ms
978:	learn: 0.3491330	total: 29.1s	remaining: 624ms
979:	learn: 0.3490756	total: 29.1s	remaining: 595ms
980:	learn: 0.3490041	total: 29.2s	remaining: 565ms
981:	learn: 0.3488875	total: 29.2s	remaining: 536ms
982:	learn: 0.3488432	total: 29.2s	remaining: 506ms
983:	learn: 0.3487903	total: 29.3s	remaining: 476ms
984:	learn: 0.3487249	total: 29.3s	remaining: 446ms
985:	learn: 0.3486468	total: 29.3s	remaining: 416ms
986:	learn: 0.3486083	total: 29.4s	remaining: 387ms
987:	learn: 0.3485

In [40]:
result

{'xgb': 0.818, 'catBoost': 0.8193333333333334, 'lightGBM': 0.8203333333333334}

<br><br>

## (Optional) 9. Feature selection <a name="9"></a>
<hr>
rubric={points:1}

**Your tasks:**

Make some attempts to select relevant features. You may try `RFECV` or forward selection. Do the results improve with feature selection? Summarize your results. If you see improvements in the results, keep feature selection in your pipeline. If not, you may abandon it in the next exercises. 

<br><br>

## 10. Hyperparameter optimization <a name="10"></a>
<hr>
rubric={points:15}

**Your tasks:**

Make some attempts to optimize hyperparameters for the models you've tried and summarize your results. You may pick one of the best performing models from the previous exercise and tune hyperparameters only for that model. You may use `sklearn`'s methods for hyperparameter optimization or fancier Bayesian optimization methods. 
  - [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)   
  - [RandomizedSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)
  - [scikit-optimize](https://github.com/scikit-optimize/scikit-optimize)

In [41]:
param = {
    "max_depth" : [3,4,5,6],
    "num_leaves" : [5,7,9,11]
}

In [42]:
random = RandomizedSearchCV(estimator = ltb.LGBMClassifier(), param_distributions = param, n_jobs = -1)
random.fit(X_train, y_train)

RandomizedSearchCV(estimator=LGBMClassifier(), n_jobs=-1,
                   param_distributions={'max_depth': [3, 4, 5, 6],
                                        'num_leaves': [5, 7, 9, 11]})

In [43]:
random.best_score_

0.8215416666666666

In [44]:
random.best_params_

{'num_leaves': 9, 'max_depth': 5}

<br><br>

## 11. Interpretation and feature importances <a name="1"></a>
<hr>
rubric={points:15}

**Your tasks:**

1. Use the methods we saw in class (e.g., `eli5`, `shap`) (or any other methods of your choice) to explain feature importances of one of the best performing models. Summarize your observations. 

In [45]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train)

LinearRegression()

In [48]:
import eli5 

eli5.show_weights(lr)

Weight?,Feature
+0.311,<BIAS>
+0.096,x5
+0.022,x6
+0.009,x7
+0.007,x9
+0.005,x8
+0.001,x4
+0.000,x16
+0.000,x13
+0.000,x12


<br><br>

## 12. Results on the test set <a name="12"></a>
<hr>

rubric={points:5}

**Your tasks:**

1. Try your best performing model on the test data and report test scores. 
2. Do the test scores agree with the validation scores from before? To what extent do you trust your results? Do you think you've had issues with optimization bias? 

In [49]:
model = ltb.LGBMClassifier(num_leaves = 5, max_depth = 3)
model.fit(X_train, y_train)
final = model.score(X_test, y_test)

<br><br>

## (Optional) 13. Explaining predictions 
rubric={points:1}

**Your tasks**

1. Take one or two test predictions and explain them with SHAP force plots.  

<br><br>

## 14. Summary of results <a name="13"></a>
<hr>
rubric={points:10}

**Your tasks:**

1. Report your final test score along with the metric you used. 
2. Write concluding remarks.
3. Discuss other ideas that you did not try but could potentially improve the performance/interpretability . 

In [50]:
print(f'final score was = {final}\nmodel was :- LGBMClassifier\nmetrix were :- num_leaves = 5, max_depth = 3')

final score was = 0.822
model was :- LGBMClassifier
metrix were :- num_leaves = 5, max_depth = 3


<br><br><br><br>