#  Applied Machine Learning 

## Homework 6: Putting it all together 



## Table of contents

- [Submission instructions](#si)
- [Understanding the problem](#1)
- [Data splitting](#2)
- [EDA](#3)
- (Optional) [Feature engineering](#4)
- [Preprocessing and transformations](#5)
- [Baseline model](#6)
- [Linear models](#7)
- [Different classifiers](#8)
- (Optional) [Feature selection](#9)
- [Hyperparameter optimization](#10)
- [Interpretation and feature importances](#11)
- [Results on the test set](#12)
- (Optional) [Explaining predictions](#13)
- [Summary of the results](#14)

## Imports 

In [1]:
import os

%matplotlib inline
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import xgboost as xgb
import catboost as ctb
import lightgbm as ltb
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    f1_score,
    make_scorer,
    plot_confusion_matrix,
)
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_val_score,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.svm import SVC
from sklearn.feature_selection import RFE

<br><br>

<br><br>

## Introduction <a name="in"></a>
<hr>

At this point we are at the end of supervised machine learning part of the course. So in this homework, you will be working on an open-ended mini-project, where you will put all the different things you have learned so far together to solve an interesting problem.

A few notes and tips when you work on this mini-project: 

#### Tips

1. This mini-project is open-ended, and while working on it, there might be some situations where you'll have to use your own judgment and make your own decisions (as you would be doing when you work as a data scientist). Make sure you explain your decisions whenever necessary. 
2. **Do not include everything you ever tried in your submission** -- it's fine just to have your final code. That said, your code should be reproducible and well-documented. For example, if you chose your hyperparameters based on some hyperparameter optimization experiment, you should leave in the code for that experiment so that someone else could re-run it and obtain the same hyperparameters, rather than mysteriously just setting the hyperparameters to some (carefully chosen) values in your code. 
3. If you realize that you are repeating a lot of code try to organize it in functions. Clear presentation of your code, experiments, and results is the key to be successful in this lab. You may use code from lecture notes or previous lab solutions with appropriate attributions. 
4. If you are having trouble running models on your laptop because of the size of the dataset, you can create your train/test split in such a way that you have less data in the train split. If you end up doing this, please write a note to the grader in the submission explaining why you are doing it.  

#### Assessment

We plan to grade fairly and leniently. We don't have some secret target score that you need to achieve to get a good grade. **You'll be assessed on demonstration of mastery of course topics, clear presentation, and the quality of your analysis and results.** For example, if you just have a bunch of code and no text or figures, that's not good. If you do a bunch of sane things and get a lower accuracy than your friend, don't sweat it.

#### A final note

Finally, this style of this "project" question is different from other assignments. It'll be up to you to decide when you're "done" -- in fact, this is one of the hardest parts of real projects. But please don't spend WAY too much time on this... perhaps "a few hours" (2-8 hours???) is a good guideline for a typical submission. Of course if you're having fun you're welcome to spend as much time as you want! But, if so, try not to do it out of perfectionism or getting the best possible grade. Do it because you're learning and enjoying it. Students from the past cohorts have found such kind of labs useful and fun and I hope you enjoy it as well. 

<br><br>

## 1. Understanding the problem <a name="1"></a>
<hr>
rubric={points:4}

In this mini project, you will be working on a classification problem of predicting whether a credit card client will default or not. 
For this problem, you will use [Default of Credit Card Clients Dataset](https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset). In this data set, there are 30,000 examples and 24 features, and the goal is to estimate whether a person will default (fail to pay) their credit card bills; this column is labeled "default.payment.next.month" in the data. The rest of the columns can be used as features. You may take some ideas and compare your results with [the associated research paper](https://www.sciencedirect.com/science/article/pii/S0957417407006719), which is available through [the UBC library](https://www.library.ubc.ca/). 

**Your tasks:**

1. Spend some time understanding the problem and what each feature means. You can find this information in the documentation on [the dataset page on Kaggle](https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset). Write a few sentences on your initial thoughts on the problem and the dataset. 
2. Download the dataset and read it as a pandas dataframe. 

In [2]:
card = pd.read_csv('UCI_Credit_Card.csv', index_col = 0)

In [3]:
card

Unnamed: 0_level_0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,20000.0,2,2,1,24,2,2,-1,-1,-2,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
2,120000.0,2,2,2,26,-1,2,0,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
3,90000.0,2,2,2,34,0,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
4,50000.0,2,2,1,37,0,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
5,50000.0,1,2,1,57,-1,0,-1,0,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29996,220000.0,1,3,1,39,0,0,0,0,0,...,88004.0,31237.0,15980.0,8500.0,20000.0,5003.0,3047.0,5000.0,1000.0,0
29997,150000.0,1,3,2,43,-1,-1,-1,-1,0,...,8979.0,5190.0,0.0,1837.0,3526.0,8998.0,129.0,0.0,0.0,0
29998,30000.0,1,2,2,37,4,3,2,-1,0,...,20878.0,20582.0,19357.0,0.0,0.0,22000.0,4200.0,2000.0,3100.0,1
29999,80000.0,1,3,1,41,1,-1,0,0,0,...,52774.0,11855.0,48944.0,85900.0,3409.0,1178.0,1926.0,52964.0,1804.0,1


<br><br>

## 2. Data splitting <a name="2"></a>
<hr>
rubric={points:2}

**Your tasks:**

1. Split the data into train and test portions. 

In [4]:
x = card.drop(columns = 'default.payment.next.month')
y = card['default.payment.next.month']
X_train, X_test, y_train, y_test = train_test_split(x, y, train_size = 0.80)

In [5]:
X_train.head()

Unnamed: 0_level_0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
15988,50000.0,2,2,2,31,2,2,2,2,2,...,30895.0,31589.0,32071.0,32709.0,1900.0,1600.0,1500.0,1300.0,1300.0,1500.0
27912,180000.0,2,2,1,26,-1,0,0,2,-1,...,9227.0,3235.0,2655.0,7251.0,5000.0,9000.0,0.0,2668.0,7251.0,13365.0
28713,180000.0,2,2,1,42,0,0,0,0,0,...,109319.0,111343.0,91966.0,82280.0,6167.0,5800.0,5700.0,3600.0,3000.0,3234.0
509,80000.0,2,2,1,30,0,0,0,0,0,...,13997.0,10914.0,10685.0,5515.0,1700.0,1400.0,200.0,1000.0,0.0,500.0
18266,50000.0,1,3,1,51,0,0,0,0,0,...,46538.0,45465.0,18654.0,19042.0,2516.0,2106.0,2048.0,783.0,806.0,652.0


<br><br>

## 3. EDA <a name="3"></a>
<hr>
rubric={points:10}

**Your tasks:**

1. Perform exploratory data analysis on the train set.
2. Include at least two summary statistics and two visualizations that you find useful, and accompany each one with a sentence explaining it.
3. Summarize your initial observations about the data. 
4. Pick appropriate metric/metrics for assessment. 

In [6]:
card.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30000 entries, 1 to 30000
Data columns (total 24 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   LIMIT_BAL                   30000 non-null  float64
 1   SEX                         30000 non-null  int64  
 2   EDUCATION                   30000 non-null  int64  
 3   MARRIAGE                    30000 non-null  int64  
 4   AGE                         30000 non-null  int64  
 5   PAY_0                       30000 non-null  int64  
 6   PAY_2                       30000 non-null  int64  
 7   PAY_3                       30000 non-null  int64  
 8   PAY_4                       30000 non-null  int64  
 9   PAY_5                       30000 non-null  int64  
 10  PAY_6                       30000 non-null  int64  
 11  BILL_AMT1                   30000 non-null  float64
 12  BILL_AMT2                   30000 non-null  float64
 13  BILL_AMT3                   300

In [7]:
X_train.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
LIMIT_BAL,24000.0,167788.236667,129749.484967,10000.0,50000.0,140000.0,240000.0,1000000.0
SEX,24000.0,1.601833,0.48953,1.0,1.0,2.0,2.0,2.0
EDUCATION,24000.0,1.8535,0.790715,0.0,1.0,2.0,2.0,6.0
MARRIAGE,24000.0,1.553125,0.521318,0.0,1.0,2.0,2.0,3.0
AGE,24000.0,35.442167,9.197206,21.0,28.0,34.0,41.0,79.0
PAY_0,24000.0,-0.016792,1.128116,-2.0,-1.0,0.0,0.0,8.0
PAY_2,24000.0,-0.132542,1.199528,-2.0,-1.0,0.0,0.0,7.0
PAY_3,24000.0,-0.167833,1.195153,-2.0,-1.0,0.0,0.0,8.0
PAY_4,24000.0,-0.222583,1.165982,-2.0,-1.0,0.0,0.0,8.0
PAY_5,24000.0,-0.267667,1.127382,-2.0,-1.0,0.0,0.0,8.0


<br><br>

## (Optional) 4. Feature engineering <a name="4"></a>
<hr>
rubric={points:1}

**Your tasks:**

1. Carry out feature engineering. In other words, extract new features relevant for the problem and work with your new feature set in the following exercises. You may have to go back and forth between feature engineering and preprocessing. 

<br><br>

## 5. Preprocessing and transformations <a name="5"></a>
<hr>
rubric={points:10}

**Your tasks:**

1. Identify different feature types and the transformations you would apply on each feature type. 
2. Define a column transformer, if necessary. 

In [8]:
# column transformer is not necessary

<br><br>

## 6. Baseline model <a name="6"></a>
<hr>

rubric={points:2}

**Your tasks:**
1. Try `scikit-learn`'s baseline model and report results.

In [9]:
dummy = DummyClassifier()
dummy.fit(X_train, y_train)
dummy.score(X_test, y_test)

0.7673333333333333

<br><br>

## 7. Linear models <a name="7"></a>
<hr>
rubric={points:12}

**Your tasks:**

1. Try logistic regression as a first real attempt. 
2. Carry out hyperparameter tuning to explore different values for the complexity hyperparameter `C`. 
3. Report validation scores along with standard deviation. 
4. Summarize your results.

In [10]:
params = {"logisticregression__C": np.arange(0, 1, 0.1)}
pipe = make_pipeline(StandardScaler(), LogisticRegression())
grid = RandomizedSearchCV(estimator = pipe, param_distributions = params, n_jobs = -1)

In [11]:
grid.fit(X_train, y_train);

5 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\ADMIN\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\ADMIN\anaconda3\lib\site-packages\sklearn\pipeline.py", line 394, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "C:\Users\ADMIN\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1589, in fit
    fold_coefs_ = Parallel(
  File "C:\Users\ADMIN\anaconda3\lib\site-packages\joblib\parallel.py", line 1043, in __call__
    if self.dispatch_one_batch(iterator

In [12]:
grid.best_score_

0.813875

In [13]:
grid.best_params_

{'logisticregression__C': 0.8}

<br><br>

## 8. Different classifiers <a name="8"></a>
<hr>
rubric={points:15}

**Your tasks:**
1. Try at least 3 other models aside from logistic regression. At least one of these models should be a tree-based ensemble model (e.g., lgbm, random forest, xgboost). 
2. Summarize your results. Can you beat logistic regression? 

In [14]:
models = {
    "xgb" : xgb.XGBClassifier(),
    "catBoost": ctb.CatBoostClassifier(),
    "lightGBM": ltb.LGBMClassifier()
}
result = {}

In [15]:
for i in models:
    model = models[i]
    model.fit(X_train, y_train)
    result[i] = model.score(X_test, y_test)

Learning rate set to 0.040021
0:	learn: 0.6702394	total: 174ms	remaining: 2m 53s
1:	learn: 0.6489701	total: 186ms	remaining: 1m 32s
2:	learn: 0.6297710	total: 198ms	remaining: 1m 5s
3:	learn: 0.6119777	total: 211ms	remaining: 52.5s
4:	learn: 0.5967510	total: 224ms	remaining: 44.7s
5:	learn: 0.5819035	total: 240ms	remaining: 39.7s
6:	learn: 0.5685549	total: 254ms	remaining: 36s
7:	learn: 0.5567578	total: 268ms	remaining: 33.2s
8:	learn: 0.5457779	total: 278ms	remaining: 30.7s
9:	learn: 0.5356777	total: 291ms	remaining: 28.8s
10:	learn: 0.5268877	total: 302ms	remaining: 27.2s
11:	learn: 0.5187452	total: 316ms	remaining: 26s
12:	learn: 0.5110274	total: 330ms	remaining: 25s
13:	learn: 0.5039077	total: 343ms	remaining: 24.1s
14:	learn: 0.4976993	total: 355ms	remaining: 23.3s
15:	learn: 0.4917791	total: 367ms	remaining: 22.6s
16:	learn: 0.4863353	total: 379ms	remaining: 21.9s
17:	learn: 0.4815571	total: 390ms	remaining: 21.3s
18:	learn: 0.4770162	total: 402ms	remaining: 20.8s
19:	learn: 0.47

162:	learn: 0.4088414	total: 2.07s	remaining: 10.6s
163:	learn: 0.4087735	total: 2.08s	remaining: 10.6s
164:	learn: 0.4086955	total: 2.1s	remaining: 10.6s
165:	learn: 0.4086131	total: 2.11s	remaining: 10.6s
166:	learn: 0.4085646	total: 2.12s	remaining: 10.6s
167:	learn: 0.4084563	total: 2.13s	remaining: 10.5s
168:	learn: 0.4083621	total: 2.14s	remaining: 10.5s
169:	learn: 0.4082573	total: 2.15s	remaining: 10.5s
170:	learn: 0.4081635	total: 2.15s	remaining: 10.4s
171:	learn: 0.4080877	total: 2.16s	remaining: 10.4s
172:	learn: 0.4080381	total: 2.17s	remaining: 10.4s
173:	learn: 0.4079603	total: 2.18s	remaining: 10.4s
174:	learn: 0.4078575	total: 2.19s	remaining: 10.3s
175:	learn: 0.4077095	total: 2.2s	remaining: 10.3s
176:	learn: 0.4075996	total: 2.21s	remaining: 10.3s
177:	learn: 0.4075023	total: 2.22s	remaining: 10.3s
178:	learn: 0.4074232	total: 2.23s	remaining: 10.2s
179:	learn: 0.4073648	total: 2.24s	remaining: 10.2s
180:	learn: 0.4072857	total: 2.25s	remaining: 10.2s
181:	learn: 0.

335:	learn: 0.3924120	total: 3.96s	remaining: 7.82s
336:	learn: 0.3923319	total: 3.97s	remaining: 7.81s
337:	learn: 0.3922534	total: 3.98s	remaining: 7.79s
338:	learn: 0.3921559	total: 3.99s	remaining: 7.78s
339:	learn: 0.3920373	total: 4.01s	remaining: 7.78s
340:	learn: 0.3919538	total: 4.02s	remaining: 7.77s
341:	learn: 0.3918725	total: 4.03s	remaining: 7.76s
342:	learn: 0.3918035	total: 4.04s	remaining: 7.74s
343:	learn: 0.3917480	total: 4.05s	remaining: 7.73s
344:	learn: 0.3916452	total: 4.06s	remaining: 7.71s
345:	learn: 0.3915691	total: 4.07s	remaining: 7.7s
346:	learn: 0.3914642	total: 4.08s	remaining: 7.68s
347:	learn: 0.3913543	total: 4.09s	remaining: 7.67s
348:	learn: 0.3913027	total: 4.1s	remaining: 7.65s
349:	learn: 0.3912232	total: 4.11s	remaining: 7.64s
350:	learn: 0.3911301	total: 4.12s	remaining: 7.62s
351:	learn: 0.3910060	total: 4.13s	remaining: 7.61s
352:	learn: 0.3909162	total: 4.14s	remaining: 7.59s
353:	learn: 0.3908174	total: 4.15s	remaining: 7.58s
354:	learn: 0.

508:	learn: 0.3773723	total: 5.84s	remaining: 5.63s
509:	learn: 0.3772734	total: 5.85s	remaining: 5.62s
510:	learn: 0.3772158	total: 5.86s	remaining: 5.61s
511:	learn: 0.3770787	total: 5.87s	remaining: 5.6s
512:	learn: 0.3769683	total: 5.88s	remaining: 5.59s
513:	learn: 0.3768760	total: 5.9s	remaining: 5.58s
514:	learn: 0.3767951	total: 5.91s	remaining: 5.57s
515:	learn: 0.3767316	total: 5.92s	remaining: 5.56s
516:	learn: 0.3766588	total: 5.93s	remaining: 5.54s
517:	learn: 0.3765282	total: 5.95s	remaining: 5.53s
518:	learn: 0.3764546	total: 5.96s	remaining: 5.52s
519:	learn: 0.3763686	total: 5.97s	remaining: 5.51s
520:	learn: 0.3762938	total: 5.98s	remaining: 5.5s
521:	learn: 0.3762318	total: 5.99s	remaining: 5.49s
522:	learn: 0.3761594	total: 6s	remaining: 5.48s
523:	learn: 0.3760705	total: 6.01s	remaining: 5.46s
524:	learn: 0.3759905	total: 6.03s	remaining: 5.45s
525:	learn: 0.3758749	total: 6.04s	remaining: 5.44s
526:	learn: 0.3757779	total: 6.05s	remaining: 5.43s
527:	learn: 0.3756

679:	learn: 0.3641533	total: 7.7s	remaining: 3.62s
680:	learn: 0.3640727	total: 7.71s	remaining: 3.61s
681:	learn: 0.3639309	total: 7.72s	remaining: 3.6s
682:	learn: 0.3638601	total: 7.73s	remaining: 3.59s
683:	learn: 0.3637913	total: 7.74s	remaining: 3.58s
684:	learn: 0.3637211	total: 7.75s	remaining: 3.56s
685:	learn: 0.3636470	total: 7.76s	remaining: 3.55s
686:	learn: 0.3635749	total: 7.77s	remaining: 3.54s
687:	learn: 0.3635042	total: 7.78s	remaining: 3.53s
688:	learn: 0.3634299	total: 7.79s	remaining: 3.52s
689:	learn: 0.3633708	total: 7.8s	remaining: 3.5s
690:	learn: 0.3632822	total: 7.81s	remaining: 3.49s
691:	learn: 0.3632393	total: 7.83s	remaining: 3.48s
692:	learn: 0.3631836	total: 7.84s	remaining: 3.47s
693:	learn: 0.3631157	total: 7.85s	remaining: 3.46s
694:	learn: 0.3630194	total: 7.86s	remaining: 3.45s
695:	learn: 0.3629226	total: 7.87s	remaining: 3.44s
696:	learn: 0.3628470	total: 7.89s	remaining: 3.43s
697:	learn: 0.3627823	total: 7.9s	remaining: 3.42s
698:	learn: 0.362

854:	learn: 0.3517012	total: 9.8s	remaining: 1.66s
855:	learn: 0.3516371	total: 9.81s	remaining: 1.65s
856:	learn: 0.3515906	total: 9.82s	remaining: 1.64s
857:	learn: 0.3515258	total: 9.83s	remaining: 1.63s
858:	learn: 0.3514506	total: 9.84s	remaining: 1.61s
859:	learn: 0.3514055	total: 9.85s	remaining: 1.6s
860:	learn: 0.3513048	total: 9.86s	remaining: 1.59s
861:	learn: 0.3512687	total: 9.87s	remaining: 1.58s
862:	learn: 0.3511948	total: 9.88s	remaining: 1.57s
863:	learn: 0.3511290	total: 9.89s	remaining: 1.56s
864:	learn: 0.3510488	total: 9.9s	remaining: 1.54s
865:	learn: 0.3510168	total: 9.91s	remaining: 1.53s
866:	learn: 0.3509464	total: 9.93s	remaining: 1.52s
867:	learn: 0.3508960	total: 9.94s	remaining: 1.51s
868:	learn: 0.3508083	total: 9.95s	remaining: 1.5s
869:	learn: 0.3507616	total: 9.96s	remaining: 1.49s
870:	learn: 0.3507144	total: 9.97s	remaining: 1.48s
871:	learn: 0.3506776	total: 9.99s	remaining: 1.47s
872:	learn: 0.3506156	total: 10s	remaining: 1.46s
873:	learn: 0.3506

In [16]:
result

{'xgb': 0.8038333333333333, 'catBoost': 0.8085, 'lightGBM': 0.8111666666666667}

<br><br>

## (Optional) 9. Feature selection <a name="9"></a>
<hr>
rubric={points:1}

**Your tasks:**

Make some attempts to select relevant features. You may try `RFECV` or forward selection. Do the results improve with feature selection? Summarize your results. If you see improvements in the results, keep feature selection in your pipeline. If not, you may abandon it in the next exercises. 

<br><br>

## 10. Hyperparameter optimization <a name="10"></a>
<hr>
rubric={points:15}

**Your tasks:**

Make some attempts to optimize hyperparameters for the models you've tried and summarize your results. You may pick one of the best performing models from the previous exercise and tune hyperparameters only for that model. You may use `sklearn`'s methods for hyperparameter optimization or fancier Bayesian optimization methods. 
  - [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)   
  - [RandomizedSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)
  - [scikit-optimize](https://github.com/scikit-optimize/scikit-optimize)

In [17]:
param = {
    "max_depth" : [3,4,5,6],
    "num_leaves" : [5,7,9,11]
}

In [18]:
random = RandomizedSearchCV(estimator = ltb.LGBMClassifier(), param_distributions = param, n_jobs = -1)
random.fit(X_train, y_train)

RandomizedSearchCV(estimator=LGBMClassifier(), n_jobs=-1,
                   param_distributions={'max_depth': [3, 4, 5, 6],
                                        'num_leaves': [5, 7, 9, 11]})

In [19]:
random.best_score_

0.8249583333333333

In [20]:
random.best_params_

{'num_leaves': 11, 'max_depth': 3}

<br><br>

## 11. Interpretation and feature importances <a name="1"></a>
<hr>
rubric={points:15}

**Your tasks:**

1. Use the methods we saw in class (e.g., `eli5`, `shap`) (or any other methods of your choice) to explain feature importances of one of the best performing models. Summarize your observations. 

In [21]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train)

LinearRegression()

In [22]:
import eli5 

eli5.show_weights(lr)

Weight?,Feature
+0.303,<BIAS>
+0.093,x5
+0.024,x6
+0.010,x7
+0.006,x9
+0.006,x8
+0.001,x4
+0.000,x12
+0.000,x16
… 4 more negative …,… 4 more negative …


<br><br>

## 12. Results on the test set <a name="12"></a>
<hr>

rubric={points:5}

**Your tasks:**

1. Try your best performing model on the test data and report test scores. 
2. Do the test scores agree with the validation scores from before? To what extent do you trust your results? Do you think you've had issues with optimization bias? 

In [23]:
model = ltb.LGBMClassifier(num_leaves = 5, max_depth = 3)
model.fit(X_train, y_train)
final = model.score(X_test, y_test)

<br><br>

## (Optional) 13. Explaining predictions 
rubric={points:1}

**Your tasks**

1. Take one or two test predictions and explain them with SHAP force plots.  

<br><br>

## 14. Summary of results <a name="13"></a>
<hr>
rubric={points:10}

**Your tasks:**

1. Report your final test score along with the metric you used. 
2. Write concluding remarks.
3. Discuss other ideas that you did not try but could potentially improve the performance/interpretability . 

In [24]:
print(f'final score was = {final}\nmodel was :- LGBMClassifier\nmetrix were :- num_leaves = 5, max_depth = 3')

final score was = 0.8091666666666667
model was :- LGBMClassifier
metrix were :- num_leaves = 5, max_depth = 3


<br><br><br><br>