# Model Selection and Assessment

[1 - Model Selection](#selection)
* [1.1 - Hold-Out Method](#hold)
* [1.2 - K-Fold](#kfold)
* [1.3 - Repeated K-Fold](#r_kfold)
* [1.4 - Leave-one-Out](#loo)
* [1.5 - Other splitting techniques](#other)

[2 - Performance measures](#measures)
* [2.1 - Regression Problems](#regress)
* [2.2 - Classification Problems](#class)

<a class="anchor" id="selection">

## `1. Model Selection`
    
</a>

__`Step 1`__ Import the needed libraries

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

__`Step 2`__ Read the dataset __diabetes.csv__

In [2]:
diabetes = pd.read_csv(r'./Datasets/diabetes.csv')
diabetes

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In this dataset we have only 768 rows and 9 columns. Usually in data with this dimensionality we apply K-Fold Croo Validation.But let's see other approaches.

__`Step 3`__ Create an object named __data__ that will contain your independent variables and another object named __target__ that will contain your dependent variable / target (the last column in the dataset)

In [3]:
# DO IT
data = diabetes.iloc[:,:-1]
target = diabetes.iloc[:,-1]

<a class="anchor" id="hold">

## 1.1. The Hold-Out Method

</a>
    
In this approach we randomly split the full dataset into training and test sets. Then we apply the model training on the training set and use the test set for validation purpose, ideally split the data into 70:30 or 80:20. With this approach there is a possibility of high bias if we have limited data, because we would miss some information about the data which we have not used for training. If our data is huge then this approach is acceptable.

In this exercise, we are going to split our dataset into train, test and validation. <br> <br>
By default, sklearn has a function named train_test_split that allows to split the dataset into two different datasets.

__`Step 4`__ Import the library `train_test_split` from `sklearn.model_selection`

In [4]:
# DO IT
from sklearn.model_selection import train_test_split

__`Step 5`__ Divide the `data`into `X_train_val` and `X_test`, the `target`into `y_train_val` and `y_test`, and define the following arguments: `test_size = 0.15`, `random_state = 15`, `shuffle = True` and `stratify = target`.

In [5]:
X_train_val, X_test, y_train_val, y_test = train_test_split(data, 
                                                    target, 
                                                    test_size=0.15, 
                                                    random_state=15, 
                                                    shuffle=True, 
                                                    stratify=target
                                                   )

This will allow me to create two different datasets, one for train and validation (85% of the data) and one for test (15% of the data). <br>
The stratification will allow me to have the same proportion of each label of the dependent variable in both datasets.


### How to create the three datasets: train, validation and test?
To create three datasets (train, validation and test) we are going to use the function train_test_split twice. <br><br>
We already created in Step 5 two sets of datasets, one for test (X_test and y_test) and another one that includes the data for training and validation (X_train_val and y_train_val). <br>
Now is time to split our biggest dataset into train and validation.




<img src="Division.png" alt="Drawing" style="width: 500px;"/> <br>

__`Step 6`__  Divide the `X_train_val`into `X_train` and `X_val`, the `y_train_val` into `y_train` and `y_val`, and define the following arguments: `test_size = 0.18`, `random_state = 15`, `shuffle = True` and `stratify = y_train_val`.

In [6]:
# DO IT
X_train, X_val, y_train, y_val = train_test_split(X_train_val,
                                                  y_train_val,
                                                  test_size = 0.18,
                                                  random_state = 15,
                                                  shuffle=True,
                                                  stratify=y_train_val
)

__`Step 7`__ Check the proportion of data for each dataset. _(written for you)_

In [7]:
print('train:{}% | validation:{}% | test:{}%'.format(round(len(y_train)/len(target),2),
                                                     round(len(y_val)/len(target),2),
                                                     round(len(y_test)/len(target),2)
                                                    ))

train:0.7% | validation:0.15% | test:0.15%


Now we have three different datasets, namely:
- Training dataset, with 70% of the data, that will allow me to build the model;
- Validation dataset, with 15% of the data, that will allow me to fine tune the model and check some problems like overfitting;
- Test dataset, with 15% of the data, that will allow me to evaluate the performance of the final model.

Since we apply stratification taking into account the target, our datasets are going to have similar proportions in '0s' and '1s' for all datasets.

In [8]:
print('Training Data')
print(y_train.value_counts()/len(y_train))
print('Validation Data')
print(y_val.value_counts()/len(y_val))
print('Test Data')
print(y_test.value_counts()/len(y_test))

Training Data
0    0.649813
1    0.350187
Name: Outcome, dtype: float64
Validation Data
0    0.652542
1    0.347458
Name: Outcome, dtype: float64
Test Data
0    0.655172
1    0.344828
Name: Outcome, dtype: float64


__`Step 8`__ What if I didn't apply stratification? Let's see an example:

In [9]:
X_train_not_strat, X_test_not_strat, y_train_not_strat, y_test_not_strat = train_test_split(data, 
                                                    target, 
                                                    test_size=0.15, 
                                                    random_state=15, 
                                                    shuffle=True, 
                                                   )

In [10]:
print('Training Data')
print(y_train_not_strat.value_counts()/len(y_train))
print('Validation Data')
print(y_test_not_strat.value_counts()/len(y_val))

Training Data
0    0.780899
1    0.440075
Name: Outcome, dtype: float64
Validation Data
0    0.703390
1    0.279661
Name: Outcome, dtype: float64


In this case, the percentage of each possible class for boths datasets do not match.

<a class="anchor" id="kfold">

## 1.2. K-Fold

</a>

Now we are going to apply K-Fold. K-Fold is the most used strategy when splitting our data, and more appropriate when we have a medium-sized dataset.

__`Step 9`__ Import __KFold__ from __sklearn.model_selection__

In [11]:
# DO IT
from sklearn.model_selection import KFold

__`Step 10`__ Create a KFold Instance where the number of splits is 10 (*n_splits*) and name it as __kf__

In [12]:
# DO IT
kf = KFold(n_splits=10)

This time I want to apply already a machine learning algorithm on my data, so we can verify the results obtained on the different models built during the K-Fold.

__`Step 11`__ Import __LogisticRegression__ from __sklearn.linear_model__

In [13]:
# DO IT
from sklearn.linear_model import LogisticRegression

__`Step 12`__ Create a function named __avg_score_LR__ that will return the average score value for the train and the test set and the standard deviation by using a logistic Regression. This will have as parameters the splitting technique you are going to use, your independent variables and your target.

In [14]:
def avg_score_LR(split_method,X,y):
    score_train = []
    score_test = []
    for train_index, test_index in split_method.split(X):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        model = LogisticRegression().fit(X_train, y_train)
        value_train = model.score(X_train, y_train)
        value_test = model.score(X_test,y_test)
        score_train.append(value_train)
        score_test.append(value_test)

    
    print('Training mean accuracy for each model:', score_train)
    print('\nTest mean accuracy for each model:', score_test)
    print('\nTrain average value:' +  str(round(np.mean(score_train),2)) + '+/-' + str(round(np.std(score_train),2)))
    print('\nTest average value:' +  str(round(np.mean(score_test),2)) + '+/-' + str(round(np.std(score_test),2)))

__`Step 13`__ Call the function __avg_score_LR__ and check the average score for the train and the test sets using the split technique __kf__

In [15]:
avg_score_LR(kf, data, target)

Training mean accuracy for each model: [0.7988422575976846, 0.7756874095513748, 0.7756874095513748, 0.788712011577424, 0.7829232995658466, 0.7901591895803184, 0.768451519536903, 0.7742402315484804, 0.7933526011560693, 0.7803468208092486]

Test mean accuracy for each model: [0.7012987012987013, 0.8441558441558441, 0.7532467532467533, 0.6883116883116883, 0.7922077922077922, 0.7402597402597403, 0.8571428571428571, 0.8181818181818182, 0.7368421052631579, 0.8026315789473685]

Train average value:0.78+/-0.01

Test average value:0.77+/-0.06


<a class="anchor" id="r_kfold">

## 1.3. Repeated K-Fold

</a>

We can also apply Repeated K-Fold. This is a technique that, as the name says, is going to repeat the process of K-Fold several times.

__`Step 14`__ Import __RepeatedKFold__ from __sklearn.model_selection__

In [16]:
# DO IT
from sklearn.model_selection import RepeatedKFold

__`Step 15`__ Create a RepeatedKFold Instance where the number of splits is 6 (`n_splits=6`) and the number of times cross-validator needs to be repeated is 2 (`n_repeats=2`)  and name it as __rkf__

In [17]:
# DO IT
rkf = RepeatedKFold(n_splits=6, n_repeats=2)

__`Step 16`__ Call the function __avg_score_LR__ and check the average score for the train and the test sets using __rkf__

In [18]:
# DO IT
avg_score_LR(rkf, data, target)

Training mean accuracy for each model: [0.78125, 0.778125, 0.76875, 0.775, 0.7765625, 0.7875, 0.778125, 0.771875, 0.78125, 0.8, 0.7671875, 0.7796875]

Test mean accuracy for each model: [0.7890625, 0.765625, 0.7890625, 0.7421875, 0.7890625, 0.7578125, 0.78125, 0.796875, 0.7734375, 0.6953125, 0.828125, 0.7109375]

Train average value:0.78+/-0.01

Test average value:0.77+/-0.04


<a class="anchor" id="loo">

## 1.4. Leave One Out

</a>

__`Step 17`__ Do the same steps you applied on the previous techniques, but this time using the Leave One Out. For that, you need to import __LeaveOneOut__ from __sklearn.model_selection__

In [19]:
# DO IT
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
avg_score_LR(loo, data, target)

Training mean accuracy for each model: [0.7770534550195567, 0.7796610169491526, 0.7757496740547588, 0.7770534550195567, 0.7809647979139505, 0.7796610169491526, 0.7848761408083442, 0.7835723598435462, 0.7809647979139505, 0.7822685788787483, 0.7809647979139505, 0.7809647979139505, 0.7809647979139505, 0.7822685788787483, 0.7835723598435462, 0.7796610169491526, 0.7822685788787483, 0.7835723598435462, 0.7783572359843546, 0.7822685788787483, 0.7809647979139505, 0.7822685788787483, 0.7783572359843546, 0.7835723598435462, 0.7822685788787483, 0.7835723598435462, 0.7822685788787483, 0.7757496740547588, 0.7809647979139505, 0.7809647979139505, 0.7848761408083442, 0.7822685788787483, 0.7822685788787483, 0.7861799217731421, 0.7822685788787483, 0.7796610169491526, 0.7861799217731421, 0.7835723598435462, 0.7822685788787483, 0.7809647979139505, 0.7848761408083442, 0.7848761408083442, 0.788787483702738, 0.7783572359843546, 0.7770534550195567, 0.7835723598435462, 0.7822685788787483, 0.7861799217731421, 0

<a class="anchor" id="other">

## 1.5. Stratified k-fold and others

</a>


Using SkLearn you have several options to select your model, and the application is similar to the cases we saw previously.

<img src="model_selection.png" alt="Drawing" style="width: 800px;"/> <br>

## Comparing models

Don't forget that the purpose of this notebook is to compare different models. In the following steps, you are going to fit your data into a DecisionTree model also, and use the __KFold__ to compare the performance of it with the Logistic Regression.

__`Step 21`__ Import __DecisionTreeClassifier__ from __sklearn.tree__

In [20]:
# DO IT
from sklearn.tree import DecisionTreeClassifier

__`Step 22`__ Similarly to step 12, create a function named __avg_score_DT__ that will return the average score value for the train and the test set and the standard deviation by using a Decision Tree Classifier. This will have as parameters the splitting technique you are going to use, your independent variables and your target.


In [21]:
# DO IT
def avg_score_DT(method,X,y):
    score_train = []
    score_test = []
    for train_index, test_index in method.split(X):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        model = DecisionTreeClassifier().fit(X_train, y_train)
        value_train = model.score(X_train, y_train)
        value_test = model.score(X_test,y_test)
        score_train.append(value_train)
        score_test.append(value_test)

    print('Training mean accuracy for each model:', score_train)
    print('\nTest mean accuracy for each model:', score_test)
    print('\nTrain average value:' +  str(round(np.mean(score_train),2)) + '+/-' + str(round(np.std(score_train),2)))
    print('\nTest average value:' +  str(round(np.mean(score_test),2)) + '+/-' + str(round(np.std(score_test),2)))

__`Step 24`__ Apply KFold to the data using `n_splits = 10` and check the performance of the DecisionTree you created by calling the function __avg_score_DT__

In [22]:
# DO IT

print('Logistic Regression')
kf_lr = KFold(n_splits=10)
avg_score_LR(kf_lr, data, target)

print('\nDecision Tree')

kf_dt = KFold(n_splits=10)
avg_score_DT(kf_dt, data, target)

Logistic Regression
Training mean accuracy for each model: [0.7988422575976846, 0.7756874095513748, 0.7756874095513748, 0.788712011577424, 0.7829232995658466, 0.7901591895803184, 0.768451519536903, 0.7742402315484804, 0.7933526011560693, 0.7803468208092486]

Test mean accuracy for each model: [0.7012987012987013, 0.8441558441558441, 0.7532467532467533, 0.6883116883116883, 0.7922077922077922, 0.7402597402597403, 0.8571428571428571, 0.8181818181818182, 0.7368421052631579, 0.8026315789473685]

Train average value:0.78+/-0.01

Test average value:0.77+/-0.06

Decision Tree
Training mean accuracy for each model: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

Test mean accuracy for each model: [0.6363636363636364, 0.7792207792207793, 0.6883116883116883, 0.5974025974025974, 0.6753246753246753, 0.7142857142857143, 0.7272727272727273, 0.8051948051948052, 0.6447368421052632, 0.6710526315789473]

Train average value:1.0+/-0.0

Test average value:0.69+/-0.06


<a class="anchor" id="measures">

## `2. Performance Measures`
    
</a>

Until this point we just saw the method __'score()'__ that we can call for every model available on sklearn. <br>
This __'score()'__ method returns the mean accuracy for Classification problems and the R^2 for Regression problems. <br>
But we have more metrics that can suit better for our own problem.

* [2.1 - Regression Problems](#regress)
* [2.2 - Classification Problems](#class)

<a class="anchor" id="regress">

## 2.1. Regression Problems

</a>

* [2.1.1 - $R^{2}$ Score](#rsquare)
* [2.1.2 - Adjusted $R^{2}$ Score](#adjusted)
* [2.1.3 - MAE](#mae)
* [2.1.4 - RMSE](#mse)
* [2.1.5 - MedAE](#medae)
* [2.1.6 - The Classification Report](#cr)

Import the needed libraries:

In [23]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, median_absolute_error

__`Step 1`__ Import the dataset __Boston.csv__ and define as data the independent variables and target the dependent variable (last column) 

In [24]:
boston = pd.read_csv(r'./Datasets/Boston.csv')
data_boston = boston.iloc[:,:-1]
target_boston = boston.iloc[:,-1]

__`Step 2`__ By using the method train_test_split from sklearn.model_selection, split your dataset into train(80%) and validation(20%).

In [25]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(data_boston, 
                                                    target_boston, 
                                                    test_size=0.2, 
                                                    random_state=15, 
                                                    shuffle=True, 
                                                   )

__`Step 3`__ Create an instance of LinearRegression named as lr with the default parameters and fit to your train data.

In [26]:
lr = LinearRegression().fit(X_train,y_train)

__`Step 4`__ Now that you have your model created, assign the predictions to y_pred, using the method predict().

In [27]:
y_pred = lr.predict(X_val)
y_pred 

array([28.80383915, 40.26470606, 23.1629741 , 22.73903454, 26.60248165,
        6.78850771, 17.98737207, 12.90645395, 28.13880473, 15.96300504,
       17.58476405, 22.54993315, 15.57583198, 16.42813922, 20.85701954,
       14.4238478 ,  8.59570996,  7.00268049, 21.90974047, 10.41836313,
       38.99970045, 13.10069505, 23.60170542, 19.36745226, 19.4704504 ,
       19.44473926, 26.81161139, 21.94644687, 19.910743  , 19.55839769,
       21.33408116,  7.97494586, 20.91117634, 20.17838513, 23.55157079,
       19.3060909 , 24.34755999, 28.33114956, 20.98210245, 18.08903855,
       28.5614124 , 36.5386986 , 20.20828082, 27.06956955, 26.23745421,
       21.00792914, 21.1962516 , 30.55209364, 24.88050603, 20.75515688,
       30.57871029, 15.35275076, 14.12154202, 13.92054419, 17.58306333,
       30.23390841,  7.78156918, 29.50907892, 16.69885153, 26.35705786,
       17.51457779, 27.86328712, 18.91817276, 29.7953683 , 34.10098499,
       20.5512098 , 23.29515547, 18.68891381, 25.08112538, 19.35

<a class="anchor" id="rsquare">
    
### 2.1.1. $R^{2}$ Score

</a>

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score'>sklearn.metrics.r2_score(y_true, y_pred, ... )</a>

__Definition:__ <br>
R^2 (coefficient of determination) regression score function.

__Interpretation:__ <br>
Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). 

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values; <br>
_y_pred_: Estimated target values; <br>
...
</div>

__`Step 5`__ Check the R^2 score of the model you created previously

In [28]:
r2_score(y_val, y_pred)

0.6920749038652123

__When to use?__ <br>
When we want to measure the amount of variance in the target variable that can be explained by our model. <br>
It gives the degree of variability in the target variable that is explained by the model or the independent variables. <br>
If this value is 0.7, then it means that the independent variables explain 70% of the variation in the target variable.

<a class="anchor" id="adjusted">
    
### 2.1.2. Adjusted $R^{2}$ Score

</a>

There is no direct way to obtain the adjusted R^2 using sklearn, but we can apply the formula:
<img src="adj_r2.png" alt="Drawing" style="width: 300px;"/> <br>


where n stands for the sample size and p for the number of the regressors.

__`Step 6`__ Calculate the Adjusted R^2 Score for your model.

In [29]:
# DO IT
r2 = r2_score(y_val, y_pred)
n = len(y_val)
p = len(X_train.columns)

def adj_r2 (r2,n,p):
    return 1-(1-r2)*(n-1)/(n-p-1)

adj_r2(r2,n,p)

0.6465859692089369

__When to use?__ <br>
When we want to measure the amount of variance in the target variable that can be explained by our model. <br>
This is a form of R-squared that is adjusted for the number of terms in the model. <br>
Tries to avoid the problem associated with R-squared:  even if we are adding redundant variables to the data, the value of R-squared does not decrease - it either remains the same or increases with the addition of new independent variables.

__Then what is the advantage of $R^{2}$?__ <br>
It has a direct interpretation as the proportion of variance in the dependent variable that is accounted for by the model.


<hline>

***
    
However in some cases we are more interested in quantifying the error in the same measuring unit of the variable:
    - we can use metrics like MAE, MSE and MedAE for that.
    
***

<a class="anchor" id="mae">
    
### 2.1.3. MAE (Mean absolute error)

</a>

<img src="mae.png" alt="Drawing" style="width: 200px;"/>

__`Step 7`__ Check the MAE of the model you created previously

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.mean_absolute_error'>sklearn.metrics.mean_absolute_error(y_true, y_pred, ... )</a>

__Definition:__ <br>
Mean absolute error regression loss.

__Interpretation:__ <br>
Best possible value is 0.0. MAE is always non-negative.

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values; <br>
_y_pred_: Estimated target values; <br>
...
</div>

In [30]:
mean_absolute_error(y_val, y_pred)

3.686086823380285

__When to use?__ <br>
It measures the average magnitude of the errors in a set of predictions, without considering their direction.

<a class="anchor" id="mse">
    
### 2.1.4. RMSE (Root Mean squared error)

</a>

<img src="rmse.png" alt="Drawing" style="width: 250px;"/>

__`Step 8`__ Check the RMSE of the model you created previously

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error'>sklearn.metrics.mean_squared_error(y_true, y_pred, ... )</a>

__Definition:__ <br>
Mean absolute error regression loss.

__Interpretation:__ <br>
Best possible value is 0.0. MSE is always non-negative.

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values; <br>
_y_pred_: Estimated target values; <br>
...
</div>

In [31]:
mean_squared_error(y_val, y_pred, squared = True)

23.81224546508083

__When to use?__ <br>
Since the errors are squared before they are averaged, the RMSE gives a relatively high weight to large errors. This means the RMSE should be more useful when large errors are particularly undesirable.

__MAE vs. RMSE__ <br>
RMSE has the benefit of penalizing large errors more so can be more appropriate in some cases, for example, if being off by 20 is more than twice as bad as being off by 10. But if being off by 20 is just twice as bad as being off by 10, then MAE is more appropriate.

<a class="anchor" id="medae">
    
### 2.1.5. MedAE (Median absolute error)

</a>

__`Step 9`__ Check the MedAE score of the model you created previously

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.median_absolute_error'>sklearn.metrics.median_absolute_error(y_true, y_pred, ... )</a>

__Definition:__ <br>
Median absolute error regression loss

__Interpretation:__ <br>
Best possible value is 0.0. MedAE is always non-negative.

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values; <br>
_y_pred_: Estimated target values; <br>
...
</div>

In [32]:
median_absolute_error(y_val, y_pred)

2.82096468962707

__When to use?__ <br>
Using the median instead of the mean implies that we are ignoring the outliers.

<a class="anchor" id="class">

## 2.2. Classification Problems
</a>

* [2.2.1 - The confusion matrix](#confusion)
* [2.2.2 - The accuracy Score](#accuracy)
* [2.2.3 - The precision](#precision)
* [2.2.4 - The recall](#recall)
* [2.2.5 - The F1 Score](#f1)


Now we are going to apply some classification metrics to the diabetes dataset. For that, we are going to import the needed packages from `sklearn.metrics`. <br>
<br>The sklearn library offers a wide range of metrics for this situation. We are going to see the most used ones. 

In [33]:
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

__`Step 10`__ By using the method train_test_split from sklearn.model_selection, split your dataset `diabetes` into train(70%) and validation(30%).

In [34]:
X_train, X_val, y_train, y_val = train_test_split(data, 
                                                  target, 
                                                  test_size = 0.3, 
                                                  random_state=5, 
                                                  stratify = target)

__`Step 11`__ Create an instance of LogisticRegression named as __log_model__ with the default parameters and fit to your train data.

In [35]:
log_model = LogisticRegression().fit(X_train, y_train)

__`Step 12`__ Now that you have your model created, assign the predictions to y_pred, using the method predict().

In [36]:
# DO IT
y_pred = log_model.predict(X_val)
y_pred

array([1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0,
       1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0], dtype=int64)

<a class="anchor" id="confusion">
    
### 2.2.1. The confusion matrix

</a>

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix'>sklearn.metrics.confusion_matrix(y_true, y_pred, ...)</a>

__Definition:__ <br>
Compute confusion matrix to evaluate the accuracy of a classification

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values.; <br>
_y_pred_: Estimated targets as returned by a classifier.; <br>
...
</div>

__`Step 13`__ Obtain the confusion matrix

In [37]:
confusion_matrix(y_val, y_pred)

array([[133,  17],
       [ 38,  43]], dtype=int64)

The confusion matrix in sklearn is presented in the following format: <br>
[ [ TN  FP  ] <br>
    [ FN  TP ] ]

<a class="anchor" id="accuracy">
    
### 2.2.2. The accuracy score

</a>

<img src="accuracy.png" alt="Drawing" style="width: 300px;"/>

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score'>sklearn.metrics.accuracy_score(y_true, y_pred, normalize=True,...)</a>

__Definition:__ <br>
Accuracy classification score.

__Interpretation:__ <br>
If normalize is True, then the best performance is 1. When normalize = False, then the best performance is the number of samples.

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values.; <br>
_y_pred_: Estimated targets as returned by a classifier.; <br>
_normalize_: If False, return the number of correctly classified samples. Otherwise, return the fraction of correctly classified samples. <br>
...
</div>

__`Step 14`__ Get the accuracy score

In [38]:
# DO IT
accuracy_score(y_val, y_pred)

0.7619047619047619

Is accuracy always a good option? Let's check with an example:

<img src="example_1.png" alt="Drawing" style="width: 400px;"/>

In this case, what is the accuracy?

<img src="example_2.png" alt="Drawing" style="width: 300px;"/>

We have an accuracy of 99,1% which is very very high! That is great, right? <br>
Well, not really...<br>
Imagine that we are testing people potentially with covid... A positive person is actually someone who is sick and carrying a virus that can spread very quickly! The cost of having a misclassified actual positive (or a false negative) is very high!

<a class="anchor" id="precision">
    
### 2.2.3. The precision

</a>

<img src="precision.png" alt="Drawing" style="width: 200px;"/>


<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score'>sklearn.metrics.precision_score(y_true, y_pred, ...)</a>

__Definition:__ <br>
Compute the precision.

__Interpretation:__ <br>
The best value is 1, and the worst value is 0.

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values.; <br>
_y_pred_: Estimated targets as returned by a classifier.; <br>
...
</div>

__`Step 15`__ Get the precision score

In [39]:
precision_score(y_val, y_pred)

0.7166666666666667

If you look at the confusion matrix, we can verify that precision is only concerned to the predicted values that were considered positive:
    
<img src="example_3.png" alt="Drawing" style="width: 400px;"/>

So precision gives us how precise / accurate our model is out of those predicted positive, how many of them are actual positive.

__When to use?__

`When the cost of False Positives is high.` <br>
For example, in email spam detection, where a negative is considered not spam and a positive is a spam email. <br>
A false positive will be an email that is considered spam when in reality it was not - the user will loose potentially importante information if the precision is not high in the spam detection model.

<a class="anchor" id="recall">
    
### 2.2.4. The recall

</a>
<img src="recall.png" alt="Drawing" style="width: 180px;"/>

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.recall_score'>sklearn.metrics.recall_score(y_true, y_pred, ...)</a>

__Definition:__ <br>
Compute the recall.

__Interpretation:__ <br>
The best value is 1 and the worst value is 0.

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values.; <br>
_y_pred_: Estimated targets as returned by a classifier.; <br>
...
</div>

__`Step 16`__ Get the recall score

In [40]:
recall_score(y_val, y_pred)

0.5308641975308642

Looking at the confusion matrix:
    
<img src="example_4.png" alt="Drawing" style="width: 400px;"/>

Recall calculates how many of the actual positives our model is able to capture through labeling it as positive (True positive).

__When to use?__

`When the cost of False Negatives is high.` <br>
For example, in the example we gave before concerning Covid tests. If a sick patient (Actual Positive) does the test and is predicted as not sick (predicted as negative), the risk will be extremely high since the sickness is contagious. 

<a class="anchor" id="f1">
    
### 2.2.5. The F1 Score

</a>

<img src="f1.png" alt="Drawing" style="width: 270px;"/>

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score'>sklearn.metrics.f1_score(y_true, y_pred, ...)</a>

__Definition:__ <br>
Compute the F1 score, also known as balanced F-score or F-measure.

__Interpretation:__ <br>
F1 score reaches its best value at 1 and worst score at 0.

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values.; <br>
_y_pred_: Estimated targets as returned by a classifier.; <br>
...
</div>

__`Step 17`__ Get the F1 Score

In [41]:
f1_score(y_val, y_pred)

0.6099290780141844

__When to use?__

F1 Score should be used when you want to seek a balance between Precision and Recall and if there is an uneven class distribution (large number of Actual Negatives).

<a class="anchor" id="cr">
    
### 2.2.6. The Classification Report

</a>

__`Step 18`__ To evaluate the results, we are going to use also the classification report method. <br>
Import __classification_report__ from __sklearn.metrics__

In [42]:
# DO IT
from sklearn.metrics import classification_report

__`Step 19`__ Create  a function named `metrics` that will print the results of the classification report and the confusion matrix for both datasets (train and validation)

In [43]:
def metrics(y_train, pred_train , y_val, pred_val):
    print('___________________________________________________________________________________________________________')
    print('                                                     TRAIN                                                 ')
    print('-----------------------------------------------------------------------------------------------------------')
    print(classification_report(y_train, pred_train))
    print(confusion_matrix(y_train, pred_train))


    print('___________________________________________________________________________________________________________')
    print('                                                VALIDATION                                                 ')
    print('-----------------------------------------------------------------------------------------------------------')
    print(classification_report(y_val, pred_val))
    print(confusion_matrix(y_val, pred_val))

__`Step 20`__ Create an object named __labels_train__ that will containt the predicted values for the train and another one named __labels_val__ that will contain the predicted values for the validation set.

In [44]:
labels_train = log_model.predict(X_train)
labels_val = log_model.predict(X_val)

__`Step 21`__ Call the function metrics() defined previously, and define the arguments: <br> (`y_train = y_train`, `pred_train = labels_train` , `y_val = y_val`, `pred_val = labels_val`)

In [45]:
# DO IT
metrics(y_train = y_train, pred_train = labels_train, y_val = y_val, pred_val = labels_val)

___________________________________________________________________________________________________________
                                                     TRAIN                                                 
-----------------------------------------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.81      0.90      0.85       350
           1       0.76      0.60      0.67       187

    accuracy                           0.80       537
   macro avg       0.78      0.75      0.76       537
weighted avg       0.79      0.80      0.79       537

[[314  36]
 [ 74 113]]
___________________________________________________________________________________________________________
                                                VALIDATION                                                 
-----------------------------------------------------------------------------------------------------------
  