# Model Evaluation

In order to properly determine whether your model is good, it's important to use the appropriate <b>performance measures</b>. Naturally, the measures used in regression problems are different from the ones used in classification problems:

* [1 - Regression Problems](#regression)
* [2 - Classification Problems](#classification)

But first, let's import pandas, train_test_split, the Linear Regression model from sklearn, and the metrics to be used in regression:

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

#Regression problem model and metrics
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error, \
    mean_squared_error, median_absolute_error



<a class="anchor" id="regression">

## 1. Regression Problems
</a>

The objective in regression problems is to predict a continuous feature's values, and the metrics used to measure a model's performance are based on gauging the average difference between the real values and the predicted values, in multiple differnt ways. The measures discussed in this notebook are the following:

* $R^{2}$ Score
* Adjusted $R^{2}$ Score
* MAE
* RMSE
* MedAE
 
<b>1. Import the data located in 'Datasets/insurance.csv' with pandas and store it in a variable.</b>

In [3]:
insurance = pd.read_csv('Datasets/insurance.csv')

In [5]:
insurance.head()

Unnamed: 0,age,bmi,children,female,northeast,northwest,southeast,smoker,charges
0,19,27.9,0,1,0,0,0,1,16.884924
1,18,33.77,1,0,0,0,1,0,1.725552
2,28,33.0,3,0,0,0,1,0,4.449462
3,33,22.705,0,0,0,1,0,0,21.984471
4,32,28.88,0,0,0,1,0,0,3.866855


<b>2. Store the independent variables in a variable called `data`, and store the dependent variable (last column) in a variable called `target`</b>

In [8]:

data = insurance.iloc[:,:-1]
target = insurance.iloc[:,-1]

<b>3. By using the method train_test_split from sklearn.model_selection, split your dataset into a training set and a validation set, with the training set having 80% of the data.</b>

In [10]:

X_train, X_val, y_train, y_val = train_test_split(data, 
                                                    target, 
                                                    train_size=0.8, 
                                                    random_state=15, 
                                                    shuffle=True, 
                                                   )

<b>4. Create an instance of LinearRegression named `lr` with the default parameters and fit to the training dataset.</b>

In [13]:
lr = LinearRegression().fit(X_train,y_train)

<b>5. Assign the predictions to `y_pred`, using the method `.predict()`.</b>

In [16]:
y_pred = lr.predict(X_val)
y_pred

array([33.89021746, 25.23316771,  3.7913586 ,  3.2364895 ,  2.7537859 ,
        8.03319434,  0.9868359 , 34.99635894,  8.44293274,  8.82010937,
        3.89551369,  6.21106335, 35.977644  , 32.76835405,  5.51916539,
       37.35753637, 27.19740132,  9.38117556, 30.15399798,  8.23776932,
        5.44336484,  9.75778818,  3.38066877, 18.56202604, 11.84739888,
        8.69127558,  7.68321386, 39.06950138,  3.7146851 , -0.04833778,
        6.88226407,  9.27097592,  5.48424601, 41.11911931,  6.97111106,
        5.57771534,  5.45822851,  3.82722533,  6.3289568 , 11.56151731,
        7.1672574 , 10.78062088, 15.75070435,  2.72546684, 11.14961901,
       11.23945479, 11.66962825,  6.45120447,  9.67410638, 28.65483353,
        1.28648507,  2.15791909,  9.00472208, 10.26646054, 12.9356007 ,
       26.76668774,  1.04292582, 12.36693775,  4.9768755 , 15.60985497,
        6.43266019, 30.69092701, 25.07199352,  5.49482916, 34.37535533,
       36.9069539 ,  4.98718735, 11.00109822, 11.49822537, 33.64


### 1.1. $R^{2}$ Score


sklearn documentation: <a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score'>sklearn.metrics.r2_score(y_true, y_pred, ... )</a>

The $R^{2}$ score, also known as the coefficient of determination, is used to determine how much variance in the target feature can be explained by the given model. Its value corresponds to the proportion of the target feature's variability that is explained by either the model or he independent variables. Its formula is the following:<br>

$$R^{2} = 1-\frac{\sum_i{(y_{i}-\hat{y_{i}})^{2}}}{\sum_i{(y_{i}-\bar{y})^{2}}} = 1-\frac{SS_{Error}}{SS_{Total}}$$

* $y_{i}$ - a real value
* $\hat{y_{i}}$ - a predicted value
* $\bar{y}$ - average of all real values<br>


$R^{2}$'s highest possible value is 1 - when $R^{2}$ = 1, that means the entirety of the target feature's variance is accounted for. It can also have negative values - the model may be arbitrarily worse. <br>
In Python, to calculate the $R^{2}$ we call the following <i>sklearn</i> function: `r2_score(real values,predictions)`

<b>6. Check the R^2 score of the model you created previously.</b>

In [8]:
r2_score(y_val, y_pred)

0.7709928565663494


### 1.2. Adjusted $R^{2}$ Score


There is no direct way to obtain the adjusted R^2 using sklearn, but we can apply the formula:

$$\bar{R^{2}} = 1-(1-R^{2})*\frac{n-1}{n-p-1}$$
with 

* $n$ - amount of observations
* $p$ - amount of features <br>

The adjusted $R^{2}$ score is a better option when we want to measure the amount of variance in the target variable that can be explained by our model. <b><i>But why?</b></i> <br>
If extra features are added to out data, the regular $R^{2}$ score may increase, but it doesn't decrease, even if the new features are redundant. The adjusted $R^{2}$ score accounts for this since the number of features in the model is used in its calculation. Therefore, it's a more accurate measure of the proportion of variance of the dependent variable that is accounted for by the model. <br>

<b>7. Calculate the Adjusted R^2 Score for your model.</b>

In [9]:
#single row
print(1 - ((1 - r2_score(y_val, y_pred))*(len(y_val) - 1))/(len(y_val) - len(X_train.columns) - 1))

0.763919276846391


In [10]:
# Using a function:
r2 = r2_score(y_val,y_pred)
n = len(y_val)
p = len(X_train.columns)

def adjr2(r2,n,p):
    return 1-(1-r2)*(n-1)/(n-p-1)

adjr2(r2,n,p)

0.763919276846391

****
However in some cases we are more interested in quantifying the error in the same measuring unit of the variable - we can use metrics like MAE, MSE and MedAE for that. These metrics are't appropriate to tell how well the independent features may explain the dependent feature.



### 1.3. MAE (Mean absolute error)


sklearn documentation: <a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.mean_absolute_error'>sklearn.metrics.mean_absolute_error(y_true, y_pred, ... )</a>

The MAE measures the average magnitude of the errors in a set of predictions, without considering their direction.<br>
The MAE is always non-negative. The lower its value, the better the model.<br>
In Python, to calculate the $R^{2}$ we call the following <i>sklearn</i> function: `mean_absolute_error(real values,predictions)`

$$MAE = \frac{1}{n}*\sum_i{| y_{i}-\hat{y_{i}}|}$$

<b>8. Check the MAE of the model you created previously.</b>

In [11]:
mean_absolute_error(y_val, y_pred)

3.9285478094354755

   
### 1.4. RMSE (Root Mean squared error)

sklearn documentation: <a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error'>sklearn.metrics.mean_squared_error(y_true, y_pred, ... )</a>

The RMSE, like the MAE measures the average magnitude of the errors in a set of predictions, without considering their direction.<br>
It's also always non-negative. The lower its value, the better the model.<br>
In Python, to calculate the RMSE we call the following <i>sklearn</i> function: `mean_squared_error(real values,predictions)`<br><br>
In situations where **large errors** are particularly undesirable, it's preferable to use the RMSE over the MAE, as the RMSE punishes those kinds of errors. If the magnitude of the errors isn't very relevant, then the MAE is the better choice due to it being easier to interpret.<br>

$$RMSE = \sqrt{\frac{1}{n}*\sum_i{( y_{i}-\hat{y_{i}})^{2}}}$$

<b>9. Check the RMSE of the model you created previously </b>

In [31]:
mean_squared_error(y_val, y_pred)**0.5

5.4641281603458545


### 1.5. MedAE (Median absolute error)


sklearn documentation: <a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.median_absolute_error'>sklearn.metrics.median_absolute_error(y_true, y_pred, ... )</a>

Like the last two metrics, the MedAE measures the average magnitude of the errors in a set of predictions, without considering their direction.<br>
It's always non-negative. The lower its value, the better the model.<br>
In Python, to calculate the $R^{2}$ we call the following <i>sklearn</i> function: `median_absolute_error(real values,predictions)`<br><br>
This metric is used when outliers are to be ignored in the assessment of the model's performance.<br>

$$MedAE = median(| y_{i}-\hat{y_{i}}|)$$

<b>10. Check the MedAE score of the model you created previously.</b>

In [13]:
median_absolute_error(y_val, y_pred)

2.4817072426177704

<a class="anchor" id="classification">

## 2. Classification Problems
</a>

The objective in classification problems is to predict the discrete values of a non-continuous feature, and the metrics used to measure a model's performance are based on determining the amount of (in)correctly classified results. The measures discussed in this notebook are the following:

* Confusion matrix
* Accuracy Score
* Precision
* Recall
* F1-Score
* Classification report (includes most of the above)

Before going further, let's review and take note of a few concepts:

* <b>True Positive(TP)</b> - observation correctly classified as positive
* <b>True Negative(TN)</b> - observation correctly classified as negative 
* <b>False Positive(FP)</b> - observation incorrectly classified as positive
* <b>False Negative(FN)</b> - observation incorrectly classified as negative


<b>1. Import the Logistic Regression base model, and the metrics to be used for classification problems.</b>

In [33]:
#Classification problem model and metrics
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, \
    precision_score, recall_score, f1_score, classification_report

<b>2. Import the dataset from 'Datasets/winequality.csv', and define the independent variables as `data_w` and the dependent variable ('quality') as `target_w`. </b>

In [36]:
winequality = pd.read_csv('Datasets/winequality.csv')
data_w = winequality.drop(['quality'], axis=1)
target_w = winequality['quality']

<b>3. By using the method train_test_split from sklearn.model_selection, split your dataset into a training set and a validation set, with the training set having 80% of the data. Set the random state to 15, and add `stratify = target_w`</b>

In [39]:
X_train, X_val, y_train, y_val = train_test_split(data_w, 
                                                  target_w, 
                                                  train_size = 0.8, 
                                                  random_state=15, 
                                                  stratify = target_w)

<b>4. Create an instance of LogisticRegression named as `log_model` with the default parameters, and fit it to your training data.</b>

In [42]:
log_model = LogisticRegression().fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


<b>5. Now that you have your model created, assign the predictions to `y_pred`, using the method `.predict()`.</b>

In [44]:
y_pred = log_model.predict(X_val)
y_pred

array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,


### 2.1. The confusion matrix



sklearn documentation: <a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix'>sklearn.metrics.confusion_matrix(y_true, y_pred, ...)</a>

The confusion matrix is a matrix composed by the amounts of TP, TN, FP, and FN, positioned like this:

$$\begin{bmatrix}TN & FP\\
FN & TP
\end{bmatrix}^T$$


<b>6. Obtain the confusion matrix</b>

In [50]:
confusion_matrix(y_val, y_pred)


array([[576,  30],
       [118,  60]], dtype=int64)


### 2.2. The accuracy score


sklearn documentation: <a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score'>sklearn.metrics.accuracy_score(y_true, y_pred, normalize=True,...)</a>

The accuracy score is the proportion of correct predictions from the model to the total amount of observations.  <br>
In Python, to calculate the accuracy we call the following <i>sklearn</i> function: `accuracy_score(real values,predictions, normalize = True.)`. If normalize is True, then the best performance is 1. When normalize is False, then the best performance is the number of samples. Either way, the higher the value, the better.<br>
The accuracy is generally a good performance measure, but it's inadequate in situations where the cost of certain kinds of errors is too high:
* Detection of viral illnesses: The objective is detecting positive cases, the price of false negatives is too high, since said cases can spread the illness when they go undetected; In this cases the <i>recall</i> metric is very useful
* Detection of spam e-mails: E-mails incorrectly flagged as spam by the model (false positives) won't show up, and important information may potentially be lost; In this cases the <i>precision</i> metric is very useful

$$accuracy = \frac{TP+TN}{TP+TN+FP+FN}$$

<b>7. Get the accuracy score</b>

In [56]:
accuracy_score(y_val,y_pred)

0.8112244897959183

  
### 2.3. The precision



sklearn documentation: <a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score'>sklearn.metrics.precision_score(y_true, y_pred, ...)</a>

The precision score is the proportion of correct positive predictions from the model to the total amount of positive predictions.  <br>
In Python, to calculate the precision we call the following <i>sklearn</i> function: `precision_score(real values,predictions)`.<br>
This metric is particularly good in the assessment of models where false positives are particularly dangerous, since those are one of the only two variables accounted for by the metric.

$$recall = \frac{TP}{TP+FP}$$


<b>8. Get the precision score</b>

In [58]:
precision_score(y_val, y_pred)

0.6666666666666666


### 2.4. The recall


sklearn documentation: <a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.recall_score'>sklearn.metrics.recall_score(y_true, y_pred, ...)</a>

The recall score is the proportion of correct positive predictions from the model to the total amount of observations that are positive (true positives + false negatives).  <br>
In Python, to calculate the precision we call the following <i>sklearn</i> function: `recall_score(real values,predictions)`.<br>
This metric is particularly good in the assessment of models where false negatives are particularly dangerous, since those are one of the only two variables accounted for by the metric.

$$recall = \frac{TP}{TP+FN}$$


<b>9. Get the recall score</b>

In [22]:
recall_score(y_val, y_pred)

0.34831460674157305


### 2.5. The F1 Score



sklearn documentation: <a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score'>sklearn.metrics.f1_score(y_true, y_pred, ...)</a>

The F1-score is the most balanced of the common performance measures for classification problems.  <br>
It's useful in two distinct, but not mutually exclusive situations:
* When you seek a balance petween precision and recall (i.e. when both false positives and false negatives are important)
* When you have an uneven class distribution (a large number of negative cases)<br>

It's obvious how the F1-score helps with the first situation. In the second situation, the F1-score is helpful because, unlike the accuracy, it doesn't involve observations that are correctly classified as negative (true negatives). True negatives often have little to no significance in classification problems, but greatly influence the accuracy. So in cases where there's an uneven class distribution with lots of True negatives, the F1-score is advised because, unlike the accuracy, it'snot influenced by the True negatives.<br><br>
In Python, to calculate the precision we call the following <i>sklearn</i> function: `f1_score(real values,predictions)`.<br>

$$F1 = 2*\frac{precision*recall}{precision+recall}$$


<b>10. Get the F1 Score</b>

In [60]:
f1_score(y_val, y_pred)

0.44776119402985076


### 2.6. The Classification report

sklearn documentation: <a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score'>sklearn.metrics.f1_score(y_true, y_pred, ...)</a>

The classification report shows the precision, recall, and F1-scores, rounded to 2 decimals, for our model. It shows these values for both possibilities: predicting the majority class and predicting the minority class. It also shows the overall accuracy.

<b>11. _Print_ the classification report</b>

In [62]:
print(classification_report(y_val, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.95      0.89       606
           1       0.67      0.34      0.45       178

    accuracy                           0.81       784
   macro avg       0.75      0.64      0.67       784
weighted avg       0.79      0.81      0.79       784



### 2.7. Whole analysis

Now, let's run a complete analysis, on training data and on validation data, using the classification report:

<b>12. Run the cell below to create a function named `metrics` that will print the results of the classification report and the confusion matrix for both datasets (train and validation)</b>

107


In [76]:
def metrics(y_train, pred_train , y_val, pred_val):
    print('_'*107)
    print('                                                     TRAIN                                                 \n')
    print(classification_report(y_train, pred_train))
    print('Confusion matrix:\n')
    print(confusion_matrix(y_train, pred_train))
    print('___________________________________________________________________________________________________________')
    
    print('___________________________________________________________________________________________________________')
    print('                                                VALIDATION                                                 \n')
    print(classification_report(y_val, pred_val))
    print('Confusion matrix:\n')
    print(confusion_matrix(y_val, pred_val))
    print('___________________________________________________________________________________________________________')

<b>13. Create an object named ``labels_train`` that will containt the predicted values for the train and another one named ``labels_val`` that will contain the predicted values for the validation set</b>

In [79]:
y_pred_train = log_model.predict(X_train)

<b>14. Call the function `metrics()` defined previously, and define the arguments: <br> (`y_train = y_train`, `pred_train = y_pred_train` , `y_val = y_val`, `pred_val = y_pred`)</b>

In [82]:
metrics(y_train,y_pred_train,y_val,y_pred)

___________________________________________________________________________________________________________
                                                     TRAIN                                                 

              precision    recall  f1-score   support

           0       0.82      0.95      0.88      2424
           1       0.61      0.29      0.40       712

    accuracy                           0.80      3136
   macro avg       0.72      0.62      0.64      3136
weighted avg       0.77      0.80      0.77      3136

Confusion matrix:

[[2292  132]
 [ 504  208]]
___________________________________________________________________________________________________________
___________________________________________________________________________________________________________
                                                VALIDATION                                                 

              precision    recall  f1-score   support

           0       0.83      0

Sources: <br>
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d <br>
https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9

<br><br>
### That's all!