# Evaluating Models
by Evgeny Sushko

---
## Table of Contents:
1. Model evaluation applications
   - Generalization performance
   - Model selection
   - Algorithm selection
2. Model evaluation techniques
   - Holdout validation
   - K-fold cross-validation
3. Classification metrics
   - Accuracy
   - Confusion matrix
   - Precision & Recall
   - F-1 score
   - Classification report
4. Appropriate merics choice
---

### 1. Model Evaluation Applications
Let's start with a question: **"Why do we care about performance estimates at all?"**

1.1. **Generalization performance** - We want to estimate the predictive performance of our model on future (unseen) data.
- Ideally, the estimated performance of a model tells **how well it performs on unseen data** – making predictions on future data is often the main problem we want to solve.

1.2. **Model selection** - We want to increase the predictive performance by tweaking the learning algorithm and selecting the best performing model from a given hypothesis space.
- Typically, machine learning involves a lot of experimentation. Running a learning algorithm over a training dataset with different hyperparameter settings and different features will result in different models. Since we are typically interested in **selecting the best-performing model** from this set, we need to find a way to estimate their respective performances in order to rank them against each other.

1.3. **Algorithm selection** - We want to compare different ML algorithms, selecting the best-performing one.
- We are usually not only experimenting with the one single algorithm that we think would be the “best solution” under the given circumstances. More often than not, we want to **compare different algorithms to each other**, oftentimes in terms of predictive and computational performance.

Although these three sub-tasks have all in common that we want to estimate the performance of a model, they all require different approaches. 

This tutorial will focus on **supervised learning**, a subcategory of machine learning where our target values are known in our available dataset. Although many concepts also apply to regression analysis, we will focus on **classification**, the assignment of categorical target labels to the samples.

---
### 2. Model Evaluation Techniques
#### 2.1. Holdout validation
The holdout method is the simplest model evaluation technique. We take our labeled dataset and split it randomly into two parts: A **training set** and a **test set**
![](https://sebastianraschka.com/images/blog/2016/model-evaluation-selection-part1/testing_01.png)
Then, we fit a model to the training data and predict the labels of the test set.
![](https://sebastianraschka.com/images/blog/2016/model-evaluation-selection-part1/testing_02.png)
And the fraction of correct predictions constitutes our estimate of the prediction accuracy.
![](https://sebastianraschka.com/images/blog/2016/model-evaluation-selection-part1/testing_03.png)
We really don’t want to train and evaluate our model on the same training dataset, since it would introduce **overfitting**. In other words, we can’t tell whether the model simply memorized the training data or not, or whether it generalizes well to new, unseen data.

In [107]:
# import data
import pandas as pd

df = pd.read_csv('movie_reviews.csv')
len(df)

152610

In [108]:
# split dataset
from sklearn.model_selection import train_test_split

X, y = df.text, df.label
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=42)

In [111]:
# fit a model to the training data
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

vectorizer = CountVectorizer(binary=True)
classifier = LogisticRegression()

pipeline = Pipeline([('vectorizer', vectorizer),
                     ('classifier', classifier)])

model = pipeline.fit(X_train, y_train)

In [112]:
# predict the labels of the test set
y_pred = model.predict(X_test)

In [113]:
# calculate prediction accuracy
from sklearn import metrics

print ("Accuracy:", metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.820064215975


#### 2.2. K-fold Cross-validation
K-fold Cross-validation is probably the most common technique for model evaluation and model selection. 
- We split the dataset into *K* parts and iterate over a dataset set *K* times
- In each round one part is used for validation, and the remaining *K-1* parts are merged into a training subset for model evaluation
![](https://sebastianraschka.com/images/blog/2016/model-evaluation-selection-part3/kfold.png)
- We compute the cross-validation performance as the arithmetic mean over the *K* performance estimates from the validation sets.
- Runs "K" times slower than simple train/test split

In [79]:
## тут должен быть пример кросс-валидации

### 3. Classification metrics overview
Classification problems are probably the most common type of ML problem and as such there are many metrics that can be used to evaluate predictions for these problems. We will review some of them.

#### 3.1. Accuracy
Accuracy simply measures *what percent of your predictions were correct*. It's the ratio between the number of correct predictions and the total number of predictions.

$$ \mbox{accuracy} = \frac{\mbox{# correct}}{\mbox{# predictions}} $$

This is the most common evaluation metric for classification problems and the easiest to understand.

In [80]:
# calculate accuracy
print(metrics.accuracy_score(y_test, y_pred))

0.818032894306


Accuracy is also the most misused metric. It is really **only suitable** when there are an *equal number of observations in each class* (which is rarely the case) and that all *predictions and prediction errors are equally important*, which is often not the case.

#### 3.2. Confusion Matrix
The confusion matrix is a handy presentation of the accuracy of a model with 2 or more classes. The table **presents predictions** on the x-axis and **accuracy outcomes** on the y-axis. The cells of the table are the number of predictions made by a machine learning algorithm.

In [97]:
# first argument is true values, second argument is predicted values
# this produces a 2x2 numpy array (matrix)
conf = metrics.confusion_matrix(y_test, y_pred)
print(conf)

[[ 9324  3266]
 [ 2288 15644]]


|                | Predicted Negative | Predicted Positive |
|:--------------:|--------------------|--------------------|
| **Negative Cases** |      TN: 9324      |      FP: 3266      |
| **Positive Cases** |      FN: 2288      |      TP: 15644     |

- ##### True Positives (TP):
We correctly predicted that the reviews are positive: **15644**
- ##### True Negatives (TN):
We correctly predicted that the reviews are negative: **9324**
- ##### False Positives (FP):
We incorrectly predicted that the reviews are positive: **3266**
- ##### False Negatives (FN):
We incorrectly predicted that the reviews are negative: **2288**



Confusion matrix allows you to compute various classification metrics, and these metrics can guide your model selection. 

In [105]:
# slice confusion matrix into four pieces for future use
TP = conf[1, 1]
TN = conf[0, 0]
FP = conf[0, 1]
FN = conf[1, 0]

You can learn more about the [Confusion Matrix on the Wikipedia article](https://en.wikipedia.org/wiki/Confusion_matrix).

#### 3.3. Precision & Recall
Precision and recall are actually two metrics. But they are often used together.

**Precision** answers the question: *What percent of positive predictions were correct?*

$$\mbox{precision} = \frac{\mbox{# true positive}}{\mbox{#true positive + #false positive}}$$

**Recall** answers the question: *What percent of the positive cases did you catch?*

$$\mbox{recall} = \frac{\mbox{# true positive}}{\mbox{#true positive + #false negative}}$$
 

In [101]:
# calculate precision
precision = TP / float(TP + FP)

print(precision)
print(metrics.precision_score(y_test, y_pred))

0.827287149656
0.827287149656


In [102]:
# calculate recall
recall = TP / float(FN + TP)

print(recall)
print(metrics.recall_score(y_test, y_pred))

0.872406870399
0.872406870399


See also a very good explanation of [Precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall) in Wikipedia.
![](http://www.kdnuggets.com/images/precision-recall-relevant-selected.jpg)

#### 3.4 F1-score
The F1-score (sometimes known as the balanced F-beta score) is a single metric that combines both precision and recall via their harmonic mean:

$$F_1 = 2 \frac{\mbox{precision} * \mbox{recall}}{\mbox{precision + recall}}$$

Unlike the arithmetic mean, the harmonic mean tends toward the smaller of the two elements. Hence the F1 score will be small if either precision or recall is small.

In [103]:
# calculate f1-score
f1 = 2 * precision * recall / (precision + recall)

print(f1)
print(metrics.f1_score(y_test, y_pred))

0.849248140709
0.849248140709


#### 3.5. Classification Report
Scikit-learn does provide a convenience report when working on classification problems to give you a quick idea of the accuracy of a model using a number of measures.

The **classification_report()** function displays the precision, recall, f1-score and support for each class. (*support* is the number of occurrences of each class in *y_true*)

In [104]:
# print a report on the binary classification problem
print(metrics.classification_report(y_test, y_pred))

             precision    recall  f1-score   support

          0       0.80      0.74      0.77     12590
          1       0.83      0.87      0.85     17932

avg / total       0.82      0.82      0.82     30522



### 4. Choice of Metrics
Depending on your application, you may want to consider different performance metrics. Choice of metric depends on your business objective and on the data you have at hand.

In many cases **accuracy** alone will be enough. It is suitable when the data is balanced (equal number of observations in each class) and when minimizing *False Positives* and *False Negatives* is equally important.

If that is not the case:

- Identify if FP or FN is more important to reduce
- Choose metric with relevant variable (FP or FN) in the equation

##### Case 1: Spam filter (positive class is "spam")
FN (spam goes to the inbox) are more acceptable than FP (non-spam is caught by the spam filter) => Choose **FP** as a variable, optimize for **precision**

##### Case 2: Fraudulent transaction detector (positive class is "fraud")
FP (normal transactions that are flagged as possible fraud) are more acceptable than FN (fraudulent transactions that are not detected) => Choose **FN** as a variable, optimize for **recall**

---
#### References
- Sebastian Raschka, [Model evaluation, model selection, and algorithm selection in machine learning](https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html)
- Jason Brownlee, [Metrics To Evaluate Machine Learning Algorithms in Python](http://machinelearningmastery.com/metrics-evaluate-machine-learning-algorithms-python/)
- Ritchie Ng, [Evaluating a Classification Model](http://www.ritchieng.com/machine-learning-evaluate-classification-model/)
- [Turi Machine Learning Platform User Guide](https://turi.com/learn/userguide/evaluation/classification.html)
- Gregory Piatetsky, [21 Must-Know Data Science Interview Questions and Answers](http://www.kdnuggets.com/2016/02/21-data-science-interview-questions-answers.html/2)