# Evaluating Models

### 0. What valuation is for? Competitions perspective

**0.1** Validation is your **key** to success in competitions

**0.2** Validation helps to evaluate model performance, its quality, its ability to generalize. Validation can be used to select best model to perform on unseen data.

**0.3** While in real life overfit leads to incosistent and poor performance of the model on future data, in competitions its impact can be directly seen on the post deadline leaderboard shuffle

**0.4** In terms of competitions we call model to overfit if its performance on leaderboard worse than it was expected (will walk through this further)

**0.5** Validation is your **MOST VALUABLE key** to success in competitions

Model evaluation is not just the end point of our machine learning pipeline. Before we handle any data, we want to plan ahead and use techniques and metrics that are suited for our purposes.

### <a name="1"></a> 1. Model Evaluation Applications
Let's start with a question: **"Why do we care about performance estimates at all?"**

<a name="1.1"></a>**Generalization performance** - We want to estimate the predictive performance of our model on future (unseen) data.
- Ideally, the estimated performance of a model tells how well it performs on unseen data – making predictions on future data is often the main problem we want to solve.

<a name="1.2"></a>**Model selection** - We want to increase the predictive performance by tweaking the learning algorithm and selecting the best performing model from a given hypothesis space.
- Typically, machine learning involves a lot of experimentation. Running a learning algorithm over a training dataset with different hyperparameter settings and different features will result in different models. Since we are typically interested in selecting the best-performing model from this set, we need to find a way to estimate their respective performances in order to rank them against each other.

<a name="1.3"></a>**Algorithm selection** - We want to compare different ML algorithms, selecting the best-performing one.
- We are usually not only experimenting with the one single algorithm that we think would be the “best solution” under the given circumstances. More often than not, we want to compare different algorithms to each other, oftentimes in terms of predictive and computational performance.

Although these three sub-tasks have all in common that we want to estimate the performance of a model, they all require different approaches. 

This tutorial will focus on **supervised learning**, a subcategory of machine learning where our target values are known in our available dataset.

### <a name="2"></a>2. Model Evaluation Techniques
#### <a name="2.1"></a>Holdout method (simple train/test split)
The holdout method is the simplest model evaluation technique. We take our labeled dataset and split it randomly into two parts: A **training set** and a **test set**
<img src="https://sebastianraschka.com/images/blog/2016/model-evaluation-selection-part1/testing_01.png" width="500">
Then, we fit a model to the training data and predict the labels of the test set.
<img src="https://sebastianraschka.com/images/blog/2016/model-evaluation-selection-part1/testing_02.png" width="500">
And the fraction of correct predictions constitutes our estimate of the prediction accuracy.
<img src="https://sebastianraschka.com/images/blog/2016/model-evaluation-selection-part1/testing_03.png" width="500">
We really don’t want to train and evaluate our model on the same training dataset, since it would introduce **overfitting**. In other words, we can’t tell whether the model simply memorized the training data or not, or whether it generalizes well to new, unseen data.

##### Pros:
    + Simple
    + Fast

##### Cons:
    - Not so precise estimate of out-of-sample performance comparing to more advanced techniques

### Be aware.

As it was said, you want your validation to mimic your test set as close as possible. And you can make a fair assumprion (that is not always true), that distribution of target on train and not seen data is the same. Then you have to use stratification. Stratification ensures stable distributions across split. That is more than just useful if:

    + Dataset is small
    + Dataset is unbalanced (target average for binary classification this means average target close to 0 or to 1)
    + You have multiclassification task

See example below.

In [1]:
# import data
import pandas as pd

train = pd.read_csv('../data/train_titanic.csv')

# check number of rows & columns
train.shape

(891, 14)

In [2]:
train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,relatives,not_alone,Deck,Title,Age_Class,Fare_Per_Person
0,0,3,0,2,1,0,0,0,1,0,8,1,6,0
1,1,1,1,5,1,0,3,1,1,0,3,3,5,1
2,1,3,1,3,0,0,0,0,0,1,8,2,9,0
3,1,1,1,5,1,0,3,0,1,0,3,3,5,1
4,0,3,0,5,0,0,1,0,0,1,8,1,15,1


In [3]:
# split dataset to Train and Test parts
from sklearn.model_selection import train_test_split

X, y = train.drop('Survived', axis = 1), train.Survived
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=42)

In [4]:
y_train.value_counts()/y_train.count()

0    0.623596
1    0.376404
Name: Survived, dtype: float64

In [5]:
y_test.value_counts()/y_test.count()

0    0.586592
1    0.413408
Name: Survived, dtype: float64

In [6]:
# fit a model to the training data
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline

classifier = LogisticRegression()

pipeline = Pipeline([('classifier', classifier)])

%time model = pipeline.fit(X_train, y_train)

Wall time: 12 ms


In [7]:
# predict the labels of the test set
y_pred = model.predict(X_test)

In [8]:
# compute prediction accuracy
from sklearn import metrics
y_test.value_counts()/y_test.count()
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
print("Train Accuracy:", metrics.accuracy_score(y_train, model.predict(X_train)))

Accuracy: 0.7988826815642458
Train Accuracy: 0.8188202247191011


In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=42, stratify = y)

In [10]:
y_train.value_counts()/y_train.count()

0    0.616573
1    0.383427
Name: Survived, dtype: float64

In [11]:
y_test.value_counts()/y_test.count()

0    0.614525
1    0.385475
Name: Survived, dtype: float64

In [12]:
# fit a model to the training data
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline

classifier = LogisticRegression()

pipeline = Pipeline([('classifier', classifier)])

y_pred = model.predict(X_test)

y_test.value_counts()/y_test.count()
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
print("Train Accuracy:", metrics.accuracy_score(y_train, model.predict(X_train)))

Accuracy: 0.8100558659217877
Train Accuracy: 0.8160112359550562


In [13]:
test = pd.read_csv('../data/test_titanic.csv')
test.set_index('PassengerId', inplace=True)
test['Survived'] = model.predict(test)
#test['Survived'].reset_index().to_csv('pred/pred.csv', index = False)

### <a name="2.2"></a>K-fold Cross-validation
K-fold Cross-validation is probably the most common technique for model evaluation and model selection. 
- We split the dataset into *K* parts and iterate over a dataset set *K* times
- In each round one part is used for validation, and the remaining *K-1* parts are merged into a training subset for model evaluation
- We compute the cross-validation performance as the arithmetic mean over the *K* performance estimates from the validation sets.
<img src="https://sebastianraschka.com/images/blog/2016/model-evaluation-selection-part3/kfold.png" width="500">

##### Pros:
    + Better estimate of out-of-sample performance than simple train/test split

##### Cons:
    - Runs "K" times slower than simple train/test split

If we have **little data** and **enough time**, it's better to always do cross-validation for a more precise estimate of performance.

In the following example we will apply k-fold cross validation for Model Selection using *GridSearchCV* function.

> #### GridSearchCV main parameters
>*sklearn.model_selection.GridSearchCV*

>**param_grid**: dict or list of dictionaries.
Dictionary with parameters names (string) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.

>**cv**: int, cross-validation generator or an iterable, optional.
Determines the cross-validation splitting strategy.

>**scoring**: string, callable or None, default=None.
Controls what metric to apply to the estimators evaluated

### <a name="2.2"></a>LOO or Leave One Out validation
LOO validation is a corner case of K-fold cross-validation, where *K* is equal to *N* - number of examples in the dataset.  
- We split the dataset into *N* parts, where *i-th* part is the original dataset sans i-th example
- In each round i-th example is used for validation, and the remaining *N-1* examples creates a training for model 
- We compute the cross-validation performance as the arithmetic mean over the same as in K_Fold

You can use LOO validation in case you have a small dataset and/or very easy model to train


In [14]:
# fit model 
from sklearn.model_selection import GridSearchCV

params = dict(classifier__C=[0.1, 0.01, 0.001, 0.0001])
grid_search = GridSearchCV(pipeline, param_grid=params, cv=3)

%time grid_search.fit(X,y)

Wall time: 124 ms


GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('classifier', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'classifier__C': [0.1, 0.01, 0.001, 0.0001]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [15]:
# Best parameters found:
grid_search.best_params_

{'classifier__C': 0.1}

In [16]:
# Average accuracy over K folds for best parameters set
print("Validation Accuracy", grid_search.best_score_)

Validation Accuracy 0.8013468013468014


In [17]:
#test = pd.read_csv('input/test_final.csv')
#test.set_index('PassengerId', inplace=True)
#test['Survived'] = model.predict(test)
#test['Survived'].reset_index().to_csv('input/pred_cv.csv', index = False)

## Splitting dataset into train and validation

### Row validation. Random
This assumes that rows are independent such as loan default prediction where each row represents a client. This is not always true as if there are family members, you can assume that they also will be able to pay off a loan. Although this dependency can lead to interesting leaks/feature generation depending on whether family members were splitted to different train and test by organizers


Another type ov validation construction - is by group. Suppose you have a task to build a model to predict a weather in cities based on previous dates. Then if you know that in test set there are only new unseen cities, you should split yor dataset on train and validation such as there is no records for any city present in both train and validation.


### Time Validation

Doing **Time validation** in correct way is very important. Suppose you have a task to predict Wikipedia page viewers as in on of previous Kaggle competitions (https://www.kaggle.com/c/web-traffic-time-series-forecasting). What are possible ways to do a validation? Again, it is best to mimic split made by organizers and they split this by date. All before January, 1st, 2017 went to train, all after that date (2 months) - to test. The correct way to perform a split is with **sliding window**(credit for picture to Uber blogpost):
 

<img src="http://eng.uber.com/wp-content/uploads/2018/01/image3-4.png" width="500">




In [18]:
from dateutil.relativedelta import relativedelta
import math

In [19]:
sunspots = pd.read_csv('../data/sunspots_2014-2016.csv', parse_dates=['date'])
sunspots.head(3)

Unnamed: 0,date,value
0,2014-01-01,124
1,2014-01-02,133
2,2014-01-03,153


In [20]:
sunspots.tail(3)

Unnamed: 0,date,value
879,2016-05-29,33
880,2016-05-30,41
881,2016-05-31,36


In [21]:
sunspots['Month']     = sunspots["date"].dt.month
sunspots['Day']       = sunspots["date"].dt.day
sunspots['DayOfWeek'] = sunspots["date"].dt.dayofweek
#sunspots.set_index('date', inplace = True)

In [22]:
sunspots_train, sunspots_test = sunspots[sunspots['date']<'2016-01-01'], sunspots[sunspots['date']>='2016-01-01']

In [23]:
def create_validation(df, start_date):
    return df.loc[(df['date'] >= pd.to_datetime(start_date) - relativedelta(days=0)) & \
                  (df['date'] <  pd.to_datetime(start_date) + relativedelta(months=6))].index, \
           df.loc[(df['date'] >= pd.to_datetime(start_date) + relativedelta(months=6)) & \
                  (df['date'] <  pd.to_datetime(start_date) + relativedelta(months=12))].index

In [24]:
train_dates = ['2014-01-01', '2014-07-01', '2015-01-01']

In [25]:
myCViterator = []
for i in train_dates:
    trainIndices, valIndices = create_validation(sunspots_train, i)
    myCViterator.append( (trainIndices, valIndices) )

In [26]:
for x,y in myCViterator:
    print (min(x), min(y))

0 181
181 365
365 546


In [27]:
X = sunspots_train.drop(['value','date'], axis = 1)
y = sunspots_train['value']

X_test = sunspots_test.drop(['value','date'], axis = 1)
y_test = sunspots_test['value']

In [28]:
from sklearn.svm import SVR
regressor = SVR()

pipeline_r = Pipeline([('regressor', regressor)])
param_grid = [
  {'regressor__C': [0.01, 0.1, 1, 10, 100, 1000], 'regressor__kernel': ['linear']},
  {'regressor__C': [0.01, 0.1, 1, 10, 100, 1000], 'regressor__gamma': [0.001, 0.0001], 'regressor__kernel': ['rbf']},
 ]
grid_search = GridSearchCV(pipeline_r, param_grid=param_grid, cv=myCViterator)

%time grid_search.fit(X,y)

Wall time: 2 s


GridSearchCV(cv=[(Int64Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
            ...
            171, 172, 173, 174, 175, 176, 177, 178, 179, 180],
           dtype='int64', length=181), Int64Index([181, 182, 183, 184, 185, 186, 187, 188, 189, 190,
            ...
            355, 356, 357, 358, 359, ...          720, 721, 722, 723, 724, 725, 726, 727, 728, 729],
           dtype='int64', length=184))],
       error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('regressor', SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'regressor__C': [0.01, 0.1, 1, 10, 100, 1000], 'regressor__kernel': ['linear']}, {'regressor__C': [0.01, 0.1, 1, 10, 100, 1000], 'regressor__gamma': [0.001, 0.0001], 'regressor__kernel': ['rbf']}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
      

In [29]:
grid_search.best_params_

{'regressor__C': 1, 'regressor__gamma': 0.0001, 'regressor__kernel': 'rbf'}

In [30]:
y_pred = grid_search.predict(X_test)

In [31]:
grid_search.score(X_test, y_test)

-2.495360532003474

In [32]:
grid_search = GridSearchCV(pipeline_r, param_grid=param_grid, cv=3)

%time grid_search.fit(X,y)
grid_search.best_params_

Wall time: 8.09 s


{'regressor__C': 0.01, 'regressor__gamma': 0.0001, 'regressor__kernel': 'rbf'}

In [33]:
grid_search.score(X_test, y_test)

-2.502739658909923

In [34]:
def smape_fast(y_true, y_pred):
    out = 0
    for i in range(y_true.shape[0]):
        a = y_true.iloc[i]
        b = y_pred[i]
        if b < 1:
            b = 0
        c = a+b
        if c == 0:
            continue
        out += math.fabs(a - b) / c
    out *= (200.0 / y_true.shape[0])
    return out

In [35]:
smape_fast(y_test, y_pred)

60.714144145392616

### Group Validation

Group can refer to user id, store, city or any other entity. Another type of validation construction - is by group. Suppose you have a task to build a model to predict a weather in cities based on previous dates. Then if you know that in test set there are only new unseen cities, you should split yor dataset on train and validation such as there is no records for any city present in both train and validation.

#### You can  combine splitting strategies but you ALWAYS want to make split as similiar to organizer's split as possible

**IN ANY CASE** not only your performance on LB will depend on your validation but also feature generation will!

In the end, when you did your validation, you should compare your submission results on Kaggle or similiar platform leaderboard with the results on your validation. If you see that the scores are not correlated or consistantly significantly higher/lower - EDA is your friend to find the root cause of the problem.

In [36]:
import pandas_profiling as pp
train = pd.read_csv("../data/train (3).csv")
test = pd.read_csv("../data/test (2).csv")
test['Age'] = test['Age']*1.3
test["Embarked"] = test["Embarked"].map({"Q": "S", "S": "S", "C":"C"})
pp.ProfileReport(train)


0,1
Number of variables,12
Number of observations,891
Total Missing (%),8.1%
Total size in memory,83.6 KiB
Average record size in memory,96.1 B

0,1
Numeric,6
Categorical,4
Boolean,1
Date,0
Text (Unique),1
Rejected,0
Unsupported,0

0,1
Distinct count,89
Unique (%),10.0%
Missing (%),19.9%
Missing (n),177
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,29.699
Minimum,0.42
Maximum,80
Zeros (%),0.0%

0,1
Minimum,0.42
5-th percentile,4.0
Q1,20.125
Median,28.0
Q3,38.0
95-th percentile,56.0
Maximum,80.0
Range,79.58
Interquartile range,17.875

0,1
Standard deviation,14.526
Coef of variation,0.48912
Kurtosis,0.17827
Mean,29.699
MAD,11.323
Skewness,0.38911
Sum,21205
Variance,211.02
Memory size,7.0 KiB

Value,Count,Frequency (%),Unnamed: 3
24.0,30,3.4%,
22.0,27,3.0%,
18.0,26,2.9%,
28.0,25,2.8%,
19.0,25,2.8%,
30.0,25,2.8%,
21.0,24,2.7%,
25.0,23,2.6%,
36.0,22,2.5%,
29.0,20,2.2%,

Value,Count,Frequency (%),Unnamed: 3
0.42,1,0.1%,
0.67,1,0.1%,
0.75,2,0.2%,
0.83,2,0.2%,
0.92,1,0.1%,

Value,Count,Frequency (%),Unnamed: 3
70.0,2,0.2%,
70.5,1,0.1%,
71.0,2,0.2%,
74.0,1,0.1%,
80.0,1,0.1%,

0,1
Distinct count,148
Unique (%),16.6%
Missing (%),77.1%
Missing (n),687

0,1
G6,4
B96 B98,4
C23 C25 C27,4
Other values (144),192
(Missing),687

Value,Count,Frequency (%),Unnamed: 3
G6,4,0.4%,
B96 B98,4,0.4%,
C23 C25 C27,4,0.4%,
F33,3,0.3%,
F2,3,0.3%,
E101,3,0.3%,
D,3,0.3%,
C22 C26,3,0.3%,
C65,2,0.2%,
C83,2,0.2%,

0,1
Distinct count,4
Unique (%),0.4%
Missing (%),0.2%
Missing (n),2

0,1
S,644
C,168
Q,77
(Missing),2

Value,Count,Frequency (%),Unnamed: 3
S,644,72.3%,
C,168,18.9%,
Q,77,8.6%,
(Missing),2,0.2%,

0,1
Distinct count,248
Unique (%),27.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,32.204
Minimum,0
Maximum,512.33
Zeros (%),1.7%

0,1
Minimum,0.0
5-th percentile,7.225
Q1,7.9104
Median,14.454
Q3,31.0
95-th percentile,112.08
Maximum,512.33
Range,512.33
Interquartile range,23.09

0,1
Standard deviation,49.693
Coef of variation,1.5431
Kurtosis,33.398
Mean,32.204
MAD,28.164
Skewness,4.7873
Sum,28694
Variance,2469.4
Memory size,7.0 KiB

Value,Count,Frequency (%),Unnamed: 3
8.05,43,4.8%,
13.0,42,4.7%,
7.8958,38,4.3%,
7.75,34,3.8%,
26.0,31,3.5%,
10.5,24,2.7%,
7.925,18,2.0%,
7.775,16,1.8%,
26.55,15,1.7%,
0.0,15,1.7%,

Value,Count,Frequency (%),Unnamed: 3
0.0,15,1.7%,
4.0125,1,0.1%,
5.0,1,0.1%,
6.2375,1,0.1%,
6.4375,1,0.1%,

Value,Count,Frequency (%),Unnamed: 3
227.525,4,0.4%,
247.5208,2,0.2%,
262.375,2,0.2%,
263.0,4,0.4%,
512.3292,3,0.3%,

First 3 values
"Strom, Miss. Telma Matilda"
"Johnston, Miss. Catherine Helen ""Carrie"""
"Devaney, Miss. Margaret Delia"

Last 3 values
"Johansson, Mr. Karl Johan"
"Richard, Mr. Emile"
"Newell, Miss. Marjorie"

Value,Count,Frequency (%),Unnamed: 3
"Abbing, Mr. Anthony",1,0.1%,
"Abbott, Mr. Rossmore Edward",1,0.1%,
"Abbott, Mrs. Stanton (Rosa Hunt)",1,0.1%,
"Abelson, Mr. Samuel",1,0.1%,
"Abelson, Mrs. Samuel (Hannah Wizosky)",1,0.1%,

Value,Count,Frequency (%),Unnamed: 3
"de Mulder, Mr. Theodore",1,0.1%,
"de Pelsmaeker, Mr. Alfons",1,0.1%,
"del Carlo, Mr. Sebastiano",1,0.1%,
"van Billiard, Mr. Austin Blyler",1,0.1%,
"van Melkebeke, Mr. Philemon",1,0.1%,

0,1
Distinct count,7
Unique (%),0.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.38159
Minimum,0
Maximum,6
Zeros (%),76.1%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,0
95-th percentile,2
Maximum,6
Range,6
Interquartile range,0

0,1
Standard deviation,0.80606
Coef of variation,2.1123
Kurtosis,9.7781
Mean,0.38159
MAD,0.58074
Skewness,2.7491
Sum,340
Variance,0.64973
Memory size,7.0 KiB

Value,Count,Frequency (%),Unnamed: 3
0,678,76.1%,
1,118,13.2%,
2,80,9.0%,
5,5,0.6%,
3,5,0.6%,
4,4,0.4%,
6,1,0.1%,

Value,Count,Frequency (%),Unnamed: 3
0,678,76.1%,
1,118,13.2%,
2,80,9.0%,
3,5,0.6%,
4,4,0.4%,

Value,Count,Frequency (%),Unnamed: 3
2,80,9.0%,
3,5,0.6%,
4,4,0.4%,
5,5,0.6%,
6,1,0.1%,

0,1
Distinct count,891
Unique (%),100.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,446
Minimum,1
Maximum,891
Zeros (%),0.0%

0,1
Minimum,1.0
5-th percentile,45.5
Q1,223.5
Median,446.0
Q3,668.5
95-th percentile,846.5
Maximum,891.0
Range,890.0
Interquartile range,445.0

0,1
Standard deviation,257.35
Coef of variation,0.57703
Kurtosis,-1.2
Mean,446
MAD,222.75
Skewness,0
Sum,397386
Variance,66231
Memory size,7.0 KiB

Value,Count,Frequency (%),Unnamed: 3
891,1,0.1%,
293,1,0.1%,
304,1,0.1%,
303,1,0.1%,
302,1,0.1%,
301,1,0.1%,
300,1,0.1%,
299,1,0.1%,
298,1,0.1%,
297,1,0.1%,

Value,Count,Frequency (%),Unnamed: 3
1,1,0.1%,
2,1,0.1%,
3,1,0.1%,
4,1,0.1%,
5,1,0.1%,

Value,Count,Frequency (%),Unnamed: 3
887,1,0.1%,
888,1,0.1%,
889,1,0.1%,
890,1,0.1%,
891,1,0.1%,

0,1
Distinct count,3
Unique (%),0.3%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,2.3086
Minimum,1
Maximum,3
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,2
Median,3
Q3,3
95-th percentile,3
Maximum,3
Range,2
Interquartile range,1

0,1
Standard deviation,0.83607
Coef of variation,0.36215
Kurtosis,-1.28
Mean,2.3086
MAD,0.76197
Skewness,-0.63055
Sum,2057
Variance,0.69902
Memory size,7.0 KiB

Value,Count,Frequency (%),Unnamed: 3
3,491,55.1%,
1,216,24.2%,
2,184,20.7%,

Value,Count,Frequency (%),Unnamed: 3
1,216,24.2%,
2,184,20.7%,
3,491,55.1%,

Value,Count,Frequency (%),Unnamed: 3
1,216,24.2%,
2,184,20.7%,
3,491,55.1%,

0,1
Distinct count,2
Unique (%),0.2%
Missing (%),0.0%
Missing (n),0

0,1
male,577
female,314

Value,Count,Frequency (%),Unnamed: 3
male,577,64.8%,
female,314,35.2%,

0,1
Distinct count,7
Unique (%),0.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.52301
Minimum,0
Maximum,8
Zeros (%),68.2%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,1
95-th percentile,3
Maximum,8
Range,8
Interquartile range,1

0,1
Standard deviation,1.1027
Coef of variation,2.1085
Kurtosis,17.88
Mean,0.52301
MAD,0.71378
Skewness,3.6954
Sum,466
Variance,1.216
Memory size,7.0 KiB

Value,Count,Frequency (%),Unnamed: 3
0,608,68.2%,
1,209,23.5%,
2,28,3.1%,
4,18,2.0%,
3,16,1.8%,
8,7,0.8%,
5,5,0.6%,

Value,Count,Frequency (%),Unnamed: 3
0,608,68.2%,
1,209,23.5%,
2,28,3.1%,
3,16,1.8%,
4,18,2.0%,

Value,Count,Frequency (%),Unnamed: 3
2,28,3.1%,
3,16,1.8%,
4,18,2.0%,
5,5,0.6%,
8,7,0.8%,

0,1
Distinct count,2
Unique (%),0.2%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.38384

0,1
0,549
1,342

Value,Count,Frequency (%),Unnamed: 3
0,549,61.6%,
1,342,38.4%,

0,1
Distinct count,681
Unique (%),76.4%
Missing (%),0.0%
Missing (n),0

0,1
1601,7
347082,7
CA. 2343,7
Other values (678),870

Value,Count,Frequency (%),Unnamed: 3
1601,7,0.8%,
347082,7,0.8%,
CA. 2343,7,0.8%,
CA 2144,6,0.7%,
3101295,6,0.7%,
347088,6,0.7%,
382652,5,0.6%,
S.O.C. 14879,5,0.6%,
2666,4,0.4%,
349909,4,0.4%,

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [37]:
pp.ProfileReport(test)

0,1
Number of variables,11
Number of observations,418
Total Missing (%),9.0%
Total size in memory,36.0 KiB
Average record size in memory,88.2 B

0,1
Numeric,6
Categorical,4
Boolean,0
Date,0
Text (Unique),1
Rejected,0
Unsupported,0

0,1
Distinct count,80
Unique (%),19.1%
Missing (%),20.6%
Missing (n),86
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,39.354
Minimum,0.221
Maximum,98.8
Zeros (%),0.0%

0,1
Minimum,0.221
5-th percentile,10.4
Q1,27.3
Median,35.1
Q3,50.7
95-th percentile,74.1
Maximum,98.8
Range,98.579
Interquartile range,23.4

0,1
Standard deviation,18.436
Coef of variation,0.46845
Kurtosis,0.083783
Mean,39.354
MAD,14.523
Skewness,0.45736
Sum,13066
Variance,339.87
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
31.200000000000003,17,4.1%,
27.3,17,4.1%,
28.6,16,3.8%,
39.0,15,3.6%,
23.400000000000002,13,3.1%,
33.800000000000004,12,2.9%,
35.1,12,2.9%,
32.5,11,2.6%,
29.900000000000002,11,2.6%,
37.7,10,2.4%,

Value,Count,Frequency (%),Unnamed: 3
0.221,1,0.2%,
0.429,1,0.2%,
0.975,1,0.2%,
1.079,1,0.2%,
1.1960000000000002,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
80.60000000000001,1,0.2%,
81.9,2,0.5%,
83.2,3,0.7%,
87.10000000000001,1,0.2%,
98.8,1,0.2%,

0,1
Distinct count,77
Unique (%),18.4%
Missing (%),78.2%
Missing (n),327

0,1
B57 B59 B63 B66,3
F4,2
C31,2
Other values (73),84
(Missing),327

Value,Count,Frequency (%),Unnamed: 3
B57 B59 B63 B66,3,0.7%,
F4,2,0.5%,
C31,2,0.5%,
A34,2,0.5%,
C55 C57,2,0.5%,
C80,2,0.5%,
C101,2,0.5%,
B45,2,0.5%,
C116,2,0.5%,
C6,2,0.5%,

0,1
Distinct count,2
Unique (%),0.5%
Missing (%),0.0%
Missing (n),0

0,1
S,316
C,102

Value,Count,Frequency (%),Unnamed: 3
S,316,75.6%,
C,102,24.4%,

0,1
Distinct count,170
Unique (%),40.7%
Missing (%),0.2%
Missing (n),1
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,35.627
Minimum,0
Maximum,512.33
Zeros (%),0.5%

0,1
Minimum,0.0
5-th percentile,7.2292
Q1,7.8958
Median,14.454
Q3,31.5
95-th percentile,151.55
Maximum,512.33
Range,512.33
Interquartile range,23.604

0,1
Standard deviation,55.908
Coef of variation,1.5692
Kurtosis,17.922
Mean,35.627
MAD,33.294
Skewness,3.6872
Sum,14857
Variance,3125.7
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
7.75,21,5.0%,
26.0,19,4.5%,
8.05,17,4.1%,
13.0,17,4.1%,
7.8958,11,2.6%,
10.5,11,2.6%,
7.775,10,2.4%,
7.2292,9,2.2%,
7.225,9,2.2%,
7.8542,8,1.9%,

Value,Count,Frequency (%),Unnamed: 3
0.0,2,0.5%,
3.1708,1,0.2%,
6.4375,2,0.5%,
6.4958,1,0.2%,
6.95,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
227.525,1,0.2%,
247.5208,1,0.2%,
262.375,5,1.2%,
263.0,2,0.5%,
512.3292,1,0.2%,

First 3 values
"Snyder, Mr. John Pillsbury"
"Aronsson, Mr. Ernst Axel Algot"
"Johansson Palmquist, Mr. Oskar Leander"

Last 3 values
"Hansen, Mrs. Claus Peter (Jennie L Howard)"
"Buckley, Mr. Daniel"
"Katavelas, Mr. Vassilios (Catavelas Vassilios"")"""

Value,Count,Frequency (%),Unnamed: 3
"Abbott, Master. Eugene Joseph",1,0.2%,
"Abelseth, Miss. Karen Marie",1,0.2%,
"Abelseth, Mr. Olaus Jorgensen",1,0.2%,
"Abrahamsson, Mr. Abraham August Johannes",1,0.2%,
"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
"de Brito, Mr. Jose Joaquim",1,0.2%,
"de Messemaeker, Mr. Guillaume Joseph",1,0.2%,
"del Carlo, Mrs. Sebastiano (Argenia Genovesi)",1,0.2%,
"van Billiard, Master. James William",1,0.2%,
"van Billiard, Master. Walter John",1,0.2%,

0,1
Distinct count,8
Unique (%),1.9%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.39234
Minimum,0
Maximum,9
Zeros (%),77.5%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,0
95-th percentile,2
Maximum,9
Range,9
Interquartile range,0

0,1
Standard deviation,0.98143
Coef of variation,2.5014
Kurtosis,31.413
Mean,0.39234
MAD,0.60823
Skewness,4.6545
Sum,164
Variance,0.9632
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0,324,77.5%,
1,52,12.4%,
2,33,7.9%,
3,3,0.7%,
9,2,0.5%,
4,2,0.5%,
6,1,0.2%,
5,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
0,324,77.5%,
1,52,12.4%,
2,33,7.9%,
3,3,0.7%,
4,2,0.5%,

Value,Count,Frequency (%),Unnamed: 3
3,3,0.7%,
4,2,0.5%,
5,1,0.2%,
6,1,0.2%,
9,2,0.5%,

0,1
Distinct count,418
Unique (%),100.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1100.5
Minimum,892
Maximum,1309
Zeros (%),0.0%

0,1
Minimum,892.0
5-th percentile,912.85
Q1,996.25
Median,1100.5
Q3,1204.8
95-th percentile,1288.2
Maximum,1309.0
Range,417.0
Interquartile range,208.5

0,1
Standard deviation,120.81
Coef of variation,0.10978
Kurtosis,-1.2
Mean,1100.5
MAD,104.5
Skewness,0
Sum,460009
Variance,14595
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
1023,1,0.2%,
1128,1,0.2%,
1156,1,0.2%,
1157,1,0.2%,
1158,1,0.2%,
1159,1,0.2%,
1160,1,0.2%,
1161,1,0.2%,
1162,1,0.2%,
1163,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
892,1,0.2%,
893,1,0.2%,
894,1,0.2%,
895,1,0.2%,
896,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
1305,1,0.2%,
1306,1,0.2%,
1307,1,0.2%,
1308,1,0.2%,
1309,1,0.2%,

0,1
Distinct count,3
Unique (%),0.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,2.2656
Minimum,1
Maximum,3
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,1
Median,3
Q3,3
95-th percentile,3
Maximum,3
Range,2
Interquartile range,2

0,1
Standard deviation,0.84184
Coef of variation,0.37158
Kurtosis,-1.3827
Mean,2.2656
MAD,0.76608
Skewness,-0.53417
Sum,947
Variance,0.70869
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
3,218,52.2%,
1,107,25.6%,
2,93,22.2%,

Value,Count,Frequency (%),Unnamed: 3
1,107,25.6%,
2,93,22.2%,
3,218,52.2%,

Value,Count,Frequency (%),Unnamed: 3
1,107,25.6%,
2,93,22.2%,
3,218,52.2%,

0,1
Distinct count,2
Unique (%),0.5%
Missing (%),0.0%
Missing (n),0

0,1
male,266
female,152

Value,Count,Frequency (%),Unnamed: 3
male,266,63.6%,
female,152,36.4%,

0,1
Distinct count,7
Unique (%),1.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.44737
Minimum,0
Maximum,8
Zeros (%),67.7%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,1
95-th percentile,2
Maximum,8
Range,8
Interquartile range,1

0,1
Standard deviation,0.89676
Coef of variation,2.0045
Kurtosis,26.499
Mean,0.44737
MAD,0.60577
Skewness,4.1683
Sum,187
Variance,0.80418
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0,283,67.7%,
1,110,26.3%,
2,14,3.3%,
4,4,1.0%,
3,4,1.0%,
8,2,0.5%,
5,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
0,283,67.7%,
1,110,26.3%,
2,14,3.3%,
3,4,1.0%,
4,4,1.0%,

Value,Count,Frequency (%),Unnamed: 3
2,14,3.3%,
3,4,1.0%,
4,4,1.0%,
5,1,0.2%,
8,2,0.5%,

0,1
Distinct count,363
Unique (%),86.8%
Missing (%),0.0%
Missing (n),0

0,1
PC 17608,5
CA. 2343,4
113503,4
Other values (360),405

Value,Count,Frequency (%),Unnamed: 3
PC 17608,5,1.2%,
CA. 2343,4,1.0%,
113503,4,1.0%,
16966,3,0.7%,
SOTON/O.Q. 3101315,3,0.7%,
220845,3,0.7%,
C.A. 31029,3,0.7%,
347077,3,0.7%,
PC 17483,3,0.7%,
CA 31352,2,0.5%,

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,44.85,0,0,330911,7.8292,,S
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,61.1,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,80.6,0,0,240276,9.6875,,S
3,895,3,"Wirz, Mr. Albert",male,35.1,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,28.6,1,1,3101298,12.2875,,S


## Metrics overview

Every competition can rely on different metrics that usually is dictated from business needs. It is important to understand the competition metric and optimize only this metric and not any other.

### <a name="3"></a>3. Classification metrics overview
Classification problems are probably the most common type of ML problem and as such there are many metrics that can be used to evaluate predictions for these problems. We will review some of them.

First note, that many of classifiers return soft labels, or scores for each class, such as probability, while others - hard labels i.e class where target belongs. Soft label can transformed to hard labels for example using threshold for binary classification

### <a name="3.1"></a>LogLoss

For binary classification, works with soft labels

$$ LogLoss = {-\frac{1}{N} \sum_{i=1}^{N}(y_{i}log(\hat{y_{i}})+(1-y_i)log(1-\hat{y_{i}}))}$$

### <a name="3.1"></a>Accuracy
Accuracy simply measures *what percent of your predictions were correct*. It's the ratio between the number of correct predictions and the total number of predictions. The downside is that it is hard to optimize and it cares about hard labels

$$accuracy = {\frac{\#\ correct}{\#\ predictions}}$$

In [38]:
# calculate accuracy
#print(metrics.accuracy_score(y_test, y_pred))

Accuracy is also the most misused metric. It is really **only suitable** when there are an *equal number of observations in each class* (which is rarely the case) and that all *predictions and prediction errors are equally important*, which is often not the case.

### <a name="3.2"></a>Confusion Matrix
The confusion matrix is a handy presentation of the accuracy of a model with 2 or more classes. The table **presents predictions** on the x-axis and **accuracy outcomes** on the y-axis. The cells of the table are the number of predictions made by a machine learning algorithm.

In [39]:
# first argument is true values, second argument is predicted values
# this produces a 2x2 numpy array (matrix)
#conf = metrics.confusion_matrix(y_test, y_pred)
#print(conf)

|                | Predicted Negative | Predicted Positive |
|:--------------:|--------------------|--------------------|
| **Negative Cases** |      TN: 9324      |      FP: 3266      |
| **Positive Cases** |      FN: 2288      |      TP: 15644     |

- **True Positives (TP)**:
We correctly predicted that the reviews are positive: **15644**
- **True Negatives (TN)**:
We correctly predicted that the reviews are negative: **9324**
- **False Positives (FP)**:
We incorrectly predicted that the reviews are positive: **3266**
- **False Negatives (FN)**:
We incorrectly predicted that the reviews are negative: **2288**



Confusion matrix allows you to compute various classification metrics, and these metrics can guide your model selection. 

In [40]:
# slice confusion matrix into four pieces for future use
TP = conf[1, 1]
TN = conf[0, 0]
FP = conf[0, 1]
FN = conf[1, 0]

NameError: name 'conf' is not defined

You can learn more about the [Confusion Matrix on the Wikipedia article](https://en.wikipedia.org/wiki/Confusion_matrix).


### <a name="3.3"></a>Precision & Recall
Precision and recall are actually two metrics. But they are often used together.

**Precision** answers the question: *What percent of positive predictions were correct?*

$$precision = {\frac{\#\ true\ positive}{\#\ true\ positive + \#\ false\ positive}}$$

**Recall** answers the question: *What percent of the positive cases did you catch?*


$$recall = {\frac{\#\ true\ positive}{\#\ true\ positive + \#\ false\ negative}}$$

![](http://www.kdnuggets.com/images/precision-recall-relevant-selected.jpg)

See also a very good explanation of [Precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall) in Wikipedia.

[To the table of contents](#0)

In [None]:
# calculate precision
precision = TP / float(TP + FP)

#print(precision)
#print(metrics.precision_score(y_test, y_pred))

### <a name="3.4"></a>F1-score
The F1-score (sometimes known as the balanced F-beta score) is a single metric that combines both precision and recall via their harmonic mean:

$$F_1 = 2 {\frac{precision * recall}{precision + recall}}$$

Unlike the arithmetic mean, the harmonic mean tends toward the smaller of the two elements. Hence the F1 score will be small if either precision or recall is small.


In [None]:
# calculate f1-score
#f1 = 2 * precision * recall / (precision + recall)

#print(f1)
#print(metrics.f1_score(y_test, y_pred))

### <a name="3.4"></a>ROC AUC
The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.

More details here:
http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html

![](https://i.stack.imgur.com/5x3Xj.png)



### <a name="3.5"></a>Classification Report
Scikit-learn does provide a convenience report when working on classification problems to give you a quick idea of the accuracy of a model using a number of measures.

The **classification_report()** function displays the precision, recall, f1-score and support for each class. (*support* is the number of occurrences of each class in *y_true*)

[To the table of contents](#0)

In [None]:
# print a report on the binary classification problem
#print(metrics.classification_report(y_test, y_pred))

### <a name="4"></a>4. Regression metrics overview


### <a name="3.1"></a>MSE (L2 Loss) and RMSE
MSE measures your mean square error from target:


$$ MSE = {\frac{1}{N} \sum_{i=1}^{N}(y_{i}-\hat{y_{i}})^2}$$

$$ RMSE = {\sqrt{MSE}}$$

RMSE and MSE is similiar in terms of minimizers - value minimizes RMSE **if and only if** it minimizes MSE. This means that in terms of competitions we can optimize MSE instead of RMSE. In fact it is easier to work with MSE. But there is a little bit of difference between the two for gradient-based models. The gradient of RMSE with respect to i-th prediction is basically equal to gradient of MSE multiplied by some value. The value doesn't depend on the index I. It means that travelling along MSE gradient is equivalent to traveling along RMSE gradient but with a different flowing rate and the flowing rate depends on MSE score itself. So, it is kind of dynamic.So even though RMSE and MSE are really similar in terms of models scoring, they can be not immediately interchangeable for gradient based methods. We will probably need to adjust some parameters like the learning rate.

To see model performance in terms of baseline mean usually R-squared is used. Or Adjusted R-squared to penalize for model parameters/features

$$ R^2 = {\frac{MSE}{\frac{1}{N} \sum_{i=1}^{N}(y_{i}-\bar{y_{i}})^2}}$$

R_squared is between 0 and 1.

In finance usually is used MAE metric

$$ MAE = {\frac{1}{N} \sum_{i=1}^{N}|y_{i}-\hat{y_{i}}|} $$

It is not differentiable in 0, but one can simply overcome that by coding simple *if else* condition. LightGBM can use MAE while xgboost **cannot**

If you care more about **relative** error  - **MSPE** or **MAPE** can be used. They are quite similar to MSE and MAE but incorporate error to relative values rather than absolute

If you care more about error for different values, you can apply a function to prediction and target before going into MSE. For example taking a **log(y+1)** will introduce (R)MSLE metric that penalizes more for mistakes for smaller number and less for larger

### <a name="4"></a>5. What to do with all these metrics?

#### OPTIMIZE

In fact there is often the case that model optimizes different metric from what you want it to optimize. Your possible actions are:
 + Find the model that optimizes your metric. LogLoss, MSE are present in most libraries
 + Create your own loss function and pass it to the model such as xgboost. You need to write your own derivatives
 + Preprocess your original target, for example use log(y+1) and RMSE instead of RMSLE
 + Postprocess your output predictions if you need accuracy
 + Run desired model with early stopping. This means optimize default loss, monitor your metric and stop training if you see your metric is not improving

## Final thoughts

 + Try to reproduce train-test split made by organizers **at all costs**.
 + It is not always bad to have overvit for your validation if its score is similar to lb score
 + However, this may be a signal to review your EDA
 + If there is a huge gap between validation and lb, while train and val scores are similar - you have a leak in data. Carefuly review your EDA. Try to remove most predictive features and compare the results
 + Always try to look not only at F1 score but also on precision and recall (FP, FN) to find out when your model is wrong
 + Holdout validation works very good when you have a lot of data points in Neural Networks for example
 + Don't waste your time on building complex models to see if your validation is working. You should see this even submitting a constant value. Concentrate on very simple model such as linear/logistic regression for this
 + If you have time and ran out of ideas, you can use the next trick - concatenate train and test sets. Create a variable that will have a value 'train' for examples in train set or 'test' otherwise. Build a classifier that will try to predict whether an example belongs to either of two groups. After that from select top examples with highest probability to be included in test and make them your validation.
 + There is a possibility to GridSearch and compare algorithms using statistical significance of Student criteria. The link to review this idea https://youtu.be/HT3QpRp2ewA?t=1071 