## Your model versus the population

A sample is a **subset** of a population.

You will likely **never** have data that covers the entire population.

That means that you will likely **never** be able to represent the entire population!

Your model will lie!

## Populations

![](images/pop1.png)

## The problem of overfitting
Occam's razor implies that any given complex function is a priori less probable than any given simple function. If the new, more complicated function is selected instead of the simple function, and if there was not a large enough gain in training-data fit to offset the complexity increase, then the new complex function "overfits" the data, and the complex overfitted function will likely perform worse than the simpler function on validation data outside the training dataset, even though the complex function performed as well, or perhaps even better, on the training dataset

![](images/overfitting.png)

## Carowners and voters

In 1963 *millions* of mock ballots was mailed to carowners across the USA, to learn who would win the presidential election.

The Republicans was a *clear* winner in the mock ballots, but the Democrats won the election.

What went wrong?

## The problem of generalisation

If X % of sample has Y it does **not** mean that X % of population has Y!

**Always** ask yourself: is your data representative?

## Training and testing data

We now have a split between 
* **Training data**: the data that the model sees
* **Testing data**: the data that the model is tested against

Note: the model should **never** train on the testing data

## Sklearn `train_test_split`

Splitting the data into testing and training makes it more likely that your model generalises.

But it **does not guarantee it**!

In [1]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [2]:
from sklearn.datasets import load_iris

In [3]:
X = load_iris().data
X

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

In [4]:
y = load_iris().target
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [5]:
load_iris().target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [6]:
from sklearn.model_selection import train_test_split
train_test_split(X, y)

[array([[6.6, 3. , 4.4, 1.4],
        [6.1, 2.6, 5.6, 1.4],
        [5.7, 4.4, 1.5, 0.4],
        [4.8, 3. , 1.4, 0.3],
        [5.5, 2.3, 4. , 1.3],
        [7.2, 3.2, 6. , 1.8],
        [6.7, 3.1, 4.4, 1.4],
        [4.7, 3.2, 1.3, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [7.7, 2.8, 6.7, 2. ],
        [6.3, 3.4, 5.6, 2.4],
        [5.9, 3. , 5.1, 1.8],
        [6.6, 2.9, 4.6, 1.3],
        [4.4, 3. , 1.3, 0.2],
        [5.7, 2.8, 4.1, 1.3],
        [6. , 2.7, 5.1, 1.6],
        [5.4, 3.9, 1.7, 0.4],
        [5.8, 2.7, 5.1, 1.9],
        [5.2, 2.7, 3.9, 1.4],
        [5.2, 3.5, 1.5, 0.2],
        [4.9, 2.5, 4.5, 1.7],
        [5.4, 3. , 4.5, 1.5],
        [7. , 3.2, 4.7, 1.4],
        [6.7, 3.3, 5.7, 2.1],
        [5.1, 3.8, 1.6, 0.2],
        [5.8, 2.6, 4. , 1.2],
        [6.8, 2.8, 4.8, 1.4],
        [5.7, 2.9, 4.2, 1.3],
        [5.6, 3. , 4.1, 1.3],
        [7.4, 2.8, 6.1, 1.9],
        [5.5, 2.6, 4.4, 1.2],
        [5. , 3.2, 1.2, 0.2],
        [6.9, 3.1, 5.4, 2.1],
        [6

In [7]:
# split the data into training data (2/3) for x and for y and test data (1/3) for x and for y
# training data is for the model to learn, test data to see if the model learned correctly
x_train, x_test, y_train, y_test = train_test_split(X, y)
print(y_train)

[0 2 0 1 1 0 0 1 2 2 2 0 1 0 1 1 1 1 0 1 0 1 1 0 2 2 1 2 2 1 0 1 2 1 1 2 1
 2 0 0 1 1 2 0 1 2 1 1 0 2 0 2 0 2 2 2 1 1 0 1 1 1 0 1 0 2 0 2 2 2 0 1 0 2
 2 1 2 0 1 2 0 1 2 0 0 2 0 0 0 1 0 2 0 1 2 0 2 1 1 2 1 0 0 2 0 1 0 0 0 0 2
 0]


In [8]:
# use the linear regression model
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [9]:
targets = model.predict(x_train)
targets

array([-5.09203649e-02,  1.73872686e+00,  7.02235363e-03,  1.15141064e+00,
        1.35947531e+00, -4.98495578e-02, -1.12150083e-02,  1.00542706e+00,
        2.13858964e+00,  1.85130215e+00,  1.97666289e+00, -1.54734287e-02,
        1.52797289e+00,  5.64339201e-02,  1.15929090e+00,  1.36034526e+00,
        1.23891543e+00,  1.35464829e+00,  1.66855438e-02,  1.55589825e+00,
       -3.86947655e-03,  1.14675577e+00,  8.59997760e-01,  2.61726899e-02,
        1.86950946e+00,  2.02964031e+00,  1.20850162e+00,  1.90493683e+00,
        1.99669480e+00,  1.19705550e+00, -1.79694377e-01,  1.20094999e+00,
        1.83588847e+00,  1.26448536e+00,  1.20043872e+00,  1.71134942e+00,
        1.20723521e+00,  2.18758129e+00,  6.94673646e-02, -2.76694411e-02,
        1.18269441e+00,  9.34753597e-01,  1.44729576e+00, -3.33951250e-02,
        1.29436310e+00,  2.03785580e+00,  1.23869637e+00,  1.07717132e+00,
       -1.28919797e-01,  2.05717034e+00, -9.58991199e-03,  1.71134942e+00,
        1.40489013e-01,  

In [10]:
model.score(x_train, y_train)

0.9332149878895715

In [11]:
model.score(x_test, y_test)

0.9154465450719517

## Evaluating a model

* Models are supposed to be as accurate as possible
  * `model.score`
  * Read the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score)

* But not *too* accurate
  * Overfitting

## The overfitting curve
[Overfitting](https://en.wikipedia.org/wiki/Overfitting)  

Curve shows number of training cycles on the x-axis and on y-axis how blue and red (training error, validation error) enlarges at a point. This point is where validation error has its global minimum

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1f/Overfitting_svg.svg/1280px-Overfitting_svg.svg.png" style="width:40%"/>

## Exercise

* Import `science.csv` to a pandas DataFrame
* Split the input (X) and target (y) using `train_test_split`
* Train the model on the training data
* Score the model based on the testing data

## Self study: Other sklearn metrics

The model uses *default* metrics. But there are numerous others.

https://sklearn.org/modules/classes.html#module-sklearn.metrics

Metrics usually depends on the type of your model (classification, regression, etc.)

Read this article [here](https://towardsdatascience.com/understanding-data-science-classification-metrics-in-scikit-learn-in-python-3bc336865019)

In [12]:
import sklearn
sklearn.metrics.SCORERS.keys()

dict_keys(['explained_variance', 'r2', 'max_error', 'neg_median_absolute_error', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'neg_root_mean_squared_error', 'neg_mean_poisson_deviance', 'neg_mean_gamma_deviance', 'accuracy', 'roc_auc', 'roc_auc_ovr', 'roc_auc_ovo', 'roc_auc_ovr_weighted', 'roc_auc_ovo_weighted', 'balanced_accuracy', 'average_precision', 'neg_log_loss', 'neg_brier_score', 'adjusted_rand_score', 'homogeneity_score', 'completeness_score', 'v_measure_score', 'mutual_info_score', 'adjusted_mutual_info_score', 'normalized_mutual_info_score', 'fowlkes_mallows_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'jaccard', 'jaccard_macro', 'jaccard_micro', 'jaccard_samples', 'jaccard_weighted'])