In [None]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

## Overfitting and Regularization

- Population and samples
- Different populations (domains)
- Generalization and overfitting
- Underfitting and overfitting (bias vs variance)
- Regularizations (Lasso, Ridge, Dropout)

## Population and sample

For every machine learning task, we have a target for which should be applied our model:
- Images to classify
- Security cameras for video analytics
- Images to segment (body and foot mesurement)
- Cameras for traffic regulation
- Text for sentiment analysis
- Topic modeling
- Images for search
- Graphs to classify graphs
- Graphs to classify graphs nodes
- etc



Population does not mean all images for computer vision task
<br>

All text for NLP tasks
<br>

All possible graphs for geometric machine learning tasks
<br>

All posible sounds for sound recognition or speech2text, text2speech, etc tasks

Population (domain) and sample (dataset):
<img src="images/bias_variance/population_sample_1.png" height="800" width="800">

For example security camera and video analytics (from personal experience)
<img src="images/bias_variance/security_camera_1.jpeg" height="800" width="800">

Most tasks are:
- Person count (visitors)
- Person tracking
- Line analysis
- Load analysis for different services
- Anomaly detection (biheviour)
- etc

Example of security camera view:
<img src="images/bias_variance/security_camera_2.jpeg" height="800" width="800">

Or even more complicated:
<img src="images/bias_variance/security_camera_3.jpeg" height="800" width="800">

For instance this data is not a represintative sample for our population:
<img src="images/bias_variance/security_camera_4.jpeg" height="800" width="800">

Representative sample:
- Sample should imitate the population
- Because ML model are statistical models and they have to learn distribution
    - For instane if we have sample of men we can not estimate average height of grown humen
    - Or we have only 15 minutes of video of street, ve can not estimate traffic, and how different is cars on this road
        - Cars
        - Tracks
        - Bus
        - Train
        - etc
    - More complex, we can not detect and trace person with high accuracy for security cameras, which mostly are located higher, if we use images of persons taken by humans in streets
- We need text from similar domain in order to estimate sentiments of scientific forum instead of using data from general social networ comments

Here's the example of non representative sample:
<img src="images/bias_variance/representative_sample_1.png" height="800" width="800">

Representative sample:
<img src="images/bias_variance/representative_sample_2.jpeg" height="800" width="800">

Representative sample (distribution)
<img src="images/bias_variance/representative_sample_3.png" height="800" width="800">

Sometimes there are streightforward methods to measure "representativeness" level of sample
<br>

But in most cases it's almost imposible, without human-in-the loop and rough estimation

Back to our security cameras:
<br>

First we need person detection
<br>

Datasets and pre-trained which might be found online mostly are collected from car or persons taking the photos or videos, for self-driving cars
<br>

This data won't be appliable for our population
<br>

Generalization of models and even humen (which sometimes are overestimated) has it's limits

Person detection from profile:
<img src="images/bias_variance/person_detection_1.png" height="800" width="800">

Person detection front view
<img src="images/bias_variance/person_detection_2.jpeg" height="800" width="800">

In many cases model can detect person with different vews but performance is not good enough
<br>
To solve the problem there might be several ways:
- The best way woould be collect appropriate data, lable it and train (or use transfer learning) model
- Collect significant amount of data, lable it and mix it with existing data (depends how close is existing data to our population / domain)
- Collect significant amount of data, lable it and use it for validation
- Find closer, more representative sample to our population
- Collect significant amount of data, lable it, find closer, more representative sample to our population, mix them and use it for validation
- Collect significant amount of data, lable it, find closer, more representative sample to our population, mix them and train (or use transfer learning) model

Person detection more close to our population:
<img src="images/bias_variance/person_detection_3.jpeg" height="800" width="800">

Person detection, better fit for our population:
<img src="images/bias_variance/person_detection_4.png" height="800" width="800">

## Overfitting (bias) and underfitting (variance)

Under fitting when model has low accuracy and overfitting when model has high accuracy on sample (training set) but can not generalize on population:
<br>
<img src="images/bias_variance/underfitting_and_overfitting_1.png" height="800" width="800">

Example for classification:
<img src="images/bias_variance/underfitting_and_overfitting_2.png" height="800" width="800">

More precise view of outliers:
<img src="images/bias_variance/underfitting_and_overfitting_3.jpeg" height="800" width="800">

Examples here are shown for sample only (interpolation):
<br>
<img src="images/bias_variance/underfitting_and_overfitting_4.png" height="800" width="800">

Reason might be non representative sample:
<br>
<img src="images/bias_variance/underfitting_and_overfitting_5.png" height="800" width="800">

Or it might be rely on particular features:
$$
f(x) = W \cdot x + b \\
f(x) = w_1 \cdot x_1^2 + w_2 \cdot x_2^2 + b \\
f(x) = w_1 \cdot x_1^2 + w_2 \cdot x_2^2 + w_3 \cdot x_3^3 + w_4 \cdot x_4^4 + b
$$
<img src="images/bias_variance/underfitting_and_overfitting_6.png" height="800" width="800">

If we reduce the size $w_3$ and $w_4$ and make them almost $0$ then our polinomial model will approximately become a quadratic model and will fit our data
<br>

There are many techniuques for feature engineering and feature selection and we will talk about them hopefuly in the future
<br>

Instead of manually reduce the influence of particular features, let model decide which parameter should be reduced
<br>

Penalize the cost function: 
$$
C(W, b) = C_0(W, b) + \lambda \cdot \sum_{i=1}^{n}|w_i|
$$
<br>

This will give us constant $\lambda$ after derivation
<br>

This method is called $L_1$ (or lasso) regularization and is effective on constant level

Penalize the cost function: 
$$
C(W, b) = C_0(W, b) + \lambda \cdot \sum_{i=1}^{n}w_i^2
$$
<br>

This will give us $2 \cdot \lambda \cdot w_i$ after derivation
<br>

This method is called $L_2$ (or ridge) regularization and is effective on variable level
<br>

We can change $\lambda$ with $\frac{2}{m} \cdot \lambda$ and get
$$
C(W, b) = C_0(W, b) + \frac{2}{m} \cdot \lambda \cdot \sum_{i=1}^{n}w_i^2
$$

This will give us $\lambda \cdot w_i$ after derivation
<br>

All this methods, bounds each $w_i$ in order to make cost lower and distributes them in proper order by the original cost function

So we can now manipulate $\lambda$ value and regularize our model:
- If lambda is big enough than in order to make loss small weights will become almost $0$ which will give us constant funbctiuon $f(x)=b$ and causes underfitting
- If lambda is small enough it won't have an effect on loss and this will causes the overfitting
<br>

Finding appropriate lambda depends on data

## Train / Validation / Test split

How can we be sure that training goes correctly, no overfitting and no underfitting?
<br>

We should have data which is representative and might be used for observation
<br>

Suggestions

We split our dataset in training and validation parts
- Training is used for model fitting
- Validation used for observation, performance measure after some cycles

Training and validation performance curves:
<br>
<img src="images/bias_variance/train_valid_1.png" height="800" width="800">

Here's the example of overfitting:
<br>
<img src="images/bias_variance/train_valid_2.png" height="800" width="800">

Early stopping:
<br>
<img src="images/bias_variance/train_valid_3.jpeg" height="800" width="800">


Distance might grow but valudation cost should not go up:
<br>
<img src="images/bias_variance/train_valid_4.png" height="800" width="800">

Also test split is used
<br>
Validation data is used in training not automaticaly but model is fitted for it by hyperparameters tuning
<br>

Test data should not be used for training at all (not even on validation level)
<br>

After training is done, test set should be used for performance estimation et the end
<br>

Test set should be representative as well


Benchmark migt be test set
<br>

Or might be provided by the client
<br>

This is the best way to measure performance
<br>

Because of probabilitic nature of models

Distribution:
- For small datasets: $60$% / $20$% / $20$%
- For bigger datasets:  $80$% / $10$% / $10$%
- For large datasets:  $95$% / $5$% / $5$%
- But it depends on data after all

## Thank you

## Questions

<img src="images/intro2/questions_2.jpg" height="800" width="800">

## Thank you