# The broader context

It is a widespread misunderstanding when viewing progress in the field of AI (or any field of science, for that matter) is to assume a linear trend of "progress", where the current paradigm is the pinnacle of development. In fact it is more true, that at any given point multiple paradigms (and with them networks of researchers) exist in parallel, with some more dominant than others.

Carlos E. Perez summarizes this well in his essay [The many tribes of artificial intelligence](https://medium.com/intuitionmachine/the-many-tribes-problem-of-artificial-intelligence-ai-1300faba5b60).

<img src="https://cdn-images-1.medium.com/max/1400/1*ggoNLbTbRQZjcBwMBD6rGg.jpeg" width=75%>

Since at any point multiple paradigms are being worked on, progress can come from any of these, as well as their combinations or "cross fertilization". We argue, that some of the success of deep learning can be understood, when we look at it from the angle of different methods. Most prominently we will use **kernel SVM methods** and **random forests** as our viewing lens. 

# The "big problems"

## Complex decision surfaces

### Most basic decision boundary

<img src="http://drive.google.com/uc?export=view&id=1iKinTe1NnMeXFik4SLMAU22oi1C2s4jw"  width=900 heigth=900>


**Task more precisely: for logistic regression** 

- Looking for a vector in space that defines the decision boundary which separates the two areas.
- Decision boundary is perpendicular to the weight vector
- Have a "step function" ("unit step" or sigmoid) which defines the whole model. 
- Use a closed form solution or iterative approximation to learn the values. 
- Confusingly enough, talk about "logistic regression" in this case, though we are learning the most basic of classifiers -> have converted the problem into a continous one.

**We will get back to the detailed analysis of this type of model in the class about Perceptrons.**

### Non-trivial data

- Dominant portion of interesting problems behave radically differently, so linear models - though appealingly interpretable - are severely limited. 

Take for example the case of the super-trivial "two crescents":

<img src="http://drive.google.com/uc?export=view&id=1q6TEXhcZ0hU9nv4CycGcNJyUb9RqC_Xy" width=50%>

It is absolutely obvious - from a human perspective - how to separate the two distributions, but linear separability is not possible.

<img src="http://drive.google.com/uc?export=view&id=1tQu8JagtQKjd7xVbB5uDBA0CebjQcZ2B" width=50%>

We need a more interesting decision surface. 
We will see two different approaches for this below (though more exist).

## Overfitting

### Basic assumptions

-  ["empirical risk minimization"](https://en.wikipedia.org/wiki/Empirical_risk_minimization): have a dataset (as sample drawn from a data distribution) on which we would like to learn something
- Statistical algorithm: function that returns an estimator for what we want to learn from the data (e.g. maximum likelihood)
- Model selection: how to select a particular algorithm (from the class of algorithms provided)

So for model selection we have to select a statistical algorithm

#### Important notes again:


<font size="5" color="red">"All models are wrong but some are useful" - <a href="https://en.wikipedia.org/wiki/All_models_are_wrong">George Box</a></font>

And there are infinitely many models for a given situation
 

### What model to choose?

- Performance
- Interpretability
- Deployability / Maintainability (see: ["never deploy a machine learning model once"](https://www.youtube.com/watch?v=zbS9jBB8fz8))
- **"Stability", robustness, <font color='red'>generalization power</font>**


### Data = "signal" + "noise"

- Would like to set up a model and learn it's parameters which captures the underlying distribution (mechanism) behind the data in the most concise manner without regard to noise

**If the model is fitted to the underlying distribution _and_ the distribution does not change over time (covariate shift) the model will have it's explanatory=predictive power.**


### Model complexity

**"With four parameters I can fit an elephant, and with five I can make him wiggle his trunk."** - Neumann 


#### [VC (Vapnik–Chervonenkis) dimensionality ](https://datascience.stackexchange.com/questions/32557/what-is-the-exact-definition-of-vc-dimension)

If there exists a set of n points that can be shattered by the classifier and there is no set of n+1 points that can be shattered by the classifier, then the VC dimension of the classifier is n.
Measure of the "flexibility", "ealsticity",**"capacity"** of a model.


#### [Rademacher complexity ](https://en.wikipedia.org/wiki/Rademacher_complexity)

Measure of the "flexibility", "ealsticity",**"capacity"** of a model.



Suppose you have
a distribution $ D $, $ m $ samples $S $ one hypothesis $ h $
We can approximate the generalization error as
$$
L_{S}(h)-L_{D}(h) \approx L_{S_{1}}(h)-L_{S_{2}}(h)
$$

where $ S $ is split in $S_{1}$ (train) and $S_{2}$ (fold)
If we define $ l\left(h, x_{i}\right) $ as an indicator function which is 1 if $ \mathrm{h} $ errs on $ x_{i} $ and 0 otherwise, we have
$$
L_{S}(h)-L_{D}(h) \approx \frac{1}{\left|S_{1}\right|} \sum_{x_{i} \in S_{1}} l\left(h, x_{i}\right)-\frac{1}{\left|S_{2}\right|} \sum_{x_{i} \in S_{2}} l\left(h, x_{i}\right)
$$
If we set $ \left|S_{1}\right|=\left|S_{2}\right|, $ we have the Rademacher average as:
$$
L_{S}(h)-L_{D}(h) \approx \frac{2}{m}\left[\sum_{i} \sigma_{i} l\left(h, x_{i}\right)\right]
$$
If $ \sigma_{i} $ is +1 if $ x_{i} \in S_{1,}-1 $ if $ x_{i} \in S_{2} $. If we split $ S $ in the two halves randomly, this is equivalent to picking the signs of $ \sigma_{i} $ at random.
The Rademacher complexity wrt $ S $ is defined as:
$$
R_{s}(H):=\mathbb{E}_{\sigma} \sup _{h \in H} \frac{1}{m}\left[\sum_{i} \sigma_{i} l\left(h, x_{i}\right)\right]
$$
The concept can be extended from loss functions of hypotheses to any class $ \mathbb{F} $ of real value functions. Furthermore, we can calculate the Rademacher complexity based on just the sample size $ m $.
This ultimately gives the upper bound:
$$
\sup _{f \in F}\left[E_{D}(f)-\frac{1}{m} \sigma f\left(x_{i}\right)\right] \leq 2 R_{m}(F)+O\left(\sqrt{\frac{\log (1 / \delta)}{m}}\right)
$$
where the second term is a Chernoff bound for a single $ f $.


In a nutshell, the Rademacher complexity measures the ability of an hypothesis class H to fit random ±1 binary labels. If compared to the VC dimension, Rademacher complexity is distribution dependent and defined for any class of real-valued functions (not only discrete-valued functions)#




#### [Occam](https://en.wikipedia.org/wiki/Occam%27s_razor) has a good shave

**„surplus should not be introduced without necessity”**

**In case of same performance, we should choose the simpler model!**

In Bayesian optimization there is an "Occam factor" which balances the precision of a model against it's complexity, so even increased accuracy is being penalized if it mobilizes more capacity.

### "Restraining" the model (Regularization)
Source: Lecture series of Michael C. Mozer at DeepLearn2017 Bilbao

<img src="http://drive.google.com/uc?export=view&id=1UQK1IvY7mjcg_sODGA4x58cMh-A5jHyC">

### Bias Variance tradeoff

**Approximation error (also known as bias)** minimum generalization error possible for a  predictor depending on the allowed hypothesis class. Approximation error does not depend on the sample size (Shalev-Shwartz and Ben-David, 2014).

**Estimation error (also known as variance)** difference between the approximation error and the error achieved by the predictor in the hypothesis class minimizing the training error. The estimation error of a predictor is the result of the training error being only an estimate of the generalization error and thus not taking into account the variance of this generalization error (Shalev-Shwartz and Ben-David, 2014).

**Proof 1: Total error = approximation error (bias) + estimation error (variance) __[here](http://drive.google.com/file/d/1pbxVq-xtRCtIgUdMaoI41e9sjkI-2GKy/view?usp=sharing)__ **


**Proof 2: Resampling leads to bias __[here](http://drive.google.com/file/d/1pbxVq-xtRCtIgUdMaoI41e9sjkI-2GKy/view?usp=sharing)__ **
- In a sense we need a measure for how well the model fits/predicts, adjusted for model complexity
- AIC and BIC are two measures for doing this for linear regression lnear regression
- The problem of coefficient shrinkage is a closely related one (another way of underestimating generalization error is to overestimate the coefficients, i.e. the effect of individual variables)
- Traditional statistical fitting and signficance testing leads to erronous conclusions


Source: Lecture series of Michael C. Mozer at DeepLearn2017 Bilbao
<img src="http://drive.google.com/uc?export=view&id=1ihsguAwFoHx2NQAgY2Ov0tgd3H4J2lox">

<img src="http://drive.google.com/uc?export=view&id=1WmnjSlsH18z5IgEUNJvdhqy1KFYLC2hb">

### Adding data
Source: Lecture series of Michael C. Mozer at DeepLearn2017 Bilbao

<img src="http://drive.google.com/uc?export=view&id=17IyTGGXib9u9cXDbADGXyDsLUwUIuwOc">

<font size="4" color="black">"Adding more data "constrains" the set of well fitting solutions, but may require additional degrees of freedom!</font>


## Example of overfitting:

<img src="https://i.ytimg.com/vi/dBLZg-RqoLg/maxresdefault.jpg" width=600 heigth=600>

**Please observe, that there are datapoints quite close to the decision boundary, so _"margin"_ is low.**

We will get back to this later on...


## [Aikake](https://en.wikipedia.org/wiki/Akaike_information_criterion) vs Generalization error

(https://rdrr.io/github/profpetrie/regclass/man/overfit.demo.html)
- Akaike Information Criterion: "When a statistical model is used to represent the process that generated the data, the model will almost never be exact; so some information will be lost by using the model to represent the process. AIC estimates the relative information lost by a given model: the less information a model loses, the higher the quality of that model. (In making an estimate of the information lost, AIC deals with the trade-off between the goodness of fit of the model and the simplicity of the model.)"

$$  AIC= 2k -2ln \hat{L} $$

where:
- k: number of estimated parameters in the model 
- $\hat{L} $: maximum value of the likelihood value of the model 
- Rewards goodness of fit: $$ -2ln \hat{L} $$
- Penalty for inceasing number of parameters: $$ 2k $$





- Nice example on linear models
- R tutorial, nicely commented, [here]
- The example gradually adds polynomial components to a regression
- As AIC rises, the model is getting "worse"
- Overfitting is gradually increasing after an optimal value, showing that the model mainly learns noise
- AIC has difficulties in high dimensional spaces with non-linear functions

<img src="http://drive.google.com/uc?export=view&id=1guMqUljUwZ7xq-vILKhTw8L2TkNvcyTi" width=400 heigth=400>


## Conclusion

**1.Always validate your performance**
(Ideally with crossvalidation - see later)

**2. Traditional statistics tells us:**

<font size="5" color="red">Use a model with enough capacity but not bigger!</font>

**This will prove to be a tricky assumption later on, stay tuned! :-)**


# The solutions

Our two "contenders" for solving the above mentioned problems will be **Kernel-SVM**-s and **Random Forests** (or **Gradient Boosted Trees**, for that matter), since they were considered some of the most advanced forms of machine learning before deep models, and they are **strongly competitive in many areas even today**!!

## Non-linear decision surfaces

Two approaches for constructing non-trivial decision surfaces can be:
- Application of "kernels"
- "Piecewise construction" of surfaces

### Kernels

#### The advantages of "exploding the feature space"

As the case of polynomial regression demonstrates, "exploding the feature space", that is, transforming the data into a higher dimensional feature space can also serve us when we would like to use linear models (e.g., linear regression or classification methods with linear decision boundaries) to find nonlinear patterns (e.g. nonlinear decision boundaries). The trick is, of course, to apply a nonlinear transformation to the data and use the method to find parameters (e.g. decision boundaries) that are linear in the new feature space but nonlinear in the original one:

<img  src="https://journals.plos.org/ploscompbiol/article/figure/image?download&size=large&id=info:doi/10.1371/journal.pcbi.1000173.g006">

In its most general form, transforming the feature space is to use a new set of feature vectors $$\{\phi(\mathbf x_1),\dots,\phi(\mathbf x_N)\}$$ for training instead of the original $\{\mathbf x_1,\dots,\mathbf x_N\}$ data where $\phi$ is any function mapping vectors to vectors. In the case of a one-variable polynomial regression $\phi$ is simply the mapping

$$
\phi(\langle x\rangle) = \langle x, x^2,\dots,x^m \rangle.
$$


<img src="https://pbs.twimg.com/media/DJJKZR2XgAAliTi.jpg" width=40%>

#### Popular kernels

Some of the widely used kernels are

* **Linear kernel**: $K(\mathbf x, \mathbf y)=\mathbf x \cdot \mathbf y$: This is the kernel without any feature mapping (or with the identity feature mapping), which is used with kernelized algorithms when no feature mapping is needed.
* **Polynomial kernels**: kernels of the form $K(\mathbf x, \mathbf y) = (1+ \mathbf x \cdot \mathbf y)^n$ where $n$ is an integer -- these kernels correspond to polynomial feature mappings (we have seen an instance as an example above).
* **Gaussian or RBF (Radial Basis Function) kernels**: kernels of the form 
$$K(\mathbf x, \mathbf y) = \exp(-\frac{\|\mathbf x-\mathbf y\|^2}{2\sigma^2}).$$
Can be seen as inducing a nonlinear, Gaussian weighted distance metric on the original feature space. On the other hand, the "implicit feature mapping" behind them is infinite dimensional as can be shown by using the Taylor series expansion of the exponential function. (See, e.g., [these slides](https://www.csie.ntu.edu.tw/~cjlin/talks/kuleuven_svm.pdf) for details.)
* **String kernels**: These kernels operate on strings and measure their similarity in various ways, e.g., they can measure the number of substrings that occur in both of them (strings are from alphabet $A$):
$$K(\mathbf x, \mathbf y) = \sum_{s\in A^*}w_s c_s(\mathbf x)c_s(\mathbf y)$$
where $c_s(\mathbf x)$ is the number of occurrences of $s$ in $\mathbf x$ as a substring, and $w_s$ is a weight belonging to $s$. Similarly to the Gaussian kernel, the underlying feature space has an inifinite number of dimensions but here -- in contrast to the Gaussian -- the used feature mapping is fairly obvious.

<img src="https://qph.fs.quoracdn.net/main-qimg-c7f5c6f1fc6d4be7daaaf82d975e226e" width=60%>

**The application of kernels can transform a non-linear decision case to a linear one - given we find the right kernel.**

Kernelized Support Vector Machines for example rely heavily on this trick to achieve high performance.

### "Piecewise construction" of decision surfaces

Example: Bagging (Bootstrap aggregation)

<img src="https://www.kdnuggets.com/wp-content/uploads/bagging.jpg" width=40%>

- Goal is to decrease variance
- Trains models in parallel, capitalizes on their independence
- Variance decreases because of averaging (approximation)

A good simple illustration can be found [here](https://machinelearningmastery.com/implement-bagging-scratch-python/).

#### Random Forest

<img src="https://www.researchgate.net/profile/Mariana_Recamonde-Mendoza/publication/280533599/figure/fig5/AS:267770621329410@1440852899493/Random-forest-model-Example-of-training-and-classification-processes-using-random.png" width=65%>

#### Random forest algorithm

<img src="http://drive.google.com/uc?export=view&id=14TV7kY7Y1cKbsLLTW_9i5WEwdKOc4RDl" width=55%>

#### Effects

It decreases variance by training decision trees in a randomized manner
- All trees learn on a subset of data samples (subset selection with replacement, bootstrap sample)
- All trees use only a part of the variables

Pro:

 - Parallelizable
 - Pretty high performance
 - Helps interpretation (relevance of input variables can be gauged)
 - Drastically decreases variance
 - Less need for validation, has "built in crossvalidation"
 
Con:
 - Adds a little bias
 - **Can only predict in the known range of values!**

Detailed description [here](https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#workings)


It is worth noting, that RF has a classification as well as a regression variant, so it is quite universal.

#### SIdenote: Stacking

<img src="https://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier_files/stackingclassification_overview.png" width=55%>

- Heterogenous, hierarchic method
- We train different models on the whole available training data then use the output of these models as input to train a meta-model on top of them. **This will be highly relevant in deep learning!**

#### Decision surfaces of ensembles

One of the main properties of a model is the shape of the decision boundary it can represent. In this regard, ensembles have some distinctive advantages.

Even in case of an ensemble of classifiers with low degrees of freedom ("simple classifiers") it is true, that the final decision boundary of the ensemble can become non-trivially shaped:

<img src="https://images.slideplayer.com/25/8236676/slides/slide_4.jpg" width=65%>

[Source](https://www3.nd.edu/~rjohns15/cse40647.sp14/www/content/lectures/31%20-%20Decision%20Tree%20Ensembles.pdf)


In case of tree based models the process can look somewhat like this:

<img src="https://shapeofdata.files.wordpress.com/2013/07/randforest.png" width=55%>

And results in a highly non-trivial boundary eg. in case of a random forest model:

<img src="https://i.stack.imgur.com/HXGXw.png" width=55%>

Takeaway: **ensemble methods can "puzzle together" complex shapes for decision surfaces**

It is to be concluded, that the ability of these models to construct complex decision boundaries in a "local" "piecewise" manner is of great advantage - especially if the built in crossvalidation approaches prevent it from overfitting.

We will come back to this ability of **"piecewise construction" of decision surfaces** in case of deep learning later on.

There is also some research on the behavior of "ensemble margins", like [here](https://tel.archives-ouvertes.fr/tel-01662444/document) (p12-p15) and [here](http://www.cs.man.ac.uk/~stapenr5/publications/tech_report2012.pdf) which tries to investigate if there is any connection of ensemble decision boundaries and "large margin" methods (see below).

## Ensuring stability

### Stability of boundary: Max margin classification (Support vectors)

There are potentially infinitely many decision boundaries that can separate this data. Which one to choose?

<img src="https://cdn-images-1.medium.com/max/1500/1*UGsHP6GeQmLBeteRz80OPw.png" width=600 heigth=600>
[Support Vector Machines — A Brief Overview](https://towardsdatascience.com/support-vector-machines-a-brief-overview-37e018ae310f)

**Solution: Choose the classifier with the maximum "margin".**

<img src="https://nlp.stanford.edu/IR-book/html/htmledition/img1260.png" width=400 heigth=400>

Inítuition:
- Most "confident" classification
- Most tolerant to noise: needs great amount of perturbation to switch classes (this notion is still important for deep learning, especially ["adversarial examples"](http://www.mdpi.com/1999-5903/10/3/26/htm), which we may detail later)

#### Support vector machines (Linear)

[This](http://www.saedsayad.com/support_vector_machine.htm) is a good description.


<img src="http://www.saedsayad.com/images/SVM_optimize_3.png" width=600 heigth=600>

**One way to ensure the stability of classification is the enforcement of a large decision margin.**

Support vector machines use this technique to get reliable results.

### K-Fold-Cross-validation

- Iterative estimate of model performance 
- Repeatedly ($k$ times) leave out a part of the dataset and do the teaching, 
- Finally estimating the performance as the average "goodness"

<img src="https://i.stack.imgur.com/LttqQ.png" weights=500 heigth=500>

Cross validation can also be used with train-valid-test split

<img src="https://i.stack.imgur.com/0SQJq.png" weights=500 heigth=500>

[source](https://stats.stackexchange.com/questions/338044/what-is-exact-way-to-do-k-fold-validation)

**Definition:**

Divide data (randomly) into K equal groups, called folds. Let Ak denote the set of data points (Yi, Xi) placed into the k’th fold
Y=(Y1,...,Yn) - sample of all observations (Bold indicating a vector or a matrix)
Xi=(X1,...,Xip)  - vector of al Xs for a particular value of Y
X=Matrix whose particular rows are Xi

For  k=1,....,K train model on all except k’th fold. Let f-k     denote the resulting fitted model
Estimated prediction error is the average across the turning of the folds
$$Err_{CV} = \frac{1}{K} \sum\limits_{K=1}^k \bigg( \frac{1}{n/K}  \sum\limits_{i \epsilon A_k} {(Y_i− {\hat{f}}^{−k} ( \textbf{X}_i))^2} \bigg) $$



**Just calculating predictive accuracy across the folds is not sufficient. You should also calculate the standard deviation of this predictive accuracy**

K-Fold effective for two reasons:
- Predictive accuracy: single hypothesis with fixed parameters on unseen data-set
- Learn stability of the relationship across the folds (so don't just take the total error value)

**Bagging ensemble models, such as Random Forests use built in crossvalidation to ensure the efficient usage of capacity, and to prevent overfitting.**

This is an important aspect ensuring their success!

# Limitations

## No transformation

Though the decision surfaces of these models are complex, they **still operate inside the topology of the original data distribution**. Deep learning methods will behave in a markedly different manner, the presence of successive (hierarchic) transformations of the data will be crucial.

**Intuition:** what if we would not go for a complex surface, just transform ("systematically rearrange") the data, so as the same classes would get near to each-other, and then we would only need to learn a simple decision boundary on top?

It is worth noting, that this shortcoming of eg. random forests was attacked by a line of research proposing some kind of hierarchic structure for forests like in [this paper](https://arxiv.org/abs/1702.08835), or hybrid solutions like [here](https://www.cs.cmu.edu/~mfiterau/papers/2016/IJCAI_2016_DNDF.pdf) and [here](https://arxiv.org/abs/1807.06699).

Stacking is in a sense also a rudimentary solution pointing in this direction.

## "Manual" transformation

In case of kernel methods, the single biggest disadvantage is, that we have to **rely on human intuition/expertise to design the kernels used**, that is to say - in a sense - that feature engineering is scaling with the human factor.
(See for example [this](https://www.kdnuggets.com/2016/06/select-support-vector-machine-kernels.html) article, where the author basically admits, that you have tor try many kernels, and see for yourself.)

**The biggest promise of deep learning will be the "automatic" feature (kernel) learning.**

# Conclusion

In _many_ problem domains, ensemble models, especially RandomForest and XGBoost are having dominant performance. In fact many of the open competitions at Kaggle.com are being won by these methods, and a dedicated paper titled  [Do we Need Hundreds of Classifiers to Solve Real World
Classification Problems?](http://www.jmlr.org/papers/volume15/delgado14a/delgado14a.pdf) also concluded, that RandomForest is a _very_ good guess to solve problems.

None the less, the big advantage of deep learning based methods will shine in case of huge dimensionality of input data (think hundreds or thousands of features in a really complex shape) that can be disentangled by the right (learned) representations. But more on that later.

Moreover: we can assume, that for a generally strong "learner" there are some properties that define it's learning ability. ""Crossvalidation"", "piecewise construction" of a high capacity (non linear) decision surface are important characteristics, and are _shared_ by ensemble methods and deep learning - if we analyze them deep enough.