# Polynomial features & curse of dimensionality

## Prerequisites

- Gradient based methods

## Introduction

In the previous chapter we've learnt how to train linear regression and what it's good for.

Major limitation of linear/logistic regression is that, as the name suggests, __is linear__.

> __Most of real life data correlation is non-linear__

So our $ X \rightarrow Y $ function could be something like:
- $f(x, y z) = x^2 + y^3 - \frac{1}{z}$
- $f(x) = (x+8)^3 $
- $f(x, y) = e^{x} - e^{(x+y)}$

Our `boston` regression task is most probably non-linear also, we're gonna go over that soon.

## Modelling more advanced functions

We can make more complex models by assuming that they should contain more, different mathematical terms. 

For example, instead of our model just having a term for $x$, it could also have a term for $x^2$, $x^3$, and so on (that is, it could include higher order terms). We could make all kinds of models by including any kind of mathematical terms such as (but not limited to):
- trigonometric terms
- exponential terms
- gaussian terms

Simple linear model: 

$$
y = b + w_1x_1
$$

Higher order polynomial model of single variable: 

$$
y = b + w_1x + w_2x^2 + w_3x^3
$$

To do that, we simply __augment__ our train, validation and test sets with polynomial features.
    
Our X variable would look like this now:

![title](images/X_matrix.jpg)
    
We will train these more complex polynomial models to learn to perform the same task as we did previously.

Lets try fitting more complex curves than just a straight line by using a polynomial model for multivariate linear regression. 

## Loading data

We will use some `aicore` utilities not to repeat ourselves.

In [31]:
from aicore.ml import data
from sklearn import datasets, model_selection


def standard_scaler(*datasets):
    scaler = preprocessing.StandardScaler().fit(datasets[0])
    return [scaler.transform(dataset) for dataset in datasets]


(X_train, y_train), (X_validation, y_validation), (X_test, y_test) = data.split(
    datasets.make_classification(
        n_samples=10000,
        n_features=8,
        n_informative=4,
        n_redundant=2,
        n_repeated=1,
        n_classes=5,
        n_clusters_per_class=3,
    )
)

X_train, X_validation, X_test = standard_scaler(X_train, X_validation, X_test)

X_train_poly, X_validation_poly, X_test_poly = standard_scaler(
    X_train_poly, X_validation_poly, X_test_poly
)

## Multiple polynomial features

As we have `13` features, our polynomial will become a function which multiplies each feature column-wise.

Let's assume we have two features `a` and `b` only. In this case, polynomial combination of power of `2` would be:

$$
    [1, a, b, a^2, ab, b^2]
$$

## Exercise

Let's use `sklearn` to make polynomial features for us ([here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) is the documentation).

Create `polynomial_datasets` function taking `(degree, *datasets)` as arguments.

- First create `PolynomialFeatures` object with specified degree
- Return list (via list comprehension) of transformed datasets using `fit_transform` function of `PolynomialFeatures` object

In [32]:
from sklearn import preprocessing


def polynomial_datasets(degree: int, *datasets):
    polynomial = preprocessing.PolynomialFeatures(degree=degree)
    return [polynomial.fit_transform(dataset) for dataset in datasets]


X_train_poly, X_validation_poly, X_test_poly = polynomial_datasets(
    2, X_train, X_validation, X_test
)

X_train_poly, X_validation_poly, X_test_poly = standard_scaler(
    X_train_poly, X_validation_poly, X_test_poly
)

In [33]:
X_train_poly.shape

(6000, 45)

You can see how our features expanded __drastically__. The higher the degree, the more features you're gonna have.

## Testing on polynomial and non-polynomial features


As previously, we will try how an algorithm works for polynomial and non-polynomial features.

In [34]:
from sklearn import linear_model
from sklearn.metrics import accuracy_score


def check(X_train, y_train, X_test, y_test):
    clf = linear_model.LogisticRegression()
    clf.fit(X_train, y_train)
    print(f"Train accuracy: {accuracy_score(y_train, clf.predict(X_train))}")
    print(f"Test accuracy: {accuracy_score(y_test, clf.predict(X_test))}")

    
print("NON-POLYNOMIAL")
check(X_train, y_train, X_test, y_test)
print("POLYNOMIAL")
check(X_train_poly, y_train, X_test_poly, y_test)

NON-POLYNOMIAL
Train accuracy: 0.4841666666666667
Test accuracy: 0.4935
POLYNOMIAL
Train accuracy: 0.64
Test accuracy: 0.648


## What happened?

As this task is a little harder simple linear regression is unable to fit our data. If presented with it's non-linear variation, the task becomes a little easier.

### Why not degree `100`?

We will get to that in next chapters, but too high degree leads to overfitting on training dataset (where the model remembers `train` but is unable to generalize to `test`)

# The Curse of Dimensionality

Another important aspect in `deep learning` is:

> the higher dimensional the problem, the harder it is to find the solution

It becomes both computationally infeasible & it becomes really hard to __find manifolds in high dimensional space__

## Manifolds

> Manifolds are lower dimensional spaces which describe the data being in higher dimensional space

Easiest way to see what manifold is, is looking at the picture below:

![](images/manifold.jpg)


## Random search

Let's assume we train our model by finding random parameters. If we test enough possibilities we can find good approximation (even perfect one in some cases).

> Such approach is named random search and is sometimes used when we have no knowledge or way to solve a task instead of tryin 

To model more complex functions we'll need more complex models.

But the time taken for random search scales **exponentially** with the number of parameters. This is because it has to search the whole parameter space, which has as many dimensions as the number of parameters.

Imagine our parameters can only take the integer values $0$ or $1$"

- If we have one parameter, the entire parameter space is ${0, 1}$.
That is, we have to check the criterion of $2$ possible parameterisations, which is ${(0, 0), (0, 1), (1, 0), (1, 1)}$, which is $4 \ (=2^2)$ possible parameterisations.
- If we have two parameter, the entire parameter space is ${(0, 0, 0), (0, 0, 1), (0, 1, 0), (0, 1, 1), (1, 0, 0), (1, 0, 1), (1, 1, 0), (1, 1, 1)}$.
That is, we have to check the criterion of $8 \ (2^3)$ possible parameterisations.

As you should be able to tell from the pattern, the number of parameterisations that are possible is given by:
$$
\begin{equation}
    n^d
\end{equation}
$$

This is in the case where parameters can take only integer values, which massively simplifies the problem, and is optimistic because most models parameters can take continuous values.

This space is unsearchable by random and becomes increasingly harder to optimize using heuristics like optimizers (though the limit is unknown and models with billions of parameters were successfully optimized).

## Challenges

- Play around with polynomial and find the best degree for the presented task on `test` set
- Find tasks where polynomial features do more harm than good (__tip__: those should be quite simple tasks and without a lot of data samples)
- Check [RBF kernel](https://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.kernels.RBF.html) instead of polynomial features. How does it work and does it work better for our task?
- Is there a way to generate such functions automatically (learn them from data)? Does such thing exist?
- What is [manifold learning](https://scikit-learn.org/stable/modules/manifold.html)? Build some intuition about it

## Summary

- Polynomial features can give linear models a way to model non-linear relationships
- Too high degree may cause overfitting and learning noise instead of real-life relationships