<a href="https://colab.research.google.com/github/Richish/hands_on_ml/blob/master/5_SVM_regression_and_svm_detailed_insights.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SVM Regression
Tries to fit as many instances as possible on the street while limiting margin violations(instances off the street).

The width of the street is controlled by a hyperparameter
ϵ

In [None]:
from sklearn.svm import LinearSVR

svm_reg = LinearSVR(epsilon=1.5)
svm_reg.fit(X, y)

In [None]:
from sklearn.svm import SVR

svm_poly_reg = SVR(kernel="poly", degree=2, C=100, epsilon=0.1)
svm_poly_reg.fit(X, y)

## hyperparameter tuning

epsilon defines the width of street. However epsilon doesn't affect the predictions made by SVM regression, hence SVM Regresssion is called epsilon insensitive.

'C' defines regularization in SVR class. Greates 'C' corresponds to lower regularization.

# Under the hood - SVC and SVR

Decision function: wT x + b = w1 x1 + ⋯ + wn xn + b

Classification:
ÿ = 1 if (decision function >0) else 0.


## decision function for 2-feature dataset.

decision function that corresponds to the model is a two-dimensional plane if the dataset has two features. The decision boundary is the set of points where the decision function is equal to 0: it is the intersection of two planes, which is a straight line (represented by the thick solid line).

The dashed lines represent the points where the decision function is equal to 1 or –1: they are parallel and at equal distance to the decision boundary, forming a margin around it. Training a linear SVM classifier means finding the value of w and b that make this margin as wide as possible while avoiding margin violations (hard margin) or limiting them (soft margin).



### My intution of how to get dotted lines:
Consider a line on the decision function plane, with z = 1 and similarly consider another line on the decision function plane, with z = -1.
Now project these lines onto the original feature plane(x-y plane)(projected by dropping along z-axis).
These 2 projected lines represent the dotted lines. These are the ends of the street and these happen to be parallel of the hard line in centre.

Now consider the slope of decision plane wrt to feature plane. If the slope is 90 degrees, then the dotted lines(i.e, the projection of z=1 and z=-1 lines from decision function plane onto x-y plane) will be on the same hard line hence street width=0.
If you start decreasing the slope, the width of the street starts increasing and is max when decision function is parallet to (and hence coincides with) feature plane.

## Training objective

Consider the slope of the decision function: it is equal to the norm of the weight vector,
∥ w ∥. If we divide this slope by 2, the points where the decision function is equal
to ±1 are going to be twice as far away from the decision boundary. In other words,
dividing the slope by 2 will multiply the margin by 2.The smaller the weight vector w, the larger the margin.
(see my intution above).

So we want to minimize ∥ w ∥ to get a large margin. However, if we also want to avoid
any margin violation (hard margin), then we need the decision function to be greater
than 1 for all positive training instances, and lower than –1 for negative training
instances.

If we define t(i) = –1 for negative instances (if y(i) = 0) and t(i) = 1 for positive
instances (if y(i) = 1), then we can express this constraint as t(i)(wT x(i) + b) ≥ 1 for all
instances.

###  Hard margin linear SVM classifier objective
minimize(w, b) for 1/2(w†w)
subject to t^(i)(w†x^i + b) ≥ 1 for i = 1, 2, ⋯,m

We are minimizing 1/2(wT w), which is equal to 1/2(∥ w ∥)^2, rather than
minimizing ∥ w ∥. Since ||w||^2 is differentiable at w=0, whereas ||w|| is not differentiable at w=0.

### To get the soft margin objective:
we need to introduce a slack variable ζ(i) ≥ 0 for each instance: ζ(i) measures how much the ith instance is allowed to violate the margin. We now have two conflicting objectives: making the slack variables as small as possible to reduce the margin violations, and making 1/2(wT w) as small as possible to increase the
margin.

This is where the C hyperparameter comes in: it allows us to define the trade‐off between these two objectives. This gives us the constrained optimization problem

Soft margin linear SVM classifier objective:
minimize(w, b, ζ), for {1/2(w†w) + C((from i=1 to m)Σ(ζ^i))};
subject to t^(i)(w†x^i + b) ≥ 1-ζ^i, for i = 1, 2, ⋯,m




## How to solve hard and soft margin problems

### Quadratic Programming
The hard margin and soft margin problems are both convex quadratic optimization
problems with linear constraints. Such problems are known as Quadratic Programming
(QP) problems. Many off-the-shelf solvers are available to solve QP problems
using a variety of techniques. 

So one way to train a hard margin linear SVM classifier is just to use an off-the-shelf
QP solver by passing it the preceding parameters. The resulting vector p will contain
the bias term b = p0 and the feature weights wi = pi for i = 1, 2, ⋯, n. Similarly, you
can use a QP solver to solve the soft margin problem (see the exercises at the end of
the chapter).
However, to use the kernel trick we are going to look at a different constrained optimization
problem.

### The Dual Problem
Given a constrained optimization problem, known as the primal problem, it is possible
to express a different but closely related problem, called its dual problem. The solution
to the dual problem typically gives a lower bound to the solution of the primal
problem, but under some conditions it can even have the same solutions as the primal
problem. Luckily, the SVM problem happens to meet these conditions,6 so you
can choose to solve the primal problem or the dual problem; both will have the same
solution.

### Kernelized SVM(Kernel Trick)
Suppose you want to apply a 2nd-degree polynomial transformation to a two-dimensional training set (such as the moons training set), then train a linear SVM classifier on the transformed training set.

Transformed x, ϕ(x) = ϕ[x1 , x2] = [x1^2, x1.x2, x2^2]

Dot product of 2 transformed instances: ϕ(a).ϕ(b)
= [a1^2, a1.a2, a2^2].[b1^2, b1.b2, b2^2] # this is dot product of 2 vectors.
= (a1.b1)^2 + a1.b1.a2.b2 + (a2.b2)^2
= (a1.b1 + a2.b2)^2
= ([a1, a2]†.[b1, b2])^2
= (a†.b)^2

Hence in the end, we can just use dot product of 2-dimensional vectors and square that resultant scalar.
We don't need to use polynomial features at all even though we get the effect of calculationg polynomial features.

In the end we don't even need to to know about the actual ϕ to get the dot product of ϕ(a).ϕ(b). We just need to know the vectors a and b. This is the essence of kernel trick.

The function K(a, b) = (a† b)2 is called a 2nd-degree polynomial kernel. In Machine Learning, a kernel is a function capable of computing the dot product ϕ(a)T.ϕ(b) based only on the original vectors a and b, without having to compute (or even to know about) the transformation ϕ.

Common kernels
Linear: K(a, b) = a†b
Polynomial: K(a, b) = (γa†b + r)^d
Gaussian RBF: K(a, b) = exp(−γ||a − b||^2)
Sigmoid: K(a, b) = tanh(γa†b + r)

### Mercer’s theorem:
According to Mercer’s theorem, if a function K(a, b) respects a few mathematical conditions
called Mercer’s conditions (K must be continuous, symmetric in its arguments
so K(a, b) = K(b, a), etc.), then there exists a function ϕ that maps a and b into
another space (possibly with much higher dimensions) such that K(a, b) = ϕ(a)T ϕ(b).
So you can use K as a kernel since you know ϕ exists, even if you don’t know what ϕ
is. In the case of the Gaussian RBF kernel, it can be shown that ϕ actually maps each
training instance to an infinite-dimensional space, so it’s a good thing you don’t need
to actually perform the mapping!
Note that some frequently used kernels (such as the Sigmoid kernel) don’t respect all
of Mercer’s conditions, yet they generally work well in practice.

# Online SVMs

For linear SVM classifiers, one method is to use Gradient Descent (e.g., using
SGDClassifier) to minimize the cost function in Equation 5-13, which is derived
from the primal problem. Unfortunately it converges much more slowly than the
methods based on QP.

Linear SVM classifier cost function:
J(w, b) = 1/2(wTw) + C(from i=1 to m)Σmax(0, 1 − t^i(wTx^i + b))

The first sum in the cost function will push the model to have a small weight vector
w, leading to a larger margin. The second sum computes the total of all margin violations.
An instance’s margin violation is equal to 0 if it is located off the street and on
the correct side, or else it is proportional to the distance to the correct side of the
street. Minimizing this term ensures that the model makes the margin violations as
small and as few as possible

## hinge loss:
The function max(0, 1 – t) is called the hinge loss function (represented below). It is
equal to 0 when t ≥ 1. Its derivative (slope) is equal to –1 if t < 1 and 0 if t > 1. It is not
differentiable at t = 1, but just like for Lasso Regression you can still use Gradient Descent using any subderivative at t = 1 (i.e., any
value between –1 and 0).

# Large scale non-linear problems.
These must almost always be solved with neural networks rather than SVM :)