# Support Vector Machine (SVM)
- Created in the 1960s. They are popular because are powerful and different from other ML learning algorithms.
- Imagina you have a plot and there are two main clusters, how can you separate the data? Maybe drawing a line between the two clusters to classify them. And there are many different lines that can separate the results, but will have different results when adding new data. So, need to find best line to separate the data.
- Line is found through the maximum margin, so draw a line and will choose the one that separates better the clusters by optimizing the separation between the line and the distance with the nearest point of the two clusters.
- The SVM checks the nearest points from the different clusters and calculates the margin based on the nearest vectors (in 3D/2D this are points, but higher dimensions then a point is represented like a vector). That is why its called support vector machine.
- In 2D/3D it can be seen as a line that separates the points. But higher dimensions is a hyperplane

- So, when data points are in a cluster, you can see that usually when points of different clusters are very close means that they are similar or hard to recognize. In contrast, the points that are in the center of each cluster or far away from the other clusters are the data points that are easy for the model to recognize as part of that cluster. So, what SVM does is look for those points for each cluster that are hard to differentiate because are close to the other clusters and try to maximize the distance between them.

## Importing the libraries

In [3]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

The data set contains 400 rows if a customer bought 'yes' or 'no' SUV. Using age and salary. 

In [4]:
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [5]:
dataset.head()

Unnamed: 0,Age,EstimatedSalary,Purchased
0,19,19000,0
1,35,20000,0
2,26,43000,0
3,27,57000,0
4,19,76000,0


## Splitting the dataset into the Training set and Test set

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Feature Scaling

In [7]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train,)
X_test = sc.transform(X_test)

## Training the SVM model on the Training set

In [8]:
from sklearn.svm import SVC

classifier = SVC(kernel='linear', random_state=0) # classic model with the linear kernel
model = classifier.fit(X_train, y_train)
print(model)

SVC(kernel='linear', random_state=0)


## Predicting the Test set results

In [9]:
y_pred = classifier.predict(X_test)
y_pred

array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1])

## Making the Confusion Matrix

Confusion matrix
- 0,0: True negatives - negatives that were actually negatives
- 1,0: False negatives - values that were predicted false but where positive
- 0,1: False positives - values that were predicted positive but where false
- 1,1: True positives - positives that were actually positives

Accuracy 
- Measure in machine learning of how many predictions that you did were actually true

F1 Score
- Combines the precision and recall of a model. This is better when classes are imbalanced because when classes are imbalanced and calculate accuracy, then it doesn't represent the predicitive power of the model. i.e. If have that 80% of predictions are true, and only 20% are false, then here you have a class imbalance and the model could learn to always predict true all the time, and accuracy (# of correct predictions) will give .80 which is big considering that you are only printing true.

- On classification problems, precision and recall are metrics that come at the cost of another.
- Precision: (sum of true positives) / (sum of true positives + sum of false positives) -> Of all the positives you predicted, how many were correct. This doubts even positives, because don't want to be incorrect. Optimize to not be incorrect, so decrease recall. 
- Recall: (sum of true positives) / (sum of true positives + sum of false negatives) -> What percentage of all the true values you predicted to true. This one is less critic because want to catch all the trues. Optiimzing for selecting the actual correct ones, so decrease precision

- F1 Score: Uses harmonic mean to maximize precision and recall. Which is (2 * precision * recall) / (precision + recall). Harmonic mean encourages similar values for both precision and recall. 

In [10]:
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score
confusion = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

## Results
- Print the confusion matrix, accuracy, and the f1 score.
- The results can be improved if the kernel method used with SVM is non linear. This is because in here we are using linear models and that maybe not be the best method for fitting the data

In [11]:
confusion

array([[57,  1],
       [ 6, 16]])

In [12]:
accuracy

0.9125

In [13]:
f1

0.8205128205128205

## Kernel SVM

What we want is find the decision boundary to clearly separate the data. And most of the time this can be not done through a linear function. Based on the kernel that you are doing, is the assumption if the data is linearly separable, polynomic, etc. 

The idea is to create a mapping function that takes the data that is not linearly separable, and increasing the dimensions that you have, making the data higher dimension. In exmaple, if have data plotted in 2D then use a mapping function to transform it to 3D. By this, it is possible that the result makes the data separable by a linear function, even when before it was not possible. The problem with mapping to higher dimensions is that is computationa expensive. For this, we use the kernel trick that makes similar results but without being that much of computational expensive. 

### Kernel Trick
- Gaussian RBF Kernel: Python locates a landmark on the clusters of points and is able to take all the points within a circumference and be able to differentiate between the circumference and other values. The value of sigma in this case is the value of the diameter of the circle. And this allows to separate the data in the same dimension, not in higher dimensions. And in the same way, if have more complex numbers, like the infinte symbol inside a circle (imagining that inside infinte symbol there is one result), then we can convert two of this methods to fit the data.
- Sigmoid kernel: still select a landmark and depending on the distance, will give values between 0-1.
- Polynomial kernel: The kernel behaves more like a polynomial function with 3 variables. 