<a href="https://colab.research.google.com/github/D4ve39/pythonProg/blob/master/MachineLearningBasics_SVM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import

In [None]:
from sklearn.svm import SVC
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Load data
The iris dataset consists of several features for three different kind of irises: Setosa, Versicolour, and Virginica. We will try to predict, based on a subset of these features, the category of the given iris sample. 

In [None]:
iris = datasets.load_iris()

# Let's select only the first two features (sepal length, sepal width)
features = iris.data[:, 0:2]
labels = iris.target

# Randomize the data point
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.3, random_state=42)

features.shape, train_features.shape, test_features.shape

In [None]:
def make_meshgrid(x, y, h=.02):
    x_min, x_max = x.min() - 1, x.max() + 1
    y_min, y_max = y.min() - 1, y.max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    return xx, yy

In [None]:
xx, yy = make_meshgrid(train_features[:, 0], train_features[:,1])

plt.scatter(train_features[:, 0], train_features[:, 1], c=train_labels, cmap=plt.cm.coolwarm, s=30, edgecolors='k')
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xticks(())
plt.yticks(())

plt.show()

# Support Vector Machine

# `TODO`  Why?

Support Vector Machine is again a **supervised learning** model very successful and popular, as it has successfully (compared to previous models) been applied to tasks such as image classification and natural language processing. It also has strong theoretical guarantess (unlike deep learning models), and it is both computationally and memory very efficient. Its main strenght is the so called **kernel-trick**, which allows you to implicitly map the input samples to a high dimensional space, in which the data points (potentially non linearly separable in the original space), become linearly separable. 


In [None]:
# Now it's your turn! Train and score your svm model (Hint: the code is very similar to the previous task!)

# Create the SVM model
svm = ...
# Fit the SVM model
...

In [None]:
# Plotting the results

fig, sub = plt.subplots(1, 1)


Z = svm.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
out = sub.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.6)

sub.scatter(train_features[:, 0], train_features[:, 1], c=train_labels, cmap=plt.cm.coolwarm, s=30, edgecolors='k')
sub.set_xlim(xx.min(), xx.max())
sub.set_ylim(yy.min(), yy.max())
sub.set_xlabel('Sepal length')
sub.set_ylabel('Sepal width')
sub.set_xticks(())
sub.set_yticks(())

plt.show()



## `TODO`  Predicting new data points!
Now let's try to use our model for prediction! Use the test_features array defined above to make prediction

In [None]:
# Insert your code here, how well does our model perform on unseen data?

predictions = ...

In [None]:
accuracy=balanced_accuracy_score(test_labels,predictions)
print("The prediction accuracy is ",accuracy)
# Plotting the results

fig, sub = plt.subplots(1, 1)


Z = svm.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
out = sub.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.6)

sub.scatter(train_features[:, 0], train_features[:, 1], c=train_labels, cmap=plt.cm.coolwarm, s=30, edgecolors='k')

sub.scatter(test_features[:, 0], test_features[:, 1], c=predictions, cmap=plt.cm.coolwarm, s=60, edgecolors='y')

sub.set_xlim(xx.min(), xx.max())
sub.set_ylim(yy.min(), yy.max())
sub.set_xlabel('Sepal length')
sub.set_ylabel('Sepal width')
sub.set_xticks(())
sub.set_yticks(())

plt.show()

# `TODO` Adjusting the parameters of the model
The scikit-learn documentation of support vector machines lists a lot of parameters that can be adjusted. The most important one is the so-called "regularisation parameter" C.

– Try out setting C to different values. (For example, 0.01, 1, 100,10000,1000000) 

– What do you observe (about the decision boundary and the prediction accuracy)?


#`TODO` Imbalanced classes
– What happens if there are much more samples from one class than from another?

– Can you think of ways to solve the problems arising from classes not being equally represented?

Remark: For classifications models such as support vector machines and logistic regression, Scikit-learn already corrects for imbalalanced data by default.

##Solution

In [None]:

# Define the model
svm = SVC(C=1)
# Fit the model to the training set
svm.fit(train_features, train_labels)
# Predict on the test set
predictions = svm.predict(test_features)
