In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import matplotlib

matplotlib.rcParams['animation.embed_limit'] = 30000000.0
plt.rcParams['figure.dpi'] = 120

# Support Vector Machines

The term Support Vector Machines (SVM's) is sometimes used loosely to refer to three methods; maximal margin classifier, a support vector classifier, and a support vector machine. Each are an extension of the previous method, allowing them to be applied to a broader range of cases.

Support Vector Machines (SVM) are a common discriminative algorithm, well suited to *complex* small- to medium sized datasets<sup>2</sup>, which aim to find a hyperplane that provides the maximum margin of separation between classes of objects. They can be used for both classification and regression.

**NOTES**
- Images from the Hands on machine learning (using the petal data)
- some explanation from the python ML
- structure of the Intro to stats learning & some algebra
- Some algebra from Machine learning: a probabilitistic perspective.

---
1. James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013.
2. Géron, A. (2017). Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools, and techniques to build intelligent systems. " O'Reilly Media, Inc.".

"Developed in the computer science community in the 1990s"<sup>1</sup>

"generalisation of the maximal margin classifier" which requires a linear boundary - SVC as an extension can be applied to a broader range of cases<sup>1</sup>

---
1. James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013.

# Classification
## Linear

Maximal margin classifier which induces a linear decision boundary on the feature space. SVMs simply divide the space to outline a decision boundary. Lewts have a look at a decision boundary

In [None]:
#from mlxtend.plotting import plot_decision_regions

#vis_data = X_train[:,[feature_list.index(x_axis_label),
#                      feature_list.index(y_axis_label)]]

#pipe_svc_linear.fit(vis_data, y_train)

#plot_decision_regions(vis_data,
#                      y_train,
#                      clf = pipe_svc_linear)

#plt.xlabel(x_axis_label) 
#plt.ylabel(y_axis_label)
#plt.xlim(0,.6)
#plt.ylim(0,1.)

#plt.savefig('svm_linear_boundary.png')
#plt.show()

In SVM context two classes are (perfectly) separable by a K − 1 dimensional hyperplane. In one dimension the separator is a point, in two dimensions - a line, in three - a plane and so on.

In [None]:
# One Dimension

In [None]:
# Two Dimension

In [None]:
# Three Dimension

[More description on what a Hyperplane is and demonstration]

A subset of training data, known as support vectors, are selected by an algorithm to compute the optimal separation hyperplane between classes. 

If data can be linearly separated, then a 'hard margin' of separation can be used; whereby a point on the edge of a class is used as the support vector for the decision boundary. 

However this method is sensitive to outliers, so a more flexible method may be preferable, using a soft margin of separation to compute a hyperplane that still provides a maximum margin of separation, whilst still allowing for some errors. 

[Discussion of Maximising the Decision margin]

## Soft Margin

Not always the data could be perfectly separated by a K − 1 dimensional hyperplane. To overcome this problem we could either tweak the constraints on the hyperplane to allow some points to be misclassified (soft margin) or alternatively we could transform the data to be separable by a hyperplane in another space (kernel method).

In Scikit-Learn's SVM class this can be controlled by the C hyperparameter; with a smaller C creating a wider boundary but with more margin violations.

## Non-linear
A hyperplane does not need to be linear as the input feature space can be projected to higher dimensions using a kernel (e.g. radial basis kernel<sup>2,3</sup>), allowing a hyperplane to be fitted to split the data into classes. The data can then be mapped back into the original feature space to create a nonlinear separation boundary.

---
2. Cover, T. M. (1965). Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE transactions on electronic computers, (3), 326-334.
3. Varsavsky, A., Mareels, I., & Cook, M. (2016). Epileptic seizures and the EEG: measurement, models, detection and prediction. CRC Press.

[Describe Mercer's Theorem using Hands on Machine Learning]

[Describe and demonstrate various kernels]

# Regression

# Implimentation

There are three classes for SVM classification in Scikit-Learn (table below adapted from Géron(2017)<sup>1</sup>):

| Class                | Time Complexity                                                                      | Out-of-core Support | Kernel Trick|
|------------------------|-------------------------------------------------------------------------------------------|---------------------------------|--------------------|
| LinearSVC       | 0(*m* x *n*)                                                                               | No                               | No                |
| SGDClassifier | 0(*m* x *n*)                                                                               | Yes                              | No                |
| SVC                  | 0(*m*<sup>2</sup> x *n*) to 0(*m*<sup>3</sup> x *n*) | No                               | Yes               |

First lets make a pipeline with two steps:

1. Standardize the features
2. SVM

For the SVM we'll just use the SVC and set the kernel to linear so we can compare the decision boundary to the logistic regression as we did before. The data in our examples is quite small so using SVC, although takes longer than the other two methods, is fine for this dataset.

## Dimensionality Reduction and SVM
In order to reduce a models complexity, run time, and potential for overfitting to the training data, dimension reduction techniques can be used. Broadly they can be grouped into methods that create a subset of the original set of features (Feature Selection) and methods that create new synthetic features through combining the original features and discarding less important ones (Feature Extraction). Essentially we want to remove "uninformative infromation" and retain useful bits<sup>1</sup>. If you have too many features, it may be that some of them are highly correlated and therefore redundant. Therefore we can either select just some of them, or compress them onto a lower dimensional subspace<sup>2</sup>.

---
1. Zheng, A., & Casari, A. (2018). Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists. " O'Reilly Media, Inc.".

2. Raschka, 2016

## Filtering

A computationally efficient method of selecting features is to use a filter method. Filter methods aim to remove features with a low potential to predict outputs; usually though univariate analysis before classification. A filter could be a threshold set on each features variance, or the correlation or mutual information between each feature and the response variable. Although filters are computationally efficient comparative to other feature selection methods, they are independent of the model chosen so should be used conservatively to ensure data is not removed that a model may find useful<sup>1</sup>.

### Variance Threshold

VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold.

---

1. Zheng, A., & Casari, A. (2018). Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists. " O'Reilly Media, Inc.".

In [None]:
#from sklearn.feature_selection import VarianceThreshold
#from itertools import compress
#from collections import defaultdict
#import pprint
#pp = pprint.PrettyPrinter()
#
#sel = VarianceThreshold(threshold=.8)
#sel.fit(X_train)
#
# get boolian list of what is kept and what not
#keep_bool = sel.get_support()
# get index of false values
#remove_index = [i for i, x in enumerate(keep_bool) if not x]
#
# merge multiindex feature labels into 1 label list
#feat_labels = reduced_features.columns
#remove_list = list(feat_labels[remove_index])
#
#print(color.BOLD+color.UNDERLINE+'Features and Channels Removed ('+str(len(remove_index))+')\n'+color.END)
#pp.pprint(remove_list)

### Embedded Methods

Instead of being independent, feature selection methods can be embedded in the model training process. An example would be the l1 regularizer for linear models, which imposes a sparsity constraint on the model to ensure a model favours fewer features. These methods are efficient and specific to the chosen model, but are not as powerful at wrapper methods (discussed next)<sup>1</sup>.

Below is just an example of how you could implement it in a pipeline using a Support Vector Machine.

---

1. Zheng, A., & Casari, A. (2018). Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists. " O'Reilly Media, Inc.".

## Wrapper Methods
Wrapper methods are also  specific to the chosen model as they directly optimise the accuracy of a classifier by trying subsets of features. This enables keeping features that are useful in combination with others, even if uninformative in isolation<sup>1</sup>. Wrapper methods are the most computationally expensive, especially when used with nonlinear classifiers.

---

1. Zheng, A., & Casari, A. (2018). Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists. " O'Reilly Media, Inc.".

## Imballanced Data and SVM

## Multi-class

Tree-based classifiers are inherently multiclass whereas other machine learning algorithms are able to be extended to multi-class classification using techniques such as the One-versus-Rest or One-versus-One methods<sup>1</sup>.

**One-vs-the-rest (or one-verses-all)** is were you train a classifier for each class and select the class from the classifier that outputs the highest score<sup>1</sup>. As each class is fitted against all other classes for each classifier, it is relatively interpretable<sup>2</sup>.

---
1. Géron, A. (2017). Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools, and techniques to build intelligent systems. " O'Reilly Media, Inc.".
2. https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html

In [None]:
#from sklearn.multiclass import OneVsRestClassifier
#
#multi_pipe_svc_rbf = Pipeline([
#    ('scl', StandardScaler()),
#    ('clf', OneVsRestClassifier(SVC(C=100,
#                                    kernel='rbf',
#                                    gamma = 'auto',
#                                    class_weight = 'balanced',
#                                    random_state=RANDOM_STATE)))])

#multi_pipe_svc_rbf.fit(multi_X_train, multi_y_train)
#print('Validation Accuracy: %.3f' % multi_pipe_svc_rbf.score(multi_X_val, multi_y_val))

Another strategy is to use a **OneVsOneClassifer**. This trains $N \times (N-1) / 2$ classifiers by comparing each class against each other so when a prediction is made, the class that is selected the most is chosen<sup>1</sup> (we'll get more onto *Bagging* next week). It is useful where algorithms do not scale well with data size (such as SVM) because each training and prediction is only needed to be run on a small subset of the data for each classifer<sup>1,2</sup>.

---
1. Géron, A. (2017). Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools, and techniques to build intelligent systems. " O'Reilly Media, Inc.".
2. https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html

In [None]:
#from sklearn.multiclass import OneVsOneClassifier
#
#multi_pipe_svc_rbf = Pipeline([
#    ('scl', StandardScaler()),
#    ('clf', OneVsOneClassifier(SVC(C=100,
#                                   kernel='rbf',
#                                   gamma = 'auto',
#                                   class_weight = 'balanced',
#                                   random_state=RANDOM_STATE)))])
#
#multi_pipe_svc_rbf.fit(multi_X_train, multi_y_train)
#print('Validation Accuracy: %.3f' % multi_pipe_svc_rbf.score(multi_X_val, multi_y_val))

In [None]:
#multi_pipe_svc_rbf.fit(multi_vis_data, multi_y_train)
#
#plot_decision_regions(multi_vis_data,
#                      multi_y_train,
#                      clf = multi_pipe_svc_rbf)
#
#plt.xlabel(x_axis_label) 
#plt.ylabel(y_axis_label)
#plt.xlim(0,.6)
#plt.ylim(0,1.)
#plt.show()

## Majority Voting

A group of classifiers don't have to all be SVM's. Indeed Scikitlearn has a VotingClassifier where multipule classification pipelines can be combined to create an even better classifier that aggregates predictions. This aggregation can be done by simply selecting the class label that has been predicted by the majority of the classifiers (more than 50% of votes) for 'hard voting'. Majority vote refers to binary class decisions but can be generalized to a multi-class setting using 'plurality voting'. Particular classifiers return the probability of a predicted class label via the predict_proba method and this can be used for 'soft voting' instead of class labels<sup>1</sup>.

Ensemble methods work best when the predictors are as independent as possible, so one way of achiving this is to get diverse classifiers. This increases the chance they each make different types of errors which in combination will improve the overall accuracy<sup>2</sup>.

As can be seen below the soft majority voter has better scores than the hard voting method and better than most other methods individually when all are on their default settings. Soft voting often achives a higher performance than hard voting because highly confident votes are given more weight<sup>2</sup>.

**NOTE**
- with some hyper-parameter optimisation its likely we could increase the performance of the soft-majority vote.

---
1. Raschka, Sebastian, and Vahid Mirjalili. Python Machine Learning, 2nd Ed. Packt Publishing, 2017
2. Géron, A. (2017). Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools, and techniques to build intelligent systems. " O'Reilly Media, Inc.".

In [None]:
# TODO: Replace the Tree with one of the other classifiers they have already learnt

#%%time
#from sklearn.preprocessing import StandardScaler
#from sklearn.ensemble import VotingClassifier
#from sklearn.pipeline import Pipeline
#from sklearn.linear_model import LogisticRegression
#from sklearn.svm import SVC
#from sklearn.tree import DecisionTreeClassifier
#from sklearn.model_selection import cross_val_score
#from imblearn.under_sampling import NeighbourhoodCleaningRule
#from sklearn.decomposition import PCA
#import timeit
#from sklearn.model_selection import StratifiedKFold
#from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, accuracy_score, make_scorer
#
#clf1 = Pipeline([('scl', StandardScaler()),
#                 ('clf', SVC(kernel='rbf', 
#                             gamma='auto',
#                             random_state=RANDOM_STATE, 
#                             probability = True))])

#clf2 = Pipeline([('scl', StandardScaler()),
#                 ('clf', LogisticRegression(solver='liblinear',
#                                            random_state=RANDOM_STATE))
#])

#clf3 = DecisionTreeClassifier(random_state=RANDOM_STATE)
#
#clf_labels = ['SVM', # Support Vector Machine
#              'LR', # LogisticRegression
#              'DT'] # Decision Tree
#
# Majority Rule Voting
#hard_mv_clf = VotingClassifier(estimators=[(clf_labels[0],clf1),
#                                           (clf_labels[1],clf2),
#                                           (clf_labels[2],clf3)],
#                              voting='hard')
#
#soft_mv_clf = VotingClassifier(estimators=[(clf_labels[0],clf1),
#                                           (clf_labels[1],clf2),
#                                           (clf_labels[2],clf3)],
#                               voting='soft')
#
#clf_labels += ['Hard Majority Voting', 'Soft Majority Voting']
#all_clf = [clf1, clf2, clf3, hard_mv_clf, soft_mv_clf]
#
#print(color.BOLD+color.UNDERLINE+'Validation Scores\n'+color.END)
#for clf, label in zip(all_clf, clf_labels):
#    start = timeit.default_timer() # TIME STUFF
#    
#    clf.fit(X_train, y_train)
#
#    y_pred = clf.predict(X_val)
#    scores = f1_score(y_val, y_pred)
#    print(color.BOLD+label+color.END)
#    print("Score: %0.3f"
#          % scores)
    # TIME STUFF
#    stop = timeit.default_timer()
#    print("Run time:", np.round((stop-start)/60,2),"minutes")
#    print()

In [None]:
#%%time
#from sklearn.metrics import roc_curve
#from sklearn.metrics import auc
#
# remove the hard voting because doesnt have predict proba
#del clf_labels[3], all_clf[3]
#
#colors = ['black', 'orange', 'blue', 'green']
#linestyles = [':', '--', '-.', '-']
#for clf, label, clr, ls \
#        in zip(all_clf,
#               clf_labels, colors, linestyles):
#
    # assuming the label of the positive class is 1
#    y_pred = clf.fit(X_train, 
#                          y_train).predict_proba(X_test)[:, 1]
#    fpr, tpr, thresholds = roc_curve(y_true=y_test,
#                                     y_score=y_pred)
#    roc_auc = auc(x=fpr, y=tpr)
#    plt.plot(fpr, tpr,
#             color=clr,
#             linestyle=ls,
#             label='%s (auc = %0.2f)' % (label, roc_auc))
#
#plt.legend(loc='lower right')
#plt.plot([0, 1], [0, 1],
#         linestyle='--',
#        color='gray',
#         linewidth=2)

#plt.xlim([-0.1, 1.1])
#plt.ylim([-0.1, 1.1])
#plt.grid(alpha=0.5)
#plt.xlabel('False positive rate (FPR)')
#plt.ylabel('True positive rate (TPR)')

#plt.savefig(os.path.join(IMAGE_DIR, 'Pipeline_Rocs.png'), dpi=300)
#plt.show()