# SVM

**Description:**   

A hyperplane (line that splits the input variable space) is selected to best separate the points in the input variable space by their class, either class 0 or class 1. In two-dimensions, you can visualize this as a line.  An optimization algorithm is used to find the values for the coefficients that maximizes the margin.  The distance between the hyperplane and the closest data points is referred to as the **margin**. The best or optimal hyperplane that can separate the two classes is the line that has the largest margin. Only these points, called the **support vectors**, are relevant in defining the hyperplane and in the construction of the classifier.

**Output Type:** binary

**Pros:**  
- Effective in high dimensional spaces: due to feature transformation and ignoring distances between data points.
- It works really well with clear margin of separation
- It is effective in cases where number of dimensions is greater than the number of samples.
- Uses a subset of training points in the decision function (the support vectors) so it is also memory efficient
- SVM might be one of the most powerful out-of-the-box classifiers

**Cons:**   
- Does not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation.  - prone to overfitting 
- Poor performance with a large data set because the required training time is higher
- Poor performance with very noisy datasets, where the target classes are overlapping especially. 

**Examples:**  
- Sentiment Analysis
https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-example-code/


**Testing Methods:**  
- Cross-validation
- Confusion matrix
- Precision and recall

## SVM from sklearn

## Parameters

`sklearn.svm.SVC(C=1.0, kernel='rbf', degree=3, gamma=0.0, coef0=0.0, shrinking=True, probability=False,tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, random_state=None)`

Parameters with higher impact on model performance:  “kernel”, “gamma” and “C”


In [2]:
# source:  https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-example-code/
#Import Library
from sklearn import svm
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create SVM classification object 
model = svm.svc(kernel='linear', c=1, gamma=1) 
# there is various option associated with it, like changing kernel, gamma and C value. Will discuss more # about it in next section.Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)
#Predict Output
predicted= model.predict(x_test)

AttributeError: module 'sklearn.svm' has no attribute 'svc'

### Kernels: a way to transform features

Instead of being forced to live in a coordinate system such as <x0,⋯, x1>, we can instead transform our data into a new coordinate system that is easier to solve. 
(show plot of 2 circles, one inside the other):  draw a straight line that separates the two circles.
https://www.safaribooksonline.com/library/view/thoughtful-machine-learning/9781491924129/assets/tmlp_0704.png

These look like regular circles, so there doesn’t appear to be a line that you could separate them with. This is true in 2D Cartesian coordinate systems, but if you project this into a 3D Cartesian coordinate system, < x,y > → <x2,√2xy,y2>, you will find that in fact this turns out to be linear.  
Now you can see that these two circles are separate and you can draw a plane easily between the two. If you took that and mapped it back to the original plane, then there would in fact be a third circle in the middle that is a straight plane." 
https://www.safaribooksonline.com/library/view/thoughtful-machine-learning/9781491924129/assets/tmlp_0705.png

As a side note, there are many different types of projections (or kernels) such as:  
- Polynomial kernel (heterogeneous and homogeneous)
- Radial basis functions
- Gaussian kernels

Options avialable with sklearn.svm.SVC:  
- “linear”, “**rbf**”,”poly” and others  
- “rbf” and “poly” are useful for non-linear hyper-plane. 
- look at the iris example, where we’ve used linear kernel on two feature of iris data set to classify their class.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets

In [None]:
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features. We could avoid this ugly slicing by using a two-dim dataset
y = iris.target

In [None]:
# we create an instance of SVM and fit out data. We do not scale our data since we want to plot the support vectors
C = 1.0 # SVM regularization parameter
svc = svm.SVC(kernel='linear', C=1,gamma=0).fit(X, y)

In [None]:
# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
h = (x_max / x_min)/100
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
 np.arange(y_min, y_max, h))

In [None]:
plt.subplot(1, 1, 1)
Z = svc.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)

In [None]:
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.title('SVC with linear kernel')
plt.show()

Change the kernel type to rbf and look at the impact.

In [None]:
svc = svm.SVC(kernel='rbf', C=1,gamma=0).fit(X, y)

Go for linear kernel if you have large number of features (>1000) because it is more likely that the data is linearly separable in high dimensional space. Also, you can RBF but do not forget to cross validate for its parameters as to avoid over-fitting.

### gamma
Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. Higher the value of gamma, will try to exact fit the as per training data set i.e. generalization error and cause over-fitting problem.

Example: Explore the difference if we have gamma different gamma values like 0, 10 or 100.

In [None]:
svc = svm.SVC(kernel='rbf', C=1,gamma=0).fit(X, y)


### C
Penalty parameter of the error term.  
It also controls the trade off between smooth decision boundary and classifying the training points correctly.
Compare c = 1, 100, 1000

### Parameter Tuning Summary
We should always look at the cross validation score to have effective combination of these parameters and avoid over-fitting.

### Our Challenge
Are our customers are happy or not. We understand that happy customers generally say nice things while unhappy ones don’t. This is their sentiment.
There are two tiers to this problem:
1. We need to figure out whether customers are happy or not, or whether their sentiment is positive or negative in what they say.
2. Does overall customer sentiment correlate with our bottom line?

We also assume that a happy customer means more money, but is that actually true? How can we even build an algorithm to test something like that?

To start solving this two-tiered problem, we will figure a way to map customers to sentiment. There are many ways to approach this problem such as clustering customers into two groups or using KNN to find the closest neighbors to people we know are unhappy or happy. Or we could use SVMs.



### Data
We first need data that indicates sentiment. 
- A support system that allows access to verbatim text from our customers
- Social media posts from customers
- Free text survey responses

### Labels
How do we determine whether they are happy or not, i.e. label the data? 
- Have support agents tag each individual ticket with a sentiment (positive or negative).
- Have support agents tag a subset of tickets (X% of all tickets).
- Use an existing tagged database (such as movie reviews or some academic data set).

#### Data Collection Notes:

Having a group of people tag a subset is generally the right way to approach this problem.
We want to achieve the best results for the least amount of work.  We could have support agents collect a subset of tickets (say 30% of all tickets) and as a group tag them either negative or positive.
But, because humans differ on their opinions on problems like this, it’s important to apply some sort of voting mechanism, whether it’s a mean or mode. So if we were to tag 30% of all tickets we would want at least three people tagging each ticket so we could either average the answers or find the most common answer.


### Algorithm Selection
Many algorithms would work for this problem.   
SVMs would work especially well here because:
- They avoid the curse of dimensionality, meaning we can use lots of dimensions (features)
- They have been shown to work well with sentiment analysis, which is pertinent to the issues discussed next

### The Theory Behind SVMs

https://www.safaribooksonline.com/library/view/thoughtful-machine-learning/9781491924129/ch07.html#support_vector_machines

Let’s imagine we have data from our customers, in this case support tickets. In this example let’s say the customer is either happy or unhappy with the ticket
Conceptually, if we were to build a model of what makes a customer happy or unhappy, we could take our inputs (in this case features from the text) and determine customer groupings, like in KNN.  
The problem here is that textual features generally are high in number, which can incur the curse of dimensionality. For instance, given a set of support tickets there might be 4,000 dimensions, each defining whether they said a word in a corpus. So instead of relying on KNN, we should approach this model via a decision boundary.  

### Decision Boundaries  

*show image of points where a natural decision boundary can be drawn*   
Looking at these data points you could think about splitting the data into two pieces by drawing a line down the middle, which will likely give a good solution.  
There are many algorithms taht use the decision boundary method, including rules-based algorithms and decision trees.  
Decision trees and random forests are types of decision boundary methods.  

But for sentiment analysis with 4,000 dimensions, given what we see here, how can we find the best boundary that splits the data into two parts?

### Maximizing Boundaries  

To find the most optimal line between the two sets of data, SVMs find the widest margin between the two data sets.  
They aim to maximize the breadth of the margin between two (or more) classifications. 





### Sentiment Analyzer  

- Goal:  build a sentiment analyzer that determines the sentiment of movie reviews. The example we’ll use also applies to working with support tickets. 
- Class diagram for movie-review sentiment analyzer to visualize what this tool will look like. https://www.safaribooksonline.com/library/view/thoughtful-machine-learning/9781491924129/assets/tmlp_0707.png
- Build a Corpus: transforming the text into numerical information
- Build a CorpusSet: Turn features into sparse matrix
- Build a SentimentClassifier: SentimentClassifier is where we will then use the SVM algorithm to build this sentiment analyzer.

### Vocab:

**Corpus**:  like corpse, means a body, but in this case it’s a body of writings. This word is used heavily in the natural-language processing community to signal a big group of previous writings that can be used to infer knowledge. In our example, we are using corpus to refer to a body of writings around a certain sentiment. **Corpora** is the plural of corpus.

### Strategy

1. Corpus class:  Parse sentiment text and store as a corpus with frequencies in it.  
    - Tokenizing text: extracting out stems, frequency of letters, emoticons, and words (defined as strings between nonalpha characters) StringIO makes strings look like IO objects, which makes it easy to test file IO–type operations on strings.
    - Sentiment leaning, whether :negative or :positive
    - Mapping from sentiment leaning to a numerical value
    - Returning a unique set of words from the corpus
    - image of text tokenization: https://www.safaribooksonline.com/library/view/thoughtful-machine-learning/9781491924129/assets/tmlp_0708.png


2. CorpusSet class:  
    - Takes in multiple corpora, transforms all of that into a matrix of features, and has the properties words, x’s and y’s. 
    - This doesn’t do much except store all of the words in a set for later use, by iterating the corpora and storing all the unique words. 
    - From here we need to calculate the sparse vectors we will use in the SVM, which depends on building a feature matrix composed of feature vectors

3. Model Validation and the Sentiment Classifier
    - cross-validation will determine how well our classification works. 
