# Support Vector Machines
Corresponds with modu

In [17]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 

import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report, confusion_matrix  
import re
import string

from sklearn.model_selection import train_test_split
from sklearn import svm

## Prepare data

In [2]:
#load data that is saved locally
data = pd.read_csv("microfinance_tweets.csv", encoding="ISO-8859-1")

In [3]:
data.loc[data['Sentiment'] == 'negative', 'Sentiment'] = -1
data.loc[data['Sentiment'] == 'neutral', 'Sentiment'] = 0
data.loc[data['Sentiment'] == 'positive', 'Sentiment'] = 1

In [4]:
data.head()

Unnamed: 0,Comments,Date,Favorites,User,Polarity,Sentiment
0,RT @atmadiprayET: Here's why Janalakshmi Finan...,3/22/2018 5:40,0,Saloni Shukla,-0.1,-1
1,RT @ecosmob: Ecosmob's #Mobility solutions for...,3/22/2018 5:36,0,Sindhav Bhageerath,-0.0625,0
2,Project have big future! Microfinance is belie...,3/22/2018 5:27,0,Konstantin #savedroidICO,0.166667,1
3,#Online #Banking- Yako Microfinance Bank prov...,3/22/2018 5:21,0,YakoMicrofinance,0.5,1
4,MICROFINANCE EVENT: 3rd BoP Global Network Sum...,3/22/2018 5:19,0,MicroCapital,0.045455,1


In [11]:
train, test = train_test_split(data, test_size=0.2, random_state=42)

In [12]:
vectorizer = CountVectorizer()
train_features = vectorizer.fit_transform(train['Comments'])
test_features =  vectorizer.transform(test['Comments'])

We have vectorized our data such that each index corresponds with a word as well as the frequency of that word in the text.

In [52]:
print(train_features[0])

  (0, 585)	2
  (0, 778)	1
  (0, 788)	1
  (0, 1301)	1
  (0, 1302)	1
  (0, 1940)	1
  (0, 1994)	1
  (0, 2088)	1
  (0, 2230)	1
  (0, 3106)	1
  (0, 3381)	2
  (0, 3573)	1
  (0, 3770)	2
  (0, 4161)	1
  (0, 4516)	1
  (0, 5257)	1


## Linear SVM

There are many types of SVMs, but we will first try a linear SVM, the most basic. This means that the decision boundary will be linear. <br>

There is another input called decision_function_shape. The two options of one versus rest, and one versus one. This relates to how the decision boundary separates points, whether it separates negative points from everyone else or negative points from neutral points, etc. (https://pythonprogramming.net/support-vector-machine-parameters-machine-learning-tutorial/). The default is one versus rest. One versus rest takes less computational power but may be thrown off by outliers and don't do well on imbalanced data sets, e.g. more of one class than another.

In [36]:
clf = svm.SVC(kernel='linear')  
clf.fit(train_features, train['Sentiment'])

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [56]:
y_train = clf.predict(train_features)  

In [57]:
print(confusion_matrix(train['Sentiment'],y_train)) 
print(classification_report(train['Sentiment'],y_train))  

[[ 188   19    0]
 [   6 1540    0]
 [   0    0  839]]
             precision    recall  f1-score   support

         -1       0.97      0.91      0.94       207
          0       0.99      1.00      0.99      1546
          1       1.00      1.00      1.00       839

avg / total       0.99      0.99      0.99      2592



In [58]:
y_pred = clf.predict(test_features)  

In [59]:
print(confusion_matrix(test['Sentiment'],y_pred)) 
print(classification_report(test['Sentiment'],y_pred))  

[[ 41   8   3]
 [  8 386   3]
 [  1   9 190]]
             precision    recall  f1-score   support

         -1       0.82      0.79      0.80        52
          0       0.96      0.97      0.96       397
          1       0.97      0.95      0.96       200

avg / total       0.95      0.95      0.95       649



What do you think of the performance of the SVM? We can also adjust gamma to account for overfitting, but it doesn't look like we've overfit too much given the training and test performances.

Remember that support vectors are the data points that lie closest to the decision surface (or hyperplane). We can figure out what those data points are below for each class we are classifying, noting that we have three classes for negative, neutral, and positive.

In [39]:
print(clf.support_vectors_)

  (0, 531)	1.0
  (0, 1440)	1.0
  (0, 2371)	1.0
  (0, 2769)	1.0
  (0, 2775)	2.0
  (0, 2780)	1.0
  (0, 3106)	1.0
  (0, 3157)	1.0
  (0, 3312)	1.0
  (0, 3381)	1.0
  (0, 3496)	1.0
  (0, 3729)	1.0
  (0, 4796)	1.0
  (0, 4864)	1.0
  (0, 4964)	1.0
  (0, 5021)	1.0
  (0, 5059)	1.0
  (0, 5092)	1.0
  (0, 5156)	2.0
  (0, 5638)	1.0
  (1, 374)	2.0
  (1, 585)	1.0
  (1, 1885)	1.0
  (1, 2484)	2.0
  (1, 2485)	1.0
  :	:
  (1299, 3729)	1.0
  (1299, 3861)	1.0
  (1299, 3999)	1.0
  (1299, 4102)	1.0
  (1299, 5156)	2.0
  (1299, 5370)	1.0
  (1300, 614)	1.0
  (1300, 934)	1.0
  (1300, 1213)	1.0
  (1300, 1401)	1.0
  (1300, 1473)	1.0
  (1300, 1518)	1.0
  (1300, 1684)	1.0
  (1300, 1925)	1.0
  (1300, 2097)	1.0
  (1300, 2501)	1.0
  (1300, 3106)	1.0
  (1300, 3487)	1.0
  (1300, 4358)	1.0
  (1300, 4913)	1.0
  (1300, 5104)	1.0
  (1300, 5156)	1.0
  (1300, 5158)	1.0
  (1300, 5573)	1.0
  (1300, 5627)	1.0


We can check for the number of points in each class using another function. Here we see that most support vectors are in our last class, the positive class.

In [40]:
clf.n_support_

array([152, 713, 436])

We can also find the support vector in our original data using the indices provided for us with clf.support_

In [41]:
clf.support_

array([   8,   21,   34, ..., 2573, 2585, 2587])

In [49]:
print(train_features[8])

  (0, 531)	1
  (0, 1440)	1
  (0, 2371)	1
  (0, 2769)	1
  (0, 2775)	2
  (0, 2780)	1
  (0, 3106)	1
  (0, 3157)	1
  (0, 3312)	1
  (0, 3381)	1
  (0, 3496)	1
  (0, 3729)	1
  (0, 4796)	1
  (0, 4864)	1
  (0, 4964)	1
  (0, 5021)	1
  (0, 5059)	1
  (0, 5092)	1
  (0, 5156)	2
  (0, 5638)	1


## Non-linear SVM

We can also check different kernel types, with rbf being gaussian and sigmoid being similar to the sigmoid function in logistic regression. A visualization is simplest to understand below:

<img src="svm shapes.png">

In [30]:
clf = svm.SVC(kernel='rbf')  
clf.fit(train_features, train['Sentiment'])

y_pred = clf.predict(test_features)  

In [31]:
print(confusion_matrix(test['Sentiment'],y_pred)) 
print(classification_report(test['Sentiment'],y_pred))  

[[  0  52   0]
 [  0 397   0]
 [  0 200   0]]
             precision    recall  f1-score   support

         -1       0.00      0.00      0.00        52
          0       0.61      1.00      0.76       397
          1       0.00      0.00      0.00       200

avg / total       0.37      0.61      0.46       649



  'precision', 'predicted', average, warn_for)


In [34]:
clf = svm.SVC(kernel='sigmoid')  
clf.fit(train_features, train['Sentiment'])

y_pred = clf.predict(test_features)  

In [35]:
print(confusion_matrix(test['Sentiment'],y_pred)) 
print(classification_report(test['Sentiment'],y_pred))  

[[  0  52   0]
 [  0 397   0]
 [  0 200   0]]
             precision    recall  f1-score   support

         -1       0.00      0.00      0.00        52
          0       0.61      1.00      0.76       397
          1       0.00      0.00      0.00       200

avg / total       0.37      0.61      0.46       649



  'precision', 'predicted', average, warn_for)


It looks like the linear SVM performs best on this model from both a precision and recall perspective. Remember that precision are the accuracy of the prediction and recall is how much of the true positive space we are capturing. 

What does this mean about our underlying data?

Source: https://stackabuse.com/implementing-svm-and-kernel-svm-with-pythons-scikit-learn/, https://jakevdp.github.io/PythonDataScienceHandbook/05.07-support-vector-machines.html, https://gist.github.com/WittmannF/60680723ed8dd0cb993051a7448f7805