
# Support Vector Machines

One thing SVMs are very good at is text classification.

The goal here is to determine whether a tweet was written by a Democratic or Republican politician, using just the text of the tweet.

`sklearn` library is used in this exercise.

The data has three fields:

| Feature 	| Description               	|
|---------	|---------------------------	|
|   Party 	| Democrat or Republican    	|
|  Handle 	| The author's Twitter name 	|
|   Tweet 	| The text of the tweet     	|

In [4]:
import pandas as pd

data = pd.read_csv('data/tweets.csv')

# Training an SVM on a lot of data with a lot of features can take a few minutes,
# so to keep things speedy here we will use a subset of the data.
data = data.sample(5000, random_state=5)

data.head()

Unnamed: 0,Party,Handle,Tweet
4098,Democrat,RepStephMurphy,"RT @SSNAlerts: .@RepStephMurphy's, Mike Kelly'..."
39638,Democrat,cbrangel,America is greater thx to contributions of Lat...
44277,Republican,RepPaulMitchell,Speaking to @therealnmma this morning about my...
70181,Republican,RepDavid,RT @DodieLondenEIPS: Congratulations to Britta...
32440,Democrat,BennieGThompson,"Republicans control the House, Senate and Whit..."


## Looking at the data

How many politicians do we have tweets from, per party?

(Not how many tweets per party!)

In [5]:
data.drop_duplicates(subset='Handle')['Party'].value_counts()

Republican    222
Democrat      211
Name: Party, dtype: int64

And how many tweets per politician?

In [6]:
data.groupby('Handle')['Tweet'].agg('count')

Handle
AGBecerra          16
AlanGrayson        18
AnthonyBrownMD4     5
AustinScottGA08     9
BennieGThompson     6
                   ..
reppittenger       15
repsandylevin      11
rosadelauro        12
sethmoulton         9
virginiafoxx        7
Name: Tweet, Length: 433, dtype: int64

## Working with text data

The features for an SVM can't be words or whole tweets. We need a numerical representation for the words in the texts. One method is to transform the text into TF-IDF vectors.

It will take the tweets, tokenise them into words (using a special tokeniser that knows how best to split up tweets), remove stop words (very common words like "the" and "and", which do not really contribute to the meaning of a tweet much) then it will create a sparse matrix representation of all the tweets. Each row is a single tweet, each column is a word in the vocabulary of all the tweets.

It only uses the 5000 most common words - using all ~200k words would take a long time to train a model.

(This will take a few seconds!)

In [13]:
# you need to install twokenize. You can run this cell only once
!pip install twokenize

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [14]:
import twokenize
from sklearn.feature_extraction.text import TfidfVectorizer

transformer = TfidfVectorizer(tokenizer=twokenize.tokenizeRawTweetText, stop_words='english', max_features=5000)
tweet_vecs = transformer.fit_transform(data['Tweet'])

tweet_vecs

<5000x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 53445 stored elements in Compressed Sparse Row format>

In [15]:
# Some words in the vocabulary and their IDs

list(transformer.vocabulary_.items())[0:10]

[('rt', 4015),
 (':', 503),
 ('.', 325),
 (',', 318),
 ('mike', 3219),
 ('cracking', 1671),
 ('house', 2625),
 ('…', 4977),
 ('america', 985),
 ('greater', 2422)]

## Setting up the data

As is standard, we will use some of our data for training the model and some of it for evaluating it. This gives a better idea of how well the model can generalise to unseen data, rather than simply overfitting to the data it has seen.

First, we set up a variable `y` to store the Party labels we want to predict.

Then, we use the `train_test_split` function to split up the `tweet_vecs` and the `y` data into train/test portions, using an 80:20 train:test ratio.

In [16]:
from sklearn.model_selection import train_test_split

y = data['Party']

X_train, X_test, y_train, y_test = train_test_split(tweet_vecs, y, test_size=0.2)

## Task 1: Train a linear kernel SVM

SVMs in `sklearn` have a few configurable options. The key ones are the kernel to be used (which can overcome non-separable data classes) and the regularization value $C$ (to relax or tighten the margins).

Use a `for` loop to try different values for the kernel: `['linear', 'rbf', 'poly', 'sigmoid']`

On each iteration, create a classifier with that `kernel`, and call `.fit()` with the training data.

Then, use the `.score()` method to see how well the model did on both the seen and the unseen data. What do you observe?

In [17]:
from sklearn.svm import SVC

# Your code here...

###
###
###



Kernel: linear
	Training accuracy:	0.911
	Test accuracy:		0.700
Kernel: rbf
	Training accuracy:	0.989
	Test accuracy:		0.706
Kernel: poly
	Training accuracy:	0.997
	Test accuracy:		0.636
Kernel: sigmoid
	Training accuracy:	0.838
	Test accuracy:		0.687
Models do extremely well on the training data, in general, suggesting that the kernels are very good at finding a separating hyperplane for this high dimensional data.
But this does not always translate directly into generalisability - unseen data accuracy is lower.


## Task 2: Find the best model parameters

In addition to testing different kernels, try different values for $C$, the regularization hyperparameter.

Rather than doing this in a loop, one model at a time, we can parallelise it using `GridSearchCV` in `sklearn`. 

The `GridSearchCV` class takes a model, with a dictionary of hyperparameters and values. Then you just fit/train it as usual, using the training data from before.

Try the following:

1. Different kernels
2. A few values for $C$

Below, create a `GridSearchCV` in the same way you would do with a model: assign it to a variable named `gcv`, pass it the `classifier` as your basic model without parameters set, and also pass it `params`.

To speed things up, set `n_jobs=-1` to use all available CPU cores. Set `verbose=1` so you get updates as it proceeds - useful for making sure it is actually working!

In [18]:
from sklearn.model_selection import GridSearchCV

params = dict(kernel=['linear', 'rbf', 'poly', 'sigmoid'],
              C=[0.1, 0.3, 0.5, 0.7, 0.9, 1.0, 1.1, 1.3, 1.5, 1.7, 2.0],
             )

classifier = SVC()

# Your code here...

###
###



Fitting 5 folds for each of 44 candidates, totalling 220 fits


GridSearchCV(estimator=SVC(), n_jobs=-1,
             param_grid={'C': [0.1, 0.3, 0.5, 0.7, 0.9, 1.0, 1.1, 1.3, 1.5, 1.7,
                               2.0],
                         'kernel': ['linear', 'rbf', 'poly', 'sigmoid']},
             verbose=1)

## What was the best model?

`GridSearchCV` evaluated each possible model using the accuracy metric.

The best model is stored inside `gcv` as `best_estimator_`. Its score is in `gcv.best_score_` and the actual hyperparameters used are in `gcv.best_params_`.

(The score here is not the score on the training set, but the average score across subsets of the training set.)

Take a look at these and then evaluate the best model using the test set.

How does it compare to the four models you trained before?

In [None]:
# Your code here...

###
###



## What is easier to classify? Democrat or Republican?

Accuracy only gives one impression. We have three classes here, so print a classification report for each of the baseline models.

`sklearn.metrics.classification_report` takes two arguments: the true labels and a model's predictions.

You can get predictions for `X_test` by using the `.predict()` method of a trained model.

How does the best model do at predicting the two classes?

In [None]:
# Your code here...

###
###

