
# Support Vector Machines

One thing SVMs are very good at is text classification.

The goal here is to determine whether a tweet was written by a Democratic or Republican politician, using just the text of the tweet.

`sklearn` library is used in this exercise.

The data has three fields:

| Feature 	| Description               	|
|---------	|---------------------------	|
|   Party 	| Democrat or Republican    	|
|  Handle 	| The author's Twitter name 	|
|   Tweet 	| The text of the tweet     	|

In [2]:
import pandas as pd

data = pd.read_csv('data/tweets.csv')

# Training an SVM on a lot of data with a lot of features can take a few minutes,
# so to keep things speedy here we will use a subset of the data.
data = data.sample(5000, random_state=5)

data.head()

Unnamed: 0,Party,Handle,Tweet
4098,Democrat,RepStephMurphy,"RT @SSNAlerts: .@RepStephMurphy's, Mike Kelly'..."
39638,Democrat,cbrangel,America is greater thx to contributions of Lat...
44277,Republican,RepPaulMitchell,Speaking to @therealnmma this morning about my...
70181,Republican,RepDavid,RT @DodieLondenEIPS: Congratulations to Britta...
32440,Democrat,BennieGThompson,"Republicans control the House, Senate and Whit..."


## Looking at the data

How many politicians do we have tweets from, per party?

(Not how many tweets per party!)

In [3]:
data.drop_duplicates(subset='Handle')['Party'].value_counts()

Party
Republican    222
Democrat      211
Name: count, dtype: int64

And how many tweets per politician?

In [4]:
data.groupby('Handle')['Tweet'].agg('count')

Handle
AGBecerra          16
AlanGrayson        18
AnthonyBrownMD4     5
AustinScottGA08     9
BennieGThompson     6
                   ..
reppittenger       15
repsandylevin      11
rosadelauro        12
sethmoulton         9
virginiafoxx        7
Name: Tweet, Length: 433, dtype: int64

## Working with text data

The features for an SVM can't be words or whole tweets. We need a numerical representation for the words in the texts. One method is to transform the text into TF-IDF vectors.

It will take the tweets, tokenise them into words (using a special tokeniser that knows how best to split up tweets), remove stop words (very common words like "the" and "and", which do not really contribute to the meaning of a tweet much) then it will create a sparse matrix representation of all the tweets. Each row is a single tweet, each column is a word in the vocabulary of all the tweets.

It only uses the 5000 most common words - using all ~200k words would take a long time to train a model.

(This will take a few seconds!)

In [5]:
from nltk.tokenize import TweetTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer

def tok(text):
    tt = TweetTokenizer()
    return tt.tokenize(text)

transformer = TfidfVectorizer(tokenizer=tok, stop_words='english', max_features=5000)
tweet_vecs = transformer.fit_transform(data['Tweet'])

tweet_vecs



<5000x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 55941 stored elements in Compressed Sparse Row format>

In [6]:
# Some words in the vocabulary and their IDs

list(transformer.vocabulary_.items())[0:10]

[('rt', 4044),
 (':', 439),
 ('.', 291),
 ('@repstephmurphy', 778),
 ("'", 280),
 ('s', 4063),
 (',', 287),
 ('mike', 3198),
 ('cracking', 1728),
 ('house', 2629)]

## Setting up the data

As is standard, we will use some of our data for training the model and some of it for evaluating it. This gives a better idea of how well the model can generalise to unseen data, rather than simply overfitting to the data it has seen.

First, we set up a variable `y` to store the Party labels we want to predict.

Then, we use the `train_test_split` function to split up the `tweet_vecs` and the `y` data into train/test portions, using an 80:20 train:test ratio.

In [7]:
from sklearn.model_selection import train_test_split

y = data['Party']

X_train, X_test, y_train, y_test = train_test_split(tweet_vecs, y, test_size=0.2)

## Task 1: Train a linear kernel SVM

SVMs in `sklearn` have a few configurable options. The key ones are the kernel to be used (which can overcome non-separable data classes) and the regularization value $C$ (to relax or tighten the margins).

Use a `for` loop to try different values for the kernel: `['linear', 'rbf', 'poly', 'sigmoid']`

On each iteration, create a classifier with that `kernel`, and call `.fit()` with the training data.

Then, use the `.score()` method to see how well the model did on both the seen and the unseen data. What do you observe?

$\color{red}{\textbf{TO DO :}}$

In [9]:
from sklearn.svm import SVC

# Your code here...
kernels_list = ['linear', 'rbf', 'poly', 'sigmoid']
scores_list = []
for kernel in kernels_list:
    clf = SVC(kernel=kernel)
    clf.fit(X_train, y_train)

    score_seen = clf.score(X_train, y_train)
    score_unseen = clf.score(X_test, y_test)
    scores_list.append((score_seen, score_unseen))


results = pd.DataFrame({
    'kernel': kernels_list,
    'train_score': [x[0] for x in scores_list],
    'test_score': [x[1] for x in scores_list]
})
results.sort_values('test_score')

Unnamed: 0,kernel,train_score,test_score
2,poly,0.99675,0.642
3,sigmoid,0.84875,0.664
0,linear,0.9125,0.67
1,rbf,0.99175,0.674


We see that the best score on unseen data is achieved when we use the poly kernel, with the rbf kernel performing the worst on unseen data. All the kernels have high scores on training data, with poly performing the best then rbf, then linear, then sigmoid.

## Task 2: Find the best model parameters

In addition to testing different kernels, try different values for $C$, the regularization hyperparameter.

Rather than doing this in a loop, one model at a time, we can parallelise it using `GridSearchCV` in `sklearn`.

The `GridSearchCV` class takes a model, with a dictionary of hyperparameters and values. Then you just fit/train it as usual, using the training data from before.

Try the following:

1. Different kernels
2. A few values for $C$

Below, create a `GridSearchCV` in the same way you would do with a model: assign it to a variable named `gcv`, pass it the `classifier` as your basic model without parameters set, and also pass it `params`.

To speed things up, set `n_jobs=-1` to use all available CPU cores. Set `verbose=1` so you get updates as it proceeds - useful for making sure it is actually working!

$\color{red}{\textbf{TO DO :}}$

In [10]:
from sklearn.model_selection import GridSearchCV

params = dict(kernel=['linear', 'rbf', 'poly', 'sigmoid'],
              C=[0.1, 0.3, 0.5, 0.7, 0.9, 1.0, 1.1, 1.3, 1.5, 1.7, 2.0],
             )

classifier = SVC()

# Your code here...

gcv = GridSearchCV(classifier, params, n_jobs=-1, verbose=1)
gcv.fit(X_train, y_train)



Fitting 5 folds for each of 44 candidates, totalling 220 fits


### What was the best model?

`GridSearchCV` evaluated each possible model using the accuracy metric.

The best model is stored inside `gcv` as `best_estimator_`. Its score is in `gcv.best_score_` and the actual hyperparameters used are in `gcv.best_params_`.

(The score here is not the score on the training set, but the average score across subsets of the training set.)

Take a look at these and then evaluate the best model using the test set.

How does it compare to the four models you trained before?

$\color{red}{\textbf{TO DO :}}$ Evaluate the model

In [24]:
# Your code here...

print('Best Params Search')
best_results = pd.DataFrame(gcv.cv_results_).sort_values('rank_test_score')
best_results.head(5)

Best Params Search


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_kernel,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
25,0.921203,0.015345,0.228709,0.008103,1.1,rbf,"{'C': 1.1, 'kernel': 'rbf'}",0.67,0.7075,0.70125,0.67625,0.68,0.687,0.014676,1
21,0.973823,0.046533,0.225165,0.01939,1.0,rbf,"{'C': 1.0, 'kernel': 'rbf'}",0.675,0.70375,0.7,0.6675,0.685,0.68625,0.013964,2
29,0.908539,0.020407,0.216973,0.006103,1.3,rbf,"{'C': 1.3, 'kernel': 'rbf'}",0.66625,0.70875,0.69375,0.67375,0.68875,0.68625,0.015,2
41,0.859384,0.033586,0.22607,0.023479,2.0,rbf,"{'C': 2.0, 'kernel': 'rbf'}",0.66,0.70125,0.6975,0.66625,0.69875,0.68475,0.017808,4
37,0.866323,0.022759,0.206112,0.008901,1.7,rbf,"{'C': 1.7, 'kernel': 'rbf'}",0.66,0.70125,0.69375,0.66875,0.7,0.68475,0.017055,4


In [25]:
print(f'Best params in params search: {gcv.best_params_}')

best_model = gcv.best_estimator_
best_model.score(X_test, y_test)
print(f'Best Model Score on test dataset: {best_model.score(X_test, y_test)}')

Best params in params search: {'C': 1.1, 'kernel': 'rbf'}
Best Model Score on test dataset: 0.675


This model accuracy is 0.675, which is worsen the scores by just varying the kernels

### What is easier to classify? Democrat or Republican?

Accuracy only gives one impression. We have three classes here, so print a classification report for each of the baseline models.

`sklearn.metrics.classification_report` takes two arguments: the true labels and a model's predictions.

You can get predictions for `X_test` by using the `.predict()` method of a trained model.

How does the best model do at predicting the two classes?

$\color{red}{\textbf{TO DO :}}$

In [31]:
# Your code here...
import sklearn
import sklearn.metrics

print(sklearn.metrics.classification_report(
    y_test,
    best_model.predict(X_test),
))


              precision    recall  f1-score   support

    Democrat       0.68      0.65      0.66       491
  Republican       0.67      0.70      0.69       509

    accuracy                           0.68      1000
   macro avg       0.68      0.67      0.67      1000
weighted avg       0.68      0.68      0.67      1000



The best model is performs equally in predicing both the classes. It has an f1-score of 0.66 for Democrat, and an f1-score of 0.69 for Rebuplican.

The model has better precision with Democrats, and better Recall for Rebublicans.