Pre-processed a dataset and learn SVM The dataset D2 is not preprocessed. It
consists of label[ham or spam] and content of sms text. Your task in this part is to pre-process this data into
a processable format. Using OneHotEnconding might not help, therefore you have to use other means of
converting text data into features. You can look at scikit-learn text feature extraction utilities i.e. TFIDF or
count. You might also want to get rid of the stop words i.e. This, the, is, a etc, which appear in almost all
the documents. After preprocessing you have to use SVM implementation provided by scikit-learn. Here
you will experiment with different hyperparameters and two kernels (linear and RBF). As usual you will
perform 5-fold cross validation and present the score using plots and tables. You might also want to look
at sklearn.pipeline.Pipeline utility to streamline your workflow.

In [1]:
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import csv
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import KFold
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
import random
from sklearn import metrics
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import average_precision_score
from sklearn.svm import SVC
import warnings


tokenizer = RegexpTokenizer(r'\w+')
stop = stopwords.words('english')

data = {"text":[], "class":[]}

f = open("C:\Users\saikiran\Desktop\SMSSpamCollection.txt", "r")
reader=csv.reader(f,delimiter='\t')
for target, value in reader:
    tokens = []
    token = tokenizer.tokenize(value)
    for i in token:
        if i not in stop:
            tokens.append(i)

    value = " ".join(tokens).decode('utf-8', 'ignore')
    data["text"].append(value)
    data["class"].append(target)

f.close()

length = len(data["text"])
sample = random.sample(range(0, length), length)
data["text"] = [data["text"][i] for i in sample]
data["class"] = [data["class"][i] for i in sample]

pipeline1 = Pipeline([
    ('vectorizer',  CountVectorizer(ngram_range=(1,20))),
    ('classifier',  SVC(kernel='rbf')) ])
pipeline2 = Pipeline([
    ('vectorizer',  CountVectorizer(ngram_range=(1,20))),
    ('classifier',  SVC(kernel='linear')) ])
k_fold = KFold(n=len(data["text"]), n_folds=5)
new_data_text = np.asarray(data['text'])
new_data_class = np.asarray(data['class'])
scores_rbf = []
scores_linear = []

for train_indices, test_indices in k_fold:
    train_text = new_data_text[train_indices]
    train_y = new_data_class[train_indices]
    
    test_text = new_data_text[test_indices]
    test_y = new_data_class[test_indices]
    pipeline1.fit(train_text, train_y)
    pipeline2.fit(train_text, train_y)
    predicted_rbf = pipeline1.predict(test_text)
    predicted_linear = pipeline2.predict(test_text)
    score_rbf = pipeline1.score(test_text, test_y)
    score_linear = pipeline2.score(test_text, test_y)
    scores_rbf.append(score_rbf)
    scores_linear.append(score_linear)
print "scores on rbf kernel are:",scores_rbf
warnings.filterwarnings("ignore")
print "scores on linear svm kernel are:",scores_linear
print(metrics.classification_report(test_y, predicted_rbf, target_names=['ham', 'spam']))
print "Mean Accuracy: of rbf kernel " + str(score_rbf)
print(metrics.classification_report(test_y, predicted_linear, target_names=['ham', 'spam']))
score_rbf = sum(scores_rbf) / len(scores_rbf)
score_linear = sum(scores_linear) / len(scores_linear)
print "Mean Accuracy: of linear kernel " + str(score_linear)



scores on rbf kernel are: [0.87623318385650228, 0.87174887892376685, 0.85906642728904847, 0.86445242369838415, 0.85816876122082586]
scores on linear svm kernel are: [0.97040358744394617, 0.96771300448430497, 0.96678635547576297, 0.96588868940754036, 0.95960502692998206]
             precision    recall  f1-score   support

        ham       0.86      1.00      0.92       956
       spam       0.00      0.00      0.00       158

avg / total       0.74      0.86      0.79      1114

Mean Accuracy: of rbf kernel 0.858168761221
             precision    recall  f1-score   support

        ham       0.96      1.00      0.98       956
       spam       1.00      0.72      0.83       158

avg / total       0.96      0.96      0.96      1114

Mean Accuracy: of linear kernel 0.966079332748


# Mean accuracy score using linear kernel is better to mean accuracy score of rbf kernel for this dataset.
