## **Exercise 2: Support Vector Machine**


In this exercise, you will be building and tuning a model to detect spam SMS using SVM algorithm. 

**A)** Read the csv file "SMSSpamCollection". 

**B)** Create a function that preprocesses the SMS text and apply it on the SMS column. 

**C)** Extract features from the SMS text using CountVectorizer.

**D)** Split the data into training and testing datasets.

**E)** Instantiate a SVC classifier and fit it on the training data. 

**F)** Predict the output for the test data and calculate the accuracy. 

**G)** Create a function to calculate the specificity. 

**H)** Optimize the parameters based on the accuracy.

**I)** Optimize the parameters based on the specificity. 

In [2]:
# Import needed libraries
import pandas as pd
import numpy as np
import string
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

### **A)** 

In [3]:
# Read the csv file "SMSSpamCollection.csv"
sms_data = pd.read_csv('Datasets/SMSSpamCollection.csv')

### **B)** 

In [48]:
# Define a function that takes the message string as input and does the following:
# 1. Convert all characters to lower case
# 2. Remove all punctuation ("string.punctuation" contains a list of punctuations)
# 3. Remove all digits ("string.digits" contains a list of numbers)
# 4. Returns a string of the processed text
def text_process(message):
    message_proc = [char.lower() for char in message if (char not in string.punctuation) and (char not in string.digits)]
    message_proc = ''.join(message_proc)
    return message_proc

In [49]:
sms_data['SMS_processed'] = sms_data['SMS'].apply(text_process)

### **C)** 

In [50]:
vectorizer = CountVectorizer(stop_words="english")
X = vectorizer.fit_transform(sms_data['SMS_processed'])
y = sms_data['Label']

### **D)** 

In [51]:
# splitting data
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.25, random_state=0)


### **E)** 

In [52]:
# Instantiate a SVC classifier and fit it on the training data.
model_svc  = SVC()
model_svc.fit(X_train, y_train)

### **F)** 

In [53]:
# Calculating the accuracy
model_svc.score(X_test, y_test)

0.9634146341463414

### **G)** 

In [54]:
# Calculating the specificity (True Negative Rate)
def specificity(model, X_test, y_test):
    y_pred = model.predict(X_test)
    y_test = np.array(y_test)
    
    # Counters
    index = 0
    hit = 0
    spam_total = 0
    for index in range(len(y_pred)):
        if y_pred[index] == y_test[index]:
            if y_test[index] == "spam":
                hit += 1
                
        if y_test[index] == "spam":
            spam_total += 1
    #specifity = correctly predicted spam in predictions / total spam in actual values 
    spec = (hit/spam_total)*100
    spec = round(spec,2)
    return spec


spec = specificity(model_svc, X_test, y_test)
print (f"Specificity of the model = {spec}%")

Specificity of the model = 72.73%


### **H)** 

In [55]:
# Manually optimizing parameters for both accuracy and specificity
c_values = np.arange(0.05, 1.05, 0.05)
params = {"kernel": ["linear", "rbf", "poly", "sigmoid"], "C": c_values, "gamma": ["auto", "scale"]}

opt_params = {}
best_score = 0
best_spec = 0
opt_params_spec = {}
for ktype in params["kernel"]:
    for c in params["C"]:
        for gamma in params["gamma"]:
            svc = SVC(kernel=ktype, C=c, gamma=gamma)
            svc.fit(X_train, y_train)
            acc = svc.score(X_test, y_test)
            spec = specificity(svc, X_test, y_test)
            if spec > best_spec:
                best_spec = spec
                opt_params_spec["ktype"] = ktype
                opt_params_spec["C"] = c
                opt_params_spec["gamma"] = gamma
            if acc > best_score:
                best_score = acc
                opt_params["ktype"] = ktype
                opt_params["C"] = c
                opt_params["gamma"] = gamma
            
print("Highest accuracy score: ", best_score)
print("Parameters used: ", opt_params)
print("Highest specificity score: ", best_spec)
print("Parameters used: ", opt_params_spec)

Highest accuracy score:  0.9784791965566715
Parameters used:  {'ktype': 'linear', 'C': 0.3, 'gamma': 'auto'}
Highest specificity score:  84.49
Parameters used:  {'ktype': 'linear', 'C': 0.3, 'gamma': 'auto'}
