 SVM Spam Classification (SMS)

Binary text classification on SMS messages using Support Vector Machines (SVM).
We compare **binary / TF / TF-IDF** features, tune **C**, and benchmark vs **Naive Bayes** and **Decision Tree**.

## Overview
- Dataset: SMS spam (labels: ham/spam).  
- Features: Count (binary & term frequency) and TF-IDF.  
- Models: Linear SVM (LinearSVC), RBF SVM (SVC), Naive Bayes, Decision Tree.  
- Metrics: accuracy, precision, recall, F1; confusion matrix.  
- Repro: fixed `random_state`, train/test split with stratification.

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

In [2]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kristop/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
df = pd.read_csv('spam.csv', encoding='latin-1')

In [4]:
df = df[['v1', 'v2']]
df.columns = ['label', 'message']
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

In [5]:
def preprocess_text(text):
    # 1. Convert to lowercase
    text = text.lower()
    # 2. Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # 3 & 4. Tokenize, remove stopwords, apply stemming
    tokens = text.split()
    tokens = [stemmer.stem(word) for word in tokens if word not in stop_words]
    return ' '.join(tokens)


In [6]:
df['clean_message'] = df['message'].apply(preprocess_text)

In [7]:
X_train, X_test, y_train, y_test = train_test_split(df['clean_message'],
                                                    df['label'],
                                                    test_size=0.2,
                                                    random_state=42)

In [8]:
print(X_train.head())

1978              im boat still mom check yo im half nake
3989    bank granit issu strongbuy explos pick member ...
3935                     r give second chanc rahul dengra
4078                           play smash bro ltgt religi
4086    privat 2003 account statement 07973788240 show...
Name: clean_message, dtype: object


In [9]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


In [10]:
binary_vectorizer = CountVectorizer(binary=True)
X_train_binary = binary_vectorizer.fit_transform(X_train)
X_test_binary = binary_vectorizer.transform(X_test)

In [11]:
tf_vectorizer = CountVectorizer()   
X_train_tf = tf_vectorizer.fit_transform(X_train)
X_test_tf = tf_vectorizer.transform(X_test)

In [12]:
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

In [13]:
print("Binary Vector Shape:", X_train_binary.shape)
print("TF Vector Shape:", X_train_tf.shape)
print("TF-IDF Vector Shape:", X_train_tfidf.shape)


Binary Vector Shape: (4457, 7113)
TF Vector Shape: (4457, 7113)
TF-IDF Vector Shape: (4457, 7113)


In [14]:
from sklearn.svm import SVC
import time

In [15]:
results = {}

In [16]:
feature_sets = {
    'Binary': (X_train_binary, X_test_binary),
    'TF': (X_train_tf, X_test_tf),
    'TF-IDF': (X_train_tfidf, X_test_tfidf)
}

In [20]:
for feature_name, (X_tr, X_te) in feature_sets.items():
    print(f"\n--- {feature_name} Representation ---")
    results[feature_name] = {}
    
    # 1️⃣ Train Linear SVM
    linear_svm = SVC(kernel='linear')
    start_time = time.time()
    linear_svm.fit(X_tr, y_train)
    end_time = time.time()
    linear_time = end_time - start_time
    results[feature_name]['Linear_SVM_Time'] = linear_time
    print(f"Linear SVM Training Time: {linear_time:.4f} seconds")
    
    rbf_svm = SVC(kernel='rbf')
    start_time = time.time()
    rbf_svm.fit(X_tr, y_train)
    end_time = time.time()
    rbf_time = end_time - start_time
    results[feature_name]['RBF_SVM_Time'] = rbf_time
    print(f"RBF SVM Training Time: {rbf_time:.4f} seconds")


--- Binary Representation ---
Linear SVM Training Time: 0.1826 seconds
RBF SVM Training Time: 0.4074 seconds

--- TF Representation ---
Linear SVM Training Time: 0.1746 seconds
RBF SVM Training Time: 0.4022 seconds

--- TF-IDF Representation ---
Linear SVM Training Time: 0.2573 seconds
RBF SVM Training Time: 0.5146 seconds


In [21]:
from sklearn.metrics import confusion_matrix

In [22]:
def calculate_metrics(y_true, y_pred):
    cm = confusion_matrix(y_true, y_pred, labels=['ham', 'spam'])
    tn, fp, fn, tp = cm.ravel()
    
    false_alarm_rate = fp / (fp + tn)
    miss_rate = fn / (fn + tp)
    overall_error = (fp + fn) / (fp + fn + tp + tn)
    
    return round(false_alarm_rate, 4), round(miss_rate, 4), round(overall_error, 4)

In [23]:
metrics_results = {}

for feature_name, (X_tr, X_te) in feature_sets.items():
    print(f"\nEvaluating Linear SVM on {feature_name} Features")
    
    # Predict on test set
    linear_svm = SVC(kernel='linear')
    linear_svm.fit(X_tr, y_train)
    y_pred = linear_svm.predict(X_te)
    
    # Calculate metrics
    fa_rate, miss_rate, error_rate = calculate_metrics(y_test, y_pred)
    
    metrics_results[feature_name] = {
        'False Alarm Rate': fa_rate,
        'Miss Rate': miss_rate,
        'Overall Error Rate': error_rate
    }
    
    print(f"False Alarm Rate: {fa_rate}, Miss Rate: {miss_rate}, Overall Error Rate: {error_rate}")


Evaluating Linear SVM on Binary Features
False Alarm Rate: 0.0021, Miss Rate: 0.16, Overall Error Rate: 0.0233

Evaluating Linear SVM on TF Features
False Alarm Rate: 0.001, Miss Rate: 0.14, Overall Error Rate: 0.0197

Evaluating Linear SVM on TF-IDF Features
False Alarm Rate: 0.0021, Miss Rate: 0.14, Overall Error Rate: 0.0206


In [24]:
metrics_df = pd.DataFrame(metrics_results).T
metrics_df.index.name = 'Feature Representation'
print(metrics_df)

                        False Alarm Rate  Miss Rate  Overall Error Rate
Feature Representation                                                 
Binary                            0.0021       0.16              0.0233
TF                                0.0010       0.14              0.0197
TF-IDF                            0.0021       0.14              0.0206


In [25]:
c_values = [0.01, 0.1, 0.5, 1, 5, 10, 50, 100, 200, 500, 1000]

In [26]:
c_metrics = []

In [27]:
for c in c_values:
    svm_model = SVC(kernel='linear', C=c)
    svm_model.fit(X_train_binary, y_train)
    y_pred = svm_model.predict(X_test_binary)
    
    fa_rate, miss_rate, error_rate = calculate_metrics(y_test, y_pred)
    
    c_metrics.append({
        'C': c,
        'False Alarm Rate': fa_rate,
        'Miss Rate': miss_rate,
        'Overall Error Rate': error_rate
    })

In [28]:
c_metrics_df = pd.DataFrame(c_metrics)
print(c_metrics_df)

          C  False Alarm Rate  Miss Rate  Overall Error Rate
0      0.01            0.0000     0.3733              0.0502
1      0.10            0.0000     0.1667              0.0224
2      0.50            0.0010     0.1467              0.0206
3      1.00            0.0021     0.1600              0.0233
4      5.00            0.0021     0.1733              0.0251
5     10.00            0.0021     0.1733              0.0251
6     50.00            0.0021     0.1733              0.0251
7    100.00            0.0021     0.1733              0.0251
8    200.00            0.0021     0.1733              0.0251
9    500.00            0.0021     0.1733              0.0251
10  1000.00            0.0021     0.1733              0.0251


In [29]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
import time

In [30]:
comparison_results = {}

In [32]:
svm = SVC(kernel='linear')
start_train = time.time()
svm.fit(X_train_binary, y_train)
end_train = time.time()

start_test = time.time()
y_pred_svm = svm.predict(X_test_binary)
end_test = time.time()

fa_svm, miss_svm, error_svm = calculate_metrics(y_test, y_pred_svm)

comparison_results['SVM (Linear)'] = {
    'False Alarm Rate': fa_svm,
    'Miss Rate': miss_svm,
    'Training Time (s)': round(end_train - start_train, 4),
    'Testing Time (s)': round(end_test - start_test, 4)
}

In [33]:
nb = MultinomialNB()
start_train = time.time()
nb.fit(X_train_binary, y_train)
end_train = time.time()

start_test = time.time()
y_pred_nb = nb.predict(X_test_binary)
end_test = time.time()

fa_nb, miss_nb, error_nb = calculate_metrics(y_test, y_pred_nb)

comparison_results['Naive Bayes'] = {
    'False Alarm Rate': fa_nb,
    'Miss Rate': miss_nb,
    'Training Time (s)': round(end_train - start_train, 4),
    'Testing Time (s)': round(end_test - start_test, 4)
}

In [34]:
dt = DecisionTreeClassifier()
start_train = time.time()
dt.fit(X_train_binary, y_train)
end_train = time.time()

start_test = time.time()
y_pred_dt = dt.predict(X_test_binary)
end_test = time.time()

fa_dt, miss_dt, error_dt = calculate_metrics(y_test, y_pred_dt)

comparison_results['Decision Tree'] = {
    'False Alarm Rate': fa_dt,
    'Miss Rate': miss_dt,
    'Training Time (s)': round(end_train - start_train, 4),
    'Testing Time (s)': round(end_test - start_test, 4)
}

In [35]:
comparison_df = pd.DataFrame(comparison_results).T
comparison_df.index.name = 'Model'
print(comparison_df)

               False Alarm Rate  Miss Rate  Training Time (s)  \
Model                                                           
SVM (Linear)             0.0021     0.1600             0.1794   
Naive Bayes              0.0041     0.1267             0.0474   
Decision Tree            0.0104     0.2000             0.0632   

               Testing Time (s)  
Model                            
SVM (Linear)             0.0315  
Naive Bayes              0.0064  
Decision Tree            0.0005  
