# CEM300 SMS Spam Analysis - Comparative Study

**Module**: CE4145 - Natural Language Processing  
**Student**: MOHAMED ABDELLAH - 2118775  
**Date**: 2025  
**Word Count**: 


## 0. Generative AI Declaration

Generative AI was used to support completion of this assessment. Comments have been provided against relevant cells.


## 1. Introduction 

SMS spam has become a significant problem in mobile communication, with unsolicited messages causing user annoyance, potential security risks, and economic losses. The SMS Spam Collection dataset provides an excellent foundation for developing and comparing Natural Language Processing (NLP) systems to address this challenge.

This comparative study aims to evaluate different machine learning approaches for SMS spam classification, focusing on two distinct methodologies: similarity-based learning and neural network-based approaches. The dataset contains 5,572 SMS messages with a significant class imbalance (86.6% ham, 13.4% spam), presenting both opportunities and challenges for classification algorithms.

The business context for this NLP system is critical - mobile service providers and users need reliable spam filtering to maintain communication quality and security. By comparing different approaches, we can identify the most effective strategy for real-world deployment, considering factors such as accuracy, computational efficiency, and interpretability.

This study will implement two significantly different pipelines: a similarity-based approach using k-Nearest Neighbors (kNN) and Case-Based Reasoning (CBR), and a neural network approach using Multi-Layer Perceptrons (MLP) and Convolutional Neural Networks (CNN). The comparative evaluation will provide insights into the strengths and limitations of each approach for SMS spam detection.


In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.preprocessing import LabelEncoder
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import string
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)
random.seed(42)

# Upload the file
uploaded = files.upload()


## 2. Dataset Overview 

The SMS Spam Collection v.1 dataset consists of 5,572 SMS messages collected from multiple sources, including the Grumbletext website, NUS SMS Corpus, and academic research. The dataset is well-balanced for binary classification with 4,825 legitimate messages (86.6%) and 747 spam messages (13.4%).

The dataset presents several characteristics that make it suitable for comparative NLP analysis. Messages vary significantly in length, with spam messages typically being longer and containing promotional content, while legitimate messages are often shorter and conversational. The text contains informal language, abbreviations, and various formatting styles typical of SMS communication.

The class imbalance (13.4% spam) reflects real-world scenarios where spam represents a minority of messages, making this dataset particularly valuable for evaluating algorithm performance on imbalanced data. The dataset's diversity in message sources ensures robust evaluation across different writing styles and content types.

This dataset is ideal for comparing similarity-based and neural network approaches because it provides sufficient data for training while maintaining computational feasibility. The binary classification task is well-defined, allowing for clear performance evaluation and meaningful comparison between different algorithmic approaches.


In [None]:
# Load the dataset
sms_df = pd.read_csv('data/SMSSpamCollection', sep='\t', names=['label', 'text'])

# Display basic information
print("Dataset Shape:", sms_df.shape)
print("\nLabel Distribution:")
print(sms_df['label'].value_counts())
print("\nSpam Percentage:", (sms_df['label'] == 'spam').mean() * 100, "%")

# Display sample messages
print("\nSample Ham Messages:")
for i, text in enumerate(sms_df[sms_df['label'] == 'ham']['text'].head(3)):
    print(f"{i+1}. {text[:100]}...")

print("\nSample Spam Messages:")
for i, text in enumerate(sms_df[sms_df['label'] == 'spam']['text'].head(3)):
    print(f"{i+1}. {text[:100]}...")


In [None]:
# Data exploration and visualization
sms_df['text_length'] = sms_df['text'].str.len()
sms_df['word_count'] = sms_df['text'].str.split().str.len()

# Create visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Text length distribution
axes[0, 0].hist(sms_df[sms_df['label'] == 'ham']['text_length'], alpha=0.7, label='Ham', bins=50)
axes[0, 0].hist(sms_df[sms_df['label'] == 'spam']['text_length'], alpha=0.7, label='Spam', bins=50)
axes[0, 0].set_xlabel('Text Length (characters)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].legend()
axes[0, 0].set_title('Text Length Distribution')

# Word count distribution
axes[0, 1].hist(sms_df[sms_df['label'] == 'ham']['word_count'], alpha=0.7, label='Ham', bins=50)
axes[0, 1].hist(sms_df[sms_df['label'] == 'spam']['word_count'], alpha=0.7, label='Spam', bins=50)
axes[0, 1].set_xlabel('Word Count')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].legend()
axes[0, 1].set_title('Word Count Distribution')

# Label distribution pie chart
label_counts = sms_df['label'].value_counts()
axes[1, 0].pie(label_counts.values, labels=label_counts.index, autopct='%1.1f%%')
axes[1, 0].set_title('Label Distribution')

# Box plot for text length by label
sms_df.boxplot(column='text_length', by='label', ax=axes[1, 1])
axes[1, 1].set_title('Text Length by Label')
axes[1, 1].set_xlabel('Label')
axes[1, 1].set_ylabel('Text Length')

plt.tight_layout()
plt.show()

# Statistical summary
print("\nStatistical Summary:")
print(sms_df.groupby('label')['text_length'].describe())


## 3. Representation Learning (200 words)

Text representation learning is crucial for converting SMS messages into numerical vectors suitable for machine learning algorithms. This study implements a comprehensive preprocessing pipeline followed by multiple feature extraction approaches.

The preprocessing pipeline includes text normalization (lowercasing, punctuation removal), tokenization using NLTK's word_tokenize, stopword removal using English stopwords, and stemming using Porter Stemmer. This pipeline ensures consistent text formatting while preserving meaningful content.

For feature extraction, three approaches are implemented: TF-IDF vectorization with n-gram features (1-2 grams), Word2Vec embeddings for semantic similarity, and neural network embeddings. TF-IDF captures term frequency importance, Word2Vec provides semantic relationships between words, and neural embeddings learn task-specific representations.

The representation learning approach addresses the unique challenges of SMS text, including informal language, abbreviations, and varying message lengths. By implementing multiple feature extraction methods, we can evaluate which representation best captures the distinguishing characteristics between spam and legitimate messages.


In [None]:
# Download required NLTK data
nltk.download('punkt')
nltk.download('stopwords')

def preprocess_text(text):
    """Comprehensive text preprocessing function"""
    # Convert to lowercase
    text = text.lower()
    
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Tokenize
    tokens = word_tokenize(text)
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    
    # Stemming
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(word) for word in tokens]
    
    return ' '.join(tokens)

# Apply preprocessing
sms_df['processed_text'] = sms_df['text'].apply(preprocess_text)

# Display preprocessing results
print("Original text:", sms_df['text'].iloc[0])
print("Processed text:", sms_df['processed_text'].iloc[0])


In [None]:
# Feature extraction using TF-IDF
tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1, 2), min_df=2, max_df=0.95)
X_tfidf = tfidf.fit_transform(sms_df['processed_text'])

# Label encoding
le = LabelEncoder()
y = le.fit_transform(sms_df['label'])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_tfidf, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")
print(f"Feature matrix sparsity: {1 - X_train.nnz / (X_train.shape[0] * X_train.shape[1]):.3f}")

# Display feature names
feature_names = tfidf.get_feature_names_out()
print(f"\nTotal features: {len(feature_names)}")
print(f"Sample features: {feature_names[:10]}")


## 4. Algorithms (500 words)

This comparative study implements two fundamentally different approaches to SMS spam classification: similarity-based learning and neural network-based learning. Each approach represents a distinct paradigm in machine learning, allowing for comprehensive evaluation of their effectiveness.

**Similarity-Based Learning Approach:**

The first pipeline employs k-Nearest Neighbors (kNN) and Case-Based Reasoning (CBR) algorithms. kNN operates on the principle that similar instances should have similar labels, making it particularly suitable for text classification where message similarity can indicate spam likelihood. The algorithm calculates distances between test instances and training examples using cosine similarity, which is effective for high-dimensional text data.

CBR extends similarity-based learning by implementing a four-stage cycle: retrieve similar cases, reuse their solutions, revise solutions if necessary, and retain new cases. This approach mimics human problem-solving and provides interpretable decisions by referencing similar historical cases.

**Neural Network-Based Learning Approach:**

The second pipeline implements Multi-Layer Perceptrons (MLP) and Convolutional Neural Networks (CNN). MLPs learn non-linear decision boundaries through multiple hidden layers, automatically discovering complex patterns in text representations. The network architecture includes input, hidden, and output layers with ReLU activation functions and dropout regularization.

CNNs, originally designed for image processing, are adapted for text classification by treating text as 1D sequences. Convolutional layers capture local patterns and n-gram features, while pooling layers reduce dimensionality and extract the most important features.

**Algorithm Selection Rationale:**

These algorithms were selected to represent fundamentally different approaches: instance-based vs. model-based learning, interpretable vs. black-box methods, and traditional vs. deep learning approaches. The comparison will reveal trade-offs between accuracy, interpretability, computational efficiency, and robustness to class imbalance.

The implementation follows best practices including proper train-test splitting, cross-validation, hyperparameter tuning, and performance evaluation using multiple metrics. This comprehensive approach ensures fair comparison and provides insights into the strengths and limitations of each method for SMS spam detection.


<a href="https://colab.research.google.com/github/Mody2828/CEM300-SMS-Spam-Analysis/blob/main/CEM300_SMS_Spam_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>