# Spam Classification with Naive Bayes Theorem

## Overview
This project focuses on classifying SMS messages as either spam or ham (legitimate) using the Naive Bayes Theorem. The goal is to build an efficient classifier that can distinguish unwanted messages from legitimate ones based on historical data.

## Dataset
We are utilizing the **SMS Spam Collection** dataset from the UC Irvine Machine Learning Repository:

- **Dataset Link:** [SMS Spam Collection](https://archive.ics.uci.edu/dataset/228/sms+spam+collection)
- The dataset contains **5,574** SMS messages, each labeled as either **spam** or **ham**.
- This dataset has been widely used in mobile phone spam research and serves as a strong benchmark for spam classification tasks.

## Background
The **Naive Bayes Theorem** is a probabilistic approach based on Bayes' Rule, assuming that the presence of one feature in a classification task is independent of the others. Despite its simplicity, it performs well in text classification tasks due to its ability to handle high-dimensional data effectively.

### Why Naive Bayes?
- **Fast and Efficient:** Works well with large text datasets.
- **Performs well with small training data:** Requires less training time compared to deep learning models.
- **Handles noisy data:** Suitable for spam detection where wording can be highly variable.

## Methodology
1. **Data Preprocessing**
   - Load and clean the dataset.
   - Tokenization, lowercasing, and removing unnecessary characters.
   - Converting text into numerical feature vectors using techniques like TF-IDF or Count Vectorization.

2. **Model Training**
   - Implementing Naive Bayes classification (MultinomialNB or BernoulliNB from scikit-learn).
   - Splitting the dataset into training and testing subsets.

3. **Evaluation**
   - Measuring accuracy, precision, recall, and F1-score.
   - Analyzing false positives and false negatives.

4. **Deployment and Future Enhancements**
   - Exploring additional preprocessing techniques to improve accuracy.
   - Extending to real-world applications such as email spam filtering.

## Conclusion
This project demonstrates the effectiveness of the **Naive Bayes** approach in spam classification and highlights its practical applications in AI-driven security solutions. As part of the **AI Red Teamer Job Role Path**, understanding spam detection techniques strengthens adversarial AI defense mechanisms, preparing security professionals for real-world AI security challenges.

---

# Environment Setup and Dataset Cleaning

#### Download the Dataset
----

In [10]:
import requests
import zipfile
import io

# URL of the dataset
url = "https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip"

# Download the dataset
response = requests.get(url)
if response.status_code == 200:
    print("Download Successful.")
else:
    print("Failed to download the dataset.")

Download Successful.


#### Extracting the DataSet
---

In [20]:
with zipfile.ZipFile(io.BytesIO(response.content)) as z:
    z.extractall("sms_spam_collection")
    print("Extraction Complete")

# Verify complete extraction
import os
extracted_files = os.listdir("sms_spam_collection")
print("Extracted files: ", extracted_files)

Extraction Complete
Extracted files:  ['readme', 'SMSSpamCollection']


#### Loading the DataSet
---

In [30]:
import pandas as pd

# Load the dataset
df = pd.read_csv(
    "sms_spam_collection/SMSSpamCollection",
    sep="\t",
    header=None,
    names=["label", "message"],
)

In [34]:
# Displaying basic information about the dataset
print("-------------------- HEAD --------------------")
print(df.head()) # First few rows
print("-------------------- DESCRIBE --------------------")
print(df.describe()) # Statistical Summary
print("-------------------- INFO --------------------")
print(df.info()) # Concise summary, including non-null entry number and data types of each column

-------------------- HEAD --------------------
  label                                            message
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...
-------------------- DESCRIBE --------------------
       label                 message
count   5572                    5572
unique     2                    5169
top      ham  Sorry, I'll call later
freq    4825                      30
-------------------- INFO --------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   label    5572 non-null   object
 1   message  5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB
None


In [36]:
# Check for missing values that can skew and reduce the quality of data
print("Missingvalues:\n", df.isnull().sum())

Missingvalues:
 label      0
message    0
dtype: int64


In [38]:
# Check for duplicate values for the same purpose
print("Duplicate entries:\n", df.duplicated().sum())

Duplicate entries:
 403


In [42]:
# Remove the duplicates (you can run the above to confirm)
df = df.drop_duplicates()

# Preprocessing the DataSet

This is going to standardize the text, reduce noise, extract meaningful features (all improve performance), relying on the NLTK library for tokenization, stop word removal, and stemming to better implement the Bayes Spam Classification

---

In [60]:
import nltk

# Download the necessary NLTK data files
#nltk.download("punkt") # tokenization
#nltk.download("punkt_tab")
#nltk.download("stopwords") # Stop words

print("=== BEFORE ANY PREPROCESSINGM ===")
print(df.head(5))

=== BEFORE ANY PREPROCESSINGM ===
  label                                            message
0   ham  go until jurong point, crazy.. available only ...
1   ham                      ok lar... joking wif u oni...
2  spam  free entry in 2 a wkly comp to win fa cup fina...
3   ham  u dun say so early hor... u c already then say...
4   ham  nah i don't think he goes to usf, he lives aro...


#### Lowercasing the text for word equality
Regardless of original casing, this will consider "Word" and "word" as the same token, improving performance and uniformity across the dataset.

---

In [68]:
# Convert all message text to lowercased
df["message"] = df["message"].str.lower()
print("\n=== AFTER LOWERCASING ===")
print(df["message"].head(5))


=== AFTER LOWERCASING ===
0    go until jurong point, crazy.. available only ...
1                        ok lar... joking wif u oni...
2    free entry in 2 a wkly comp to win fa cup fina...
3    u dun say so early hor... u c already then say...
4    nah i don't think he goes to usf, he lives aro...
Name: message, dtype: object


#### Remove Punctuation and Numbers
Simplify the dataset by focusing on meaningful words. Symbols such as '$' and '!' could be important for spam messages, be it a monetary amount of added emphasis.

The code will remove all characters besides lowercase text, whitespace, dollar signs, and exclamation marks to make a more distinguishable dataset.

---

In [71]:
import re

# Remove non-essential punctuation and numbers, keep useful symbols like $ and !
df["message"] = df["message"].apply(lambda x: re.sub(r"[^a-z\s$!]", "", x))
print("\n=== AFTER REMOVING PUNCTUATION & NUMBERS (except $ and !) ===")
print(df["message"].head(5))


=== AFTER REMOVING PUNCTUATION & NUMBERS (except $ and !) ===
0    go until jurong point crazy available only in ...
1                              ok lar joking wif u oni
2    free entry in  a wkly comp to win fa cup final...
3          u dun say so early hor u c already then say
4    nah i dont think he goes to usf he lives aroun...
Name: message, dtype: object


#### Tokenizing the Text
We do this to divide the message text into individual words (tokens), before proceeding. We convert unstructured text to a sequence of words to prepare for removing stop words and applying stemming.

----

In [83]:
from nltk.tokenize import word_tokenize

# Split each message into individual tokens
df["message"] = df["message"].apply(word_tokenize)
print("\n=== AFTER TOKENIZATION ===")
print(df["message"].head(5))


=== AFTER TOKENIZATION ===
0    [go, until, jurong, point, crazy, available, o...
1                       [ok, lar, joking, wif, u, oni]
2    [free, entry, in, a, wkly, comp, to, win, fa, ...
3    [u, dun, say, so, early, hor, u, c, already, t...
4    [nah, i, dont, think, he, goes, to, usf, he, l...
Name: message, dtype: object


#### Removing Stop Words
Stop words like 'and', 'the', 'is', that do not add meaningful context are removed to reduce noise from the dataset. This helps to distinguish SPAM (bad) vs HAM (good) and helps the model learn better. The token list is shorter after removing stop words, which creates a better dataset for future text transformations.

---

In [90]:
from nltk.corpus import stopwords

# Define a set of English stop words and remove them from the tokens
stop_words = set(stopwords.words("english"))
df["message"] = df["message"].apply(lambda x: [word for word in x if word not in stop_words])
print("\n=== AFTER REMOVING STOP WORDS ===")
print(df["message"].head(5))


=== AFTER REMOVING STOP WORDS ===
0    [go, jurong, point, crazy, available, bugis, n...
1                       [ok, lar, joking, wif, u, oni]
2    [free, entry, wkly, comp, win, fa, cup, final,...
3        [u, dun, say, early, hor, u, c, already, say]
4    [nah, dont, think, goes, usf, lives, around, t...
Name: message, dtype: object


#### Stemming the Text
Stemming normalizes words by reducing them to base form (e.g. eating becomes eat). So we are acquiring the 'root word', base of the word. Chopping up vocabulary to smooth out the text representation, so the model can learn better and understand what is going on without too much variance. This improves our models ability to generalize.

---

In [96]:
from nltk.stem import PorterStemmer

# Stem each token to reduce words to their base form
stemmer = PorterStemmer()
df["message"] = df["message"].apply(lambda x: [stemmer.stem(word) for word in x])
print("\n=== AFTER STEMMING ===")
print(df["message"].head(5))


=== AFTER STEMMING ===
0    [go, jurong, point, crazi, avail, bugi, n, gre...
1                         [ok, lar, joke, wif, u, oni]
2    [free, entri, wkli, comp, win, fa, cup, final,...
3        [u, dun, say, earli, hor, u, c, alreadi, say]
4    [nah, dont, think, goe, usf, live, around, tho...
Name: message, dtype: object


#### Joining Tokens Back into a Single String
Machine learning algorithms and vectorization techniques (e.g. TF_IDF) work better with raw strings. Rejoining tokens into a space-separated string restores a format compatible with these methods, preparing the dataset for the feature extraction phase.

---

In [100]:
# Rejoin tokens into a single string for feature extraction
df["message"] = df["message"].apply(lambda x: " ".join(x))
print("\n=== AFTER JOINING TOKENS BACK INTO STRINGS ===")
print(df["message"].head(5))


=== AFTER JOINING TOKENS BACK INTO STRINGS ===
0    go jurong point crazi avail bugi n great world...
1                                ok lar joke wif u oni
2    free entri wkli comp win fa cup final tkt st m...
3                  u dun say earli hor u c alreadi say
4            nah dont think goe usf live around though
Name: message, dtype: object


# Feature Extraction
The purpose of feature extraction is to transform preprocessed SMS messages into numerical vectors that work with machine learning algorithms. Models cannot directly process raw text data, so we transform the data into numerically represented information the model can consume.

---

#### Counter Vectorization for the Bag-of-Words Approach
CountVectorizer can be used from the Scikit-learn library to implement a bag-of-words approach. It converts a collection of documents into a matrix of term counts, each row represents a message and each column corresponds to a term (unigram or bigram). Before transformation, tokenization, vocabulary building, and the mapping of each document to a numeric vector occurs with CountVectorizer.

This step prepares 'X' below to become a numerical feature matrix ready to be fed into a classifier like Naive Bayes.

---

In [110]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer with bigrams, min_df, and max_df to focus on relevant terms
vectorizer = CountVectorizer(min_df=1, max_df=0.9, ngram_range=(1, 2))

# Fit and transform the message column
X = vectorizer.fit_transform(df["message"])

# Labels (target variable)
y = df["label"].apply(lambda x: 1 if x == "spam" else 0) # Converting labels 1 to 0

# Training and Evaluation (Spam Detection)

#### Training using a 'Multinomial Naive Bayes' classifier
We employ a pipeline to chain together vectorization and modeling steps, ensuring data transformation is applied before feeding the transformed data into the classifier. This encapsulates the feature extraction and model training into a single workflow.

---

In [117]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Build the pipeline by combining vectorization and classification
pipeline = Pipeline([
    ("vectorizer", vectorizer),
    ("classifier", MultinomialNB())
])

#### Integrating Hyperparamter Tuning for Performance Boost
GridSearchCV will allow us to find optimal paramter values for the classifier, that ensures the model generalizes well and avoids overfitting. This will balance bias and variance by tuning 'alpha', a smoothing factor that adjusts how the model handles unseen words and prevents possibilities from being zero, to improve the model's robustness.

---

In [120]:
# Define the parameter grid for hyperparameter tuning
param_grid = {
    "classifier__alpha": [0.01, 0.1, 0.15, 0.2, 0.25, 0.5, 0.75, 1.0]
}

# Perform the grid search with 5-fold cross-validation and the F1-score as metric
grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring="f1"
)

# Fit the grid search on the full dataset
grid_search.fit(df["message"], y)

# Extract the best model identified by the grid search
best_model = grid_search.best_estimator_
print("Best model parameters:", grid_search.best_params_)

Best model parameters: {'classifier__alpha': 0.25}


#### Setting up Evaluation Messages
We provide a list of new SMS messages for evaluation. These messages represent the type of inputs the model may receive in real-world use, including different types of SPAM.

---

In [123]:
# Example SMS messages for evaluation
new_messages = [
    "Congratulations! You've won a $1000 Walmart gift card. Go to http://bit.ly/1234 to claim now.",
    "Hey, are we still meeting up for lunch today?",
    "Urgent! Your account has been compromised. Verify your details here: www.fakebank.com/verify",
    "Reminder: Your appointment is scheduled for tomorrow at 10am.",
    "FREE entry in a weekly competition to win an iPad. Just text WIN to 80085 now!",
]

#### Preprocessing New Messages
Before predicting with the model we must preprocess new messages using hte same steps applied during training. We can create a preprocess_message function will take care of this for us.

---

In [130]:
import numpy as np
import re

# Preprocess functio nthat mirrors the training-time preprocessing
def preprocess_message(message):
    message = message.lower()
    message = re.sub(r"[^a-z\s$!]", "", message)
    tokens = word_tokenize(message)
    tokens = [word for word in tokens if word not in stop_words]
    tokens = [stemmer.stem(word) for word in tokens]
    return " ".join(tokens)

# Applying this function to preprocess and vectorize messages
processed_messages = [preprocess_message(msg) for msg in new_messages]

#### Vectorizing the Processed Messages
The model requires numerical input features, so we need to apply the same vectorization methods used during trianing. The CountVectorizer saved within the pipeline is available to do this.

---

In [136]:
# Transform preprocessed messages into feature vectors
X_new = best_model.named_steps["vectorizer"].transform(processed_messages)

#### Making Predictions
With the data preprocessed and vectorized, we can feed the new messages into the trained MultinomialNB classifier to output both a predicted lable (spam or not spam) and class probability, indicating the model's confidence in its decisions.

---

In [139]:
# Predict with the trained classifier
predictions = best_model.named_steps["classifier"].predict(X_new)
prediction_probabilities = best_model.named_steps["classifier"].predict_proba(X_new)

#### Displaying Predictions and Probabilities
We can now present the evaluation results and display the following:
- The original text of the message.
- The predicted label (Spam or Not-Spam).
- The probability that the message is spam.
- The probability that the message is not spam.

These results show that the model can successfully determine SPAM vs real data.

----

In [151]:
# Display predictions and probabilities for each evaluated message
for i, msg in enumerate(new_messages):
    prediction = "Spam" if predictions[i] == 1 else "Not-Spam"
    spam_probability = prediction_probabilities[i][1]  # Probability of being spam
    ham_probability = prediction_probabilities[i][0]   # Probability of being not spam
    
    print(f"Message: {msg}")
    print(f"Prediction: {prediction}")
    print(f"Spam Probability: {spam_probability:.2f}")
    print(f"Not-Spam Probability: {ham_probability:.2f}")
    print("-" * 50)

Message: Congratulations! You've won a $1000 Walmart gift card. Go to http://bit.ly/1234 to claim now.
Prediction: Spam
Spam Probability: 1.00
Not-Spam Probability: 0.00
--------------------------------------------------
Message: Hey, are we still meeting up for lunch today?
Prediction: Not-Spam
Spam Probability: 0.00
Not-Spam Probability: 1.00
--------------------------------------------------
Message: Urgent! Your account has been compromised. Verify your details here: www.fakebank.com/verify
Prediction: Spam
Spam Probability: 0.96
Not-Spam Probability: 0.04
--------------------------------------------------
Message: Reminder: Your appointment is scheduled for tomorrow at 10am.
Prediction: Not-Spam
Spam Probability: 0.00
Not-Spam Probability: 1.00
--------------------------------------------------
Message: FREE entry in a weekly competition to win an iPad. Just text WIN to 80085 now!
Prediction: Spam
Spam Probability: 1.00
Not-Spam Probability: 0.00
----------------------------------

#### Saving Models with JobLib
We can preserve the model for re-use later by saving it to a file. Joblib is a Python library designed to serialize and deserialize Python objects, like large arrays (NumPy) or Scikit-learn models.

"Serialization converts an in-memory object into a format that can be stored on disk or transmitted across networks. Deserialization involves converting the stored representation back into an in-memory object with the exact same state it had when saved."

---

In [163]:
import joblib

# Save the trained model to a file for future use
model_filename = 'spam_detection_model.joblib'
joblib.dump(best_model, model_filename)

print(f"Model saved to {model_filename}")

Model saved to spam_detection_model.joblib


In [157]:
# You can reuse the model later, by submitting
#loaded_model = joblib.load(model_filename)
#predictions = loaded_model.predict(new_messages)

# Uploading the Model to Hack The Box's Endpoint for flag
I cleared the output of the flag in order to encourage those who may land here to do the work themselves.

---

In [None]:
import requests
import json

# Define the URL of the API endpoint
url = "http://10.129.51.30:8000/api/upload"

# Path to the model file you want to upload
model_file_path = "spam_detection_model.joblib"

# Open the file in binary mode and send the POST request
with open(model_file_path, "rb") as model_file:
    files = {"model": model_file}
    response = requests.post(url, files=files)

# Pretty print the response from the server
print(json.dumps(response.json(), indent=4))