# Naive Bayes for Spam Detection (Classification)

In this code, we are solving the text classification problem where the goal is to categorize SMS messages as either "ham" (non-spam) or "spam". Specifically, the task involves using the multinomial Naive Bayes classifier to predict whether a given message is spam based on its content. To do this, we load a dataset containing labelled messages (as "spam" or "ham"). Then, we convert the raw text data (SMS messages) into a numerical representation (using a bag-of-words model). We then train the Naive Bayes classifier on this (now) numerical data to learn the relationship between word frequencies and the two classes (ham or spam).

The core step before applying Naive Bayes, is the "tokenization" of the text data (using the aforementioned "bag-of-words" model). To do this, we utilize the `CountVectorizer` class from `sklearn.feature_extraction.text`, which converts raw text messages into a bag-of-words representation (which is one possible way of numerically representing text data for machine learning tasks). The `CountVectorizer` splits each message into individual words (tokens). This process involves removing punctuation, handling case sensitivity (lowercasing), and breaking the text into meaningful units such as words. Then, it constructs a vocabulary (i.e., a set of unique tokens) from the entire dataset. It looks at all the messages and identifies the unique words. After building the vocabulary, the `CountVectorizer` counts how often each word appears in each document (SMS message) in the dataset. This produces a document-term matrix (DTM), where:



*   each row represents a message (or document),
*   each column represents a word in the vocabulary,
*   and  the cell value at position $(i, j)$ represents the (frequency) count of word $j$ in document $i$.

That way, we create discrete features vectors, with components representing the count (frequency) of a specific work (token) from the vocabulary.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Load the dataset (assumes dataset is downloaded locally or from an online source)
url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"
data = pd.read_csv(url, sep='\t', header=None, names=['label', 'message'])

# Map labels to binary: "ham" -> 0, "spam" -> 1
data['label'] = data['label'].map({'ham': 0, 'spam': 1})

# Extract features and labels
X = data['message']
y = data['label']

# Convert text data to a bag-of-words representation
vectorizer = CountVectorizer(stop_words='english')
X_vectorized = vectorizer.fit_transform(X)

# Split into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=0.2, random_state=42)

# Initialize and train the Naive Bayes classifier
# Since we have discrete features, we utilize the multinomial model
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)

# Make predictions
y_pred = nb_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Ham', 'Spam']))

# Test on a custom message
custom_message = ["Congratulations! You've won a free trip to Bahamas! Reply now to claim."]
custom_vectorized = vectorizer.transform(custom_message)
prediction = nb_classifier.predict(custom_vectorized)
print("\nCustom Message Prediction: ", "Spam" if prediction[0] == 1 else "Ham")


Accuracy: 0.98

Classification Report:
              precision    recall  f1-score   support

         Ham       0.99      0.99      0.99       966
        Spam       0.92      0.96      0.94       149

    accuracy                           0.98      1115
   macro avg       0.96      0.97      0.96      1115
weighted avg       0.98      0.98      0.98      1115


Custom Message Prediction:  Spam


From the above results, we can observe that despite the strong (and unrealistic) naive Bayes assumption, we obtain a very good accuracy in classifying spam messages. Note that without the Naive Bayes assumption, the classification problem using the "bag-of-words" encoding would be significantly more computationally intensive, since we would need to model the dependence between each pair of features (we would need to consider all possible combinations of features, which leads to a combinatorial explosion in the number of model parameters).