## Message Spam Detection

In this  project, we explore the practical application of algorithms by building a spam filter for Message messages. The goal is to develop a model that can automatically classify incoming messages as either "spam" or "ham" (not spam), using machine learning techniques.

We will walk through each step of the process, including data preprocessing, feature extraction, model training, and evaluation, to understand how spam detection works in practice.

In [2]:
import pandas as pd

data = pd.read_csv('/Users/nurbekkhujaev/Jupyter/SMSSpamCollection', sep='\t', header=None, names=['Label', 'Message'])

In [3]:
data.shape

(5572, 2)

In [4]:
data.head()

Unnamed: 0,Label,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
data['Label'].value_counts(normalize=True)

Label
ham     0.865937
spam    0.134063
Name: proportion, dtype: float64

### Data Randomization and Splitting

We randomized the dataset to eliminate any ordering bias. After shuffling, we split the data into training and test sets using an 80:20 ratio. This ensures that the model has sufficient data to learn from while reserving a portion for unbiased evaluation.

We also verified that the class distribution of "ham" and "spam" messages remains approximately the same in both the training and test sets as in the original dataset. This helps maintain the integrity of the dataset and ensures the model is trained and evaluated on representative samples.

In [7]:
data_randomized = data.sample(frac=1, random_state=1)

In [8]:
split_point = int(0.8 * len(data_randomized))

train_df = data_randomized[:split_point].reset_index(drop=True)
test_df = data_randomized[split_point:].reset_index(drop=True)

In [9]:
train_df['Label'].value_counts(normalize=True)

Label
ham     0.86538
spam    0.13462
Name: proportion, dtype: float64

In [10]:
test_df['Label'].value_counts(normalize=True)

Label
ham     0.868161
spam    0.131839
Name: proportion, dtype: float64

In [11]:
print(train_df.shape)
print(test_df.shape)

(4457, 2)
(1115, 2)


### Letter Case and Punctuation

As part of preprocessing, we cleaned the text data by removing punctuation and converting all characters to lowercase. This step helps standardize the messages and reduces noise in the data, making it easier for the model to identify meaningful patterns.

In [13]:
train_df['Message'] = train_df['Message'].str.replace(r'\W', ' ')
train_df['Message'] = train_df['Message'].str.lower()

test_df['Message'] = test_df['Message'].str.replace(r'\W', ' ')
test_df['Message'] = test_df['Message'].str.lower()

In [14]:
test_df.head()

Unnamed: 0,Label,Message
0,ham,wherre's my boytoy ? :-(
1,ham,later i guess. i needa do mcat study too.
2,ham,but i haf enuff space got like 4 mb...
3,spam,had your mobile 10 mths? update to latest oran...
4,ham,all sounds good. fingers . makes it difficult ...


In [15]:
train_df.head()

Unnamed: 0,Label,Message
0,ham,"yep, by the pretty sculpture"
1,ham,"yes, princess. are you going to make me moan?"
2,ham,welp apparently he retired
3,ham,havent.
4,ham,i forgot 2 ask ü all smth.. there's a card on ...


### Text Vectorization

After cleaning the text data, we transformed the Message messages into numerical format using a technique called Bag of Words (BoW). In this approach, we create a vocabulary of all unique words across the dataset and represent each message as a vector indicating the frequency of each word.

For example, the message:

"SECRET PRIZE! CLAIM SECRET PRIZE NOW!!"

is transformed into a vector that counts how many times each word appears (e.g., "secret": 2, "prize": 2, "claim": 1, etc.). This step converts raw text into a format that can be used by machine learning models, allowing them to identify patterns in the word usage between spam and ham messages.

### Building the Vocabulary

To prepare for vectorization, we first created a list of all unique words that appear in the messages of the training set. This list forms the vocabulary for our Bag of Words model.

By scanning through each message in the training data, we collected every distinct word after cleaning, ensuring that each word appears only once in the vocabulary. This vocabulary will later be used to construct the word frequency vectors for each message, enabling the model to interpret and learn from the text data.

In [18]:
train_df['Message'].head()

0                         yep, by the pretty sculpture
1        yes, princess. are you going to make me moan?
2                           welp apparently he retired
3                                              havent.
4    i forgot 2 ask ü all smth.. there's a card on ...
Name: Message, dtype: object

In [19]:
train_df['Message'] = train_df['Message'].str.split()

vocabulary = []

for col in train_df['Message']:
    for word in col:
        vocabulary.append(word)

In [20]:
vocabulary = list(set(vocabulary))

In [21]:
word_counts_per_message = {unique_word: [0] * len(train_df['Message']) for unique_word in vocabulary}

for index, message in enumerate(train_df['Message']):
    for word in message:
        word_counts_per_message[word][index] += 1

In [23]:
word_counts = pd.DataFrame(word_counts_per_message)

In [25]:
word_counts.head()

Unnamed: 0,chef,satthen,idk,sayhey!,headin,other.,paris,thought-,home!,nap..,...,os,millers,yes.he,empty,heater,strange.,appreciated,subscriptions.,"nyc,",standing...|
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [27]:
train_df_clean = pd.concat([train_df, word_counts], axis=1)
train_df_clean.head()

Unnamed: 0,Label,Message,chef,satthen,idk,sayhey!,headin,other.,paris,thought-,...,os,millers,yes.he,empty,heater,strange.,appreciated,subscriptions.,"nyc,",standing...|
0,ham,"[yep,, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes,, princess., are, you, going, to, make, m...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent.],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth.., there's, a...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Building the Spam Filter with Naive Bayes

With the data cleaned and the training set prepared, we can now begin constructing our spam filter. We’ll use the Naive Bayes algorithm, a popular and effective method for text classification tasks like spam detection.

We will use the training data to calculate the probabilities needed for the algorithm to classify new, unseen messages as either spam or ham.

### Calculating Constants for Naive Bayes

Before we can apply the Naive Bayes algorithm, we need to calculate a few constants that will be used repeatedly when classifying new messages:

- **p_spam**: The probability that any given message is spam.
- **p_ham**: The probability that any given message is ham (not spam).
- **n_spam**: The total number of words in all spam messages.
- **n_ham**: The total number of words in all ham messages.
- **n_vocabulary**: The total number of unique words in the training set vocabulary.

We also apply Laplace smoothing to handle zero probabilities for words that may not appear in some classes. For this, we set the smoothing parameter **alpha** = 1.

These constants are essential for calculating the conditional probabilities of words given a class, which will then be used to compute the overall probability of a message being spam or ham.

In [31]:
spam_messages = train_df_clean[train_df_clean['Label'] == 'spam']
ham_messages = train_df_clean[train_df_clean['Label'] == 'ham']

In [33]:
p_spam = len(spam_messages) / len(train_df)
p_ham = len(ham_messages) / len(train_df)

n_words_per_spam_message = spam_messages['Message'].apply(len)
n_spam = n_words_per_spam_message.sum()

n_words_per_ham_message = ham_messages['Message'].apply(len)
n_ham = n_words_per_ham_message.sum()

n_vocabulary = len(vocabulary)

alpha = 1

### Calculating Parameters

Next, we calculate the conditional probabilities of each word given a class — that is, how likely a word is to appear in spam versus ham messages. These probabilities form the core of the Naive Bayes model.

For each word w in the vocabulary, we compute:

- P(w | Spam) = (Number of occurrences of w in spam messages + alpha) / (n_spam + alpha × n_vocabulary)
- P(w | Ham) = (Number of occurrences of w in ham messages + alpha) / (n_ham + alpha × n_vocabulary)

This step ensures that every word, even those not seen in one of the classes, has a non-zero probability thanks to smoothing. These word probabilities will be used later when classifying new messages.

In [36]:
parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}

for word in vocabulary:
    n_word_given_spam = spam_messages[word].sum()  
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
    parameters_spam[word] = p_word_given_spam
    
    n_word_given_ham = ham_messages[word].sum()   
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
    parameters_ham[word] = p_word_given_ham

In [38]:
import re

def classify(message):
    
    message = re.sub(r'\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
            
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)
    
    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [40]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.168667088083531e-26
P(Ham|message): 6.093223663290184e-28
Label: Spam


In [42]:
classify("hello are you coming today")

P(Spam|message): 2.8003668602200727e-18
P(Ham|message): 7.788307098631482e-15
Label: Ham


### Evaluating the Spam Filter on the Test Set

With our Naive Bayes spam filter fully trained, we now evaluate its performance using the test set. For each message in the test set, we:

1. Calculate the probability that the message is spam.
2. Calculate the probability that the message is ham.
3. Compare the two and classify the message based on the higher probability.

Once all messages are classified, we compare the predicted labels with the actual labels to measure the model’s accuracy.

Accuracy is calculated as:

$$
\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}
$$

This metric gives us a clear indication of how well the spam filter generalizes to new, unseen data.

In [56]:
def classify_test_set(message):

    message = re.sub(r'\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]

        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [58]:
test_df['Predicted'] = test_df['Message'].apply(classify_test_set)
test_df.head()

Unnamed: 0,Label,Message,Predicted
0,ham,wherre's my boytoy ? :-(,ham
1,ham,later i guess. i needa do mcat study too.,ham
2,ham,but i haf enuff space got like 4 mb...,ham
3,spam,had your mobile 10 mths? update to latest oran...,spam
4,ham,all sounds good. fingers . makes it difficult ...,ham


In [60]:
correct = (test_df['Label'] == test_df['Predicted']).sum()
total = len(test_df)

print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct / total)

Correct: 1091
Incorrect: 24
Accuracy: 0.97847533632287


### Results

The spam filter achieved an accuracy of 97.85% on the test set:

This high accuracy indicates that the Naive Bayes model is performing well in distinguishing between spam and ham messages in real-world data.

In [65]:
incorrect = test_df[test_df['Label'] != test_df['Predicted']]

In [67]:
incorrect

Unnamed: 0,Label,Message,Predicted
115,spam,not heard from u4 a while. call me now am here...,ham
136,spam,more people are dogging in your area now. call...,ham
153,ham,unlimited texts. limited minutes.,spam
160,ham,26th of july,spam
285,ham,nokia phone is lovly..,spam
294,ham,a boy loved a gal. he propsd bt she didnt mind...,needs human classification
320,ham,we have sent jd for customer service cum accou...,spam
364,spam,email alertfrom: jeri stewartsize: 2kbsubject:...,ham
399,ham,hasn't that been the pattern recently crap wee...,spam
493,ham,"madam,regret disturbance.might receive a refer...",spam


In [79]:
for message in incorrect['Message']:
    print(message)
    print(classify(message))
    print(" ")

not heard from u4 a while. call me now am here all night with just my knickers on. make me beg for it like u did last time 01223585236 xx luv nikiyu4.net
P(Spam|message): 2.535810010954546e-99
P(Ham|message): 2.8500294939879786e-87
Label: Ham
None
 
more people are dogging in your area now. call 09090204448 and join like minded guys. why not arrange 1 yourself. there's 1 this evening. a£1.50 minapn ls278bb
P(Spam|message): 8.869058556592091e-86
P(Ham|message): 6.065005527292867e-83
Label: Ham
None
 
unlimited texts. limited minutes.
P(Spam|message): 6.046145907395024e-12
P(Ham|message): 8.116382971423812e-13
Label: Spam
None
 
26th of july
P(Spam|message): 2.3882276334210343e-12
P(Ham|message): 1.176163567437907e-12
Label: Spam
None
 
nokia phone is lovly..
P(Spam|message): 1.5083924809769105e-09
P(Ham|message): 2.2566962084967214e-10
Label: Spam
None
 
a boy loved a gal. he propsd bt she didnt mind. he gv lv lttrs, bt her frnds threw thm. again d boy decided 2 aproach d gal , dt time 

### Model Evaluation and Misclassification Analysis

Despite using Laplace smoothing (α = 1), several spam messages were still misclassified as "Ham", including those with:
- Adult content
- Premium-rate numbers
- Promotional phrases

#### Key Observations:
- P(Spam|message) was significantly lower than P(Ham|message) for many clearly spammy texts.
- This suggests the classifier's word probabilities for spam are undertrained, even with smoothing.

#### Likely Causes:
- Training data limitations: The model may not have seen enough spam examples with relevant vocabulary.
- Weak feature representation:
  - Words common in spam may also appear in ham messages, making them less discriminative.
  - Lack of bigram/trigram features means context is lost (e.g., "free" vs. "free tonight").
- Imbalanced dataset: If ham messages dominate, the model may be skewed toward predicting ham.
- Preprocessing limitations:
  - Phone numbers and URLs are treated as generic words.
  - No normalization of word forms (stemming/lemmatization).