# Jupyter notebook sample

## Introduction

To build a system to classify messages as spam or not spam, we need to start with a labelled dataset of what people already consider spam. There is a labelled dataset at 

https://dq-content.s3.amazonaws.com/433/SMSSpamCollection

We'll read it in and start building a system to use a naive bayes spam filter

In [20]:
# Import libraries
import pandas as pd
import re
import matplotlib.pyplot as plt


In [21]:
# Read in the data, process it a little
data_set = "https://dq-content.s3.amazonaws.com/433/SMSSpamCollection"
data = pd.read_csv(data_set, header=None, names=['Label', 'SMS'], sep='\t')
# We want a random seed
seed = 1


With the data set loaded, lets explore it

In [22]:
print(data.shape)
print(data.head(5))
data["Label"].value_counts()


(5572, 2)
  Label                                                SMS
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


Label
ham     4825
spam     747
Name: count, dtype: int64

In [23]:
spam_ratio = data["Label"].value_counts(normalize=True)["spam"] / data["Label"].value_counts(normalize=True)[
    "ham"] * 100
print(f'The ratio of spam messages is {spam_ratio:.2f}%')

The ratio of spam messages is 15.48%


There's 747 spam messages and 4825 'ham' (non-spam) messages

## Building training and test sets
We need to split this labelled data set into a training set using about 80% of the data, and a test set, by which we'll validate the classification filter. We'll aim for 80% accuracy of the spam filter. Fortunate the scikit-learn module has a function for exactly this purpose


In [24]:

from sklearn.model_selection import train_test_split

# Assuming 'data' is your dataset DataFrame and the target column is labeled as 'label'

# Split the dataset into training (80%) and test (20%) sets
train_set, test_set = train_test_split(data, test_size=0.2, random_state=seed)

train_set = train_set.reset_index(drop=True)
test_set = test_set.reset_index(drop=True)

# Calculate percentage of 'spam' and 'ham' in the full dataset
full_spam_percentage = (data['Label'].value_counts()['spam'] / len(data)) * 100
full_ham_percentage = (data['Label'].value_counts()['ham'] / len(data)) * 100

# Calculate percentage of 'spam' and 'ham' in the training set
train_spam_percentage = (train_set['Label'].value_counts()['spam'] / len(train_set)) * 100
train_ham_percentage = (train_set['Label'].value_counts()['ham'] / len(train_set)) * 100

# Calculate percentage of 'spam' and 'ham' in the test set
test_spam_percentage = (test_set['Label'].value_counts()['spam'] / len(test_set)) * 100
test_ham_percentage = (test_set['Label'].value_counts()['ham'] / len(test_set)) * 100

# Output the percentages
percentages_df = pd.DataFrame({
        "Full dataset": [full_spam_percentage, full_ham_percentage],
        "Training set": [train_spam_percentage, train_ham_percentage],
        "Test set"    : [test_spam_percentage, test_ham_percentage]
}, index=["spam", "ham"]
)

print(percentages_df)


      Full dataset  Training set   Test set
spam     13.406317      13.46197  13.183857
ham      86.593683      86.53803  86.816143


We can see we've generated the data sets and the ratio of spam to non spam remains about the same, so it should be fine for our purposes

## Splitting messages into discrete words
Now we need to process all the strings in the message to split them out into discrete words for the naive bayes filter.

In [25]:
def sms_string_to_words(message):
    # Remove punctuation and convert to lowercase
    message = re.sub(r'\W', ' ', message)
    message = message.lower()
    return message.split()


# Extract sample messages from the 'sms' column of the training set
sample_sms = train_set['SMS'].head(5)  # Extract the first 5 messages as test cases

# Apply the function to each sample and print the result
for i, sms in enumerate(sample_sms):
    print(f"Original SMS {i + 1}: {sms}")
    words = sms_string_to_words(sms)
    print(f"Processed SMS {i + 1}: {words}\n")

Original SMS 1: Hi , where are you? We're at  and they're not keen to go out i kind of am but feel i shouldn't so can we go out tomo, don't mind do you?
Processed SMS 1: ['hi', 'where', 'are', 'you', 'we', 're', 'at', 'and', 'they', 're', 'not', 'keen', 'to', 'go', 'out', 'i', 'kind', 'of', 'am', 'but', 'feel', 'i', 'shouldn', 't', 'so', 'can', 'we', 'go', 'out', 'tomo', 'don', 't', 'mind', 'do', 'you']

Original SMS 2: If you r @ home then come down within 5 min
Processed SMS 2: ['if', 'you', 'r', 'home', 'then', 'come', 'down', 'within', '5', 'min']

Original SMS 3: When're you guys getting back? G said you were thinking about not staying for mcr
Processed SMS 3: ['when', 're', 'you', 'guys', 'getting', 'back', 'g', 'said', 'you', 'were', 'thinking', 'about', 'not', 'staying', 'for', 'mcr']

Original SMS 4: Tell my  bad character which u Dnt lik in me. I'll try to change in  &lt;#&gt; . I ll add tat 2 my new year resolution. Waiting for ur reply.Be frank...good morning.
Processed SMS

The function is successfully extracting the words, but it's splitting words when an apostrophe would be used. I'm becomes I m

This shouldn't be much of a problem, but could present some bias in the training data if the randomly selected 80% of the messages contain more people that use contractions compared to those that don't

In [26]:
train_set['SMS'] = train_set['SMS'].str.replace(r'[^\w]', ' ', regex=True)
train_set['SMS'] = train_set['SMS'].str.lower()

vocabulary = []

for message in train_set['SMS']:
    for word in sms_string_to_words(message):
        vocabulary.append(word)

vocabulary_set = set(vocabulary)
vocabulary_list = list(vocabulary_set)

print(f"Vocabulary size: {len(vocabulary_list)}")
print(f"Sample vocabulary words: {vocabulary_list[:20]}")

Vocabulary size: 7753
Sample vocabulary words: ['87077', '82277', 'difficult', 'bimbo', 'testing', 'cn', 'imma', 'q', 'slippery', 'timings', 'stopbcm', 'bcm4284', 'mon', 'weaseling', 'argh', 'stylish', 'previews', 'sweatter', 'act', 'gift']


In [27]:
word_counts_per_sms = {unique_word: [0] * len(train_set['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(train_set['SMS']):
    for word in sms_string_to_words(sms):
        word_counts_per_sms[word][index] += 1

# Now transforming the word_counts_per_sms dictionary into a DataFrame
word_counts_df = pd.DataFrame(word_counts_per_sms).fillna(0)

# Concatenate this DataFrame with the original train_set DataFrame
train_set_clean = pd.concat([train_set, word_counts_df], axis=1)

train_set_clean.shape


(4457, 7755)

Now that the training data is cleaned and ready, need to write the functions that calculate probabilities of spam and non-spam. First lets define all the constants going to be used

In [29]:
total_messages = len(train_set_clean)
spam_messages = train_set_clean[train_set_clean['Label'] == 'spam']
ham_messages = train_set_clean[train_set_clean['Label'] == 'ham']

P_spam = len(spam_messages) / total_messages
P_ham = len(ham_messages) / total_messages

# Step 2: Calculate NSpam (total number of words in spam messages)
N_words_per_spam = train_set_clean[train_set_clean['Label'] == 'spam']['SMS'].apply(len)
N_spam = N_words_per_spam.sum()

# Step 3: Calculate NHam (total number of words in ham messages)
N_words_per_ham = train_set_clean[train_set_clean['Label'] == 'ham']['SMS'].apply(len)
N_ham = N_words_per_ham.sum()

N_vocabulary = len(vocabulary)

# Step 5: Initialize alpha
alpha = 1

# Results summary
results = {
        "P(Spam)"    : P_spam,
        "P(Ham)"     : P_ham,
        "NSpam"      : N_spam,
        "NwSpam"     : N_words_per_spam,
        "NwHam"      : N_words_per_ham,
        "NHam"       : N_ham,
        "NVocabulary": N_vocabulary,
        "Alpha"      : alpha
}

In [32]:
# Initialize two dictionaries for P(wi|Spam) and P(wi|Ham) with vocabulary words as keys and 0 as the initial value
P_wi_given_spam = {word: 0 for word in vocabulary}
P_wi_given_ham = {word: 0 for word in vocabulary}

# Isolate spam and ham messages into two different DataFrames
spam_messages_df = train_set_clean[train_set_clean['Label'] == 'spam']
ham_messages_df = train_set_clean[train_set_clean['Label'] == 'ham']


# Function to count the occurrences of a word in a DataFrame of messages
def count_word_occurrences(word, df):
    return df['SMS'].apply(lambda message: message.count(word)).sum()


# Calculate P(wi|Spam) and P(wi|Ham) for each word in the vocabulary
for word in vocabulary:
    Nwi_spam = count_word_occurrences(word, spam_messages_df)
    Nwi_ham = count_word_occurrences(word, ham_messages_df)

    # Calculate P(wi|Spam) and P(wi|Ham) using the formulas
    P_wi_given_spam[word] = (Nwi_spam + alpha) / (N_spam + alpha * N_vocabulary)
    P_wi_given_ham[word] = (Nwi_ham + alpha) / (N_ham + alpha * N_vocabulary)

# Output the first few values of each dictionary for inspection
P_wi_given_spam_sample = {k: P_wi_given_spam[k] for k in list(P_wi_given_spam)[:5]}
P_wi_given_ham_sample = {k: P_wi_given_ham[k] for k in list(P_wi_given_ham)[:5]}

In [34]:
import re


def classify_message(message):
    # clean the message
    message = re.sub(r'\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = P_spam
    p_ham_given_message = P_ham
    
    for word in message:
        if word in P_wi_given_spam:
            p_spam_given_message *= P_wi_given_spam[word]
        
        if word in P_wi_given_ham:
            p_ham_given_message *= P_wi_given_ham[word]
    
    if p_spam_given_message > p_ham_given_message:
        return "spam"
    else:
        return "ham"


classify_message("WINNER!! This is the secret code to unlock the money: C3421.")



'spam'

In [35]:
classify_message("Sounds good, Tom, then see u there")

'ham'

In [37]:
def classify_test_set(message):
    # clean the message
    message = re.sub(r'\W', ' ', message)
    message = message.lower().split()

    p_spam_given_message = P_spam
    p_ham_given_message = P_ham

    for word in message:
        if word in P_wi_given_spam:
            p_spam_given_message *= P_wi_given_spam[word]

        if word in P_wi_given_ham:
            p_ham_given_message *= P_wi_given_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [38]:
test_set['Prediction'] = test_set['SMS'].apply(classify_test_set)
test_set.head(10)

Unnamed: 0,Label,SMS,Prediction
0,ham,"Yep, by the pretty sculpture",ham
1,ham,"Yes, princess. Are you going to make me moan?",ham
2,ham,Welp apparently he retired,ham
3,ham,Havent.,ham
4,ham,I forgot 2 ask ü all smth.. There's a card on ...,ham
5,ham,Ok i thk i got it. Then u wan me 2 come now or...,ham
6,ham,I want kfc its Tuesday. Only buy 2 meals ONLY ...,ham
7,ham,No dear i was sleeping :-P,ham
8,ham,Ok pa. Nothing problem:-),ham
9,ham,Ill be there on &lt;#&gt; ok.,ham


In [40]:
correct = 0
total = len(test_set)

for index, row in test_set.iterrows():
    prediction = row['Prediction']
    actual_label = row['Label']
    
    if prediction == actual_label:
        correct += 1

accuracy = correct / total
print(accuracy)

0.9883408071748879


The filter was 98.8% accurate, 