<h1><b font size = "50px" font-color: black>Spam Detecion Filter<b></h1>

By using the dataset from https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection where they collected spam and ham messages of over 5,000+ SMS messages, we can use basic machine learning from these data to estimate if a message is spam or ham

In [70]:
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

Importing the basic modules for Data Science

In [71]:
df = pd.read_csv("./spam.csv", encoding='latin-1')
clean = df.drop(['Unnamed: 2'], axis=1)
clean = clean.drop(['Unnamed: 3'], axis = 1)
clean = clean.drop(['Unnamed: 4'], axis = 1)
clean.rename(columns = {'v1' : 'Category', 'v2' : 'Message'}, inplace=True)
clean["Message"] = clean["Message"].str.replace('[^a-zA-Z]',' ', regex = True)
clean["Message"] = clean["Message"].str.lower()

Getting the Database read, titling the unnamed column titles, making everything lower case for easier identification and removing all non letters, we cleaned up the data to be more easily read and analyzed

In [74]:

train_data, test_data = train_test_split(clean, test_size=0.25, random_state=42)

We seperated the data into 2 sections, the training set which we use to train our models for the prediction if a message is spam or ham and the test data where we end up using seperate data to test how accuracte our model is.

In [75]:
train_data['Message'] = train_data['Message'].str.split()
words = []
for i in train_data['Message']:
    for word in i:
        words.append(word)
words = list(set(words))
word_freq = pd.DataFrame(words)
word_freq['#Spam'] = 0
word_freq['#Ham'] = 0
word_freq.rename(columns={word_freq.columns[0]: "Word" }, inplace = True)
for i in range(0,len(train_data['Message'])):
    spam_ham = train_data.iloc[i]['Category']
    for word in train_data.iloc[i]['Message']:
        location = word_freq.loc[word_freq['Word'] == word]
        if spam_ham == "spam":
            word_freq.at[location.index[0],'#Spam']+=1
        elif spam_ham == "ham":
            word_freq.at[location.index[0],'#Ham']+=1

Create a seperate database where we count for each seperate word in the training data set the number of times its in a <em>spam</em> message and the number of times its in a <em>ham</em> message.

In [76]:
word_prob = word_freq
total_spam = (train_data["Category"] == "spam").sum()
total_ham = (train_data["Category"] == "ham").sum()
word_prob["P(E|S)"] = (word_prob["#Spam"] + 0.5) / (total_spam + 1)
word_prob["P(E|¬S)"] = (word_prob["#Ham"] + 0.5) / (total_ham + 1)

Calculating the probability of each word appearing if the message is spam or ham.

In [77]:
def spam_ham_prob_check_message_log(message):
    message_split = message.split()
    prob_spam = []
    prob_ham = []
    prior_value_spam = total_spam / (total_ham + total_spam)
    prior_value_ham = total_ham / (total_ham + total_spam)
    for i in message_split:
        if i in words:
            location = word_freq.loc[word_freq['Word'] == i]
            prob_spam.append(location['P(E|S)'].values[0])
            prob_ham.append(location['P(E|¬S)'].values[0])
    log_probability_spam = math.log(prior_value_spam)
    log_probability_ham = math.log(prior_value_ham)
    for i in prob_spam:
        log_probability_spam += math.log(i)
    for i in prob_ham:
        log_probability_ham += math.log(i)
    if log_probability_spam > log_probability_ham:
        return 'spam'
    elif log_probability_ham > log_probability_spam:
        return 'ham'
    else:
        return '50/50'

In [80]:
match_spam = 0
match_ham = 0
thought_ham_is_spam = 0
thought_spam_is_ham = 0
for i in test_data.index:
    message = test_data['Message'][i]
    category = test_data['Category'][i]
    test_category = spam_ham_prob_check_message_log(message)
    if test_category == category and category == "spam":
        match_spam += 1
    elif test_category == category and category == "ham":
        match_ham += 1
    elif test_category != category and test_category == "spam":
        thought_ham_is_spam += 1
    elif test_category != category and test_category == "ham":
        thought_spam_is_ham += 1
accuracy = (match_spam + match_ham) / (match_spam + match_ham + thought_ham_is_spam + thought_spam_is_ham)

In [81]:
print("spam matched by spam = " + str(match_spam))
print("ham matched by ham = " + str(match_ham))
print("ham matched by spam = " + str(thought_ham_is_spam))
print("spam matched by ham = "+str(thought_spam_is_ham))
print("Accuracy: "+str(accuracy*100)+"%")

spam matched by spam = 184
ham matched by ham = 1106
ham matched by spam = 96
spam matched by ham = 7
Accuracy: 92.60588657573582%
