<b>Bayes Spam Filter</b>

On the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection) page, there is a dataset with over 5000 SMS messages that are already classified as spam or not spam. My goal here is to write a classifier that correctly detects new messages with more than 80% accuracy. The classifier is based on a multinomial Bayes algorithm.

In [1]:
import pandas as pd
import re

msg  = pd.read_csv("msg", sep = "\t", header = None, names = ["Label","Text"])
print(msg.shape)
msg.head()

(5572, 2)


Unnamed: 0,Label,Text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
pct_ham = msg["Label"].value_counts(normalize=True).loc["ham"]
pct_spam = msg["Label"].value_counts(normalize=True).loc["spam"]

print("Spam percentage : {}".format(round(pct_spam,4)))
print("No spam percentage : {}".format(round(pct_ham,4)))

Spam percentage : 0.1341
No spam percentage : 0.8659


Dividing the messages into training and test set.

In [3]:
data_random_sort = msg.sample(frac=1,random_state=1) # random order

len_training = round(len(data_random_sort) * 0.8)

msg_training =  data_random_sort.iloc[:len_training,:].reset_index(drop=True)
msg_test = data_random_sort.iloc[len_training:,:].reset_index(drop=True)

In [4]:
pct_ham_training = msg_training["Label"].value_counts(normalize=True).loc["ham"]
pct_spam_training = msg_training["Label"].value_counts(normalize=True).loc["spam"]

pct_ham_test = msg_test["Label"].value_counts(normalize=True).loc["ham"]
pct_spam_test = msg_test["Label"].value_counts(normalize=True).loc["spam"]

print("Spam percentage in training : {}".format(round(pct_spam_training,4)))
print("Spam percentage in test: {}".format(round(pct_spam_test,4)))
print("\n")
print("No spam percentage in training: {}".format(round(pct_ham_training,4)))
print("No Spam percentage in test: {}".format(round(pct_ham_test,4)))

Spam percentage in training : 0.1346
Spam percentage in test: 0.132


No spam percentage in training: 0.8654
No Spam percentage in test: 0.868


**Efficient Formatting**

Formatting goal: Each occurring word in the entire dataset gets its own row, which counts the frequency of the word in each message. An exemplary series should look like this:

| Index | Label | Word1 | Word2 | Word3 | --- |
| --- | --- | --- |--- | --- | --- |
| 435 | Ham | 1 | 0 | 2 | --- |
| 436 | Ham | 0 | 1 | 1 | --- |

Formatted in this way, all the necessary probability for label determination can be easily computed using the multinominal Bayes technique.

Adjustments: 

* remove punctuation
* format all letters to lowercase

Additional note: For later improvement, number of uppercase letters and exclamation marks are recorded.

In [5]:
def count_exclamation_marks(text):
    counts = 0
    for letter in text:
        if ord(letter) == ord("!"):
            counts += 1
    return counts

In [6]:
def count_upper_cases(text):
    counts = 0
    for letter in text:
        if ord("A") <= ord(letter) <= ord("Z"):
            counts += 1
    return counts

In [7]:
exclamation_training = msg_training["Text"].apply(count_exclamation_marks).rename("exclamation")
exclamation_test = msg_test["Text"].apply(count_exclamation_marks).rename("exclamation")

In [8]:
upper_letter_training = msg_training["Text"].apply(count_upper_cases).rename("upper")
upper_letter_test = msg_test["Text"].apply(count_upper_cases).rename("upper")

In [9]:
msg_training["Text"] = msg_training["Text"].str.replace("\W"," ").str.lower()
original_test_msg = msg_test["Text"].reset_index(drop=True)
msg_test["Text"] = msg_test["Text"].str.replace("\W"," ").str.lower()

  msg_training["Text"] = msg_training["Text"].str.replace("\W"," ").str.lower()
  msg_test["Text"] = msg_test["Text"].str.replace("\W"," ").str.lower()


In [10]:
#list with unique words
word_list = []

for text in msg_training["Text"].str.split():
    for word in text:
        if word not in word_list:
            word_list.append(word)
len(word_list)

7783

Initaialisierung of dictionary with frequency.

In [11]:
word_dict = dict(zip(word_list,[0] * len(word_list)))

In [12]:
def count_words(frame,in_col = "Text"):
    all_dicts = []
    for row_num in range(len(frame)):
        text_col = frame.columns.get_loc(in_col)
        text = frame.iloc[row_num,text_col].split()
        current_dic = word_dict.copy()
        for word in text:
            current_dic[word] += 1
        all_dicts.append(current_dic)
    return all_dicts 

In [13]:
word_count_frame = pd.DataFrame(count_words(msg_training))

In [14]:
training = pd.concat([msg_training,word_count_frame],axis=1)
training.shape
training.iloc[0:3,0:12]

Unnamed: 0,Label,Text,yep,by,the,pretty,sculpture,yes,princess,are,you,going
0,ham,yep by the pretty sculpture,1,1,1,1,1,0,0,0,0,0
1,ham,yes princess are you going to make me moan,0,0,0,0,0,1,1,1,1,1
2,ham,welp apparently he retired,0,0,0,0,0,0,0,0,0,0


<b>Multinomial Bayes Algorithmus</b>

The probability that a new message falls into the spam or no spam category is:

$$P(Label|Msg) = \frac{P(Label|Msg) * P(Label)}{P(Msg)}$$

The denominator of this fraction is the same for both categories, since it does not depend on the category. Therefore, it is the numerator that is interesting. The naive Bayes algorithm makes the simplifying assumption that the probabilities for the occurrence of each word in a message are independent of the occurrence of each other word (mutually independent). This assumption will not hold most of the time, because if the word "winner" occurs in a message, then an increased probability for the occurrence of the word "prize" can be assumed in the following words.Despite the strong assumptions, naive Bayes produces good results.  In particular, the cancellation of the constants from both probabilities and the assumption about independence allows the following simplification:

$$ P(Label | Word_1, ... , Word_n)~{\displaystyle \propto}~P(Label)~*~\prod_{i = 1}^{n} P(Word_i | Label)   $$

New words in the test set that are not present in the training set are problematic. These lead to an estimated probability of Ham given the included words of zero. To avoid this, the Laplace Smoothing technique (alpha=1) is used. The conditional probability for the occurrence of a word given the category is then equal to:

$$ P(Word_i|Label) = \frac{N_{w_i|Label}+\alpha}{N_{Label} + \alpha *N_{total}} $$

N refers to the number of words with the given properties.First, the constants are calculated: 

In [15]:
training_ham = training[training["Label"]=="ham"]
training_spam = training[training["Label"]=="spam"]

In [16]:
p_ham = training_ham.shape[0]/msg_training.shape[0]
p_spam = training_spam.shape[0]/msg_training.shape[0]

In [17]:
n_unique = len(word_list)
n_ham = training_ham.iloc[:,2:].sum().sum()
n_spam = training_spam.iloc[:,2:].sum().sum()

In [18]:
alpha = 1

Other terms are calculated from the dictionary.

In [19]:
def full_dict_words(frame,in_col = "Text"):
    full = word_dict.copy()
    for row_num in range(len(frame)):
        text_col = frame.columns.get_loc(in_col)
        text = frame.iloc[row_num,text_col].split()
        for word in text:
            full[word] += 1
    return full

In [20]:
dict_words_ham = full_dict_words(training_ham)
dict_words_spam = full_dict_words(training_spam)

In [21]:
p_w_ham = { k : (dict_words_ham[k] + alpha)/(n_ham + alpha * n_unique) for k in dict_words_ham}
p_w_spam = { k : (dict_words_spam[k] + alpha)/(n_spam + alpha * n_unique) for k in dict_words_spam}

The Classify function calculates the two conditional probabilities for a new message and classifies the message into the category with the higher probability.

In [22]:
def classify(msg):
    
    text = re.sub("\W"," ",msg)
    text = text.lower().split()
    p_spam_given_msg = p_spam
    p_ham_given_msg = p_ham
    
    for word in text:
        
        if word in dict_words_spam:
            p_spam_given_msg *= p_w_spam[word]
        if word in dict_words_ham:
            p_ham_given_msg *= p_w_ham[word]
    
    print("P(Ham|Msg)={}".format(p_ham_given_msg))
    print("P(Spam|Msg)={}".format(p_spam_given_msg))
    if p_ham_given_msg >= p_spam_given_msg:
        return "ham"
    else:
        return "spam"

In [23]:
classify("Hey. Whats up? Wanna get some burgers later?") # Example1

P(Ham|Msg)=2.488613716142424e-21
P(Spam|Msg)=1.088052485084291e-26


'ham'

In [30]:
classify("Congratulations! You are customer number 100.000. Call 123123 to claim your price") # Example2

P(Ham|Msg)=8.575406141432293e-41
P(Spam|Msg)=4.969879250082834e-32


'spam'

In [25]:
def classify_test_set(msg):
    
    text = re.sub("\W"," ",msg)
    text = text.lower().split()
    p_spam_given_msg = p_spam
    p_ham_given_msg = p_ham
    
    for word in text:
        
        if word in dict_words_spam:
            p_spam_given_msg *= p_w_spam[word]
        if word in dict_words_ham:
            p_ham_given_msg *= p_w_ham[word]
    
    if p_ham_given_msg >= p_spam_given_msg:
        return "ham"
    else:
        return "spam"
    

In [26]:
msg_test["Prediction"]=msg_test["Text"].apply(classify_test_set)
msg_test["Correctly classified"] = (msg_test["Prediction"] == msg_test["Label"])

In [27]:
round(msg_test["Correctly classified"].sum()/len(msg_test) * 100,2)


98.83

Over 98% of the messages from the test set are correctly classified. In particular, very short messages or messages that are difficult to recognize as "spam" or "not spam" even for the human eye are incorrectly classified. An obvious improvement would be to include capitalization and the number of exclamation marks in the classification. However, the result improves only marginally for this test set. For the time being, I'm satisfied with a rate of over 98% and may look for further improvements later.

In [28]:
msg_test.loc[(msg_test["Prediction"] == "ham") & (msg_test["Label"] == "spam"),:]

Unnamed: 0,Label,Text,Prediction,Correctly classified
114,spam,not heard from u4 a while call me now am here...,ham,False
135,spam,more people are dogging in your area now call...,ham,False
504,spam,oh my god i ve found your number again i m s...,ham,False
546,spam,hi babe its chloe how r u i was smashed on s...,ham,False
741,spam,0a networks allow companies to bill for sms s...,ham,False
876,spam,rct thnq adrian for u text rgds vatian,ham,False
885,spam,2 2 146tf150p,ham,False
953,spam,hello we need some posh birds and chaps to us...,ham,False


In [29]:
msg_test.loc[(msg_test["Prediction"] == "spam") & (msg_test["Label"] == "ham"),:]

Unnamed: 0,Label,Text,Prediction,Correctly classified
152,ham,unlimited texts limited minutes,spam,False
159,ham,26th of july,spam,False
284,ham,nokia phone is lovly,spam,False
302,ham,no calls messages missed calls,spam,False
319,ham,we have sent jd for customer service cum accou...,spam,False
