# SMS messages classification
This project aims to use multinomial Naive Bayes algorithm to classify 5,572 SMS messages as either spam/non-spam. The process comprises of three steps: 
1. Teaching the computer how to classify messages using a dataset of already classified messages (by humans)
2. Using this human knowledge to estimate probabilities for new messages (P_spam and P_non_spam)
3. Comparing the probabilites (i.e. P_spam > P_non_spam => The message is a spam)

In [32]:
import pandas as pd
import re

messages = pd.read_csv('SMSSpamCollection', sep = '\t', header = None, names = ['Label','SMS'])


print(messages.info(verbose = True), '\n',
      messages.head(),'\n',
      messages.tail(),'\n',
      round(messages['Label'].value_counts(normalize = True) * 100, 1))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
Label    5572 non-null object
SMS      5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB
None 
   Label                                                SMS
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro... 
      Label                                                SMS
5567  spam  This is the 2nd time we have tried 2 contact u...
5568   ham               Will ü b going to esplanade fr home?
5569   ham  Pity, * was in mood for that. So...any other s...
5570   ham  The guy did some bitching but I acted like i'd...
5571   ham                         Rofl. Its true to its name 
 ham     86.6
spam    13.4
Name: Label, dtype: float64


From the initial exploration we can see that our dataset contains 5572 mesages out of which 13.4% are labeled as each 'spam' and 86.6% as 'ham' (non-spam). The dataset does not contain any missing values.

In [33]:
messages_randomized = messages.sample(frac = 1, random_state = 1)

train = messages_randomized.iloc[:round(5572 * 0.8),:].reset_index(drop = True)
test = messages_randomized.iloc[round(5572 * 0.8):,:].reset_index(drop = True)
print(train['Label'].value_counts(normalize = True) * 100)
print(test['Label'].value_counts(normalize = True) * 100)

ham     86.54105
spam    13.45895
Name: Label, dtype: float64
ham     86.804309
spam    13.195691
Name: Label, dtype: float64


In order assess the performance of the classifier the data has been split to train and test dataset with ratio of 80% to 20% respectively. The train data will be used to teach the model how to classify unseen messages (in this cas the test dataset), whilst the test dataset (already labeled) will serve as a benchmark for our classification model. Inspection reveals that the proportions of 'spam' and 'ham' (non-spam) messages has been preserved in both datasets.

In [34]:
train['SMS'] = train['SMS'].str.replace('\W',' ')
train['SMS'] = train['SMS'].str.lower()
train.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


Transforming the data such that we can transform the 'SMS' column into  word frequency columns (our vocabulary).

In [35]:
train['SMS'] = train['SMS'].str.split()
train['SMS'].head()

0                    [yep, by, the, pretty, sculpture]
1    [yes, princess, are, you, going, to, make, me,...
2                      [welp, apparently, he, retired]
3                                             [havent]
4    [i, forgot, 2, ask, ü, all, smth, there, s, a,...
Name: SMS, dtype: object

In [36]:
vocabulary = []
for row in train['SMS']:
    for word in row:
        vocabulary.append(word)
        
vocabulary = list(set(vocabulary))
print(len(vocabulary))

7783


The list 'vocabulary' contains unique words from our messages. This will be used to create a new dataframe (a word frequency table).

In [37]:
word_count_sms = {unique_word: [0] * len(train['SMS']) 
                  for unique_word in vocabulary}

for index, sms in enumerate(train['SMS']):
    for word in sms:
        word_count_sms[word][index] += 1

In [38]:
df_word_count_sms = pd.DataFrame(word_count_sms)
# df_word_count_sms.head()

In [39]:
train_clean = pd.concat([train, df_word_count_sms], axis = 1)
# train_clean.head()

In [40]:
spam_messages = train_clean[train_clean['Label'] == 'spam']
ham_messages = train_clean[train_clean['Label'] == 'ham']

P_spam = len(spam_messages) / len(train_clean)
P_ham = len(ham_messages) / len(train_clean)

N_vocabulary = len(vocabulary)

N_spam = spam_messages['SMS'].apply(len).sum()

N_ham = ham_messages['SMS'].apply(len).sum()

alpha = 1

In [41]:
dict_spam = {}
dict_ham = {}

for word in vocabulary:
    dict_spam[word] = 0
    dict_ham[word] = 0
    
for word in vocabulary:
    n_w_spam = spam_messages[word].sum()
    n_w_ham = ham_messages[word]. sum() 
    
    p_w_spam = (n_w_spam + alpha) / ( N_spam + alpha * N_vocabulary)
    p_w_ham = (n_w_ham + alpha) / (N_ham + alpha * N_vocabulary)
    
    dict_spam[word] = p_w_spam
    dict_ham[word] = p_w_ham

In [45]:
def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = P_spam 
    p_ham_given_message = P_ham
    
    for word in message:
        if word in dict_spam:
            p_spam_given_message *= dict_spam[word]
        if word in dict_ham:
            p_ham_given_message *= dict_ham[word]

        
#     print('P(Spam|message):', p_spam_given_message)
#     print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [46]:
print(classify("Sounds good, Tom, then see u there"))

ham


In [47]:
print(classify('WINNER!! This is the secret code to unlock the money: C3421.'))

spam


In [48]:
test['predicted'] = test['SMS'].apply(classify)
test.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [49]:
test['correct'] = test['Label'] == test['predicted']
test.head()

Unnamed: 0,Label,SMS,predicted,correct
0,ham,Later i guess. I needa do mcat study too.,ham,True
1,ham,But i haf enuff space got like 4 mb...,ham,True
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam,True
3,ham,All sounds good. Fingers . Makes it difficult ...,ham,True
4,ham,"All done, all handed in. Don't know if mega sh...",ham,True


In [60]:
correct_incorrect = test['correct'].value_counts()
correct = correct_incorrect[1]
accuracy = correct / len(test)
print('Correctly classified:',correct_incorrect[1],'\n',
      'Incorrectly classified:',correct_incorrect[0],'\n',
      'Accuracy:',accuracy * 100)

Correctly classified: 1100 
 Incorrectly classified: 14 
 Accuracy: 98.74326750448833


It seems that with accuracy of 98.74% on the