### Practice 4 - separate numbers from letters

This version picks on the 'practice 2' version that modifies the original by allowing the algorithm to recognize currency symbols. 


### This notebook compiles the core structure of the algorithm built in Project 12

Algorithm's purpose(s):  

- 1st, determine if a SMS (cellphone message) is spam or not.
- 2nd, return a percentage of accuracy when applying it to a test data set comprised of SMSs (pre-evaluated as spam or not spam).

Notes:    

- In this notebook we tweak the three sections of the algorithm where the input messages (strings) are divided into a list of words (also strings): 



    - Section 1 - in the training section of the algorithm
    - Section 2 - inside the `classify` function (1st purpose mentioned above) 
    - Section 3 - inside the `classify_test_set` function (2nd purpose mentioned above)

In [2]:
import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
import seaborn as sns


%matplotlib inline

In [7]:
sms_spam_full = pd.read_csv('SMSSpamCollection.txt',
                   sep='\t',
                   names=['Label', 'SMS'])


#-----------------------------------------

# 1.
random_sms_spam = sms_spam_full.sample(n=None, frac=1, random_state=1).reset_index(drop=True)

# 2.
training_set = random_sms_spam.copy().iloc[:4458+1, :]

testing_set = random_sms_spam.copy().iloc[4458:, :]

# 3.

# Training set.
count_label_training = training_set.Label.value_counts(normalize=True).round(3)*100

count_label_training = count_label_training.rename('ham vs spam (%)')

# Testing set.
count_label_testing = testing_set.Label.value_counts(normalize=True).round(3)*100

count_label_testing = count_label_testing.rename('ham vs spam (%)')

#---------------------------------------

# Saving RAM 1
del sms_spam_full
del random_sms_spam


### Tweak section 1 (start) - how a message is splitted into a list of words


ts_cleaned_SMS = training_set.SMS.copy().str.replace('(\W)', r' \1 ', regex=True)
ts_cleaned_SMS = ts_cleaned_SMS.str.replace('(\d+)', r' \1 ', regex=True)
# ts_cleaned_SMS = ts_cleaned_SMS.str.replace('\.', '', regex=True)

#-------------------------------------------------------------


#`\s+` ensures that if there are two or more joined whitespaces they are converted to just one.
ts_cleaned_SMS = ts_cleaned_SMS.str.replace('\s+', ' ', regex=True) 


ts_cleaned_SMS = ts_cleaned_SMS.str.replace('(\A +| +\Z)', '', regex=True) 


# Lower case for every string.
ts_cleaned_SMS = ts_cleaned_SMS.str.lower()


### Tweak section 1 (end)

#-------------------------------------------------------------

# 1.
ts_cleaned_SMS_split = ts_cleaned_SMS.str.split(' ', expand=True) 

# 2.
ts_cleaned_SMS_split_cat = pd.concat([ts_cleaned_SMS_split.iloc[i, :] for i in range(0, ts_cleaned_SMS_split.shape[0])], ignore_index=True)
# 3.
ts_cleaned_SMS_split_cat = ts_cleaned_SMS_split_cat.dropna()

# 4.
ts_cleaned_SMS_split_cat_to_list = ts_cleaned_SMS_split_cat.to_list()

#5.
vocabulary_set = set(ts_cleaned_SMS_split_cat_to_list)

#6.
vocabulary = list(vocabulary_set)

for index, el in enumerate(vocabulary):
    if el == '':
        del vocabulary[index]

#-------------------------------------------------------------
        
# Free RAM 2

del ts_cleaned_SMS_split_cat 
del ts_cleaned_SMS_split_cat_to_list
del vocabulary_set

#-------------------------------------------------------------

def remove_elements(list_x, list_strings):
    """Strings in list_strings are removed from list_x if this later 
    list contains any of those strings.
    """
    
    for index, el in enumerate(list_x):
        if el in list_strings:
            del list_x[index]
    
    return list_x


# `expand=False` by default.
ts_cleaned_SMS_split_listed = ts_cleaned_SMS.str.split(' ') 


strings_to_remove = ['']

ts_cleaned_SMS_split_listed_1 = ts_cleaned_SMS_split_listed.copy().apply(lambda x: remove_elements(x, strings_to_remove))

#-------------------------------------------------------------

# From the tutorial.

# 1.
word_counts_per_sms = {unique_word: [0] * len(ts_cleaned_SMS_split_listed_1) for unique_word in vocabulary}

for index, sms in enumerate(ts_cleaned_SMS_split_listed_1):
    for word in sms:
        word_counts_per_sms[word][index] += 1
        
              
# 2. Convert dictionary into DataFrame
word_counts_per_sms_df = pd.DataFrame(word_counts_per_sms)


# 3. 
# `sort=False` is required to preserve the order of the columns in a 'first in' fashion.
training_set_2 = pd.concat([training_set, word_counts_per_sms_df], axis=1, sort=False)

#-------------------------------------------------------------

# Free RAM 3

del ts_cleaned_SMS_split_listed
del word_counts_per_sms

#-------------------------------------------------------------

label_counts_ts2 = training_set_2.Label.value_counts(normalize=True)

p_spam = label_counts_ts2.spam

p_ham = label_counts_ts2.ham

n_vocabulary = len(vocabulary)

#-------------------------------------------------------------

# 1.
training_set_2['sum_words_sms'] = training_set_2.iloc[:, 2:].copy().sum(axis=1)

# 2.

n_spam = training_set_2.loc[training_set_2.Label=='spam', 'sum_words_sms'].sum()

n_ham = training_set_2.loc[training_set_2.Label=='ham', 'sum_words_sms'].sum()

alpha = 1 

#-------------------------------------------------------------

# 1. the last colum 'sum_words_sms' is not included in the calculations
# of these Series, so I set `.iloc[:, 2:-1]`.
sms_spam = training_set_2.iloc[:, 2:-1].copy()[training_set_2.Label=='spam']

sms_ham = training_set_2.iloc[:, 2:-1].copy()[training_set_2.Label=='ham']


# 2.
sms_spam_sum = sms_spam.sum().transpose()

sms_ham_sum = sms_ham.sum().transpose()

# 3.

p_wi_given_spam_dict = {}

p_wi_given_ham_dict = {}

# P(w_i|Spam)
for i in range(0, sms_spam_sum.size):
    index = sms_spam_sum.index[i]
    dividend = sms_spam_sum[index] + alpha
    divisor = n_spam + (alpha*n_vocabulary)
    p_wi_given_spam_dict[index] =  dividend / divisor

    
# P(w_i|Ham)
for i in range(0, sms_ham_sum.size):
    index = sms_ham_sum.index[i]
    dividend = sms_ham_sum[index] + alpha
    divisor = n_ham + (alpha*n_vocabulary)
    p_wi_given_ham_dict[index] =  dividend / divisor
    
#-------------------------------------------------------------

# Free RAM 4

del sms_spam
del sms_ham


#-------------------------------------------------------------


def classify_test_set(message):
    """Takes in a string - a cellphone message (SMS), and returns a classification of whether 
    the message is spam, not spam (ham), or if a human is required to classify the message.
    """
    
    ### Tweak section 3 (start) - how a message is splitted into a list of words

    message = re.sub('(\W)', r' \1 ', message)
    message = re.sub('(\d+)', r' \1 ', message)
    message = re.sub('\s+', ' ', message)
    message = message.lower() 
    message = message.split() # now a list of strings
    
    ### Tweak section 1 (end)

    
    # Calculating P(Spam|w_1, w_2, ..., w_n) with P(Ham|w_1, w_2, ..., w_n).
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    # Note: if `word` is not in the spam or in the non-spam DataFrames the loop does nothing by default.
    for word in message:
        
        if word in p_wi_given_spam_dict.keys():
            p_spam_given_message *= p_wi_given_spam_dict[word]
            
        if word in p_wi_given_ham_dict.keys():
            p_ham_given_message *= p_wi_given_ham_dict[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'Requires human classification.'
    

testing_set['Test'] = testing_set['SMS'].apply(classify_test_set)

#-------------------------------------------------------------

# Assigns True if condition is met and multplying by one converts True in '1' and False in '0'.
testing_set['Correct'] = (testing_set['Label'] == testing_set['Test'])*1

test_accuracy = (testing_set['Correct'].sum() / testing_set.shape[0])*100

test_accuracy = test_accuracy.round(2)

print(f'When applied to the messages in the `testing_set` ({testing_set.shape[0]} entries) the test accuracy was aprox. {test_accuracy}%.')

When applied to the messages in the `testing_set` (1114 entries) the test accuracy was aprox. 98.47%.


In [4]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

ts_cleaned_SMS[60:80]

60                                                                                                                                                                                                                                                                                                                                                                                                                                                         call from 08702490080 - tells u 2 call 09066358152 to claim £ 5000 prize u have 2 enter all ur mobile & personal details @ the prompts careful !
61                                                                                                                                                                                                                                                                                                                                                                                                                          

In [5]:
cond_incorrect = testing_set.Correct == 0

incorrect = testing_set[cond_incorrect].copy().reset_index()

pd.options.display.max_colwidth = 500

incorrect[['Label', 'SMS']]

Unnamed: 0,Label,SMS
0,spam,Had your mobile 10 mths? Update to latest Orange camera/video phones for FREE. Save £s with Free texts/weekend calls. Text YES for a callback orno to opt out
1,spam,Dear Dave this is your final notice to collect your 4* Tenerife Holiday or #5000 CASH award! Call 09061743806 from landline. TCs SAE Box326 CW25WX 150ppm
2,spam,"URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18"
3,spam,Get your garden ready for summer with a FREE selection of summer bulbs and seeds worth £33:50 only with The Scotsman this Saturday. To stop go2 notxt.co.uk
4,spam,25p 4 alfie Moon's Children in need song on ur mob. Tell ur m8s. Txt Tone charity to 8007 for Nokias or Poly charity for polys: zed 08701417012 profit 2 charity.
5,spam,FreeMsg: Hey - I'm Buffy. 25 and love to satisfy men. Home alone feeling randy. Reply 2 C my PIX! QlynnBV Help08700621170150p a msg Send stop to stop txts
6,spam,Want 2 get laid tonight? Want real Dogging locations sent direct 2 ur Mob? Join the UK's largest Dogging Network by txting MOAN to 69888Nyt. ec2a. 31p.msg@150p
7,spam,As one of our registered subscribers u can enter the draw 4 a 100 G.B. gift voucher by replying with ENTER. To unsubscribe text STOP
8,spam,Show ur colours! Euro 2004 2-4-1 Offer! Get an England Flag & 3Lions tone on ur phone! Click on the following service message for info!
9,spam,goldviking (29/M) is inviting you to be his friend. Reply YES-762 or NO-762 See him: www.SMS.ac/u/goldviking STOP? Send STOP FRND to 62468
