### Contents <a class="anchor" id="sections"></a>

* [1. Getting started](#section1)
* [2. Preprocess message data](#section2)
* [3. Categorise message text](#section3) 
    - `spam_tokens`  `ham_tokens`
* [4. Classify message](#section4)
    - `spam_counter`  `ham_counter`
* [5. Test spam filter](#section5)
    * [5.1 Using custom data](#section5.1)
    * [5.2 Using a custom pandas DataFrame](#section5.2)
* [6. Test observations](#section6)
* [7. Investigating the classification errors](#section7)
    * [7.1 Ham tokens with large values](#section7.1)
    * [7.2 Excluding ham tokens with a value of 100 or more](#section7.2)
    * [7.3 Test solution](#section7.3)
    * [7.4 Test observations II](#section7.4)
* [8. A new classification error](#section8)  
    * [8.1 The 'U' problem](#section8.1)
    * [8.2 Excluding spam tokens with a value of 100 or more](#section8.2)
    * [8.3 Test solution II](#section8.3)
    * [8.4 Further testing](#section8.4)

### 1. Getting started <a class="anchor" id="section1"></a>

In [1]:
import nltk
import pandas as pd
import string

In [2]:
data = pd.read_csv('data/spam.csv', 
                   encoding='ISO-8859-1',  
                   header=0, 
                   usecols=range(2), 
                   names=['label','sms'])

print(data['label'].value_counts())
display(data.head(), data.tail())

ham     4825
spam     747
Name: label, dtype: int64


Unnamed: 0,label,sms
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Unnamed: 0,label,sms
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ì_ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...
5571,ham,Rofl. Its true to its name


### 2. Preprocess message data <a class="anchor" id="section2"><a class="anchor" id="function1">

In [3]:
# Function 1
def preprocessMessage(message):
    '''
    Prepares message data for further analysis
    
    Parameters
    ----------
    message : str
        the message
        
    Returns
    -------
    list of str
        the processed message
    '''
    stopwords = nltk.corpus.stopwords.words('english')
    symbols = string.punctuation
    
    stepOne = ''.join([character.lower() for character in message if character not in symbols]) 
    stepTwo = nltk.tokenize.word_tokenize(stepOne)
    stepThree = [word for word in stepTwo if word not in stopwords]
    
    return stepThree


# Apply funciton to return a new column
data['sms_processed'] = data['sms'].apply(lambda x: preprocessMessage(x))
display(data.head(), data.tail())

Unnamed: 0,label,sms,sms_processed
0,ham,"Go until jurong point, crazy.. Available only ...","[go, jurong, point, crazy, available, bugis, n..."
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, 2, wkly, comp, win, fa, cup, fin..."
3,ham,U dun say so early hor... U c already then say...,"[u, dun, say, early, hor, u, c, already, say]"
4,ham,"Nah I don't think he goes to usf, he lives aro...","[nah, dont, think, goes, usf, lives, around, t..."


Unnamed: 0,label,sms,sms_processed
5567,spam,This is the 2nd time we have tried 2 contact u...,"[2nd, time, tried, 2, contact, u, u, å£750, po..."
5568,ham,Will Ì_ b going to esplanade fr home?,"[ì, b, going, esplanade, fr, home]"
5569,ham,"Pity, * was in mood for that. So...any other s...","[pity, mood, soany, suggestions]"
5570,ham,The guy did some bitching but I acted like i'd...,"[guy, bitching, acted, like, id, interested, b..."
5571,ham,Rofl. Its true to its name,"[rofl, true, name]"


### 3. Categorise message text <a class="anchor" id="section3"></a><a class="anchor" id="function2"></a>

In [4]:
# Function 2
# - detecting patterns is a central part of NLP
def categoriseWords():
    '''
    Categorises words/tokens found in (the now processed) messages
    
    Returns
    -------
    list x2
        a list of words associated with the spam message label
        a list of words associated with the ham message label
    '''
    
    spam_tokens=[]
    ham_tokens=[]
    
    # Spam tokens
    for message in data['sms_processed'][data['label']=='spam']:
        for each_word in message:
            spam_tokens.append(each_word)
            
    # Ham tokens
    for message in data['sms_processed'][data['label']=='ham']:
        for each_word in message:
            ham_tokens.append(each_word)        
    
    return spam_tokens, ham_tokens

# Call function & assign each list to a variable 
spam_tokens, ham_tokens = categoriseWords()

In [5]:
# Spam tokens
print('Spam token total: %s' %len(spam_tokens))
print('Unique spam tokens: %s' %len(set(spam_tokens)))
print('Spam token examples: %s' %spam_tokens[:10])
# free 
print('Example spam_tokens.count(token): {free: %s' %spam_tokens.count('free')+'}')

# Ham tokens
print('\nHam token total: %s' %len(ham_tokens))
print('Unique ham tokens: %s' %len(set(ham_tokens)))
print('Ham token examples: %s' %ham_tokens[-10:])
# free
print('Example ham_tokens.count(token): {free: %s' %ham_tokens.count('free')+'}')

Spam token total: 12516
Unique spam tokens: 2926
Spam token examples: ['free', 'entry', '2', 'wkly', 'comp', 'win', 'fa', 'cup', 'final', 'tkts']
Example spam_tokens.count(token): {free: 216}

Ham token total: 39918
Unique ham tokens: 7427
Ham token examples: ['something', 'else', 'next', 'week', 'gave', 'us', 'free', 'rofl', 'true', 'name']
Example ham_tokens.count(token): {free: 59}


### 4. Classify message <a class="anchor" id="section4"></a><a class="anchor" id="function3"></a>

In [6]:
# Function 3
# - this is our spam filter
def classifyMessage(message):
    '''
    Classifies the message as either "Spam" or "Not Spam" alongside an accuracy measure
    
    Parameters
    ----------
    message : list of str
        content of the original message.
    '''

    spam_counter = 0
    ham_counter = 0
    
    # Spam counter
    for each_word in message:
        spam_counter += spam_tokens.count(each_word) 
    
    # Ham counter
    for each_word in message:
        ham_counter += ham_tokens.count(each_word) 
    
    # True Positive / (True Positive + True Negative) * 100
    if ham_counter > spam_counter:
        accuracy = (ham_counter / (ham_counter + spam_counter)) * 100
        print('Not Spam, with {:.2f}% accuracy.\n'.format(accuracy))
        
    # True Positive / (True Positive + True Negative) * 100
    elif spam_counter > ham_counter:
        accuracy = (spam_counter / (ham_counter + spam_counter)) * 100
        print('Is Spam, with {:.2f}% accuracy.\n'.format(accuracy))
              
    else:
        print('Might be Spam, with 50% accuracy.\n')

### 5. Test spam filter <a class="anchor" id="section5"></a>

#### 5.1 Using custom data <a class="anchor" id="section5.1"></a> 

In [7]:
# Custom spam message
spamMessage = ('''Congratulations! You have won two free tickets to the game next
                weekend! Please message this number: +123456789 to claim your prize!''')

# Step 1. preprocess the message
processedSpamMessage = preprocessMessage(spamMessage)
print(processedSpamMessage,'\n')

# Step 2. classify the processed message
classifyMessage(processedSpamMessage)


# Custom ham message
hamMessage = ('''Hey - I managed to win some free tickets to the game next 
               weekend! Want to come with me and the crew???''')

# Step 1. preprocess the message
processedHamMessage = preprocessMessage(hamMessage)
print(processedHamMessage, '\n')

# Step 2. classify the processed message
classifyMessage(processedHamMessage)

['congratulations', 'two', 'free', 'tickets', 'game', 'next', 'weekend', 'please', 'message', 'number', '123456789', 'claim', 'prize'] 

Is Spam, with 61.15% accuracy.

['hey', 'managed', 'win', 'free', 'tickets', 'game', 'next', 'weekend', 'want', 'come', 'crew'] 

Not Spam, with 65.40% accuracy.



#### 5.2 Using a custom pandas DataFrame<a class="anchor" id="section5.2"></a>

In [8]:
# Define DataFrame with only spam messages
spamMessages = data[data['label']=='spam'].head(5)
display(spamMessages)

# Test spam filter
for x in spamMessages.index:
    print('>',x)
    classifyMessage(spamMessages.loc[x]['sms_processed'])

Unnamed: 0,label,sms,sms_processed
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, 2, wkly, comp, win, fa, cup, fin..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...,"[freemsg, hey, darling, 3, weeks, word, back, ..."
8,spam,WINNER!! As a valued network customer you have...,"[winner, valued, network, customer, selected, ..."
9,spam,Had your mobile 11 months or more? U R entitle...,"[mobile, 11, months, u, r, entitled, update, l..."
11,spam,"SIX chances to win CASH! From 100 to 20,000 po...","[six, chances, win, cash, 100, 20000, pounds, ..."


> 2
Is Spam, with 64.18% accuracy.

> 5
Not Spam, with 81.84% accuracy.

> 8
Is Spam, with 76.61% accuracy.

> 9
Not Spam, with 52.81% accuracy.

> 11
Is Spam, with 65.08% accuracy.



In [9]:
# Define DataFrame with only ham messages
hamMessages = data[data['label']=='ham'].head(5)
display(hamMessages)

# Test spam filter
for x in hamMessages.index:
    print('>',x)
    classifyMessage(hamMessages.loc[x]['sms_processed'])

Unnamed: 0,label,sms,sms_processed
0,ham,"Go until jurong point, crazy.. Available only ...","[go, jurong, point, crazy, available, bugis, n..."
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, u, oni]"
3,ham,U dun say so early hor... U c already then say...,"[u, dun, say, early, hor, u, c, already, say]"
4,ham,"Nah I don't think he goes to usf, he lives aro...","[nah, dont, think, goes, usf, lives, around, t..."
6,ham,Even my brother is not like to speak with me. ...,"[even, brother, like, speak, treat, like, aids..."


> 0
Not Spam, with 93.05% accuracy.

> 1
Not Spam, with 89.67% accuracy.

> 3
Not Spam, with 88.28% accuracy.

> 4
Not Spam, with 94.90% accuracy.

> 6
Not Spam, with 93.01% accuracy.



### 6. Test observations <a class="anchor" id="section6"></a>

In [10]:
# An incorrect classification of "Not Spam" found for 2 spam messages

# View labels
for x in (5, 9):
    display(data.loc[x]['label'])

# View spam filter classifications
for x in (5,9):
    print('>',x)
    classifyMessage(spamMessages.loc[x]['sms_processed'])

'spam'

'spam'

> 5
Not Spam, with 81.84% accuracy.

> 9
Not Spam, with 52.81% accuracy.



### 7. Investigating the classification errors <a class="anchor" id="section7"></a>

In [11]:
# Function 4
# - for a detailed look at each misclassified message
def messageSummary(index):
    '''
    Prints an detailed summary of selected message
    
    Parameters
    ----------
    index : int
        index number of selected message    
    '''
    print('Original message:\n%s' %data.loc[index]['sms'])
    print('\nProcessed message:\n%s' %data.loc[index]['sms_processed'])
    
    spam_counter = 0
    print('\nSpam Token Values:')
    for word in data.loc[index]['sms_processed']:
        spam_counter += spam_tokens.count(word)
        print('{' + word + ': ' +str(spam_tokens.count(word)), end='} ')
    
    ham_counter = 0
    print('\n\nHam Token Values:')
    for word in data.loc[index]['sms_processed']:
        ham_counter += ham_tokens.count(word)
        print('{' + word + ': ' +str(ham_tokens.count(word)), end='} ')
        
    print('\n\nSpam Token Total: %i' %spam_counter)
    print('\nHam Token Total: %i' %ham_counter)
    
    if spam_counter > ham_counter:
        print('\nTrue Positive: Is Spam (%s' %spam_counter + ')' + 
              ' | True Negative : Not Spam (%s' %ham_counter + ')\n')
    else:
        print('\nTrue Positive: Not Spam (%s' %ham_counter + ')' + 
              ' | True Negative : Is Spam (%s' %spam_counter + ')\n')
        
    # The spam filter output 
    return classifyMessage(data.loc[index]['sms_processed'])

In [12]:
# Index 5
messageSummary(5)

Original message:
FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, å£1.50 to rcv

Processed message:
['freemsg', 'hey', 'darling', '3', 'weeks', 'word', 'back', 'id', 'like', 'fun', 'still', 'tb', 'ok', 'xxx', 'std', 'chgs', 'send', 'å£150', 'rcv']

Spam Token Values:
{freemsg: 12} {hey: 5} {darling: 2} {3: 22} {weeks: 13} {word: 22} {back: 23} {id: 3} {like: 13} {fun: 9} {still: 7} {tb: 1} {ok: 5} {xxx: 11} {std: 9} {chgs: 1} {send: 67} {å£150: 27} {rcv: 2} 

Ham Token Values:
{freemsg: 0} {hey: 106} {darling: 3} {3: 44} {weeks: 6} {word: 12} {back: 129} {id: 29} {like: 229} {fun: 22} {still: 146} {tb: 3} {ok: 272} {xxx: 21} {std: 0} {chgs: 0} {send: 123} {å£150: 0} {rcv: 0} 

Spam Token Total: 254

Ham Token Total: 1145

True Positive: Not Spam (1145) | True Negative : Is Spam (254)

Not Spam, with 81.84% accuracy.



In [13]:
# Index 9
messageSummary(9)

Original message:
Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030

Processed message:
['mobile', '11', 'months', 'u', 'r', 'entitled', 'update', 'latest', 'colour', 'mobiles', 'camera', 'free', 'call', 'mobile', 'update', 'co', 'free', '08002986030']

Spam Token Values:
{mobile: 123} {11: 5} {months: 4} {u: 147} {r: 22} {entitled: 8} {update: 19} {latest: 36} {colour: 17} {mobiles: 12} {camera: 33} {free: 216} {call: 347} {mobile: 123} {update: 19} {co: 5} {free: 216} {08002986030: 2} 

Ham Token Values:
{mobile: 15} {11: 4} {months: 9} {u: 972} {r: 131} {entitled: 0} {update: 5} {latest: 3} {colour: 4} {mobiles: 0} {camera: 3} {free: 59} {call: 229} {mobile: 15} {update: 5} {co: 2} {free: 59} {08002986030: 0} 

Spam Token Total: 1354

Ham Token Total: 1515

True Positive: Not Spam (1515) | True Negative : Is Spam (1354)

Not Spam, with 52.81% accuracy.



#### 7.1 Ham tokens with large values<a class="anchor" id="section7.1"></a>

In [14]:
# Ham tokens with a value between 100 & 200
for x in set(ham_tokens):
    if ham_tokens.count(x) >= 100 and ham_tokens.count(x) <= 200:
        print('{' + x, ':', ham_tokens.count(x), end='} ')

{today : 125} {love : 185} {send : 123} {day : 187} {one : 162} {ì : 117} {night : 107} {happy : 105} {want : 163} {well : 126} {need : 156} {n : 134} {hey : 106} {still : 146} {much : 112} {take : 112} {back : 129} {hi : 117} {da : 131} {think : 128} {r : 131} {tell : 121} {time : 189} {great : 100} {way : 100} {lor : 160} {going : 167} {later : 134} {sorry : 153} {see : 137} {cant : 118} {4 : 168} {oh : 111} {home : 160} 

In [15]:
# Ham tokens with a value between 200 & 500
for x in set(ham_tokens):
    if ham_tokens.count(x) >= 200 and ham_tokens.count(x) <= 500:
        print('{' + x, ':', ham_tokens.count(x), end='} ')

{dont : 257} {2 : 305} {get : 303} {ok : 272} {go : 247} {come : 224} {ill : 236} {got : 243} {ur : 240} {good : 222} {im : 449} {ltgt : 276} {call : 229} {like : 229} {know : 232} 

In [16]:
# Ham tokens with a value greater than 500
for x in set(ham_tokens):
    if ham_tokens.count(x) > 500:
        print('{' + x, ':', ham_tokens.count(x), end='} ')

{u : 972} 

#### 7.2 Excluding ham tokens with a value of 100 or more <a class="anchor" id="section7.2"></a><a class="anchor" id="function5"></a>

In [17]:
# Update Function 3
def classifyMessageUpdated(message):
    '''
    Classifies the message as either "Spam" or "Not Spam" alongside an accuracy measure
    
    - Ham tokens with a value of 100 or more will be omitted from the ham counter
    
    Parameters
    ----------
    message : list of str
        content of the original message
    '''
    spam_counter = 0
    ham_counter = 0
    
    for token in message:
        spam_counter += spam_tokens.count(token)
        
    for token in message:
        if ham_tokens.count(token) <= 100: # < Logic adjustment made here
            ham_counter += ham_tokens.count(token)
    
    if ham_counter > spam_counter:
        accuracy = (ham_counter / (ham_counter + spam_counter)) * 100
        print('Not Spam, with {:.2f}% accuracy.\n'.format(accuracy))
        
    elif spam_counter > ham_counter:
        accuracy = (spam_counter / (ham_counter + spam_counter)) * 100
        print('Is Spam, with {:.2f}% accuracy.\n'.format(accuracy))
              
    else:
        print('Might be Spam, with 50% accuracy.\n')

#### 7.3 Test Solution <a class="anchor" id="section7.3">

In [18]:
# Spam messages
display(spamMessages)

# Test 'new' spam filer 
for x in spamMessages.index:
    print('>',x)
    classifyMessageUpdated(spamMessages.loc[x]['sms_processed'])

Unnamed: 0,label,sms,sms_processed
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, 2, wkly, comp, win, fa, cup, fin..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...,"[freemsg, hey, darling, 3, weeks, word, back, ..."
8,spam,WINNER!! As a valued network customer you have...,"[winner, valued, network, customer, selected, ..."
9,spam,Had your mobile 11 months or more? U R entitle...,"[mobile, 11, months, u, r, entitled, update, l..."
11,spam,"SIX chances to win CASH! From 100 to 20,000 po...","[six, chances, win, cash, 100, 20000, pounds, ..."


> 2
Is Spam, with 81.72% accuracy.

> 5
Is Spam, with 64.47% accuracy.

> 8
Is Spam, with 95.66% accuracy.

> 9
Is Spam, with 88.09% accuracy.

> 11
Is Spam, with 88.09% accuracy.



In [19]:
# Index 5: before function adjustment
classifyMessage(spamMessages.loc[5]['sms_processed'])
print('New classification:')
# Index 5: after function adjustment
classifyMessageUpdated(spamMessages.loc[5]['sms_processed'])

# Index 9: before function adjustment
classifyMessage(spamMessages.loc[9]['sms_processed'])
print('New classification:')
# Index 9: after function adjustment
classifyMessageUpdated(spamMessages.loc[9]['sms_processed'])

Not Spam, with 81.84% accuracy.

New classification:
Is Spam, with 64.47% accuracy.

Not Spam, with 52.81% accuracy.

New classification:
Is Spam, with 88.09% accuracy.



In [20]:
# Update Function 4
def messageSummaryUpdated(index):
    '''
    Prints an detailed summary of selected message
    
    Parameters
    ----------
    index : int
        index number of selected message    
    '''
    print('Original message:\n%s' %data.loc[index]['sms'])
    print('\nProcessed message:\n%s' %data.loc[index]['sms_processed'])
    
    spam_counter = 0
    print('\nSpam Token Values:')
    for word in data.loc[index]['sms_processed']:
        spam_counter += spam_tokens.count(word)
        print('{' + word + ': ' +str(spam_tokens.count(word)), end='} ')
    
    ham_counter = 0
    print('\n\nHam Token Values:')
    for word in data.loc[index]['sms_processed']:
        if ham_tokens.count(word) <= 100: # < Logic adjustment made here
            ham_counter += ham_tokens.count(word)
            print('{' + word + ': ' +str(ham_tokens.count(word)), end='} ')
        
    print('\n\nSpam Token Total: %i' %spam_counter)
    print('\nHam Token Total: %i' %ham_counter)
    
    if spam_counter > ham_counter:
        print('\nTrue Positive: Is Spam (%s' %spam_counter + ')' + 
              ' | True Negative : Not Spam (%s' %ham_counter + ')\n')
    else:
        print('\nTrue Positive: Not Spam (%s' %ham_counter + ')' + 
              ' | True Negative : Is Spam (%s' %spam_counter + ')\n')
        
    # The spam filter output 
    return classifyMessageUpdated(data.loc[index]['sms_processed'])  # < Function adjustment made here

In [21]:
# Index 5: before code adjustment
messageSummary(5)
# Index 5: after code adjustment
messageSummaryUpdated(5)

Original message:
FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, å£1.50 to rcv

Processed message:
['freemsg', 'hey', 'darling', '3', 'weeks', 'word', 'back', 'id', 'like', 'fun', 'still', 'tb', 'ok', 'xxx', 'std', 'chgs', 'send', 'å£150', 'rcv']

Spam Token Values:
{freemsg: 12} {hey: 5} {darling: 2} {3: 22} {weeks: 13} {word: 22} {back: 23} {id: 3} {like: 13} {fun: 9} {still: 7} {tb: 1} {ok: 5} {xxx: 11} {std: 9} {chgs: 1} {send: 67} {å£150: 27} {rcv: 2} 

Ham Token Values:
{freemsg: 0} {hey: 106} {darling: 3} {3: 44} {weeks: 6} {word: 12} {back: 129} {id: 29} {like: 229} {fun: 22} {still: 146} {tb: 3} {ok: 272} {xxx: 21} {std: 0} {chgs: 0} {send: 123} {å£150: 0} {rcv: 0} 

Spam Token Total: 254

Ham Token Total: 1145

True Positive: Not Spam (1145) | True Negative : Is Spam (254)

Not Spam, with 81.84% accuracy.

Original message:
FreeMsg Hey there darling it's been 3 week's now and no word back!

In [22]:
# Index 9: before code adjustment
messageSummary(9)
# Index 9: after code adjustment
messageSummaryUpdated(9)

Original message:
Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030

Processed message:
['mobile', '11', 'months', 'u', 'r', 'entitled', 'update', 'latest', 'colour', 'mobiles', 'camera', 'free', 'call', 'mobile', 'update', 'co', 'free', '08002986030']

Spam Token Values:
{mobile: 123} {11: 5} {months: 4} {u: 147} {r: 22} {entitled: 8} {update: 19} {latest: 36} {colour: 17} {mobiles: 12} {camera: 33} {free: 216} {call: 347} {mobile: 123} {update: 19} {co: 5} {free: 216} {08002986030: 2} 

Ham Token Values:
{mobile: 15} {11: 4} {months: 9} {u: 972} {r: 131} {entitled: 0} {update: 5} {latest: 3} {colour: 4} {mobiles: 0} {camera: 3} {free: 59} {call: 229} {mobile: 15} {update: 5} {co: 2} {free: 59} {08002986030: 0} 

Spam Token Total: 1354

Ham Token Total: 1515

True Positive: Not Spam (1515) | True Negative : Is Spam (1354)

Not Spam, with 52.81% accuracy.

Original message:
Had your

In [23]:
# Ham messages
display(hamMessages)

# Test 'new' spam filer 
for x in hamMessages.index:
    print('>',x)
    classifyMessageUpdated(hamMessages.loc[x]['sms_processed'])

Unnamed: 0,label,sms,sms_processed
0,ham,"Go until jurong point, crazy.. Available only ...","[go, jurong, point, crazy, available, bugis, n..."
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, u, oni]"
3,ham,U dun say so early hor... U c already then say...,"[u, dun, say, early, hor, u, c, already, say]"
4,ham,"Nah I don't think he goes to usf, he lives aro...","[nah, dont, think, goes, usf, lives, around, t..."
6,ham,Even my brother is not like to speak with me. ...,"[even, brother, like, speak, treat, like, aids..."


> 0
Not Spam, with 83.18% accuracy.

> 1
Is Spam, with 66.96% accuracy.

> 3
Not Spam, with 57.28% accuracy.

> 4
Not Spam, with 82.93% accuracy.

> 6
Not Spam, with 72.61% accuracy.



#### 7.4 Test observations II <a class="anchor" id="section7.4">

In [24]:
# An incorrect classification of "Is Spam" found for 1 ham messages

# View label
display(data.loc[1]['label'])

# View spam filter classifications
print('> 1')
classifyMessageUpdated(hamMessages.loc[1]['sms_processed'])

'ham'

> 1
Is Spam, with 66.96% accuracy.



### 8. A new classification error <a class="anchor" id="section8"></a>

In [25]:
# Index 1: before code adjustment
classifyMessage(hamMessages.loc[1]['sms_processed'])

# Index 1: after code adjustment
classifyMessageUpdated(hamMessages.loc[1]['sms_processed'])

Not Spam, with 89.67% accuracy.

Is Spam, with 66.96% accuracy.



In [26]:
# Index 1
# - a detailed summary of the problem
messageSummaryUpdated(1)

Original message:
Ok lar... Joking wif u oni...

Processed message:
['ok', 'lar', 'joking', 'wif', 'u', 'oni']

Spam Token Values:
{ok: 5} {lar: 0} {joking: 0} {wif: 0} {u: 147} {oni: 0} 

Ham Token Values:
{lar: 38} {joking: 6} {wif: 27} {oni: 4} 

Spam Token Total: 152

Ham Token Total: 75

True Positive: Is Spam (152) | True Negative : Not Spam (75)

Is Spam, with 66.96% accuracy.



#### 8.1 The 'U' problem <a class="anchor" id="section8.1"> 

In [27]:
print('u spam token count:',spam_tokens.count('u'))
print('u ham token count:',ham_tokens.count('u'))

u spam token count: 147
u ham token count: 972


In [28]:
# Spam tokens with a value between 100 & 200
for x in set(spam_tokens):
    if spam_tokens.count(x) >= 100 and spam_tokens.count(x) <= 200:
        print('{' + x, ':', spam_tokens.count(x), end='} ')

{reply : 101} {mobile : 123} {text : 120} {claim : 113} {ur : 144} {u : 147} {4 : 119} {txt : 150} {stop : 113} {2 : 173} 

In [29]:
# Spam tokens with a value greater than 200
for x in set(spam_tokens):
    if spam_tokens.count(x) > 200:
        print('{' + x, ':', spam_tokens.count(x), end='} ')

{free : 216} {call : 347} 

#### 8.2 Excluding spam tokens with a value of 100 or more <a class="anchor" id="section8.2">

In [30]:
# Update Function 3 II
def classifyMessageUpdated2(message):
    
    spam_counter = 0
    ham_counter = 0
    
    for token in message:
        if spam_tokens.count(token) <= 100: # < Logic adjustment made here
            spam_counter += spam_tokens.count(token)
        
    for token in message:
        if ham_tokens.count(token) <= 100: # < Logic adjustment made here
            ham_counter += ham_tokens.count(token)
    
    if ham_counter > spam_counter:
        accuracy = (ham_counter / (ham_counter + spam_counter)) * 100
        print('Not Spam, with {:.2f}% accuracy.\n'.format(accuracy))
        
    elif spam_counter > ham_counter:
        accuracy = (spam_counter / (ham_counter + spam_counter)) * 100
        print('Is Spam, with {:.2f}% accuracy.\n'.format(accuracy))
              
    else:
        print('Might be Spam, with 50% accuracy.\n')

#### 8.3 Test solution II <a class="anchor" id="section8.3">

In [31]:
display(hamMessages)

# Iteration 1: original spam filter & index 1
print(hamMessages.loc[1]['label'])
print('Original filter classification:')
classifyMessage(hamMessages.loc[1]['sms_processed'])

# Iteration 2: initial filter adjustment (omit ham tokens > 100) & index 1
print(hamMessages.loc[1]['label'])
print('Second filter classification')
classifyMessageUpdated(hamMessages.loc[1]['sms_processed'])

# Iteration 3: second filter adjustment (omit spam tokens > 100) & index 1
print(hamMessages.loc[1]['label'])
print('Third filter classification')
classifyMessageUpdated2(hamMessages.loc[1]['sms_processed'])

Unnamed: 0,label,sms,sms_processed
0,ham,"Go until jurong point, crazy.. Available only ...","[go, jurong, point, crazy, available, bugis, n..."
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, u, oni]"
3,ham,U dun say so early hor... U c already then say...,"[u, dun, say, early, hor, u, c, already, say]"
4,ham,"Nah I don't think he goes to usf, he lives aro...","[nah, dont, think, goes, usf, lives, around, t..."
6,ham,Even my brother is not like to speak with me. ...,"[even, brother, like, speak, treat, like, aids..."


ham
Original filter classification:
Not Spam, with 89.67% accuracy.

ham
Second filter classification
Is Spam, with 66.96% accuracy.

ham
Third filter classification
Not Spam, with 93.75% accuracy.



#### 8.4 Further testing <a class="anchor" id="section8.4">

In [32]:
# Spam messages
display(spamMessages)

# Iteration 1: original spam filter
for x in spamMessages.index:
    print('>',x)
    classifyMessage(spamMessages.loc[x]['sms_processed'])

Unnamed: 0,label,sms,sms_processed
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, 2, wkly, comp, win, fa, cup, fin..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...,"[freemsg, hey, darling, 3, weeks, word, back, ..."
8,spam,WINNER!! As a valued network customer you have...,"[winner, valued, network, customer, selected, ..."
9,spam,Had your mobile 11 months or more? U R entitle...,"[mobile, 11, months, u, r, entitled, update, l..."
11,spam,"SIX chances to win CASH! From 100 to 20,000 po...","[six, chances, win, cash, 100, 20000, pounds, ..."


> 2
Is Spam, with 64.18% accuracy.

> 5
Not Spam, with 81.84% accuracy.

> 8
Is Spam, with 76.61% accuracy.

> 9
Not Spam, with 52.81% accuracy.

> 11
Is Spam, with 65.08% accuracy.



In [33]:
# Iteration 2: initial filter adjustment (omit ham tokens > 100)
for x in spamMessages.index:
    print('>',x)
    classifyMessageUpdated(spamMessages.loc[x]['sms_processed'])

> 2
Is Spam, with 81.72% accuracy.

> 5
Is Spam, with 64.47% accuracy.

> 8
Is Spam, with 95.66% accuracy.

> 9
Is Spam, with 88.09% accuracy.

> 11
Is Spam, with 88.09% accuracy.



In [34]:
# Iteration 3: second filter adjustment (omit spam tokens > 100)
for x in spamMessages.index:
    print('>',x)
    classifyMessageUpdated2(spamMessages.loc[x]['sms_processed'])

> 2
Is Spam, with 55.36% accuracy.

> 5
Is Spam, with 64.47% accuracy.

> 8
Is Spam, with 88.51% accuracy.

> 9
Not Spam, with 50.14% accuracy.

> 11
Is Spam, with 78.37% accuracy.



In [35]:
# Ham messages
display(hamMessages)

# Iteration 1: original spam filter
for x in hamMessages.index:
    print('>',x)
    classifyMessage(hamMessages.loc[x]['sms_processed'])

Unnamed: 0,label,sms,sms_processed
0,ham,"Go until jurong point, crazy.. Available only ...","[go, jurong, point, crazy, available, bugis, n..."
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, u, oni]"
3,ham,U dun say so early hor... U c already then say...,"[u, dun, say, early, hor, u, c, already, say]"
4,ham,"Nah I don't think he goes to usf, he lives aro...","[nah, dont, think, goes, usf, lives, around, t..."
6,ham,Even my brother is not like to speak with me. ...,"[even, brother, like, speak, treat, like, aids..."


> 0
Not Spam, with 93.05% accuracy.

> 1
Not Spam, with 89.67% accuracy.

> 3
Not Spam, with 88.28% accuracy.

> 4
Not Spam, with 94.90% accuracy.

> 6
Not Spam, with 93.01% accuracy.



In [36]:
# Iteration 2: initial filter adjustment (omit ham tokens > 100)
for x in hamMessages.index:
    print('>',x)
    classifyMessageUpdated(hamMessages.loc[x]['sms_processed'])

> 0
Not Spam, with 83.18% accuracy.

> 1
Is Spam, with 66.96% accuracy.

> 3
Not Spam, with 57.28% accuracy.

> 4
Not Spam, with 82.93% accuracy.

> 6
Not Spam, with 72.61% accuracy.



In [37]:
# Iteration 3: second filter adjustment (omit spam tokens > 100)
for x in hamMessages.index:
    print('>',x)
    classifyMessageUpdated2(hamMessages.loc[x]['sms_processed'])

> 0
Not Spam, with 83.18% accuracy.

> 1
Not Spam, with 93.75% accuracy.

> 3
Not Spam, with 95.46% accuracy.

> 4
Not Spam, with 82.93% accuracy.

> 6
Not Spam, with 72.61% accuracy.



[Return to contents](#sections)