<h2> Building a spam filter for sms using Multinomial Naive Bayes Algorithm : 


In this notebook we will be creating a spam filter for SMS. We will be using Multinomial Naibe Bayes algorithm which relies on the principals of the Bayes Theorem.

Creating a filter is based on three key concepts : 

1.  Human classification of spam and non spam messages. 
2.  The computer then uses that human knowledge to estimate probabilities for new messages i.e. P(spam | new message) and P(non spam | new message).
3.  Based on the probability calculations, classifies it as spam or non spam. In simple words, if the probability of P(spam | new message) is higher than message is marked as spam and vice versa.

*If the probabilities are equal, then we may need a human to classify the message.*

---


So our first task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). You can also download the dataset directly from this [link](https://dq-content.s3.amazonaws.com/433/SMSSpamCollection).


In [1]:
import os
for dirname, _, filenames in os.walk('../input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

../input/SMSSpamCollection


In [2]:
import pandas as pd
sms_spam_collection = pd.read_csv('../input/SMSSpamCollection', sep = '\t', header=None, names = ['Label', 'SMS'])

In [3]:
sms_spam_collection.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
sms_spam_collection.shape

(5572, 2)

In [5]:
sms_spam_collection.tail()

Unnamed: 0,Label,SMS
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...
5571,ham,Rofl. Its true to its name


In [6]:
sms_spam_collection['Label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

**Ham here means non spam**

Before we begin designing out software i.e our spam filter, it is important we create a test first. If we create the software first, then it is tempting to come up with a biased test just to make sure the software passes it. 

We can break up our dataset into two parts ; a training set and a testing set. We will use the training set (which can classify as 80% of the data) to train the computer to classify the messages. We will use the testing set to test how good the filter is at classifying new mesages. 

To better understand the reasoning think of it this way. The testing set we keep aside. Keep in mind that the testing set has already been classified by a human. (therefore we have a reference point). Once the filter is created, we treat the testing set as new messages and have our filter classify them. Then we can compare the filtered results by the computer against the filtered results by a human to gauge how effective our filter is. 

**We are hoping to reach an 80% accuracy when it comes to classification by our filter.**

Let's begin dividing the dataset into training and testing sets. First we need to randomize it. 

In [7]:
sms_data_randomized = sms_spam_collection.sample(frac=1, random_state=1)

In [8]:
training_data_length = round(len(sms_spam_collection) * 0.80)

In [9]:
training_data_length

4458

In [10]:
training_sms_dataset = sms_data_randomized[:training_data_length]

In [11]:
training_sms_dataset.shape

(4458, 2)

In [12]:
training_sms_dataset.head()

Unnamed: 0,Label,SMS
1078,ham,"Yep, by the pretty sculpture"
4028,ham,"Yes, princess. Are you going to make me moan?"
958,ham,Welp apparently he retired
4642,ham,Havent.
4674,ham,I forgot 2 ask ü all smth.. There's a card on ...


We see a slight issue with the indexes our training dataset. Let's fix that.

In [13]:
training_sms_dataset.reset_index(drop = True, inplace = True)

In [14]:
training_sms_dataset.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


What we did was an inefficient way of coding. What we should rather have done was method chaining. Let's see it in example on the testing dataset. 

In [15]:
testing_sms_dataset = sms_data_randomized[training_data_length:].reset_index(drop = True)

In [16]:
testing_sms_dataset.shape

(1114, 2)

In [17]:
testing_sms_dataset.head()

Unnamed: 0,Label,SMS
0,ham,Later i guess. I needa do mcat study too.
1,ham,But i haf enuff space got like 4 mb...
2,spam,Had your mobile 10 mths? Update to latest Oran...
3,ham,All sounds good. Fingers . Makes it difficult ...
4,ham,"All done, all handed in. Don't know if mega sh..."


An important point to note here and you might have wondered, why didn't we just take 4458 random samples from the dataset using dataframe.sample() or why did we use the training length as a starting point for our testing dataset. 

The answer lies in one key detail. We are sampling without replacement. In simple words if an sms is in the training dataset, it cannot be in the testing dataset. Remember our presumption that we will be treating the testing dataset as entirely new messages.

Let us now check if the percentages of spam and non spam in our testing and training datasets are similar to the original dataset. 

In [18]:
testing_sms_dataset['Label'].value_counts(normalize = True)

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

In [19]:
training_sms_dataset['Label'].value_counts(normalize = True)

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

We can safely say that our datasets are representative of the population. The two variables that we have are nominal type variables. If they were ratio or ordinal type variables, we could also used other ways to calculate representation. 

Now that we have our training and testing datasets, it is time to train the computer to classify spam and non spam messages. As we stated earlier our algorithm is based on Naive Bayes. It is called Naive because it works on the assumption on conditional independence.

The formula that we will be using to calculate the probabilities is :

---



**P(spam | w1 w2 w3...wn) = P(spam) * P(w1 | spam) * P( w2 | spam)......P(wn | spam)**

**P(non spam | w1 w2 w3...wn) = P(non spam) * P(w1 | non spam) * P( w2 | non spam)......P(wn | non spam)**


---
To calculate P(W n | spam) or P(W n | non spam) we will be using the formula below. We will be working on the premise of additive smoothing and will be using Laplace smoothing (where alpha = 1). You can read more on additive smoothing but in a nutshell additive smoothing is used to counter 0 values for word probabilities.  


---
P(W n | spam) = ( N wn |spam * alpha) / (N spam) * (N vocab * alpha)

- N wn =  The number of times the word occurs in the spam message 
- alpha = 1 (smoothing parameter)
- N spam = The total number of words in the spam messages 
- N vocab = The total number of words in the vocabulary

A similar formula can be built for the P( Wn | non spam). You might be wondering formulas are either not correct or going against probability rules. Despite it being correct that some probability rules are broken, we have to keep in mind the objective. The objective of the filter is to not calculate probabilities but rather use probability calculations to classify between spam and non spam. Even though we haven't used the exact bayes theorem (denominator missing), our probability calculations will still be enough to differentiate. 












Before we begin though, we need to perform a few data cleaning steps. 
- We will be changing all the cases to lowercase, so that 'secret' and 'SECRET' are not counted separately. 
- Second remove any unnecessary punctuation or question marks etc.  

In [20]:
training_sms_dataset['SMS'].head() 

0                         Yep, by the pretty sculpture
1        Yes, princess. Are you going to make me moan?
2                           Welp apparently he retired
3                                              Havent.
4    I forgot 2 ask ü all smth.. There's a card on ...
Name: SMS, dtype: object

In [21]:
training_sms_dataset['SMS'] = training_sms_dataset['SMS'].str.replace('\W', ' ').str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [22]:
training_sms_dataset.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


Not particularly relevant, but SMS at index 1 needs a lot more explaining ! 

The next step is to create a vocabulary. As mentioned earlier, the vocabulary is the set of unique words in both spam and ham (non spam) messages. To understand more simply, we are trying to calculate every value that our formula (mentioned above) will need. 

In [23]:
test_training_dataset = training_sms_dataset.iloc[0:10]
# Before we begin working with the actual dataset, it is always recommended to first test out your plan of action. 
# We have decided to build a custom function that will take in the series as input and return a set of all unique words in spam and ham messages.
# Thus building our vocabulary.

In [24]:
test_training_dataset

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...
5,ham,ok i thk i got it then u wan me 2 come now or...
6,ham,i want kfc its tuesday only buy 2 meals only ...
7,ham,no dear i was sleeping p
8,ham,ok pa nothing problem
9,ham,ill be there on lt gt ok


In [25]:
def creating_vocab(series):
  vocab = []
  for x in series:
    x_list = x.split()
    for i in x_list: 
      vocab.append(i)
  return list(set(vocab))

# The reason we are converting it back to a list is so that we can use that list as column labels later on. 

In [26]:
test_vocab = creating_vocab(test_training_dataset['SMS'].iloc[0:2])

In [27]:
test_vocab

['pretty',
 'to',
 'by',
 'going',
 'moan',
 'yes',
 'the',
 'sculpture',
 'are',
 'make',
 'you',
 'me',
 'princess',
 'yep']

In [28]:
training_sms_dataset['SMS'] = training_sms_dataset['SMS'].str.split()

vocabulary = []
for sms in training_sms_dataset['SMS']:
    for word in sms:
        vocabulary.append(word)
        
vocabulary = list(set(vocabulary))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [29]:
len(vocabulary)

7783

You would see above that we used a different approach to get the vocabulary. We show cased two different approaches here to show that there are multiple options to complete a task. The option you chose is entirely up to you as long as the results are accurate. 

---


Now that we built our vocabulary, out next task is to populate the frequency of the words that occur in each of our sms. For example if our sms is "Hello There" and "Hello Tom", then the result should be

       Label | 'Hello'  | 'There' | 'Tom"
    0  spam       1          1        0
    1  ham        1          0        1
       

The final result should be a dataframe where the columns are all the unqiue words and their values are the frequency or the count of each of the unique word in our sms. So let's begin. The plan of action here is to : 
- Create a dictionary where the key is the unique value in vocab. 
- The key's value is the count of the word in each sms.
- Loop over the sms column and populate the dictionary.  

In [30]:
word_counts_per_sms = {unique_word: [0] * len(training_sms_dataset['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_sms_dataset['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [31]:
words_df = pd.DataFrame(word_counts_per_sms)

In [32]:
words_df.head()

Unnamed: 0,3mobile,radio,10ppm,devils,sudden,clothes,beth,woken,pics,burn,...,rightly,destination,cheery,alwys,walks,records,administrator,9ja,times,awww
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [33]:
training_dataset_joined = pd.concat([training_sms_dataset,words_df], axis = 1)

In [34]:
training_dataset_joined.head()

Unnamed: 0,Label,SMS,3mobile,radio,10ppm,devils,sudden,clothes,beth,woken,...,rightly,destination,cheery,alwys,walks,records,administrator,9ja,times,awww
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [35]:
training_dataset_joined['welp'].head()

0    0
1    0
2    1
3    0
4    0
Name: welp, dtype: int64

In [36]:
training_dataset_joined['all'].head()

0    0
1    0
2    0
3    0
4    2
Name: all, dtype: int64

Now that we have cleaned and formatted the data, let us move towards calculations. Let's recap on the formulas from earlier. 
  
![alt text](https://render.githubusercontent.com/render/math?math=P%28Spam%20%7C%20w_1%2Cw_2%2C%20...%2C%20w_n%29%20%5Cpropto%20P%28Spam%29%20%5Ccdot%20%5Cprod_%7Bi%3D1%7D%5E%7Bn%7DP%28w_i%7CSpam%29&mode=display)

![alt text](https://render.githubusercontent.com/render/math?math=P%28Ham%20%7C%20w_1%2Cw_2%2C%20...%2C%20w_n%29%20%5Cpropto%20P%28Ham%29%20%5Ccdot%20%5Cprod_%7Bi%3D1%7D%5E%7Bn%7DP%28w_i%7CHam%29&mode=display)

![alt text](https://render.githubusercontent.com/render/math?math=P%28w_i%7CSpam%29%20%3D%20%5Cfrac%7BN_%7Bw_i%7CSpam%7D%20%2B%20%5Calpha%7D%7BN_%7BSpam%7D%20%2B%20%5Calpha%20%5Ccdot%20N_%7BVocabulary%7D%7D&mode=display)

  ![alt text](https://render.githubusercontent.com/render/math?math=P%28w_i%7CHam%29%20%3D%20%5Cfrac%7BN_%7Bw_i%7CHam%7D%20%2B%20%5Calpha%7D%7BN_%7BHam%7D%20%2B%20%5Calpha%20%5Ccdot%20N_%7BVocabulary%7D%7D&mode=display)

Let's begin by calculating the individual values and then plug them into the formula. We will calculate 

- P(spam) 
- P(ham) 
- Nspam
- Nvocabulary 

In [37]:
training_dataset_joined['Label'].value_counts(normalize = True)

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

In [38]:
p_ham = 0.86541
p_spam = 0.13459

In [39]:
spam = training_dataset_joined[training_dataset_joined['Label'] == 'spam']

In [40]:
spam.head()

Unnamed: 0,Label,SMS,3mobile,radio,10ppm,devils,sudden,clothes,beth,woken,...,rightly,destination,cheery,alwys,walks,records,administrator,9ja,times,awww
16,spam,"[freemsg, why, haven, t, you, replied, to, my,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
18,spam,"[congrats, 2, mobile, 3g, videophones, r, your...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
56,spam,"[free, message, activate, your, 500, free, tex...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
60,spam,"[call, from, 08702490080, tells, u, 2, call, 0...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
61,spam,"[someone, has, conacted, our, dating, service,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [41]:
ham = training_dataset_joined[training_dataset_joined['Label'] == 'ham']

In [42]:
ham.head()

Unnamed: 0,Label,SMS,3mobile,radio,10ppm,devils,sudden,clothes,beth,woken,...,rightly,destination,cheery,alwys,walks,records,administrator,9ja,times,awww
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [43]:
#calculating Nspam
spam_words = [len(x) for x in spam['SMS'] ]

In [44]:
n_spam = sum(spam_words)

In [45]:
n_spam

15190

In [46]:
#calculating Nham
ham_words = [len(x) for x in ham['SMS']]

In [47]:
n_ham = sum(ham_words)

In [48]:
n_ham

57237

In [49]:
#calculating Nvocab
n_vocab = len(vocabulary)
n_vocab

7783

In [50]:
#introduce alpha for laplace smoothing 
alpha = 1 

We have pretty much calculated the constants. You might be wondering why is p(spam) and P(ham) constant. Remember one of our objectives. We train the computer to classify. We train it based on the training dataset. So now that a new message comes in, the knowledge gained from the training is used. The new message is NOT ADDED to the training dataset. Hence the values remain constant. 

We will now look at calculations for P(Wi | spam) and P(Wi | ham). It is important to take a moment and understand a few things here especially as to why we perfomed all the above and below tasks beforehand.. Let' say you recieved a message "Secret santa is coming!". Imagine if you didnt have an algorithm in place. You would have to calculate the probability of P('secret' | spam) P(
  secret' | ham) and so on for every word right at the moment. Now imagine having 1,000,000 sms in the pipeline waiting to be classified. 

What we rather do is calculate the probability of each unqiue word given spam and ham in our training dataset beforehand. Then a new message comes in, which is 'Secret santa is coming!'. We pick up the word secret in the message and realise we already have the probability for secret | spam and secret | ham calculated. See how much faster now the process goes given we have some values calculated beforehand. 

The only catch here is that "as long as words in the new message are in the training dataset". All of the above logic only holds if we don't make any changes to the training dataset. 

Let's get back to calculating the probability of words. 

In [51]:
spam_words_prob = {} 
ham_words_prob = {}

In [52]:
for word in vocabulary: 
  spam_words_prob[word] = 0 
  ham_words_prob[word] = 0 

In [53]:
len(spam_words_prob.keys())

7783

In [54]:
len(ham_words_prob.keys())

7783

In [55]:
ham.shape[0]

3858

In [56]:
spam.shape[0]

600

In [57]:
#calculating P(Wi | ham)
for word in vocabulary:
    numerator = ham[word].sum() + alpha
    denominator = n_ham + (n_vocab * alpha)
    prob = numerator / denominator
    ham_words_prob[word] = prob
    
     
    
    

In [58]:
#calculating P(Wi | spam)
for word in vocabulary: 
  numerator = spam[word].sum() + alpha
  denominator = n_spam + (n_vocab * alpha)
  prob = numerator / denominator
  spam_words_prob[word] = prob


That was fun!. It might be confusing at first. Take your time and carefully calculate each value. Name your variables accurately so you can call them in functions or the formula easier. 

The hard part is almost done. Now all the remains is the filter. In essence, the filter will be a function that will take in the message as an argument and return whether the messge is spam or not based on probability calculations.

Keep in mind we have only calculated the probability of the words given they are in spam or ham messages. We still have to calculate the P(spam | new message) and P(ham | new message). The multinomial bayes will be calculated in our function. 

That being said, take a break. Stretch out your limbs. Drink a hot cup of coffee and then let's get back to coding !

In [59]:
def classify_message(sms): 
  #the first step is to clean the data like we performed earlier. 
  import re 
  sms_message = re.sub('\W', ' ', sms)
  sms_message = sms_message.lower()
  sms_message = sms_message.split()

  #remember the formula. P(spam | wn) = P(spam) * (W1|spam).....(Wn|spam) where n is the number of words in the sms.
  #also remember the constants we calculated. Since p(spam) and p(ham) are both constants that are a part of the formula, let's assign to the formula first.
  p_ham_given_message = p_ham
  p_spam_given_message = p_spam

  #now let's start calculating the probability for each word in the sms.

  for word in sms_message: 
    if word in ham_words_prob:
      p_ham_given_message *= ham_words_prob[word]

    if word in spam_words_prob: 
      p_spam_given_message *= spam_words_prob[word]

    
  print("P(ham | new message) :", p_ham_given_message)
  print("P(spam | new message) :", p_spam_given_message)

  if p_ham_given_message > p_spam_given_message: 
      print('Label : Ham. This is a non spam sms.')
  elif p_spam_given_message > p_ham_given_message: 
      print('Label : Spam. This is a spam sms')
  else: 
      print('We need a humam to classify this message due to equal probabilities.')
    
    



In [60]:
#Let's test out our function
classify_message('WINNER!! This is the secret code to unlock the money: C3421.')

P(ham | new message) : 1.9368037883678308e-27
P(spam | new message) : 1.3481340092074626e-25
Label : Spam. This is a spam sms


In [61]:
classify_message("Sounds good, Tom, then see u there")

P(ham | new message) : 3.687528313102144e-21
P(spam | new message) : 2.437246584367808e-25
Label : Ham. This is a non spam sms.


Now that our filter is ready, let's use it to test our testing dataset that we created earlier. Remember we have not used any values from the testing dataset, hence each message in the testing dataset can be classified as a new message. 

The idea here is to create a column in our testing dataset that has the labels from our function. We can then compare the actual labels vs generated labels to measure accuracy.

In [62]:
#let's make changes to our function so this time it outputs a label instead of printing it out. 
def classify_message(sms): 
  #the first step is to clean the data like we performed earlier. 
  import re 
  sms_message = re.sub('\W', ' ', sms)
  sms_message = sms_message.lower()
  sms_message = sms_message.split()

  #remember the formula. P(spam | wn) = P(spam) * (W1|spam).....(Wn|spam) where n is the number of words in the sms.
  #also remember the constants we calculated. Since p(spam) and p(ham) are both constants that are a part of the formula, let's assign to the formula first.
  p_ham_given_message = p_ham
  p_spam_given_message = p_spam

  #now let's start calculating the probability for each word in the sms.

  for word in sms_message: 
    if word in ham_words_prob:
      p_ham_given_message *= ham_words_prob[word]

    if word in spam_words_prob: 
      p_spam_given_message *= spam_words_prob[word]

    
  if p_ham_given_message > p_spam_given_message: 
      return 'ham'
  elif p_spam_given_message > p_ham_given_message: 
      return 'spam'
  else: 
      return 'hcr'
#hcr stands for human classification required

In [63]:
testing_sms_dataset.head()

Unnamed: 0,Label,SMS
0,ham,Later i guess. I needa do mcat study too.
1,ham,But i haf enuff space got like 4 mb...
2,spam,Had your mobile 10 mths? Update to latest Oran...
3,ham,All sounds good. Fingers . Makes it difficult ...
4,ham,"All done, all handed in. Don't know if mega sh..."


In [64]:
testing_sms_dataset['Generated_Label'] = testing_sms_dataset['SMS'].apply(classify_message)

In [65]:
testing_sms_dataset.head()

Unnamed: 0,Label,SMS,Generated_Label
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [66]:
#let us now test the accuracy of our model. 
 correct_labels = 0
 total_labels = testing_sms_dataset.shape[0]
 for x in testing_sms_dataset.itertuples():
   if x[1] == x[3]: 
     correct_labels += 1 




In [67]:
accuracy = (correct_labels / total_labels) * 100  
print("The accuracy our filter is :", accuracy)  

The accuracy our filter is : 98.74326750448833


**--------------------------------------------------------------------------------------------------------------------------------------------------
I appreciate the time you have taken to read this notebook. This is but a learner's attempt to step into the vast and beautiful field that is Data Science. I would highly appreciate any and all feedback on this notebook whether it be my fellow learners, my seniors or industry experts. **