John Olijnyk & John Machado

CSI6160 HW 2: Naive Bayes and Bag of Words Implementation

This will use the Bag of Words method with the Naive Bayes Machine Learning algorithm to attempt to do sentiment analysis on text data. 

Data Set Source: 

Spam/Ham SMS data was obtained from the Kaggle project
https://www.kaggle.com/sid321axn/sms-spam-classifier-naive-bayes-ml-algo.
The data was loaded into a dataframe and the messages were extracted for cleaning.

We split the data set into a training and a test set with 80%/20% split respectively.

Naive Bayes algorithm uses the Bayes theorem but makes an assumption of conditional independence of the feature set. This is why the approach is called Naive. In practice this is not strictly or neccessarily true. But in practice it is found that Naive Bayes will perform well for many data sets. By employing this assumption of conditional independence, this strategy decerases the number of probability values the algorithm needs to calculate and track.

The bag of words approach is to assume that the order of the words does not matter in sample texts. The approach then creates a histogram of the words used in each sample set and by output label class. These histogram values then translate into calculating the probabilities for that class given the words. This is done at a relative level. To employ a bag of words approach requires some clean up of the data set to remove punctuation, capitlization and to find each word. Other more advanced clean up to improve accuracy includes removal of stop words like articles (the, a, and, etc.) plus to handle abbreviations, and other cases that causes duplications.

One of the challenges in this approach is when the word does not appear for that label categroy.  This result in a zero word count.  The zero will then mutiply out with the Bayes theorm as a zero.  In practice a better model is to add one to all of the words.  This results in at least a small chance and a non zero probability when the Bayes calculation is applied.  Since we add one to all words and we are doing a relative comparison of probabilities this does not impact the model.  Rather is makes the model more robust.  We employed this technique of adding one to all words in our model.

References:

In [1]:
# imports
import pandas as pd
import numpy as np
from nltk.corpus import stopwords as nltk_stopwords
from nltk.tokenize import word_tokenize as nltk_tokenize
from collections import Counter as collections_counter
import string
from sklearn.feature_extraction.text import CountVectorizer as sk_feature_extract_text_CV
from sklearn.model_selection import train_test_split as sk_model_select_tts
from random import randint as random_randint
from random import seed as random_seed

Spam/Ham SMS data was obtained from the Kaggle project <br>
https://www.kaggle.com/sid321axn/sms-spam-classifier-naive-bayes-ml-algo.<br>
The data was loaded into a dataframe and the messages were extracted for cleaning.

In [2]:
# load data

_data = pd.read_csv('./spam.csv')

In [3]:
# here are the first 5 elements
# Category is our output label.  ham are good messages and spam are bad messages
# Messages are the feature we will analyze

_data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
def preprocess(df):

    """
        Modifies data dataframe in place.
            - to lower case all the words
            - removes stop words     
            - removes punctuation       
        Parameters:
            df (dataframe): dataframe, column 1 = labels, column 2 = message text
        Returns:
            (None)
    """

    _stopwords = set(nltk_stopwords.words('english'))

    for _m in df.iterrows():
        _m[1][1] = _m[1][1].lower()
        _m[1][1] = _m[1][1].translate(str.maketrans("","", string.punctuation))
        _m[1][1] = nltk_tokenize(_m[1][1])
        _m[1][1] = [_w for _w in _m[1][1] if not _w in _stopwords]
        _m[1][1] = ' '.join(_m[1][1])

In [5]:
# preprocess the text to clean it up
# in the output below if you compare above, you will see the data is now lower case, no punctuation, etc

preprocess(_data)
_data.head()

Unnamed: 0,Category,Message
0,ham,go jurong point crazy available bugis n great ...
1,ham,ok lar joking wif u oni
2,spam,free entry 2 wkly comp win fa cup final tkts 2...
3,ham,u dun say early hor u c already say
4,ham,nah dont think goes usf lives around though


The data needs to be separated by spam/ham category for term/document and term frequency analysis.

In [6]:
# separate based on label

_spam = _data[_data['Category'] == 'spam']
_ham = _data[_data['Category'] == 'ham']

In [7]:
#here you see a copy of the spam category

_spam

Unnamed: 0,Category,Message
2,spam,free entry 2 wkly comp win fa cup final tkts 2...
5,spam,freemsg hey darling 3 weeks word back id like ...
8,spam,winner valued network customer selected receiv...
9,spam,mobile 11 months u r entitled update latest co...
11,spam,six chances win cash 100 20000 pounds txt csh1...
...,...,...
5537,spam,want explicit sex 30 secs ring 02073162414 cos...
5540,spam,asked 3mobile 0870 chatlines inclu free mins i...
5547,spam,contract mobile 11 mnths latest motorola nokia...
5566,spam,reminder o2 get 250 pounds free call credit de...


In [8]:
# here you see a copy of the ham category

_ham

Unnamed: 0,Category,Message
0,ham,go jurong point crazy available bugis n great ...
1,ham,ok lar joking wif u oni
3,ham,u dun say early hor u c already say
4,ham,nah dont think goes usf lives around though
6,ham,even brother like speak treat like aids patent
...,...,...
5565,ham,huh lei
5568,ham,ü b going esplanade fr home
5569,ham,pity mood soany suggestions
5570,ham,guy bitching acted like id interested buying s...


In [9]:
# splitting each class's data

random_seed()
_test_size = 0.5

_spam_train_data, _spam_test_data, _spam_train_labels, _spam_test_labels = sk_model_select_tts(
    _spam.iloc[:,1],
    _spam.iloc[:,0],
    test_size=_test_size,
    random_state=random_randint(0,100)
)

_ham_train_data, _ham_test_data, _ham_train_labels, _ham_test_labels = sk_model_select_tts(
    _ham.iloc[:,1],
    _ham.iloc[:,0],
    test_size=_test_size,
    random_state=random_randint(0,100)
)

Since we noted that the classes are inbalanced with 4825 entries for "ham" and 747 entries for "spam", we decided that when we do the 80%/20% split that we first split up the data between each class.  Then we randomly split the train and test up on a per class basis.  Then we recombnie the two classes to for the acutal train and test sets.  That way we maintain the same distribution of classes and we do not inadverdantly end up with a training set dominated by "ham" entires if it was purely random.

Next, we used train_test_split from sklearn to split each class's data. <br>
Spam's and ham's training data would result in a term frequency matrix for each <br>
of the two classes. These matrices represent the trained algorithm; any subsequent <br>
message could be processed by using the word probabilites saved in these matrices <br>
to make a prediction.

In [10]:
# organize data

_spam_train = pd.DataFrame.from_dict(
    {
        'label': _spam_train_labels,
        'message': _spam_train_data
    }
)

_spam_test = pd.DataFrame.from_dict(
    {
        'label': _spam_test_labels,
        'message': _spam_test_data
    }
)

_ham_train = pd.DataFrame.from_dict(
    {
        'label': _ham_train_labels,
        'message': _ham_train_data
    }
)

_ham_test = pd.DataFrame.from_dict(
    {
        'label': _ham_test_labels,
        'message': _ham_test_data
    }    
)

In [11]:
# check spam train

print(f'{_spam_train.shape[0]} messages')
_spam_train.head()

373 messages


Unnamed: 0,label,message
4586,spam,u secret admirer looking 2 make contact ufind ...
660,spam,88800 89034 premium phone services call 087187...
3229,spam,six chances win cash 100 20000 pounds txt csh1...
1129,spam,ur hmv quiz cashbalance currently £500 maximiz...
2850,spam,chance reality fantasy show call 08707509020 2...


In [12]:
# check spam test

print(f'{_spam_test.shape[0]} messages')
_spam_test.head()

374 messages


Unnamed: 0,label,message
2269,spam,88066 88066 lost 3pound help
4061,spam,weeks savamob member offers accessible call 08...
4587,spam,mila age23 blonde new uk look sex uk guys u li...
2774,spam,come takes little time child afraid dark becom...
15,spam,xxxmobilemovieclub use credit click wap link n...


In [13]:
# check ham train

print(f'{_ham_train.shape[0]} messages')
_ham_train.head()

2412 messages


Unnamed: 0,label,message
2969,ham,mostly sports typelyk footblcrckt
1631,ham,going film 2day da 6pm sorry da
1994,ham,eh den sat u book e kb liao huh
4040,ham,cant pick phone right pls send message
4593,ham,right wasnt phoned someone number like


In [14]:
# check ham test

print(f'{_ham_test.shape[0]} messages')
_ham_test.head()

2413 messages


Unnamed: 0,label,message
5495,ham,good afternoon love goes day sleep hope well b...
3842,ham,howz painit come todaydo said ystrdayice medicine
1508,ham,wen ur lovable bcums angry wid u dnt take seri...
1479,ham,think far find check google maps place dorm
2532,ham,whats happening gotten job begun registration ...


We implemented our bag of words by creating a term/document (sparse) matrix and a term frequency (condensed) matrix in order to conduct Baysian analysis. <br>
We used a method to create term/document matrices from the following site: https://www.kaggle.com/sid321axn/sms-spam-classifier-naive-bayes-ml-algo <br>
This method invoved using the Count Vectorizer class from sklearn to create a list of counts of words found in the data as well as a list of the words themselves. <br>
The rows of the dataframe will represent word counts for a given word. <br>
The columns will represent the words found in the messages.

In [15]:
def generate_tdm(df):
    
    """
        generates a term document matrix from input data
        Parameters:
            df (DataFrame): dataframe with column 1 = labels, column 2 = messages
        Returns:
            DataFrame: a term document matrix
    """
    
    _count_vectorizer = sk_feature_extract_text_CV()
    _count_vectorizer.fit(df.iloc[:,1])
    _features = _count_vectorizer.get_feature_names_out()
    _counts = _count_vectorizer.transform(df.iloc[:,1]).toarray()
    return pd.DataFrame(_counts, columns=_features)


In [16]:
# generate tdm for spam and ham

_spam_tdm = generate_tdm(_spam_train)
_ham_tdm = generate_tdm(_ham_train)

In [17]:
# tdm for spam category

_spam_tdm

Unnamed: 0,008704050406,0089my,0121,01223585236,02,0207,02070836089,02073162414,02085076972,020903,...,ymca,youll,youre,yourinclusive,youto,youve,yr,yrs,zebra,zed
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
368,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
369,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
370,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
371,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
# tdm for ham category

_ham_tdm

Unnamed: 0,0125698789,10,100,1030,11,12,120,1230,1405,1526,...,yuou,yup,zac,zeros,zindgi,zoe,zoom,zyada,üll,〨ud
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2407,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2408,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2409,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2410,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The frequency matrices were dataframes with one row for each word found in the dataset an two columns, <br> 
one for the number of times the word appeard and the other for the word's frequency among the messages.
i.e. histogram of counts and probability


In [19]:
def generate_tfm(df):
    
    """
        generates a term frequency matrix from a term document matrix
        Parameters:
            df (DataFrame): should be a term document matrix
        Returns:
            DataFrame: a term frequency matrix
    """
    _counts = []
    _probability = []
    for _label, _data in df.iteritems():
        _sum = _data.sum()
        _counts.append(_sum)
    _sum_of_all_words = sum(_counts)
    for _c in _counts:
        _probability.append((_c + 1) / (_sum_of_all_words + df.shape[1] + 1)) # shape[1] is the number of unique words in the class
    _out_data = {
        'word': df.columns,
        'count': _counts,
        'probability': _probability
    }
    return pd.DataFrame().from_dict(_out_data)


In [20]:
# generated term frequency matrix

_spam_tfm = generate_tfm(_spam_tdm)
_ham_tfm = generate_tfm(_ham_tdm)

In [21]:
# look at spam tfm histogram

_spam_tfm.sort_values(by=['probability'], ascending=False)

Unnamed: 0,word,count,probability
627,call,164,0.020744
914,free,112,0.014207
1784,txt,78,0.009932
1815,ur,76,0.009681
1228,mobile,66,0.008423
...,...,...,...
287,2stop,1,0.000251
1053,increments,1,0.000251
1052,inclusive,1,0.000251
288,2stoptx,1,0.000251


In [22]:
# look at ham tfm histogram

_ham_tfm.sort_values(by=['probability'], ascending=False)

Unnamed: 0,word,count,probability
2112,im,248,0.010529
1256,dont,155,0.006596
1757,get,142,0.006047
4569,ur,141,0.006004
2373,know,121,0.005159
...,...,...,...
2071,hunt,1,0.000085
2072,hunting,1,0.000085
2073,hurricanes,1,0.000085
2074,hurried,1,0.000085


We needed the probabilty for each class; the probability that a given message was spam or ham.

In [23]:
def calculate_class_probability(df):

    """
        spam = #_spam/#_messages; ham = #_ham/#_messages
        Parameters:
            df (DataFrame): the data set; can be pre or post processed
        Returns: 
            (dict): {spam: (float), ham: (float)}
    """

    _count = df.shape[0]
    _p = {
        'spam': df[df['Category'] == 'spam'].count()[0] / _count,
        'ham': df[df['Category'] == 'ham'].count()[0] / _count
    }
    return _p

In [24]:
# calculate probabilites for each class

_p_classes = calculate_class_probability(_data)
print(_p_classes)

{'spam': 0.13406317300789664, 'ham': 0.8659368269921034}


In [25]:
def calculate_probability(message, p_class, class_tfm):

    """
        calculates the probability that a given message belongs to 
            the class of the class_tfm
        Parameters:
            message (string): a single message from the data set
            p_class (float): probability of class in data set
            class_tfm (DataFrame): message probability calculated
                for this class's tfm
            data_tfm (DataFrame): data set's term frequency matrix
        Returns:
            float: the probability that the message belongs to the 
                label of the supplied class_tfm
    """

    _words = message.split()
    _words_in_class = class_tfm['count'].sum()
    _p_word_not_in_class = 1 / (1 + _words_in_class + class_tfm.shape[0])  # shape[0] is the number of unique words in the class
    _p = 1
    
    for _w in _words:
        
        # word not in class_tfm
        _row = class_tfm[class_tfm['word'] == _w]
        if _row.empty:
            _p_word = _p_word_not_in_class

        # word in class_tfm
        else:
            _p_word = _row['probability'].iloc[0]

        # product
        _p *= _p_word

    return _p * p_class


In [26]:
# recombine the train data frames for to run it through the trained model

_train_data = pd.concat([_spam_train, _ham_train])


In [27]:
#NOTE:  This cell can take a minute or two to run
# make predictions
# run our training data set through to see how accurate is with the training

_predictions = []
print('Predition, Probability Ham, Probability Spam')
for _row in _train_data.iterrows():
    _p_spam = calculate_probability(_row[1][1], _p_classes['spam'], _spam_tfm)
    _p_ham = calculate_probability(_row[1][1], _p_classes['ham'], _ham_tfm)
    _prediction = 'spam'
    if _p_ham > _p_spam:
        _prediction = 'ham'
    _predictions.append(_prediction)
    print(_prediction, _p_ham, _p_spam)
    


Predition, Probability Ham, Probability Spam
spam 7.189182208397568e-51 2.651503462268494e-42
spam 1.0734592422332983e-27 5.218801321694282e-23
spam 3.1407464381435687e-85 8.918895017173444e-64
spam 4.919128642594028e-51 5.035744216683025e-40
spam 5.449907676513799e-84 3.7827195746073724e-63
spam 2.1787863024162587e-81 1.1062480410264456e-68
spam 6.296052414008574e-72 3.641586095899106e-58
spam 1.3089063860126498e-76 2.0772326374133972e-60
spam 5.191337989105381e-60 7.721165359686796e-43
spam 2.671331204511014e-64 3.840219300930297e-48
spam 7.200458774070013e-59 1.100821732991224e-48
spam 2.068299716759566e-88 1.5549793500547441e-69
spam 6.713749579676807e-68 2.2757866658702887e-55
spam 2.4734548189916792e-65 3.2775347616496015e-47
spam 1.1575768552881058e-64 1.6904476656572273e-49
spam 3.257586100144175e-72 7.841238836687558e-62
spam 1.2847500429911984e-84 2.351110409163255e-71
spam 1.5422273897257352e-60 3.621500687226019e-54
spam 6.94732122618832e-91 9.301101649263084e-73
spam 1.562

spam 2.8254150724601015e-60 7.117828999023617e-50
spam 4.9806565468395815e-83 7.58975183154872e-76
spam 1.8109646802729488e-61 5.983024922339346e-47
spam 3.1358474627785404e-55 1.5711894153044808e-46
spam 6.122817078894587e-53 1.2111818050575365e-47
spam 2.680861588036621e-66 3.7759367214495102e-53
spam 1.8011425026669794e-34 3.7705718006642624e-31
spam 1.8634618060785574e-62 4.764302907452337e-49
spam 1.3288212482709445e-79 2.937530356578768e-66
spam 1.394183420848152e-61 5.242791402045272e-42
spam 3.217033905643947e-65 6.717728689363076e-53
spam 2.2781803099585218e-68 2.049223841689967e-52
spam 8.280844826481069e-100 5.361366307941959e-85
spam 7.189182208397568e-51 2.651503462268494e-42
spam 2.1587179631751144e-60 7.136032109635319e-47
spam 1.3720770175518024e-77 4.0907056639238264e-63
spam 3.2944535644075234e-68 3.2124830529043236e-50
spam 1.0029174375229337e-51 4.844727220230146e-47
spam 7.503831817071062e-106 3.848653386534848e-79
spam 9.359553035064515e-61 2.3230976547289043e-45


spam 6.116487421052415e-89 1.1723938676512519e-76
spam 7.107872567690211e-53 1.0358976200931154e-38
spam 1.104340025461299e-58 3.6987610228505125e-46
spam 1.5691758727072467e-43 4.5899051243096946e-36
spam 1.5921917550647274e-100 5.680070224560639e-74
spam 7.978361325697087e-72 9.294605353281868e-54
spam 1.9192762184260242e-53 3.044185709298351e-44
spam 1.3930036213139634e-72 2.3694452154520198e-52
spam 4.062448472514395e-47 3.715384625840411e-36
spam 3.1407464381435687e-85 8.918895017173444e-64
spam 3.6971083776862787e-91 4.108860967774305e-68
spam 2.8208123420908352e-24 2.452884064844691e-19
spam 1.84136157246371e-79 8.904327770756371e-67
spam 8.777922617164243e-21 8.489304053608249e-18
spam 1.8532983114918906e-88 1.743034903631335e-68
spam 3.4394248458174326e-62 3.672356646270283e-52
spam 5.205289346423711e-76 3.5911016909802464e-66
spam 2.649136086406144e-66 9.713665798113369e-56
spam 2.167148264096965e-45 4.053077140359368e-38
spam 7.20092914634425e-60 9.634939719466796e-47
spam 7

ham 1.488627561396062e-43 2.3792702670384718e-54
ham 1.6853611424955344e-18 1.6843857249222715e-20
ham 5.3602861019552156e-36 2.116282351570786e-39
spam 1.3505439231073225e-64 4.495747500951216e-64
ham 3.4588504783458074e-109 2.7291911549064186e-109
ham 1.2816577644657645e-11 1.1488445478047223e-14
ham 5.289885920585496e-55 6.017174228399854e-60
ham 2.878690462556956e-16 6.698802028015874e-17
ham 1.7426634213403827e-16 1.1521198358468336e-17
ham 3.1141768256682817e-21 5.294146734103192e-25
ham 4.58015125554048e-11 1.3397604056031747e-15
ham 3.2958965260219225e-22 5.294146734103192e-25
ham 1.7532996418794977e-12 2.1436166489650795e-14
ham 3.8026885261304995e-36 7.139805614883563e-41
ham 2.260450547144117e-21 8.470634774565108e-23
ham 5.086502712670867e-109 2.611017876307628e-115
ham 1.0518292074727339e-16 6.698802028015874e-17
ham 8.265716787776522e-14 3.0144609126071437e-16
ham 7.117691696230004e-61 3.0085871141999256e-61
ham 6.643131836669899e-17 3.349401014007937e-17
ham 5.1715397362

ham 1.1248001118528626e-61 2.146556293425904e-65
ham 1.3439913455783043e-31 3.787404710488658e-35
ham 7.267761641838194e-47 3.0533968426993724e-51
ham 1.3173082309503763e-14 2.4053028151890033e-16
ham 1.4890912216289548e-47 9.569399395207449e-51
ham 2.157262262394284e-17 5.0531571747668146e-20
ham 2.9415587094833415e-07 1.0595179654137187e-08
ham 1.7723875740235291e-12 1.641206496863889e-14
ham 7.586500562149559e-22 2.647073367051596e-23
ham 3.122212755868431e-18 1.415125422025783e-20
ham 8.590659126848467e-25 1.6066675769187617e-29
ham 1.5124462540091934e-09 4.262581706467061e-11
ham 7.864788065606909e-29 4.820002730756284e-30
ham 2.956531422124541e-32 5.681107065732985e-35
ham 7.478790069823933e-18 1.2969770081901492e-18
ham 1.893138333916717e-45 4.1637229673173264e-53
ham 8.549710673794159e-14 1.2057843650428572e-14
ham 5.914798089942044e-21 1.905892824277149e-23
ham 8.388719797065864e-21 2.7000148343926273e-21
spam 1.3587407660429113e-22 3.388253909826043e-22
ham 1.5011856687175563

ham 2.8779845170700718e-70 1.026509123455448e-81
ham 1.5713977143063425e-44 4.6241911998925856e-45
ham 2.2143772788899663e-16 3.349401014007937e-17
ham 3.481238134666182e-56 3.133944910624923e-63
ham 1.5187306567266836e-10 2.664113566541913e-13
ham 7.4313062134316e-08 2.1190359308274375e-09
ham 1.4401756227580618e-11 5.328227133083826e-13
ham 6.321604255775078e-12 2.6795208112063494e-15
ham 1.3119817100864337e-22 4.313058943548991e-25
ham 8.843848327188193e-16 2.6950171598756344e-19
ham 2.7375627656856843e-56 9.945051849716421e-59
ham 2.832808217735095e-19 1.2705952161847661e-23
ham 7.336230303621047e-41 9.977443824981704e-44
ham 5.493474765683326e-80 1.0692803369327585e-83
ham 6.546252830718463e-12 2.664113566541913e-13
ham 4.580027793931895e-35 3.6664591740963864e-37
ham 2.7407109389861742e-24 5.990359643818045e-27
ham 1.9323844076400407e-60 4.021895968635319e-61
ham 2.225725962444277e-11 3.4633476365044874e-12
ham 2.303198165923522e-11 8.038562433619047e-15
ham 1.0411340874192967e-3

ham 2.9098906265779136e-22 5.294146734103192e-24
ham 4.878267609451399e-10 1.864879496579339e-12
ham 7.124758894828466e-14 1.6747005070039684e-16
ham 3.993214226738264e-09 1.0656454266167653e-12
ham 2.04360920869369e-07 2.330939523910181e-08
ham 3.5870456065198534e-41 1.6149090734100487e-47
ham 3.777580658494397e-07 2.1190359308274375e-09
ham 2.5405728290518393e-65 1.7902846632081588e-71
ham 2.683439960368113e-09 4.262581706467061e-12
ham 2.7913609586207334e-13 6.698802028015874e-16
spam 2.0784933046985093e-23 6.035327276877639e-23
ham 3.056121538391903e-16 1.2632892936917036e-20
ham 1.0692973842533053e-22 1.449667033803967e-25
ham 1.6471645389023018e-12 6.2449581906178e-13
ham 3.1422013587448622e-12 2.664113566541913e-13
ham 3.4512080274610803e-28 1.928001092302514e-29
ham 3.2729588677664997e-19 8.968284567570807e-22
ham 2.62403707548461e-15 6.698802028015874e-17
ham 3.777580658494397e-07 2.1190359308274375e-09
ham 7.811142465518521e-63 2.1013795394350335e-66
ham 2.002619037608799e-65

ham 2.41517451936527e-07 2.1190359308274375e-09
ham 2.0738528967716092e-10 1.0656454266167653e-12
ham 1.6996032682673864e-27 5.623336519215665e-30
ham 2.168827180253765e-47 2.4784065281650746e-54
ham 1.9021384020774517e-14 5.053157174766815e-20
ham 7.78636847832937e-18 8.421928624611358e-21
ham 1.1783255095293231e-12 2.664113566541913e-13
ham 9.688454189463325e-12 2.664113566541913e-13
ham 1.0280702969222762e-18 4.210964312305679e-21
ham 2.6276095266495077e-58 6.700062699528399e-70
ham 7.659115811940602e-10 1.0656454266167653e-12
ham 1.7924926259719956e-21 1.0588293468206383e-24
ham 5.154783053782293e-28 1.874445506405221e-29
ham 1.4090809218121495e-09 1.5984681399251478e-12
ham 1.8101912695344676e-24 4.659168611858478e-28
ham 4.516101747295237e-32 3.787404710488657e-35
ham 1.8897111810231187e-18 7.327077903411882e-19
ham 3.3287397336529377e-43 8.314536520818088e-44
ham 4.228505614832434e-54 1.4666862181724636e-58
ham 5.507970883818218e-18 3.368771449844543e-20
ham 2.6622328741612285e-

ham 1.3950576857006785e-13 1.6747005070039684e-16
ham 7.455382276178162e-19 1.6843857249222715e-20
ham 6.346508275590684e-17 6.737542899689086e-20
ham 5.0508923409701825e-19 1.5300084061558223e-21
ham 3.9277516984310775e-11 6.926695273008974e-12
ham 4.5066095055093025e-46 2.090655398747319e-48
ham 4.9640296413040144e-67 2.1357140119907487e-77
ham 8.337727427990593e-21 4.235317387282553e-24
ham 1.7106415596329675e-18 1.5159471524300443e-19
ham 4.674024521132982e-10 1.9048412000774678e-10
ham 1.4044676187462786e-20 4.210964312305679e-21
ham 4.871630013557926e-14 8.038562433619049e-16
ham 8.310677448687674e-06 9.323758095640724e-08
ham 1.5720674212500012e-18 4.210964312305678e-20
ham 2.888042545072965e-67 1.5233600744595347e-74
ham 4.161257782490025e-12 3.014460912607143e-16
ham 8.056240049011423e-21 4.235317387282553e-24
ham 9.58371414417183e-10 1.0656454266167653e-12
ham 1.1098725100515142e-31 8.368060296451884e-32
ham 2.4274580497727983e-41 6.690097275991421e-45
ham 3.153847337091291e-

ham 2.1191465025051003e-43 4.758540534076943e-52
ham 5.574425609978565e-16 4.210964312305679e-21
ham 6.132026536773228e-51 2.393030190634621e-55
ham 1.6534276279773105e-46 1.3746894876222285e-51
ham 9.408904572109181e-71 6.263816095639535e-77
ham 4.9542041422877325e-08 1.2714215584964626e-08
ham 8.619217776245915e-14 1.0948507211994766e-19
ham 3.022624985684804e-13 1.2727723853230159e-14
ham 2.8102132854173567e-31 3.3665819648788065e-35
ham 3.601266899434628e-32 8.432062494539847e-37
ham 1.0130260049518471e-13 3.90777488181967e-18
ham 4.0301666475797393e-17 7.579735762150221e-20
ham 2.4180999885478435e-14 3.014460912607143e-16
ham 2.8062043301580125e-62 2.626724424293792e-67
ham 1.4808221126274998e-50 3.5689054005577093e-54
ham 9.583714144171831e-11 1.0656454266167653e-12
ham 1.5497097948583543e-12 2.0096406084047622e-16
ham 1.3070197347274435e-43 5.828812436839395e-47
ham 3.4340892745894376e-23 1.1315123771656307e-26
ham 3.0299085547556226e-15 1.6675418676730488e-17
ham 2.798614670693

ham 6.915407509035352e-27 2.5104180889355648e-31
ham 5.171670661324201e-09 6.180743474377238e-11
ham 1.6716511228522665e-09 6.154102338711818e-11
ham 3.808752885662268e-21 1.394023248668057e-23
ham 1.9338703361298053e-33 4.9807399574308665e-39
ham 2.935994394577231e-10 6.180743474377238e-11
ham 1.8595586766036005e-22 2.1176586936412767e-24
ham 1.2381786526867195e-16 2.35814001489118e-19
ham 3.1422013587448622e-12 2.664113566541913e-13
ham 5.8916275476466164e-12 2.664113566541913e-12
ham 1.877532694895659e-21 5.717678472831447e-23
ham 8.857509115559863e-17 3.349401014007937e-17
ham 5.8916275476466164e-12 2.664113566541913e-13
spam 3.142201358744862e-12 3.99617034981287e-12
ham 2.4967950310626677e-21 7.337687373467023e-22
ham 5.9242068660371304e-24 1.5974292383514784e-26
ham 2.0180505893435822e-12 1.3397604056031748e-16
ham 1.6138452314524573e-28 2.5104180889355647e-32
ham 1.6326354559811847e-10 2.3977022098877215e-12
ham 4.034072218414299e-34 1.333257881489595e-36
ham 2.0329644037384474

ham 1.236269700851565e-28 3.938900898908203e-33
ham 3.4276179821641876e-10 2.7173958378727508e-11
ham 1.355198894680659e-12 2.177110659105159e-15
ham 1.0154924431527003e-21 5.241205266762161e-22
ham 6.556178478354157e-24 7.521229330571543e-26
ham 1.636563207679616e-11 5.328227133083826e-13
spam 4.445563247110823e-55 1.4024079862332272e-50
ham 6.502392936752649e-08 6.357107792482313e-09
ham 8.472366642029802e-16 5.0531571747668146e-20
ham 5.296066525488496e-37 2.5143158438953897e-41
ham 1.571100679372431e-11 2.664113566541913e-13
ham 2.7494261889017542e-12 2.664113566541913e-13
ham 4.453131625454001e-40 2.2206522492359068e-49
ham 1.0927951871321981e-13 2.6795208112063496e-16
ham 7.639601612170385e-16 1.0048203042023811e-16
ham 3.36322070410834e-54 1.4956438691466384e-57
ham 7.016535634077278e-09 2.0513674462372727e-11
ham 1.4037766991946176e-72 1.1274868972151166e-76
ham 2.428245530066738e-18 1.6941269549130213e-23
ham 6.554397759835149e-36 2.4584421584754914e-39
ham 6.98373750378699e-2

spam 8.70987670540328e-24 1.8000098895950852e-23
ham 1.1676223336572719e-26 1.5974292383514787e-27
ham 4.4287545577799314e-17 3.349401014007937e-17
ham 3.425364853282916e-14 1.3397604056031748e-16
ham 1.1248001118528626e-61 2.146556293425904e-65
ham 3.5371180345073186e-21 1.4376863145163308e-26
ham 4.91142131993765e-35 4.2649370134445305e-43
ham 2.9415587094833416e-08 2.1190359308274375e-09
ham 2.3353534740714734e-51 1.3478470271615665e-57
ham 9.436030680310821e-09 7.459517986317355e-11
ham 1.0863658632978112e-24 4.313058943548992e-26
ham 4.298531458762972e-09 7.033259815670651e-10
spam 2.902970697953144e-47 1.8815898588725865e-46
ham 9.895528997328789e-29 1.716956802088191e-31
ham 1.806324222651591e-45 2.102746189336001e-49
ham 1.1379775799523688e-37 1.330325843330894e-43
ham 1.9595578135217038e-13 6.028921825214286e-16
ham 3.39397739784705e-34 2.104113728049254e-36
ham 9.341904145317045e-13 3.349401014007937e-17
ham 3.164864947017945e-65 1.3473768734119526e-68
ham 7.431306213431599e-

ham 6.919715343280361e-30 2.6777792948646026e-31
ham 2.3222831916973746e-08 2.1190359308274375e-09
ham 1.520857591396264e-59 3.404234853884755e-63
ham 4.189601811659816e-11 2.664113566541913e-12
ham 1.0739572927705097e-34 2.328070225829064e-42
ham 1.0252871349270817e-30 3.4717876512812694e-33
ham 1.2701441697112424e-20 1.4935963378586323e-23
ham 3.31270840921939e-14 9.110370758101587e-15
ham 7.215341043453397e-23 1.0588293468206383e-24
ham 3.465340122523916e-42 2.129030516702701e-48
ham 4.0718616723576845e-45 2.1027461893360008e-51
ham 2.6930001016508835e-40 2.6453529394634826e-40
ham 1.2558173615105539e-15 2.7792364461217482e-18
ham 2.3710409602023844e-16 4.210964312305679e-21
ham 2.0055797595696857e-18 1.4317278661839309e-19
ham 1.5044332491151115e-23 5.294146734103192e-25
spam 1.6104332965022402e-34 1.6832909824394032e-34
ham 6.00679240753211e-37 4.2498842945735487e-44
ham 7.322932997819055e-05 1.685481179380144e-05
ham 1.834471312242476e-24 2.259376280042008e-30
ham 1.2793763695032

spam 4.169233333238373e-72 3.1388107731571604e-70
ham 4.5448387694134365e-31 4.208227456098508e-35
ham 2.618501132287385e-12 2.664113566541913e-13
ham 3.384412713481446e-10 6.393872559700591e-12
ham 4.652548568447189e-34 1.634305298532003e-39
spam 5.52529144517847e-49 6.306294609874554e-49
ham 2.6687745221645438e-25 9.58457543010887e-27
ham 7.86501866497916e-19 4.210964312305679e-21
ham 1.1263974811302794e-23 4.1932517506726315e-27
ham 8.847291652256006e-65 2.5216554473220403e-65
ham 5.139147653981766e-36 1.4473945175440125e-40
ham 9.799585732809798e-15 1.6843857249222715e-20
ham 8.850016642075781e-21 4.235317387282553e-24
ham 1.815256230826065e-39 3.746454474555195e-45
ham 1.7438221071258482e-15 3.349401014007937e-17
ham 2.5403079116490503e-27 3.347224118580753e-32
ham 4.1564469137640733e-17 4.404730082773856e-22
ham 2.4670045130597992e-27 4.142189846743682e-30
ham 9.632855472561516e-55 1.1700060999666378e-61
ham 5.567588032526054e-10 2.664113566541913e-12
ham 4.3172563980814865e-56 2

ham 1.7889125888158834e-67 1.3049616865915697e-78
ham 7.203934721548996e-83 5.387293055848673e-88
ham 4.3070483980081436e-47 3.446312094545712e-52
ham 2.4865347192717274e-77 2.087938698546512e-77
ham 3.714544517335165e-71 5.118783583311161e-81
ham 1.013466642622552e-14 1.414884008934708e-18
ham 3.1578416440858114e-42 1.2616477136016003e-49
ham 1.203213472928177e-32 2.3986896499761496e-34
ham 3.019895208608924e-32 5.079077643769886e-37
ham 1.1196458956245389e-10 2.2106046692452386e-14
ham 2.3513045508480633e-28 1.6736120592903766e-32
ham 1.0459464863899606e-24 3.9994993742760215e-30
ham 7.089591815668095e-11 5.328227133083826e-13
ham 6.333499613720113e-10 5.328227133083826e-13
ham 1.2074767679628957e-39 1.128953915323552e-46
ham 1.920021499353474e-33 3.787404710488658e-35
ham 9.127746659912461e-55 5.014311856999878e-61
ham 1.2406928926022756e-26 6.655955159797827e-29
ham 4.9252658628012e-89 1.3509029906021374e-96
ham 2.7965592092829276e-10 5.5946384897380175e-12
ham 3.565922108664443e-5

ham 5.159232780945836e-09 2.664113566541913e-13
ham 1.7048752253505037e-20 2.1176586936412767e-24
ham 3.8523305076466093e-44 1.61911456578872e-46
ham 4.5936798940438674e-60 4.137239576006113e-65
ham 1.062411757157735e-18 6.776507819652085e-23


In [28]:
# compare predictions

_sum = 0
for _i,_row in enumerate(_train_data.iterrows()):
    if _row[1][0] != _predictions[_i]:
        continue
    _sum += 1
_score = _sum / _train_data.shape[0]
print('Accuracy using TRAINING data set')
print(f'accuracy: {_score:.1%}')

Accuracy using TRAINING data set
accuracy: 96.5%


In [29]:
# recombine the test data frames for testing

_test_data = pd.concat([_spam_test, _ham_test])


In [30]:
# check some data from the recombined frame

random_seed()
for _ in range(5):
    _index = random_randint(0, _test_data.shape[0])
    print(f'message # {_index}')
    print(f'class: {_test_data.iloc[_index,0]}\nmessage: {_test_data.iloc[_index,1]}')
    print()

message # 1716
class: ham
message: im movie collect car oredi

message # 1095
class: ham
message: world suffers lot violence bad people silence good people gud night

message # 1965
class: ham
message: sorry ill call later

message # 1790
class: ham
message: somewhere beneath pale moon light someone think u dreams come true goodnite amp sweet dreams

message # 1484
class: ham
message: carlos took leave minute



In [31]:
#NOTE:  This cell can take a minute or two to run
# make predictions
# run our TEST data set through to find the predictions and accuracy

_predictions = []
print('Predition, Probability Ham, Probability Spam')
for _row in _test_data.iterrows():
    _p_spam = calculate_probability(_row[1][1], _p_classes['spam'], _spam_tfm)
    _p_ham = calculate_probability(_row[1][1], _p_classes['ham'], _ham_tfm)
    _prediction = 'spam'
    if _p_ham > _p_spam:
        _prediction = 'ham'
    _predictions.append(_prediction)
    print(_prediction, _p_ham, _p_spam)

_predictions

Predition, Probability Ham, Probability Spam
spam 9.363117458308525e-21 3.789867881075111e-20
spam 1.150444101856595e-69 3.70373203211873e-56
spam 2.9866376643437563e-82 2.4384792815728806e-71
ham 1.5731950834240891e-41 1.3547446983882627e-45
spam 1.4254205166717182e-46 3.9331625455957223e-41
spam 9.665487650954514e-50 1.9682483122085618e-41
spam 3.217033905643947e-65 8.397160861703844e-54
spam 4.442131616251323e-59 7.999827250964648e-47
spam 3.343091446763462e-89 3.978452476827766e-73
spam 1.3617074732884429e-67 2.871156728533239e-50
spam 8.905735935565143e-28 7.987146191757392e-27
spam 1.1633728397674626e-61 3.60617954748807e-51
spam 3.4027008763016167e-60 1.8840201278514237e-48
spam 5.493063342320914e-57 3.494774791854634e-54
spam 1.074969702213262e-62 8.015160024490105e-48
spam 8.044745006200806e-60 6.385354675243106e-53
spam 6.176488485370476e-80 1.483451964726187e-60
spam 8.4726874651254e-30 7.190841573947033e-26
spam 4.580632527148014e-52 3.3411710478578046e-42
spam 3.2944535644

spam 9.593448779554703e-65 4.303061790825653e-53
spam 4.3769168784271385e-67 9.833168545441435e-57
spam 5.231466321377544e-16 6.698802028015874e-16
spam 6.108191897644941e-71 4.7583332212913746e-55
spam 4.257922227090452e-76 1.3825571367789363e-68
spam 9.363117458308525e-21 4.911668773873343e-17
spam 1.3243013262395174e-87 3.3400334923149808e-80
spam 1.0620780079990093e-80 9.25031131443493e-72
spam 2.6240114810147305e-28 4.9815830797990864e-23
spam 2.3912146906722258e-79 1.3908146471464031e-66
spam 2.149843532741468e-49 1.3996610728333696e-40
spam 7.428023212711482e-39 7.365509096406962e-35
spam 2.2308962372036726e-48 5.771225420683102e-36
spam 4.9919366830256885e-45 1.0228719864609576e-38
spam 1.2060322999255245e-81 3.34420155975289e-61
spam 4.05385640831232e-59 1.0946469908089156e-48
spam 2.5623115711637218e-53 8.361003162480965e-46
spam 6.199122330176483e-69 7.596364664591957e-53
spam 1.0315590318246365e-60 3.606517793969146e-47
spam 1.849467500983427e-74 7.7986544403549e-59
ham 4.6

spam 3.4616557245045176e-81 1.2544904037781186e-70
spam 5.80438531190173e-57 7.367049258672423e-48
spam 3.947862342608306e-50 4.285896421219121e-40
spam 9.992039339660703e-81 1.8950432478260023e-65
spam 1.3270414754143308e-79 8.12455738204054e-76
spam 2.671331204511014e-64 1.9201096504651486e-48
spam 2.2465574522250478e-86 1.2585642836464576e-65
spam 2.437676737092504e-67 2.960731074955058e-57
spam 1.9184989056716282e-100 1.1236263548268769e-81
spam 1.6555831164227756e-62 1.4788709959849736e-55
spam 1.4156323965535087e-59 6.81710960793945e-49
spam 2.947533659298419e-65 2.197129914527395e-45
spam 3.145809106706833e-98 3.5475383457873055e-88
spam 7.480800649517213e-60 1.0022358624270533e-48
spam 2.2106697968098703e-97 3.3981490023798063e-78
spam 1.0606454113492491e-62 2.4795695020640806e-56
spam 2.8322637745946043e-82 2.986881913757425e-71
spam 8.925585387072428e-61 2.2118596445774933e-45
spam 1.547599418100722e-48 7.705982927669003e-36
spam 9.934553241635168e-90 6.9040335984638394e-71
s

ham 3.570494094646301e-64 3.309672774610177e-65
ham 6.741444569982136e-20 4.210964312305679e-21
ham 3.616351652981077e-78 2.352578060890528e-84
ham 3.4060153478228166e-08 2.1190359308274375e-09
ham 3.097032214212904e-09 1.0656454266167653e-12
spam 3.561603166150212e-75 8.303732204119476e-75
ham 1.5174214061605398e-10 1.0656454266167653e-12
ham 3.3078447760670534e-25 5.324764127838262e-28
ham 2.9272924214854365e-39 9.25906786958302e-40
ham 2.3621389762788982e-17 8.421928624611356e-20
ham 2.352775858820589e-16 3.349401014007937e-17
ham 2.1273471021149888e-17 1.2632892936917036e-20
ham 9.772979949432139e-22 7.188431572581653e-26
ham 2.200464556743682e-38 5.290705878926965e-40
ham 4.6815587291542624e-21 4.210964312305679e-21
ham 4.939032872573865e-70 2.4911196612358426e-74
ham 7.099847150440954e-13 4.0192812168095244e-16
spam 1.353322177489202e-35 6.018707007867315e-35
ham 4.759387098750236e-49 1.0968807187187228e-58
ham 3.43414602354587e-19 4.235317387282553e-24
spam 6.88898097901395e-78 

ham 1.9253968855806402e-56 1.418431189118648e-63
ham 1.4401756227580618e-11 2.1312908532335306e-12
ham 1.1487560854929683e-06 1.2714215584964626e-08
spam 3.1422013587448622e-12 3.1969362798502957e-12
ham 5.472099765641677e-44 2.341534046596997e-45
ham 8.718527235924938e-20 1.8636674447433914e-26
ham 6.970340129800137e-19 9.825936338495524e-22
ham 4.491310715908575e-12 1.808676547564286e-15
ham 5.3211486011725886e-14 4.689161419611111e-16
ham 3.096377588929833e-09 2.1190359308274375e-09
ham 2.8762425130392598e-22 1.810419803465009e-25
ham 3.096377588929833e-09 2.1190359308274375e-09
ham 1.3396982537284298e-15 3.6843411154087304e-16
ham 2.617634078254027e-40 1.5770596420020003e-47
ham 1.3533517163340675e-33 5.290705878926965e-40
ham 4.66880404976185e-38 1.1640351129145322e-43
spam 2.767971598612457e-18 3.349401014007937e-17
ham 1.1514204103950705e-09 4.262581706467061e-11
spam 3.7452469833234096e-21 2.5265785873834076e-20
ham 4.312438573767029e-24 1.863667444743391e-27
spam 5.23297665908

ham 5.2388352153673715e-09 2.1312908532335306e-12
ham 1.5651806806298346e-31 3.3066911743293527e-37
spam 4.453914224353948e-26 5.294146734103192e-25
ham 2.1055367604722866e-07 5.509493420151338e-08
ham 1.806373006108453e-09 2.3977022098877216e-11
ham 5.883117418966683e-08 2.1190359308274375e-09
spam 6.891392122356605e-73 7.858652157978674e-71
ham 8.839966401320361e-46 2.0501775346026016e-50
ham 5.3115414004894756e-58 3.400330228028041e-60
ham 1.9638758492155387e-12 2.664113566541913e-13
spam 4.327306324333015e-51 9.981336179009772e-50
ham 0.00010984399496728583 3.370962358760288e-05
spam 7.776948362893534e-11 6.78283314041571e-10
ham 1.0630370867148338e-19 4.235317387282553e-24
ham 7.740943972324584e-08 2.1190359308274375e-09
ham 2.751002588226449e-23 3.6101900786743413e-25
ham 7.600978752654861e-18 2.6950171598756344e-19
ham 1.9071324314439836e-14 1.0048203042023811e-16
ham 1.9680278066134575e-14 6.698802028015874e-17
spam 4.112899591337111e-42 1.984014704597612e-39
ham 1.161745137279

ham 2.72092226059958e-35 1.3967463520367186e-37
ham 1.1939894848429897e-24 1.5531119910214693e-29
ham 1.0343079472535171e-10 2.664113566541913e-13
ham 3.4805843148146745e-19 1.602855865217082e-20
ham 7.828738253607445e-24 3.194858476702957e-26
spam 2.9493819993671853e-19 6.737542899689087e-19
ham 1.2211135611301553e-34 1.052056864024627e-34
ham 1.6198443270674055e-10 2.2106046692452386e-14
ham 4.799385262841241e-08 2.1190359308274375e-09
ham 2.610905303249333e-16 8.421928624611357e-19
ham 1.6571313433587347e-16 1.3896182230608743e-18
ham 1.9001347718878277e-108 1.1372461492703818e-113
ham 1.0253900303179314e-23 1.5882440202309576e-24
ham 1.4371716482197669e-30 6.694448237161506e-32
ham 2.7738235470239003e-19 2.5265785873834076e-20
ham 4.159757933803283e-22 4.7647320606928724e-24
ham 7.466467228538553e-22 7.188431572581653e-26
spam 2.767971598612457e-18 3.349401014007937e-17
ham 1.010376064811914e-12 5.359041622412699e-16
ham 1.018708226757915e-06 4.238071861654875e-09
ham 1.10249318721

ham 2.5812441115490234e-106 2.8999802125418468e-114
spam 1.6009625714515647e-76 7.516579314767441e-75
ham 9.230034090426106e-46 3.532613598084482e-48
ham 1.6198443270674055e-10 2.2106046692452386e-14
ham 8.088505722190354e-28 4.713214750830329e-33
ham 1.0527683802361433e-07 7.204722164813286e-08
ham 6.775892448456139e-39 1.871768461566567e-40
ham 1.0028885337987762e-25 3.0125017067226774e-31
ham 8.499860939041872e-44 7.526359435490349e-47
ham 9.964697755004847e-15 3.349401014007937e-17
ham 7.344595037749582e-41 6.397405520166795e-44
ham 2.331034231926524e-15 1.4738375093069871e-18
ham 9.464239126858257e-18 3.4108810929675997e-19
ham 1.1469219335153135e-32 3.787404710488657e-35
ham 3.73067049323658e-132 2.928537962153665e-139
ham 2.76294877348472e-50 6.780252206798091e-54
ham 2.223450269291729e-80 7.528249794849697e-84
ham 3.2713760013049183e-52 4.299477575840203e-55
ham 1.3009466513478552e-15 2.6795208112063496e-16
ham 1.7695671397505013e-32 8.368060296451883e-33
ham 6.665240263354544e

ham 2.0128225384888923e-32 5.049872947318209e-35
ham 1.3081156977882616e-12 4.099666841145714e-14
ham 9.359345249908339e-54 2.3265571297836596e-57
ham 5.221810606498665e-17 1.2632892936917038e-19
spam 1.956240056706821e-52 4.0477864144718023e-50
ham 1.7936455959008728e-14 6.698802028015874e-17
ham 2.942117074413019e-31 3.787404710488658e-35
ham 6.317242454206982e-32 6.312341184147761e-35
ham 3.1337476825898392e-27 1.3311910319595654e-27
ham 1.223069147627284e-07 4.238071861654875e-09
ham 7.852029561352444e-31 1.6832909824394033e-35
ham 2.8044147126797896e-10 2.664113566541913e-13
ham 6.524877444737152e-19 1.362713369358162e-20
ham 1.488444604584468e-67 4.755447788512775e-69
ham 3.049197513031484e-14 1.0048203042023811e-16
ham 2.1614756652505235e-16 8.421928624611358e-21
ham 3.0757840850543504e-19 8.421928624611356e-20
ham 4.2184429784897986e-48 8.92226350139427e-53
ham 1.2385510355719331e-08 2.1190359308274375e-09
ham 2.1674643122508833e-08 1.4833251515792061e-08
ham 1.5572424948704223

ham 2.794662044915573e-69 1.528475524270973e-72
spam 1.1190645591341167e-55 3.96545044506412e-54
ham 3.0127819402815586e-09 5.594638489738017e-12
spam 6.908906073140592e-59 1.6578117289138708e-56
ham 5.578676029810801e-22 2.1176586936412767e-24
ham 4.712594622755934e-44 1.0093181708812802e-47
ham 4.0192121960570045e-20 3.811785648554298e-23
ham 1.0303034004360827e-15 3.234020591850761e-17
ham 7.132470580534898e-74 5.3018483372915936e-83
ham 6.2438409588964e-31 1.8179542610345558e-33
ham 2.0192893982077128e-45 6.308238568008001e-49
spam 1.5481887944649164e-09 2.1190359308274375e-09
ham 1.2648523017019485e-13 6.377259530671111e-14
ham 7.201142053000652e-41 1.496616573747255e-41
ham 3.1488164012291566e-17 2.526578587383407e-19
ham 1.0017779504730699e-18 3.60001977919017e-22
ham 6.620518577400311e-30 1.2952924109871206e-32
ham 4.526163750257446e-16 2.269709764332761e-18
spam 5.212966460970852e-49 2.090655398747319e-48
ham 4.4287545577799314e-17 3.349401014007937e-17
ham 1.6430679409363552e

ham 7.86501866497916e-19 5.474253605997383e-20
ham 9.073955317617878e-62 6.116694717974553e-66
ham 1.959645237291794e-34 2.0633752927815163e-37
ham 1.234376256896268e-41 5.740497096887282e-48
ham 4.1403586629594307e-22 4.235317387282553e-24
ham 1.0560098394178688e-16 5.0531571747668146e-20
spam 8.51964272341506e-33 4.5448856525863874e-32
ham 2.2758375278634278e-07 4.238071861654875e-09
ham 4.45391422435395e-22 2.1176586936412767e-24
ham 2.832808217735095e-19 1.2705952161847661e-23
ham 7.701164109458761e-19 3.368771449844543e-20
ham 3.3368811678236187e-47 9.970959127644253e-55
ham 3.282382409612192e-36 7.618616465654829e-38
ham 3.2385267703765754e-16 2.6795208112063496e-16
ham 5.756141654533133e-48 2.628432736670001e-51
ham 1.1788732365740466e-52 2.2733786811028896e-56
ham 1.3142379008554143e-28 1.060473318936824e-31
spam 6.546252830718463e-14 2.664113566541913e-13
ham 2.649317948235205e-39 3.9909775299926815e-43
spam 1.860662686537717e-27 1.4376863145163306e-25
spam 1.5481887944649164e

ham 5.633375797496076e-13 2.612532790926191e-15
ham 4.038647691709775e-27 6.694448237161506e-32
ham 1.899259074115117e-33 2.116282351570786e-38
ham 2.0083886948071783e-17 5.558472892243496e-18
ham 2.19823170055526e-09 1.8648794965793387e-11
ham 3.37260722283934e-45 2.564853347867473e-51
ham 2.9783509257067006e-31 1.5465235901162015e-34
ham 3.9088997716067903e-38 1.851747057624438e-39
ham 1.288726388896573e-10 6.377259530671111e-13
ham 1.047400452914954e-12 2.664113566541913e-13
spam 1.588325718577057e-49 4.181310797494638e-48
ham 3.061376588065378e-14 2.0096406084047622e-16
ham 8.631917723779654e-64 2.747592055585158e-64
ham 1.6655500528390877e-30 2.188278277171224e-34
ham 0.8659368269921034 0.13406317300789664
ham 9.51118536230038e-22 1.9058928242771491e-22
ham 6.938982176791755e-06 2.542843116992925e-08
ham 4.1537182622305285e-40 1.0704155641586273e-45
ham 8.800105872118873e-23 4.68579243249767e-25
ham 3.29488103357877e-17 4.2951835985517923e-19
ham 1.024891191924804e-20 1.3235366835

spam 5.0545266195815925e-65 3.342874571333251e-61
ham 6.051919617946713e-31 8.368060296451883e-32
ham 2.7101486719174437e-10 2.664113566541913e-13
ham 4.2529048194789394e-33 1.9046541164137073e-38
ham 8.220875647879002e-14 3.349401014007937e-17
ham 8.609498860324188e-14 1.004820304202381e-15
ham 8.643956992335757e-28 4.443888193640023e-33
ham 1.670288446627729e-197 2.112222681074809e-212
ham 2.1217680878125986e-42 8.362621594989276e-47
ham 0.0009519812897164772 1.685481179380144e-05
ham 1.5819870540228487e-40 1.3177090297554875e-45
ham 1.0752197853885134e-10 5.359041622412699e-16
ham 1.5481887944649163e-08 2.1190359308274375e-09
ham 1.280604163756469e-08 2.3444199385568832e-11
ham 4.9048285004275426e-20 3.1764880404619145e-23
ham 3.1925657160169107e-22 5.294146734103192e-25
ham 9.952381811713373e-29 9.372227532026108e-31
ham 3.628224330580743e-30 9.089771305172777e-34
spam 4.34692282176998e-81 1.0291108533030371e-78
ham 2.4729081941980992e-18 3.5576666053173444e-21
ham 0.00366146649890

ham 8.091335386489872e-20 1.0455174365010426e-23
ham 8.580711955698619e-16 1.3397604056031748e-16
ham 4.8897880865131927e-29 6.733163929757613e-34
ham 2.9458137738233082e-12 5.328227133083826e-13
ham 5.115305145410405e-18 1.0527410780764196e-18
ham 5.429427787824358e-14 5.726911464735723e-17
ham 4.138954962035612e-52 1.9034162136307775e-53
spam 1.4303178543039973e-42 2.1950376414959748e-42
spam 2.430613046308744e-25 1.3544203154672597e-23
ham 1.7604299367175237e-14 5.5265116731130964e-15
ham 1.0573363158752147e-66 2.613212946327868e-74
ham 5.503772955694928e-49 5.317844868076936e-56
ham 5.586990003029592e-22 1.5882440202309576e-24
ham 1.1242270404732554e-38 2.0070291827974262e-45
ham 3.7077945134901755e-18 2.6950171598756344e-19
ham 2.190116476880718e-42 1.2616477136016007e-50
ham 2.412163243063139e-09 1.491903597263471e-11
ham 1.1412527448131672e-47 4.3619954895705325e-52
spam 2.505424702993165e-140 1.3677903435084618e-139
ham 2.9739249636719146e-44 7.526359435490349e-47
ham 2.1125234

spam 5.815715708020607e-48 6.271966196241957e-48
ham 3.0399015762479705e-50 1.1632785648918295e-58
ham 3.635615861158519e-29 2.5249364736591046e-35
spam 8.587673937866386e-27 9.58457543010887e-27
ham 9.426604076234585e-12 5.328227133083826e-13
ham 3.151726052105642e-37 7.183759553986827e-41
ham 3.096377588929833e-09 2.1190359308274375e-09
ham 6.648130880785869e-44 1.1354829422414405e-49
ham 6.856522331152728e-28 1.2050006826890712e-29
ham 4.712674399587028e-26 2.6623820639191312e-27
ham 5.7275898113154846e-21 5.260866958304203e-25
ham 1.8154269733972328e-40 1.2543932392483912e-46
ham 2.7518442888903028e-11 4.8767278763955564e-14
ham 3.2993114266821054e-11 2.664113566541913e-13
ham 4.684798266414912e-100 1.6492506466548867e-108
ham 4.079058020745807e-34 2.1041137280492542e-35
ham 4.1241392833526316e-11 2.397702209887722e-12
ham 3.8324572285045426e-38 2.298803057275785e-38
ham 8.472366642029802e-16 5.0531571747668146e-20
ham 4.3111157648389036e-14 4.689161419611111e-16
ham 3.543651099625

ham 6.294279533406941e-28 5.324764127838262e-28
ham 1.6408562321459408e-14 1.0308440636524303e-17
ham 3.057812017488866e-30 8.368060296451883e-33
ham 4.2550643399670014e-11 2.664113566541913e-13
ham 1.1497782999891744e-47 6.878063379605473e-53
ham 9.365898775345984e-36 2.3945865179956094e-41
ham 5.79239419931584e-43 8.362621594989276e-46
ham 6.391322136616007e-34 3.990977529992681e-42
ham 3.402209386874725e-11 1.2238711305185e-13
ham 2.5931572121739546e-54 1.5953534604230807e-56
ham 2.1912368803650898e-40 1.672524318997855e-45
ham 8.536839776921918e-34 1.42849058731028e-36
ham 1.708815751728598e-17 1.7096515107961056e-18
ham 4.660932015471546e-11 4.528993063121251e-12
ham 5.630003777848522e-47 2.0906553987473188e-47
ham 1.9323673166075475e-29 1.5062508533613387e-31
ham 5.335385076065229e-26 1.6026509079764643e-28
ham 3.0841871365576317e-22 4.235317387282553e-24
ham 5.9689873796716846e-21 4.210964312305679e-21
ham 5.381504249560051e-06 6.357107792482313e-09
ham 7.835567269448156e-38 2.1

['spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spa

In [32]:
# compare predictions

_sum = 0
for _i,_row in enumerate(_test_data.iterrows()):
    if _row[1][0] != _predictions[_i]:
        continue
    _sum += 1
_score = _sum / _test_data.shape[0]
print('Accuarcy Using the TEST Data Set')
print(f'accuracy: {_score:.1%}')

Accuarcy Using the TEST Data Set
accuracy: 91.3%


<b>Conclusion</b><br>
We have successfully implemented the Naive Bayes Algorithm with a Bag of Words approach to conduct text analysis.  We have demonstrated how to execute each step:  
1. Clean the data
2. Split the data set 80%/20% for train and test.  We took care to deal with the class inbalance we had with our data.
3. Use Bag of Words to create a histogram of how many times each word appears for a given class.  
4. Apply the Naive Bayes algorithm to calculate the probabilites of the words in a given sample occuring in each class set
5. Take the maximum probability to determine the predicted class.


We saw our model predict with 96+% on the trained data set.  We say an accruacy of 90+% accuracy the testing data set based on the model with the training data set. Definitely some potential overiftting with the trained data only but our model still did well at 90+% with the test data set.
