John Olijnyk & John Machado

CSI6160 HW 2: Naive Bayes and Bag of Words Implementation

This will use the Bag of Words method with the Naive Bayes Machine Learning algorithm to attempt to do sentiment analysis on text data. 

Data Set Source: 

Spam/Ham SMS data was obtained from the Kaggle project
https://www.kaggle.com/sid321axn/sms-spam-classifier-naive-bayes-ml-algo.
The data was loaded into a dataframe and the messages were extracted for cleaning.

We split the data set into a training and a test set with 80%/20% split respectively.

Naive Bayes algorithm uses the Bayes theorem but makes an assumption of conditional independence of the feature set. This is why the approach is called Naive. In practice this is not strictly or neccessarily true. But in practice it is found that Naive Bayes will perform well for many data sets. By employing this assumption of conditional independence, this strategy decerases the number of probability values the algorithm needs to calculate and track.

The bag of words approach is to assume that the order of the words does not matter in sample texts. The approach then creates a histogram of the words used in each sample set and by output label class. These histogram values then translate into calculating the probabilities for that class given the words. This is done at a relative level. To employ a bag of words approach requires some clean up of the data set to remove punctuation, capitlization and to find each word. Other more advanced clean up to improve accuracy includes removal of stop words like articles (the, a, and, etc.) plus to handle abbreviations, and other cases that causes duplications.

One of the challenges in this approach is when the word does not appear for that label categroy.  This result in a zero word count.  The zero will then mutiply out with the Bayes theorm as a zero.  In practice a better model is to add one to all of the words.  This results in at least a small chance and a non zero probability when the Bayes calculation is applied.  Since we add one to all words and we are doing a relative comparison of probabilities this does not impact the model.  Rather is makes the model more robust.  We employed this technique of adding one to all words in our model.

References:

In [30]:
# imports
import pandas as pd
import numpy as np
from nltk.corpus import stopwords as nltk_stopwords
from nltk.tokenize import word_tokenize as nltk_tokenize
from collections import Counter as collections_counter
import string
from sklearn.feature_extraction.text import CountVectorizer as sk_feature_extract_text_CV
from sklearn.model_selection import train_test_split as sk_model_select_tts
from random import randint as random_randint
from random import seed as random_seed

Spam/Ham SMS data was obtained from the Kaggle project <br>
https://www.kaggle.com/sid321axn/sms-spam-classifier-naive-bayes-ml-algo.<br>
The data was loaded into a dataframe and the messages were extracted for cleaning.

In [31]:
# load data

_data = pd.read_csv('./spam.csv')

In [32]:
# here are the first 5 elements
# Category is our output label.  ham are good messages and spam are bad messages
# Messages are the feature we will analyze

_data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [33]:
def preprocess(df):

    """
        Modifies data dataframe in place.
            - to lower case all the words
            - removes stop words     
            - removes punctuation       
        Parameters:
            df (dataframe): dataframe, column 1 = labels, column 2 = message text
        Returns:
            (None)
    """

    _stopwords = set(nltk_stopwords.words('english'))

    for _m in df.iterrows():
        _m[1][1] = _m[1][1].lower()
        _m[1][1] = _m[1][1].translate(str.maketrans("","", string.punctuation))
        _m[1][1] = nltk_tokenize(_m[1][1])
        _m[1][1] = [_w for _w in _m[1][1] if not _w in _stopwords]
        _m[1][1] = ' '.join(_m[1][1])

In [34]:
# preprocess the text to clean it up
# in the output below if you compare above, you will see the data is now lower case, no punctuation, etc

preprocess(_data)
_data.head()

Unnamed: 0,Category,Message
0,ham,go jurong point crazy available bugis n great ...
1,ham,ok lar joking wif u oni
2,spam,free entry 2 wkly comp win fa cup final tkts 2...
3,ham,u dun say early hor u c already say
4,ham,nah dont think goes usf lives around though


The data needs to be separated by spam/ham category for term/document and term frequency analysis.

In [35]:
# separate based on label

_spam = _data[_data['Category'] == 'spam']
_ham = _data[_data['Category'] == 'ham']

In [36]:
#here you see a copy of the spam category

_spam

Unnamed: 0,Category,Message
2,spam,free entry 2 wkly comp win fa cup final tkts 2...
5,spam,freemsg hey darling 3 weeks word back id like ...
8,spam,winner valued network customer selected receiv...
9,spam,mobile 11 months u r entitled update latest co...
11,spam,six chances win cash 100 20000 pounds txt csh1...
...,...,...
5537,spam,want explicit sex 30 secs ring 02073162414 cos...
5540,spam,asked 3mobile 0870 chatlines inclu free mins i...
5547,spam,contract mobile 11 mnths latest motorola nokia...
5566,spam,reminder o2 get 250 pounds free call credit de...


In [37]:
# here you see a copy of the ham category

_ham

Unnamed: 0,Category,Message
0,ham,go jurong point crazy available bugis n great ...
1,ham,ok lar joking wif u oni
3,ham,u dun say early hor u c already say
4,ham,nah dont think goes usf lives around though
6,ham,even brother like speak treat like aids patent
...,...,...
5565,ham,huh lei
5568,ham,ü b going esplanade fr home
5569,ham,pity mood soany suggestions
5570,ham,guy bitching acted like id interested buying s...


In [38]:
# splitting each class's data

random_seed()
_test_size = 0.5

_spam_train_data, _spam_test_data, _spam_train_labels, _spam_test_labels = sk_model_select_tts(
    _spam.iloc[:,1],
    _spam.iloc[:,0],
    test_size=_test_size,
    random_state=random_randint(0,100)
)

_ham_train_data, _ham_test_data, _ham_train_labels, _ham_test_labels = sk_model_select_tts(
    _ham.iloc[:,1],
    _ham.iloc[:,0],
    test_size=_test_size,
    random_state=random_randint(0,100)
)

Since we noted that the classes are inbalanced with 4825 entries for "ham" and 747 entries for "spam", we decided that when we do the 80%/20% split that we first split up the data between each class.  Then we randomly split the train and test up on a per class basis.  Then we recombnie the two classes to for the acutal train and test sets.  That way we maintain the same distribution of classes and we do not inadverdantly end up with a training set dominated by "ham" entires if it was purely random.

Next, we used train_test_split from sklearn to split each class's data. <br>
Spam's and ham's training data would result in a term frequency matrix for each <br>
of the two classes. These matrices represent the trained algorithm; any subsequent <br>
message could be processed by using the word probabilites saved in these matrices <br>
to make a prediction.

In [39]:
# organize data

_spam_train = pd.DataFrame.from_dict(
    {
        'label': _spam_train_labels,
        'message': _spam_train_data
    }
)

_spam_test = pd.DataFrame.from_dict(
    {
        'label': _spam_test_labels,
        'message': _spam_test_data
    }
)

_ham_train = pd.DataFrame.from_dict(
    {
        'label': _ham_train_labels,
        'message': _ham_train_data
    }
)

_ham_test = pd.DataFrame.from_dict(
    {
        'label': _ham_test_labels,
        'message': _ham_test_data
    }    
)

In [40]:
# check spam train

print(f'{_spam_train.shape[0]} messages')
_spam_train.head()

373 messages


Unnamed: 0,label,message
1380,spam,1 nokia tone 4 ur mob every week txt nok 87021...
4091,spam,tried call reply sms video mobile 750 mins unl...
333,spam,call germany 1 pence per minute call fixed lin...
2954,spam,urgent mobile awarded £1500 bonus caller prize...
4060,spam,moby pub quizwin £100 high street prize u know...


In [41]:
# check spam test

print(f'{_spam_test.shape[0]} messages')
_spam_test.head()

374 messages


Unnamed: 0,label,message
9,spam,mobile 11 months u r entitled update latest co...
3424,spam,mobile 10 mths update latest orange cameravide...
368,spam,discount code rp176781 stop messages reply sto...
4499,spam,latest nokia mobile ipod mp3 player £400 proze...
4048,spam,thanks ringtone order reference number x49your...


In [42]:
# check ham train

print(f'{_ham_train.shape[0]} messages')
_ham_train.head()

2412 messages


Unnamed: 0,label,message
5415,ham,get chicken broth want ramen unless theres don...
2542,ham,dont send plus hows mode
2599,ham,okie thanx
66,ham,today song dedicated day song u dedicate send ...
5010,ham,mobile numberpls sms ur mail idconvey regards ...


In [43]:
# check ham test

print(f'{_ham_test.shape[0]} messages')
_ham_test.head()

2413 messages


Unnamed: 0,label,message
5435,ham,im wif buying tix lar
2338,ham,tell friends plan valentines day lturlgt
1865,ham,call ok said call
1668,ham,dad gon na call gets work ask crazy questions
2744,ham,family responding anything room went home diwa...


We implemented our bag of words by creating a term/document (sparse) matrix and a term frequency (condensed) matrix in order to conduct Baysian analysis. <br>
We used a method to create term/document matrices from the following site: https://www.kaggle.com/sid321axn/sms-spam-classifier-naive-bayes-ml-algo <br>
This method invoved using the Count Vectorizer class from sklearn to create a list of counts of words found in the data as well as a list of the words themselves. <br>
The rows of the dataframe will represent word counts for a given word. <br>
The columns will represent the words found in the messages.

In [44]:
def generate_tdm(df):
    
    """
        generates a term document matrix from input data
        Parameters:
            df (DataFrame): dataframe with column 1 = labels, column 2 = messages
        Returns:
            DataFrame: a term document matrix
    """
    
    _count_vectorizer = sk_feature_extract_text_CV()
    _count_vectorizer.fit(df.iloc[:,1])
    _features = _count_vectorizer.get_feature_names_out()
    _counts = _count_vectorizer.transform(df.iloc[:,1]).toarray()
    return pd.DataFrame(_counts, columns=_features)


In [45]:
# generate tdm for spam and ham

_spam_tdm = generate_tdm(_spam_train)
_ham_tdm = generate_tdm(_ham_train)

In [46]:
# tdm for spam category

_spam_tdm

Unnamed: 0,008704050406,01223585334,02,020603,0207,02072069400,02073162414,020903,0578,07008009200,...,yo,yohere,youll,youre,youve,yr,yrs,zebra,zed,zouk
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
368,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
369,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
370,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
371,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [47]:
# tdm for ham category

_ham_tdm

Unnamed: 0,10,100,101mega,1030,11,1148,12,12000pes,128,130,...,yunny,yuou,yup,zealand,zindgi,zoe,zoom,zyada,üll,〨ud
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2407,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2408,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2409,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2410,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The frequency matrices were dataframes that represented a histogram of counts <br>
class based baysian probability.


In [48]:
def generate_tfm(df):
    
    """
        generates a term frequency matrix from a term document matrix
        Parameters:
            df (DataFrame): should be a term document matrix
        Returns:
            DataFrame: a term frequency matrix
    """
    _counts = []
    _probability = []
    for _label, _data in df.iteritems():
        _sum = _data.sum()
        _counts.append(_sum)
    _sum_of_all_words = sum(_counts)
    for _c in _counts:
        _probability.append((_c + 1) / (_sum_of_all_words + df.shape[1] + 1)) # shape[1] is the number of unique words in the class
    _out_data = {
        'word': df.columns,
        'count': _counts,
        'probability': _probability
    }
    return pd.DataFrame().from_dict(_out_data)


In [49]:
# generated term frequency matrix

_spam_tfm = generate_tfm(_spam_tdm)
_ham_tfm = generate_tfm(_ham_tdm)

In [50]:
# look at spam tfm histogram

_spam_tfm.sort_values(by=['probability'], ascending=False)

Unnamed: 0,word,count,probability
640,call,164,0.020950
904,free,109,0.013966
1731,txt,84,0.010792
1757,ur,82,0.010538
1208,mobile,65,0.008380
...,...,...,...
783,daytime,1,0.000254
781,day2find,1,0.000254
777,dates,1,0.000254
775,dartboard,1,0.000254


In [51]:
# look at ham tfm histogram

_ham_tfm.sort_values(by=['probability'], ascending=False)

Unnamed: 0,word,count,probability
2106,im,223,0.009359
1741,get,158,0.006644
3045,ok,138,0.005808
1254,dont,137,0.005766
2587,ltgt,131,0.005515
...,...,...,...
735,carpark,1,0.000084
734,carolina,1,0.000084
2589,lttrs,1,0.000084
2590,lturlgt,1,0.000084


We needed the probabilty for each class; the probability that a given message was spam or ham.

In [52]:
def calculate_class_probability(df):

    """
        spam = #_spam/#_messages; ham = #_ham/#_messages
        Parameters:
            df (DataFrame): the data set; can be pre or post processed
        Returns: 
            (dict): {spam: (float), ham: (float)}
    """

    _count = df.shape[0]
    _p = {
        'spam': df[df['Category'] == 'spam'].count()[0] / _count,
        'ham': df[df['Category'] == 'ham'].count()[0] / _count
    }
    return _p

In [53]:
# calculate probabilites for each class

_p_classes = calculate_class_probability(_data)
print(_p_classes)

{'spam': 0.13406317300789664, 'ham': 0.8659368269921034}


Next, we wrote a function to calculate the probability that a <br>
given message belongs to a given class.

In [54]:
def calculate_probability(message, p_class, class_tfm):

    """
        calculates the probability that a given message belongs to 
            the class of the class_tfm
        Parameters:
            message (string): a single message from the data set
            p_class (float): probability of class in data set
            class_tfm (DataFrame): message probability calculated
                for this class's tfm
            data_tfm (DataFrame): data set's term frequency matrix
        Returns:
            float: the probability that the message belongs to the 
                label of the supplied class_tfm
    """

    _words = message.split()
    _words_in_class = class_tfm['count'].sum()
    _p_word_not_in_class = 1 / (1 + _words_in_class + class_tfm.shape[0])  # shape[0] is the number of unique words in the class
    _p = 1
    
    for _w in _words:
        
        # word not in class_tfm
        _row = class_tfm[class_tfm['word'] == _w]
        if _row.empty:
            _p_word = _p_word_not_in_class

        # word in class_tfm
        else:
            _p_word = _row['probability'].iloc[0]

        # product
        _p *= _p_word

    return _p * p_class


In [55]:
# recombine the train data frames for to run it through the trained model

_train_data = pd.concat([_spam_train, _ham_train])


In [56]:
#NOTE:  This cell can take a minute or two to run
# make predictions
# run our training data set through to see how accurate is with the training

_predictions = []
print('Predition, Probability Ham, Probability Spam')
for _row in _train_data.iterrows():
    _p_spam = calculate_probability(_row[1][1], _p_classes['spam'], _spam_tfm)
    _p_ham = calculate_probability(_row[1][1], _p_classes['ham'], _ham_tfm)
    _prediction = 'spam'
    if _p_ham > _p_spam:
        _prediction = 'ham'
    _predictions.append(_prediction)
    print(_prediction, _p_ham, _p_spam)
    


Predition, Probability Ham, Probability Spam
spam 1.0599082340898416e-88 1.7519937076300138e-68
spam 2.945091787179128e-52 1.5209051094811032e-38
spam 1.028426506139921e-73 1.933347930534633e-61
spam 1.8841085497360816e-61 3.5702181378446195e-45
spam 8.5817793053081095e-78 5.5278983992342735e-65
spam 1.3117121518474444e-73 5.562647836664167e-60
spam 8.244580701792376e-64 1.4169380612393522e-49
spam 2.1938596012620714e-57 3.1959416721709024e-53
spam 2.691132280645182e-54 3.979641060992718e-42
spam 3.049949518092634e-52 1.822240210336493e-43
spam 4.952308542498057e-30 4.8792528095080085e-24
spam 1.7087593535326582e-69 4.1787858761568354e-52
spam 3.512228703190825e-85 2.2533411081396532e-63
spam 2.5256880569372087e-53 2.7688360271019822e-46
spam 1.558671618418031e-60 3.8160653267690084e-45
spam 1.015551569674136e-91 6.593302257826583e-67
spam 1.967454121831863e-80 1.1340912772369875e-71
spam 1.2369716930139665e-36 5.475458249739022e-31
spam 1.5753391177628517e-58 3.46891104531811e-48
spam

In [57]:
# compare predictions

_sum = 0
for _i,_row in enumerate(_train_data.iterrows()):
    if _row[1][0] != _predictions[_i]:
        continue
    _sum += 1
_score = _sum / _train_data.shape[0]
print('Accuracy using TRAINING data set')
print(f'accuracy: {_score:.1%}')

Accuracy using TRAINING data set
accuracy: 96.9%


In [58]:
# recombine the test data frames for testing

_test_data = pd.concat([_spam_test, _ham_test])


In [59]:
# check some data from the recombined frame

random_seed()
for _ in range(5):
    _index = random_randint(0, _test_data.shape[0])
    print(f'message # {_index}')
    print(f'class: {_test_data.iloc[_index,0]}\nmessage: {_test_data.iloc[_index,1]}')
    print()

message # 1678
class: ham
message: grandmas oh dear u still ill felt shit morning think hungover another night leave sat

message # 2450
class: ham
message: much ur hdd casing cost

message # 2034
class: ham
message: aiyah wait lor u entertain hee

message # 1014
class: ham
message: dont think dont need going late school night especially one class one missed last wednesday probably failed test friday

message # 555
class: ham
message: thats ebay might less elsewhere



In [60]:
#NOTE:  This cell can take a minute or two to run
# make predictions
# run our TEST data set through to find the predictions and accuracy

_predictions = []
print('Predition, Probability Ham, Probability Spam')
for _row in _test_data.iterrows():
    _p_spam = calculate_probability(_row[1][1], _p_classes['spam'], _spam_tfm)
    _p_ham = calculate_probability(_row[1][1], _p_classes['ham'], _ham_tfm)
    _prediction = 'spam'
    if _p_ham > _p_spam:
        _prediction = 'ham'
    _predictions.append(_prediction)
    print(_prediction, _p_ham, _p_spam)

_predictions

Predition, Probability Ham, Probability Spam
spam 8.827574838023937e-69 2.632760806753764e-53
spam 8.14862936192902e-73 1.102540127177922e-58
spam 2.2608350414641904e-42 1.5412077797719252e-34
spam 6.464369823237304e-57 1.0451058110422283e-44
spam 1.1003767043325558e-60 1.3101554490474088e-50
spam 9.569052379782883e-68 5.836295540982773e-53
spam 3.6863227123057235e-26 2.695978886088822e-23
spam 4.427673870088305e-82 1.2018117128963424e-79
spam 2.3354881859948538e-20 1.1009703776065226e-18
spam 3.578069328359981e-65 1.0221196750124023e-61
spam 8.694341442599778e-76 3.7481694396836763e-69
spam 6.151533672717568e-70 3.0359286908240878e-52
spam 3.2013509367964346e-103 1.0685957955234543e-97
spam 2.383846959562193e-88 6.909715137928162e-69
spam 2.7549658846636847e-68 2.1758353774824494e-55
spam 4.577893966044828e-74 1.0461356425899922e-64
spam 1.590960601028864e-95 2.4173690398809955e-84
spam 1.8845141469220617e-55 1.5468357693307164e-50
spam 8.287465823763774e-80 1.0164880737716522e-68
spa

['spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 's

In [61]:
# compare predictions

_sum = 0
for _i,_row in enumerate(_test_data.iterrows()):
    if _row[1][0] != _predictions[_i]:
        continue
    _sum += 1
_score = _sum / _test_data.shape[0]
print('Accuarcy Using the TEST Data Set')
print(f'accuracy: {_score:.1%}')

Accuarcy Using the TEST Data Set
accuracy: 90.1%


<b>Conclusion</b><br>
We have successfully implemented the Naive Bayes Algorithm with a Bag of Words approach to conduct text analysis.  We have demonstrated how to execute each step:  
1. Clean the data
2. Split the data set 80%/20% for train and test.  We took care to deal with the class inbalance we had with our data.
3. Use Bag of Words to create a histogram of how many times each word appears for a given class.  
4. Apply the Naive Bayes algorithm to calculate the probabilites of the words in a given sample occuring in each class set
5. Take the maximum probability to determine the predicted class.


We saw our model predict with 96+% on the trained data set.  We say an accruacy of 90+% accuracy the testing data set based on the model with the training data set. Definitely some potential overiftting with the trained data only but our model still did well at 90+% with the test data set.
