John Olijnyk & John Machado

CSI6160 HW 2: Naive Bayes and Bag of Words Implementation

This will use the Bag of Words method with the Naive Bayes Machine Learning algorithm to attempt to do sentiment analysis on text data. 

Data Set Source: 

Spam/Ham SMS data was obtained from the Kaggle project
https://www.kaggle.com/sid321axn/sms-spam-classifier-naive-bayes-ml-algo.
The data was loaded into a dataframe and the messages were extracted for cleaning.

We split the data set into a training and a test set with 80%/20% split respectively.

Naive Bayes alorithm uses the Bayes theorem but makes an assumption of conditional independence of the feature set. This is why the approach is called Naive. In practice this is not strictly or neccessarily true. But in practice it is found that Naive Bayes will perform well for many data sets. By employing this assumption of conditional independence, this strategy decerases the number of probability values the algorithm needs to calculate and track.

The bag of words approach is to assume that the order of the words does not matter in sample texts. The approach then creates a histogram of the words used in each sample set and by output label class. These histogram values then translate into calculating the probabilities for that class given the words. This is done at a relative level. To employ a bag of words approach requires some clean up of the data set to remove punctuation, capitlization and to find each word. Other more advanced clean up to improve accuracy includes removal of stop words like articles (the, a, and, etc.) plus to handle abbreviations, and other cases that causes duplications.

One of the challenges in this approach is when the word does not appear for that label categroy.  This result in a zero word count.  The zero will then mutiply out with the Bayes theorm as a zero.  In practice a better model is to add one to all of the words.  This results in at least a small chance and a non zero probability when the Bayes calculation is applied.  Since we add one to all words and we are doing a relative comparison of probabilities this does not impact the model.  Rather is makes the model more robust.  We employed this technique of adding one to all words in our model.

References:

In [89]:
# imports
import pandas as pd
import numpy as np
from nltk.corpus import stopwords as nltk_stopwords
from nltk.tokenize import word_tokenize as nltk_tokenize
from collections import Counter as collections_counter
import string
from sklearn.feature_extraction.text import CountVectorizer as sk_feature_extract_text_CV
from sklearn.model_selection import train_test_split as sk_model_select_tts
from random import randint as random_randint
from random import seed as random_seed

Spam/Ham SMS data was obtained from the Kaggle project <br>
https://www.kaggle.com/sid321axn/sms-spam-classifier-naive-bayes-ml-algo.<br>
The data was loaded into a dataframe and the messages were extracted for cleaning.

In [90]:
# load data

_data = pd.read_csv('./spam.csv')

In [91]:
# here are the first 5 elements
# Category is our output label.  ham are good messages and spam are bad messages
# Messages are the feature we will analyze

_data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [92]:
def preprocess(df):

    """
        Modifies data dataframe in place.
            - to lower case all the words
            - removes stop words     
            - removes punctuation       
        Parameters:
            df (dataframe): dataframe, column 1 = labels, column 2 = message text
        Returns:
            (None)
    """

    _stopwords = set(nltk_stopwords.words('english'))

    for _m in df.iterrows():
        _m[1][1] = _m[1][1].lower()
        _m[1][1] = _m[1][1].translate(str.maketrans("","", string.punctuation))
        _m[1][1] = nltk_tokenize(_m[1][1])
        _m[1][1] = [_w for _w in _m[1][1] if not _w in _stopwords]
        _m[1][1] = ' '.join(_m[1][1])

In [93]:
# preprocess the text to clean it up
# in the output below if you compare above, you will see the data is now lower case, no punctuation, etc

preprocess(_data)
_data.head()

Unnamed: 0,Category,Message
0,ham,go jurong point crazy available bugis n great ...
1,ham,ok lar joking wif u oni
2,spam,free entry 2 wkly comp win fa cup final tkts 2...
3,ham,u dun say early hor u c already say
4,ham,nah dont think goes usf lives around though


The data needs to be separated by spam/ham category for term/document and term frequency analysis.

In [94]:
# separate based on label

_spam = _data[_data['Category'] == 'spam']
_ham = _data[_data['Category'] == 'ham']

In [95]:
#here you see a copy of the spam category

_spam

Unnamed: 0,Category,Message
2,spam,free entry 2 wkly comp win fa cup final tkts 2...
5,spam,freemsg hey darling 3 weeks word back id like ...
8,spam,winner valued network customer selected receiv...
9,spam,mobile 11 months u r entitled update latest co...
11,spam,six chances win cash 100 20000 pounds txt csh1...
...,...,...
5537,spam,want explicit sex 30 secs ring 02073162414 cos...
5540,spam,asked 3mobile 0870 chatlines inclu free mins i...
5547,spam,contract mobile 11 mnths latest motorola nokia...
5566,spam,reminder o2 get 250 pounds free call credit de...


In [96]:
# here you see a copy of the ham category

_ham

Unnamed: 0,Category,Message
0,ham,go jurong point crazy available bugis n great ...
1,ham,ok lar joking wif u oni
3,ham,u dun say early hor u c already say
4,ham,nah dont think goes usf lives around though
6,ham,even brother like speak treat like aids patent
...,...,...
5565,ham,huh lei
5568,ham,ü b going esplanade fr home
5569,ham,pity mood soany suggestions
5570,ham,guy bitching acted like id interested buying s...


Next, we used train_test_split from sklearn to split each class's data. <br>
Spam's and ham's training data would result in a term frequency matrix for each <br>
of the two classes. These matrices represent the trained algorithm; any subsequent <br>
message could be processed by using the word probabilites saved in these matrices <br>
to make a prediction.

In [97]:
# splitting each class's data

random_seed()
_test_size = 0.2

_spam_train_data, _spam_test_data, _spam_train_labels, _spam_test_labels = sk_model_select_tts(
    _spam.iloc[:,1],
    _spam.iloc[:,0],
    test_size=_test_size,
    random_state=random_randint(0,100)
)

_ham_train_data, _ham_test_data, _ham_train_labels, _ham_test_labels = sk_model_select_tts(
    _ham.iloc[:,1],
    _ham.iloc[:,0],
    test_size=_test_size,
    random_state=random_randint(0,100)
)

In [98]:
# organize data

_spam_train = pd.DataFrame.from_dict(
    {
        'label': _spam_train_labels,
        'message': _spam_train_data
    }
)

_spam_test = pd.DataFrame.from_dict(
    {
        'label': _spam_test_labels,
        'message': _spam_test_data
    }
)

_ham_train = pd.DataFrame.from_dict(
    {
        'label': _ham_train_labels,
        'message': _ham_train_data
    }
)

_ham_test = pd.DataFrame.from_dict(
    {
        'label': _ham_test_labels,
        'message': _ham_test_data
    }    
)

In [99]:
# check spam train

print(f'{_spam_train.shape[0]} messages')
_spam_train.head()

597 messages


Unnamed: 0,label,message
4841,spam,private 2003 account statement shows 800 unred...
5278,spam,urgent mobile number awarded £2000 prize guara...
878,spam,sunshine quiz wkly q win top sony dvd player u...
4236,spam,freemsg records indicate may entitled 3750 pou...
1318,spam,win newest “ harry potter order phoenix book 5...


In [100]:
# check spam test

print(f'{_spam_test.shape[0]} messages')
_spam_test.head()

150 messages


Unnamed: 0,label,message
4577,spam,congratulations ur awarded 500 cd vouchers 125...
3382,spam,complimentary 4 star ibiza holiday £10000 cash...
4985,spam,goldviking 29m inviting friend reply yes762 no...
4760,spam,thanks 4 continued support question week enter...
5294,spam,xmas iscoming ur awarded either £500 cd gift v...


In [101]:
# check ham train

print(f'{_ham_train.shape[0]} messages')
_ham_train.head()

3860 messages


Unnamed: 0,label,message
834,ham,thank much skyped wit kz sura didnt get pleasu...
582,ham,ok anyway need change said
4956,ham,masters buy bb cos sale hows bf
4014,ham,ok
1152,ham,sorry ill call later


In [102]:
# check ham test

print(f'{_ham_test.shape[0]} messages')
_ham_test.head()

965 messages


Unnamed: 0,label,message
3031,ham,also sir sent email log usc payment portal ill...
4932,ham,good morning boytoy hows yummy lips wheres sex...
5059,ham,geeeee internet really bad today eh
4008,ham,im reaching home 5 min
4230,ham,bookedthe hut also time way


We implemented our bag of words by creating a term/document (sparse) matrix and a term frequency (condensed) matrix in order to conduct Baysian analysis. <br>
We used a method to create term/document matrices from the following site: https://www.kaggle.com/sid321axn/sms-spam-classifier-naive-bayes-ml-algo <br>
This method invoved using the Count Vectorizer class from sklearn to create a list of counts of words found in the data as well as a list of the words themselves. <br>
The rows of the dataframe will represent word counts for a given word. <br>
The columns will represent the words found in the messages.

In [103]:
def generate_tdm(df):
    
    """
        generates a term document matrix from input data
        Parameters:
            df (DataFrame): dataframe with column 1 = labels, column 2 = messages
        Returns:
            DataFrame: a term document matrix
    """
    
    _count_vectorizer = sk_feature_extract_text_CV()
    _count_vectorizer.fit(df.iloc[:,1])
    _features = _count_vectorizer.get_feature_names_out()
    _counts = _count_vectorizer.transform(df.iloc[:,1]).toarray()
    return pd.DataFrame(_counts, columns=_features)


In [104]:
# generate tdm for spam and ham

_spam_tdm = generate_tdm(_spam_train)
_ham_tdm = generate_tdm(_ham_train)

In [105]:
# tdm for spam category

_spam_tdm

Unnamed: 0,008704050406,0089my,0121,01223585236,01223585334,02,020603,0207,02070836089,02072069400,...,youll,youre,yourinclusive,youto,youve,yr,yrs,zed,zoe,zouk
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
592,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
593,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
594,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
595,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [106]:
# tdm for ham category

_ham_tdm

Unnamed: 0,10,1000s,1010,101mega,1030,11,1120,1148,12,120,...,zac,zahers,zhong,zindgi,zoe,zogtorius,zoom,zyada,üll,〨ud
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3855,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3856,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3857,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3858,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The frequency matrices were dataframes with one row for each word found in the dataset an two columns, <br> 
one for the number of times the word appeard and the other for the word's frequency among the messages.
i.e. histogram of counts and probability


In [107]:
def generate_tfm(df):
    
    """
        generates a term frequency matrix from a term document matrix
        Parameters:
            df (DataFrame): should be a term document matrix
        Returns:
            DataFrame: a term frequency matrix
    """
    _counts = []
    _probability = []
    for _label, _data in df.iteritems():
        _sum = _data.sum()
        _counts.append(_sum)
    _sum_of_all_words = sum(_counts)
    for _c in _counts:
        _probability.append((_c + 1) / (_sum_of_all_words + df.shape[1] + 1)) # shape[1] is the number of unique words in the class
    _out_data = {
        'word': df.columns,
        'count': _counts,
        'probability': _probability
    }
    return pd.DataFrame().from_dict(_out_data)


In [108]:
# generated term frequency matrix

_spam_tfm = generate_tfm(_spam_tdm)
_ham_tfm = generate_tfm(_ham_tdm)

In [109]:
# look at spam tfm histogram

_spam_tfm.sort_values(by=['probability'], ascending=False)

Unnamed: 0,word,count,probability
867,call,275,0.022722
1223,free,172,0.014242
2347,txt,121,0.010044
2383,ur,115,0.009550
2263,text,102,0.008479
...,...,...,...
1071,died,1,0.000165
1070,dick,1,0.000165
1069,dialling,1,0.000165
1068,dial,1,0.000165


In [110]:
# look at ham tfm histogram

_ham_tfm.sort_values(by=['probability'], ascending=False)

Unnamed: 0,word,count,probability
2821,im,365,0.009924
2342,get,255,0.006941
3437,ltgt,229,0.006236
4055,ok,215,0.005857
1721,dont,204,0.005558
...,...,...,...
3562,mca,1,0.000054
3564,mcr,1,0.000054
3566,meals,1,0.000054
1026,careabout,1,0.000054


We needed the probabilty for each class; the probability that a given message was spam or ham.

In [111]:
def calculate_class_probability(df):

    """
        spam = #_spam/#_messages; ham = #_ham/#_messages
        Parameters:
            df (DataFrame): the data set; can be pre or post processed
        Returns: 
            (dict): {spam: (float), ham: (float)}
    """

    _count = df.shape[0]
    _p = {
        'spam': df[df['Category'] == 'spam'].count()[0] / _count,
        'ham': df[df['Category'] == 'ham'].count()[0] / _count
    }
    return _p

In [112]:
# calculate probabilites for each class

_p_classes = calculate_class_probability(_data)
print(_p_classes)

{'spam': 0.13406317300789664, 'ham': 0.8659368269921034}


In [113]:
def calculate_probability(message, p_class, class_tfm):

    """
        calculates the probability that a given message belongs to 
            the class of the class_tfm
        Parameters:
            message (string): a single message from the data set
            p_class (float): probability of class in data set
            class_tfm (DataFrame): message probability calculated
                for this class's tfm
            data_tfm (DataFrame): data set's term frequency matrix
        Returns:
            float: the probability that the message belongs to the 
                label of the supplied class_tfm
    """

    _words = message.split()
    _words_in_class = class_tfm['count'].sum()
    _p_word_not_in_class = 1 / (1 + _words_in_class + class_tfm.shape[0])  # shape[0] is the number of unique words in the class
    _p = 1
    
    for _w in _words:
        
        # word not in class_tfm
        _row = class_tfm[class_tfm['word'] == _w]
        if _row.empty:
            _p_word = _p_word_not_in_class

        # word in class_tfm
        else:
            _p_word = _row['probability'].iloc[0]

        # product
        _p *= _p_word

    return _p * p_class


In [114]:
# recombine the test data frames for testing

_test_data = pd.concat([_spam_test, _ham_test])


In [115]:
# check some data from the recombined frame

random_seed()
for _ in range(5):
    _index = random_randint(0, _test_data.shape[0])
    print(f'message # {_index}')
    print(f'class: {_test_data.iloc[_index,0]}\nmessage: {_test_data.iloc[_index,1]}')
    print()

message # 941
class: ham
message: problem renewal ill right away dont know details

message # 113
class: spam
message: recpt 13 ordered ringtone order processed

message # 236
class: ham
message: well im desperate ill call armand

message # 804
class: ham
message: hes apparently bffs carly quick

message # 1071
class: ham
message: usual u call ard 10 smth



In [116]:
# make predictions
# run our data set through

_predictions = []
for _row in _test_data.iterrows():
    _p_spam = calculate_probability(_row[1][1], _p_classes['spam'], _spam_tfm)
    _p_ham = calculate_probability(_row[1][1], _p_classes['ham'], _ham_tfm)
    _prediction = 'spam'
    if _p_ham > _p_spam:
        _prediction = 'ham'
    _predictions.append(_prediction)

In [117]:
# compare predictions

_sum = 0
for _i,_row in enumerate(_test_data.iterrows()):
    if _row[1][0] != _predictions[_i]:
        continue
    _sum += 1
_score = _sum / _test_data.shape[0]
print(f'accuracy: {_score:.1%}')

accuracy: 93.7%


<b>Conclusion</b><br>
We have successfully implemented the Naive Bayes Algorithm with a Bag of Words approach to conduct text analysis.  We have demonstrated how to execute each step:  
1. Clean the data
2. Use Bag of Words to create a histogram of how many times each word appears for a given class.  
3. Apply the Naive Bayes algorithm to calculate the probabilites of the words in a given sample occuring in each class set
4. Take the maximum probability to determine the predicted class.


We saw our model predict with 91.5% accuracy the testing data set based on the training data set. This was for a single test; due to random sampling in train/test/split, this result will vary. Our split was 20% test and 80% training. 
