John Olijnyk & John Machado

CSI6160 HW 2:  Naive Bayes and Bag of Words Implementation

This will use the Bag of Words method with the Naive Bayes Machine Learning algorithm to attempt to do sentiment analysis on text data.  Here we will use a data set of movie reviews.
Data Set Source:  https://www.kaggle.com/c/word2vec-nlp-tutorial/data

The labeled data set we used consists of 25,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. 

We split the data set into a training and a test set with 80%/20% split respectively.

Naive Bayes alorithm uses the Bayes theorem but makes an assumption of conditional independence of the feature set.  This is why the approach is called Naive.  In practice this is not strictly or neccessarily true.  But in practice it is found that Naive Bayes will perform well for many data sets.  By employing this assumption of conditional independence, this strategy decerases the number of probability values the algorithm needs to calculate and track.

The bag of words approach is to assume that the order of the words does not matter in sample texts.  The approach then creates a histogram of the words used in each sample set and by output label class.  These histogram values then translate into calculating the probabilities for that class given the words.  This is done at a relative level.  To employ a bag of words approach requires some clean up of the data set to remove punctuation, capitlization and to find each word.  Other more advanced clean up to improve accuracy includes removal of stop words like articles (the, a, and, etc.) plus to handle abbreviations, and other cases that causes duplications.

References:

https://towardsdatascience.com/na%C3%AFve-bayes-from-scratch-using-python-only-no-fancy-frameworks-a1904b37222d



In [1]:
#imports needed for code
import pandas as pd 
import numpy as np 
from collections import defaultdict
import re

In [2]:
#load in the training set data
#we see the output label is sentiment.  The movie review is the text that we will need to process
#here the output label is 1 for a known positive reivew and 0 for a known negative review
dataset_full=pd.read_csv('labeledTrainData.tsv',sep='\t') # reading the training data-set
dataset_full

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...
...,...,...,...
24995,3453_3,0,It seems like more consideration has gone into...
24996,5064_1,0,I don't believe they made this film. Completel...
24997,10905_3,0,"Guy is a loser. Can't get girls, needs to buil..."
24998,10194_3,0,This 30 minute documentary Buñuel made in the ...


In [3]:
#extrat the labels from the training set
#see example output
y_label=dataset_full['sentiment'].values
y_label


array([1, 1, 0, ..., 0, 0, 1], dtype=int64)

In [4]:
#extract each review
#see example output
x_feature=dataset_full['review'].values
x_feature

array(["With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it fina

In [5]:
#here we will split the data into a training set and a test set using the train_test split feature of sklearn
#we set the test_size to 20%; hence the train will be 80%
from sklearn.model_selection import train_test_split

train_data,test_data,train_labels,test_labels=train_test_split(x_feature,y_label,shuffle=True,test_size=0.20,random_state=42,stratify=y_label)


In [6]:
#original set was 25000 records
y_label.size

25000

In [7]:
#the train data set is 80%, i.e. 20000
train_data.size

20000

In [8]:
#the test data set is 20% of 25000 i.e. 5000
test_data.size

5000

In [9]:
#verify the unique classes.  should be 0 and 1 for bad and good sentiment.  we see that is the case
classes=np.unique(train_labels)
classes

array([0, 1], dtype=int64)

Create a class to perform the Naive Bayes and Bag of Words.
There are several support functions need to cleanup the data, generate a bag of words, calculate probabilites, check a test phrase against the trained model to predict a result, etc.

In [10]:
class NaiveBayes:
    
    def __init__(self,unique_classes):
        
        self.classes=unique_classes # Constructor is sinply passed with unique number of classes of the training set
        

 
    def preprocess_string(self,str_arg):
    
        """"
            We need a utility function to clean up the data by removing punctuation, extra spacing, 
            and need to change everything to lower case.  That way we are focusing only on the words 
            and not causing the same word to be counted as a separate instance.
            
            Parameters:
            ----------
            str_arg: example string to be preprocessed
        
            What the function does?
            -----------------------
            Preprocess the string argument - str_arg - such that :
            1. everything apart from letters is excluded
            2. multiple spaces are replaced by single space
            3. str_arg is converted to lower case 
        
            Example:
            --------
            Input :  Menu is absolutely perfect,loved it!
            Output:  menu is absolutely perfect loved it
        
            Returns:
            ---------
            Preprocessed string 
        
        """
    
        cleaned_str=re.sub('[^a-z\s]+',' ',str_arg,flags=re.IGNORECASE) #every char except alphabets is replaced
        cleaned_str=re.sub('(\s+)',' ',cleaned_str) #multiple spaces are replaced by single space
        cleaned_str=cleaned_str.lower() #converting the cleaned string to lower case
    
        return cleaned_str # returning the preprocessed string 
        

    def addToBagOfWords(self,example,dict_index):
        
        '''
          creates a Bag of Words 
          
            Parameters:
            1. example 
            2. dict_index - implies to which bag of words category this example belongs to
            What the function does?
            -----------------------
            It simply splits the example on the basis of space as a tokenizer and adds every tokenized word to
            its corresponding dictionary/BoW
            Returns:
            ---------
            Nothing
        
       '''
        
        if isinstance(example,np.ndarray): example=example[0]
     
        for token_word in example.split(): #for every word in preprocessed example
          
            self.bow_dicts[dict_index][token_word]+=1 #increment in its count
            
            
    def generate_BagOfWords(self,dataset,labels):
        
        '''
            Parameters:
            1. dataset - shape = (m X d)
            2. labels - shape = (m,)
            What the function does?
            -----------------------
            This is the training function which will train the Naive Bayes Model
            It will compute a Bag of Words for each category/class. 
            
            Returns:
            ---------
            Nothing
        
        '''
    
        self.examples=dataset #local copys of parameters
        self.labels=labels
        
        #create a dictionary to hold the bag of words for each class
        self.bow_dicts=np.array([defaultdict(lambda:0) for index in range(self.classes.shape[0])])
        
                 
        #constructing Bag of Words for each category
        for cat_index,cat in enumerate(self.classes):
                        
            #filter all examples of category == cat
            all_cat_examples=self.examples[self.labels==cat] 
                        
            #get examples cleaned up and preprocessed
            cleaned_examples=[self.preprocess_string(cat_example) for cat_example in all_cat_examples]
            
            cleaned_examples=pd.DataFrame(data=cleaned_examples)
            
            
            #now costruct Bag of Words of this particular category
            np.apply_along_axis(self.addToBagOfWords,1,cleaned_examples,cat_index)
         

            
                
    def train(self,dataset,labels):
              
        '''
            
            We are done with constructing of BoW for each category. But we need to precompute a few 
            other calculations at training time too:
            1. prior probability of each class - p(c)
            2. vocabulary |V| 
            3. denominator value of each class - [ count(c) + |V| + 1 ] 
            
                       ---------------------
            We can do all these 3 calculations at test time too BUT doing so means to re-compute these 
            again and again every time the test function will be called - this would significantly
            increase the computation time especially when we have a lot of test examples to classify!).  
            And moreover, it does not make sense to repeatedly compute the same thing.
            
            So we will precompute all of them & use them during test time to speed up predictions.
            
        '''
        
      
        prob_classes=np.empty(self.classes.shape[0])
        all_words=[]
        cat_word_counts=np.empty(self.classes.shape[0])
        for cat_index,cat in enumerate(self.classes):
           
            #Calculating prior probability p(c) for each class
            prob_classes[cat_index]=np.sum(self.labels==cat)/float(self.labels.shape[0]) 
            
            #Calculating total counts of all the words of each class 
            count=list(self.bow_dicts[cat_index].values())
            # |v| is remaining to be added
            cat_word_counts[cat_index]=np.sum(np.array(list(self.bow_dicts[cat_index].values())))+1 
            
            #get all words of this category                                
            all_words+=self.bow_dicts[cat_index].keys()
                                                     
        
        #combine all words of every category & make them unique to get vocabulary -V- of entire training set
        
        self.vocab=np.unique(np.array(all_words))
        self.vocab_length=self.vocab.shape[0]
        
                                          
        #computing denominator value                                      
        denoms=np.array([cat_word_counts[cat_index]+self.vocab_length+1 for cat_index,cat in enumerate(self.classes)])                                                                          
    
        '''
            Now that we have everything precomputed as well, its better to organize everything in a tuple 
            rather than to have a separate list for every thing.
            
            Every element of self.cats_info has a tuple of values
            Each tuple has a dict at index 0, prior probability at index 1, denominator value at index 2
        '''
        
        self.cats_info=[(self.bow_dicts[cat_index],prob_classes[cat_index],denoms[cat_index]) for cat_index,cat in enumerate(self.classes)]                               
        self.cats_info=np.array(self.cats_info)                                 
        
                                              
                                              
    def getExampleProb(self,test_example):                                
        
        '''
            Parameters:
            -----------
            1. a single test example 
            What the function does?
            -----------------------
            Function that estimates posterior probability of the given test example
            Returns:
            ---------
            probability of test example in ALL CLASSES
        '''                                      
                                              
        likelihood_prob=np.zeros(self.classes.shape[0]) #to store probability w.r.t each class
        
        #finding probability w.r.t each class of the given test example
        for cat_index,cat in enumerate(self.classes): 
                             
            for test_token in test_example.split(): #split the test example and get p of each test word
                
                ####################################################################################
                                              
                #This loop computes : for each word w [ count(w|c)+1 ] / [ count(c) + |V| + 1 ]                               
                                              
                ####################################################################################                              
                
                #get total count of this test token from it's respective training dict to get numerator value                           
                test_token_counts=self.cats_info[cat_index][0].get(test_token,0)+1
                
                #now get likelihood of this test_token word                              
                test_token_prob=test_token_counts/float(self.cats_info[cat_index][2])                              
                
                #remember why taking log? To prevent underflow!
                likelihood_prob[cat_index]+=np.log(test_token_prob)
                                              
        # we have likelihood estimate of the given example against every class but we need posterior probility
        post_prob=np.empty(self.classes.shape[0])
        for cat_index,cat in enumerate(self.classes):
            post_prob[cat_index]=likelihood_prob[cat_index]+np.log(self.cats_info[cat_index][1])                                  
      
        return post_prob
    
    def make_prediction(self, test_example):
        
        '''
            Parameters:
            -----------
            1. test_example - singular string
            
            What the function does?
            -----------------------
            
            This will clean test_example string with preprocess algorithm.
            Then it will caculate probabilities of belonging to the classes.
            It will finally take the max probability to determine what class is most likely.
            
            Returns:
            ---------
            The predicted Class
            The posterior probabilities of each class.
        '''       
       
        #preprocess the test example the same way we did for training set exampels                                  
        cleaned_example=self.preprocess_string(test_example) 
             
        #simply get the posterior probability of every example                                  
        post_prob=self.getExampleProb(cleaned_example) #get prob of this example for both classes
        #simply pick the max value and map against self.classes!
        return self.classes[np.argmax(post_prob)], post_prob
   
    def test(self,test_set):
      
        '''
            Parameters:
            -----------
            1. A complete test set of shape (m,)
            
            What the function does?
            -----------------------
            Determines probability of each test example against all classes and predicts the label
            against which the class probability is maximum
            Returns:
            ---------
            Predictions of test examples - A single prediction against every test example
        '''       
       
        predictions=[] #to store prediction of each test example
        for example in test_set: 
            predicted_val, post_prob = self.make_prediction(example)
            predictions.append(predicted_val)
                
        return np.array(predictions) 

In [11]:
#Now to start training our model.
#fist declare aninstance our Naive Bayes Class model for the given set of output label classes

nb=NaiveBayes(classes)



In [12]:
#Next demonstrate the text preprocessing
#see how it removed punctuation, extra spacing, and made all lower case
#that way we just have a string of words that can then be tokenized
nb.preprocess_string('Menu is     absolutely perfect,loved it!')

'menu is absolutely perfect loved it '

In [13]:
#now to demonstrate the training of the model
#build out the bag of words and store in dictionary with word and the histogram count of appearces by each class.
nb.generate_BagOfWords(train_data,train_labels)

#demonstrate the bag of words
print('Here is the bag of words which lists each word and its occurence.')
print(' ')
nb.bow_dicts


Here is the bag of words which lists each word and its occurence.
 


      dtype=object)

In [14]:
#now calculate the probabilities
nb.train(train_data,train_labels)


In [15]:
#here is the tupple that has dictionary and probabilities
#this is for the 0 label
nb.cats_info[0]

       0.5, 2451114.0], dtype=object)

In [16]:
#this is the prior probability of 0 label
nb.cats_info[0][1]

0.5

In [17]:
#this is the denominator of the 0 label
nb.cats_info[0][2]

2451114.0

In [18]:
#this is the denominator of the 1 label
nb.cats_info[1][2]

2513380.0

In [19]:
#this is the prior probability of 1 label
nb.cats_info[1][1]

0.5

In [20]:
#the training model is built.
# now to testing phase 
#this will take the test data set and apply it to result in a set of predicted classes

pclasses=nb.test(test_data)
pclasses



array([1, 1, 0, ..., 0, 1, 1], dtype=int64)

In [21]:
#now we need to compare the set of predicted classes with the known test labels and see where they are the same
#we can then calculate the accuracy from the total in the test set
test_acc=np.sum(pclasses==test_labels)/float(test_labels.shape[0])

print ("Test Set Accuracy: ",test_acc) 

Test Set Accuracy:  0.843


In [22]:
#you can use this cell to test any string against the training set for your own movie review string to see how the model does.
#a 1 is positive and a 0 is a negative review prediciton
#the array shows the relative comparison made between the two classes using the relative values of the probabilities
nb.make_prediction('What a terrible move that was.  Avoid it!!!')

(0, array([-48.69348903, -52.41721751]))

<b>Conclusion</b><br>
We have successfully implemented the Naive Bayes Algorithm with a Bag of Words approach to conduct text analysis.  We have demonstrated how to execute each step:  
1. Clean the data
2. Use Bag of Words to create a histogram of how many times each word appears for a given class.  
3. Apply the Naive Bayes algorithm to calculate the probabilites of the words in a given sample occuring in each class set
4. Take the maximum probability to determine the predicted class.

We saw our model predict with 84% accuracy the testing data set based on the training data set.  Our split was 20% test and 80% training.
I did vary the random_state of the sklearn train_test_split to see if different selected train and test sets produce different results.  This was indeed the case where one could see the calculated probabilities were indeed different.  However, the overall performance was different by only an order of 0.001.
We provided a method and cell to test your own phrase.  From trying different phrases, one can see where the model will sometimes be wrong.  Not surprising since it only had 84% accuracy. But it was interesting to see those occurences.
Some potential improvements for the future would be to remove stop words to see if that improves the model.