# 1. Naive Bayes

We are going to classify sentences with Naive Bayes algorithm. We have some abusive and not abusive sentences in the form of word lists. Our task is to predict whether a new sentence is abusive or not.


## 1.1 Prepare: transform sentences into vectors

In [1]:
def loadDataSet():
    ### Each sentence appears in the form of word list.
    postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],       
                 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    ### Labels of sentences. 1 is abusive, 0 not
    classVec = [0,1,0,1,0,1]
    return postingList,classVec

# create a list of the unique words in all sentences.
def createVocabList(dataSet):
    vocabSet = set([])    #Create an empty set
    for document in dataSet:
        vocabSet = vocabSet | set(document) #Create the union of two sets
    return list(vocabSet)

In [2]:
postingList, classVec = loadDataSet()
myVocabList = createVocabList(postingList)
print('The vocabulary list is:\n', myVocabList)

The vocabulary list is:
 ['steak', 'is', 'quit', 'to', 'mr', 'problems', 'love', 'ate', 'help', 'cute', 'I', 'licks', 'posting', 'garbage', 'take', 'not', 'park', 'so', 'food', 'worthless', 'stop', 'dog', 'maybe', 'how', 'please', 'flea', 'has', 'my', 'dalmation', 'him', 'stupid', 'buying']


In [3]:
def setOfWords2Vec(vocabList, inputSet):
    '''
    According to vocabulary list (vocabList), we convert a word vector (inputSet) to a vector of 1s and 0s of the 
    same length as the vocabulary list. 
    The $i$-th element of output vector represents whether the $i$-th word in our vocabulary list is present or not in 
    the word vector.
    
    Args:
        vocabList - a vocabulary list
        inputSet - a word list
    Returns:
        returnVec - a vector of 1s and 0s of the same length as the vocabulary list
    '''
    returnVec = [0] * len(vocabList)                               #Create a vector of all 0s
    for word in inputSet:                                          
        if word in vocabList:                                      #If the word is in the vocabulary list，then we set its value to 1 in the output vector.
            returnVec[vocabList.index(word)] += 1
        else: print("the word: %s is not in my Vocabulary!" % word)
    return returnVec                                               

In [4]:
trainMat = []
for postinDoc in postingList:
    trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
print('The 0-1 vector of the first sentence is:', trainMat[0])
print(len(trainMat[0]))

The 0-1 vector of the first sentence is: [0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0]
32


## 1.2 Implement the class of Naive Bayes

In [3]:
import numpy as np
class NaiveBayes():
    '''
    This is a class for Naive Bayes classification.
    
    The class contains the arrays of conditional probabilities for abusive class and not abusive class.
    
    It also contains the functions for initializing the class, fitting the Naive Bayes classifier model and use 
    the fitted model to predict test samples.
    
    Attributes:
        p0Vect (vector, num_sentences)   - array of conditional probabilities for abusive class
        p1Vect (vector, num_sentences) - array of conditional probabilities for not abusive class
        pAbusive (number in [0,1])  - the probability that the document belongs to the abusive class
        
    '''
    def __init__(self):
        self.p1Vect = 0
        self.p0Vect = 0
        self.pAbusive = 0
        
    def fit(self, trainMatrix, classVec):
        '''
        fit the naive Bayes classifier to the training data. To be specific, we calculate the class-conditional 
        probability $p(x_j|c)$ and $p(c)$.

        Args:
            trainMatrix (matrix, num_sentences * num_vocablist) : sentence matrix, returned by the function setOfWords2Vec()
            classVec (vector, num_sentences)                    : label vector，returned by the function loadDataSet()
        Returns:
            p0Vect (vector, num_sentences)   - array of conditional probabilities for abusive class
            p1Vect (vector, num_sentences) - array of conditional probabilities for not abusive class
            pAbusive (number in [0,1])  - the probability that the document belongs to the abusive class
        '''
        numTrainDocs = len(trainMatrix)                       
        numWords = len(trainMatrix[0])                        
        self.pAbusive = sum(classVec)/numTrainDocs     
        ### Create numpy.ones array, the number of appearance of all words is initialized to 1 due to Laplacian smoothing
        p0Num = np.ones(numWords); p1Num = np.ones(numWords)  
        ### The denominator is initialized to 2 due to Laplacian smoothing.
        p0Denom = 2.0; p1Denom = 2.0                          
        ### Calculate the probablities of appearance of all vocabulary words for the abusive and non-abusive class.
        for i in range(numTrainDocs):
            ### Update p1Num, p1Denom, p0Num, p0Denom
            if classVec[i] == 1:   
                p1Num += trainMatrix[i]
                p1Denom += sum(trainMatrix[i])
            else:                      
                p0Num += trainMatrix[i]
                p0Denom += sum(trainMatrix[i])
        self.p1Vect = p1Num/p1Denom
        self.p0Vect = p0Num/p0Denom
        return self.p0Vect, self.p1Vect, self.pAbusive
    

    def predict(self, vec2Classify):
        '''
        Args:
            vec2Classify - the word list (or sentence) to be classfied
        Returns:
            0/1 - classified as not abusive/abusive
        '''
        logp1Vect = np.log(self.p1Vect)                     
        logp0Vect = np.log(self.p0Vect)
        p1 = np.sum(vec2Classify * logp1Vect) + np.log(self.pAbusive)       
        p0 = np.sum(vec2Classify * logp0Vect) + np.log(1.0 - self.pAbusive)
        if p1 > p0:
            return 1
        else:
            return 0

## 1.3 Fit model

In [27]:
NBmodel = NaiveBayes()
p0V, p1V, pAb = NBmodel.fit(trainMat, classVec)
print('p0V:\n', p0V)
print('p1V:\n', p1V)
print('pAbusive:\n', pAb)

p0V:
 [0.07692308 0.07692308 0.07692308 0.03846154 0.07692308 0.07692308
 0.07692308 0.07692308 0.03846154 0.03846154 0.07692308 0.15384615
 0.07692308 0.03846154 0.07692308 0.07692308 0.07692308 0.07692308
 0.07692308 0.03846154 0.11538462 0.03846154 0.07692308 0.07692308
 0.03846154 0.07692308 0.07692308 0.03846154 0.03846154 0.03846154
 0.07692308 0.03846154]
p1V:
 [0.14285714 0.04761905 0.04761905 0.0952381  0.0952381  0.04761905
 0.04761905 0.04761905 0.0952381  0.0952381  0.04761905 0.04761905
 0.04761905 0.0952381  0.04761905 0.04761905 0.0952381  0.04761905
 0.04761905 0.0952381  0.0952381  0.0952381  0.04761905 0.04761905
 0.19047619 0.04761905 0.04761905 0.0952381  0.0952381  0.14285714
 0.04761905 0.0952381 ]
pAbusive:
 0.5


## 1.4 Predict the new sentence 

In [28]:
testEntry = ['love', 'my', 'dalmation']
thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))
print('{} is classified as: {}'.format(testEntry, NBmodel.predict(thisDoc)))

testEntry = ['stupid', 'garbage']
thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))
print('{} is classified as: {}'.format(testEntry, NBmodel.predict(thisDoc)))

['love', 'my', 'dalmation'] is classified as: 0
['stupid', 'garbage'] is classified as: 1
