# Naive Bayes

### 1.Classifying with Bayesian decision theory

Naive Bayes is a subset of Bayesian decision theory, so we need to talk about Bayesian decision theory quickly before we get to naive Bayes.
Assume for a moment that we have a dataset with two classes of data inside. A plot of this data is shown in figure 4.1.

![](picture/05.png)

We have an equation for the probability of a piece of data belonging to Class 1 (the circles):p1(x,y),and we have an equation for the class belonging to Class 2(the triangles):p2(x,y).
To classify a new measurement with features (x,y), we use the following relues:
- if p1(x,y) > p2(x,y), then the class is 1.
- if p1(x,y) < p2(x,y), then the class is 2.

Then if we use kNN, we need do about 1000 distance calculations. if we use decision trees, and make a split if the data once along the x-axis and once along the y-axis. So, the best choice would be the probability comparison we just discussed.

### 2.What is Conditional probability?

Let’s spend a few minutes talking about probability and conditional probability. If you’re comfortable with the p(x,y|c1) symbol, you may want to skip this section.

![](picture/06.png)

Let’s assume for a moment that we have a jar containing seven stones. Three of these stones are gray and four are black, as shown in figure 4.2. If we stick a hand into this jar and randomly pull out a stone, what are the chances that the stone will be gray? There are seven possible stones and three are gray, so the probability is 3/7. What is the probability of grabbing a black stone? It’s 4/7. We write the probability of gray as P(gray). We calcu- lated the probability of drawing a gray stone P(gray) by counting the number of gray stones and dividing this by the total number of stones.

What if the seven stones were in two buckets? This is shown in figure 4.3.

![](picture/07.png)

If you want to calculate the P(gray) or P(black), would knowing the bucket change the answer? If you wanted to calculate the probabil-ity of drawing a gray stone from bucket B, you could probably figure out how do to that. This is known as conditional probability. We’re calculating the probability of a gray stone, given that the unknown stone comes from bucket B. We can write this as P(gray|bucketB), and this would be read as “the prob- ability of gray given bucket B.” It’s not hard to see that P(gray|bucketA) is 2/4 and P(gray|bucketB) is 1/3.

To formalize how to calculate the conditional probability, we can say

P(gray|bucketB) = P(gray and bucketB)/P(bucketB)

Let’s see if that makes sense: P(gray and bucketB) = 1/7. This was calculated by taking the number of gray stones in bucket B and dividing by the total number of stones. Now, P(bucketB) is 3/7 because there are three stones in bucket B of the total seven stones. Finally,

P(gray|bucketB) = P(gray and bucketB)/P(bucketB) = (1/7) / (3/7) = 1/3.

**Another useful way to manipulate conditional probabilities is known as Bayes’ rule.**
 If we have P(x|c) but want to have P(c|x)
## $p(c|x) = \frac{p(x|c)p(c)}{p(x)}$

### 3.Classifying with conditional probabilities

Given a point identified as x,y, what is the probability it came from class c1? What is the probability it came from class c2?.
- If P(c1|x, y) > P(c2|x, y), the class is c1. 
- If P(c1|x, y) < P(c2|x, y), the class is c2.

**Note:** naive bayes need "independence"

### 4.Classifying text with Python

In order to get features from our text, we need to split up the text. But how do we do that? Our features are going to be tokens we get from the text. A token is any combina- tion of characters. You can think of tokens as words, but we may use things that aren’t words such as URLs, IP addresses, or any string of characters. We’ll reduce every piece of text to a vector of tokens where 1 represents the token existing in the document and 0 represents that it isn’t present.

To see this in action, let’s make a quick filter for an online message board that flags a message as inappropriate if the author uses negative or abusive language. Filtering out this sort of thing is common because abusive postings make people not come back and can hurt an online community. We’ll have two categories: abusive and not. We’ll use 1 to represent abusive and 0 to represent not abusive.


### 4.1 Prepare: making word vectors from text

In [1]:
def loadDataSet():
    """
    Create dataset
    
    returns:
        posting list and classVec
    """
    postingList = [['my','dog','has','flea','problems','help','please'],
                  ['maybe','not','take','him','to','dog','park','stupid'],
                  ['my','dalmation','is','so','cute','I','love','him'],
                  ['stop','posting','stupid','worthless','grabage'],
                  ['mr','licks','ate','my','steak','how','to','stop','him'],
                  ['quit','buying','worthless','dog','food','stupid']]
    classVec = [0,1,0,1,0,1] # 1 is absive,0 not
    
    return postingList,classVec

In [2]:
def createVocabList(dataSet):
    """
    Create a list of all the unique words in all of our documents.
    
    return:
        vocabSet 
    """
    vocabSet = set([]) # create an empty set
    for document in dataSet:
        vocabSet = vocabSet | set(document) # create the union of two sets
    return list(vocabSet)

In [3]:
def setOfWords2Vec(vocabList,inputSet):
    """
    check words exists our vocabulary,1 exists, 0 not.
    
    return:
        returnVec
    """
    returnVec = [0] * len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] = 1
        else:
            print("The word:{} is not in my Vocabulary !".format(word))
    return returnVec

In [4]:
listOPosts,listClasses = loadDataSet()
print("List0Post = ",listOPosts)
print("listClasses = ",listClasses)
myVocabList = createVocabList(listOPosts)
print("myVocaList = ",myVocabList)

List0Post =  [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'], ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'grabage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
listClasses =  [0, 1, 0, 1, 0, 1]
myVocaList =  ['buying', 'mr', 'not', 'dalmation', 'how', 'help', 'is', 'steak', 'ate', 'food', 'stupid', 'love', 'posting', 'my', 'so', 'take', 'I', 'dog', 'flea', 'licks', 'park', 'quit', 'problems', 'has', 'grabage', 'please', 'him', 'maybe', 'cute', 'to', 'stop', 'worthless']


If you examine this list, you’ll see that there are no repeated words

In [5]:
returnVec = setOfWords2Vec(vocabList=myVocabList,inputSet=listOPosts[0])
print(returnVec)

[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0]


In [6]:
returnVec = setOfWords2Vec(vocabList=myVocabList,inputSet=listOPosts[1])
print(returnVec)

[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0]


In [7]:
returnVec = setOfWords2Vec(vocabList=myVocabList,inputSet=listOPosts[3])
print(returnVec)
print(len(returnVec))

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1]
32


### 4.2 Train: calculating probabilities from word vectors

Now that you’ve seen how to convert from words to numbers, let’s see how to calculate the probabilities with these numbers. You know whether a word occurs in a document, It’s rewritten here, but I’ve changed the x,y to w. The bold type means that it’s a vector; that is, we have many values, in our case as many values as words in our vocabulary.

## $p(c_i|w) = \frac{p(w|c_i)p(c_i)}{p(w)}$

w:it’s a vector

i:0,1

We’re going to use the right side of the formula to get the value on the left. We’ll do this for each class and compare the two probabilities. How do we get the stuff on the right? We can calculate p(ci) by adding up how many times we see class i (abusive posts or non-abusive posts) and then dividing by the total number of posts. How can we get p(w|ci)? This is where our naïve assumption comes in. If we expand w into individual features, we could rewrite this as p(w0,w1,w2..wN|ci). Our assumption that all the words were independently likely, and something called conditional indepen- dence, says we can calculate this probability as p(w0|ci)p(w1|ci)p(w2|ci)...p(wN|ci). This makes our calculations a lot easier.

**Note:** due to "set" has disorder,so the argmax index maybe changed.But we konw the best word is "stupid" to split classifly 1(abusive).

### Example

If we need solving $p(c_1|w)$,then 
# $p(c_{1}|w) = \frac{p(w|c_{1})p(c_{1})}{p(w)} = \frac{p(c_{1})}{p(w)}\prod p(w_0,w_1,..w_n|c_1)$

Obviously, the $p(c_1) = p(\frac{3}{6}) = 0.5$

and the $p(w) = p(w_1,w_2..w_n) = p(w_1)p(w_2)\cdots p(w_n) = p(7/32)p(7/32)\cdots p(../32)$ in "independence hypothesis"

Actually, the $p(w)$ is a constant, so we do not care about it.

**Very Impotent part** is $\prod p(w_0,w_1,..w_n|c_1)$

**Note:** In the independence hypothesis, the $\prod p(w_0,w_1,..w_n|c_1) = p(w_0|c_1)p(w_1|c_1)\cdots p(w_n|c_1)$

Imagine, If we have one list call "c_1_list", and all elements are under condition "c_1"(abusive), then 

p(w_0|c_1) = First element in the c1_list / Number of c_1_list

#### for example:

if c_1_list = [1,0,1,0,1,1],then $p(w_0|c_1) = \frac{1}{6}$,so we can using this method to calculate $p(w_i|c_1)$

Finally, we can calculate $\prod p(w_0,w_1,..w_n|c_1) = p(w_0|c_1)p(w_1|c_1)\cdots p(w_n|c_1)$

**Ps:** Due to the loss of precision in the Python, we need using "$l_n(a)$".

 Do we lose anything by using the natural log of a number rather than the number itself? The answer is no.
 
Figure 4.4 plots two functions, f(x) and ln(f(x)). If you examine both of these plots, you’ll see that they increase and decrease in the same areas, and they have their peaks in the same areas. Their values are different, but that’s fine. 

![](picture/08.png)

#### So,
the $p(c_{1}|w)$ will be changed as follows as
- do not care about(p(w)), because the p(w) is a constant.
    - $l_{n}(p(c_{1}|w)) = l_{n}(p(w|c_{1})p(c_{1})) = l_{n}p(w|c_i) + l_{n}p(c_1) =l_{n}p(w_1|c_1) + l_{n}p(w_2|c_1) +\cdots + p(w_n|c_1) + l_{n}p(c_1) $
    
- In the code:
    - sum(list of the $p(w|c_1)$) + log($p(c_1)$
    
**PPs:** 

If, do not use "$l_n()$ function",this will look something like $p(w_0|c_1)p(w_1|c_1)p(w2|_c1)$.

If any of these numbers are 0, then when we multiply them together we get 0. 

To lessen the impact of this, we’ll initialize all of our occur- rence counts to 1 and initialize the denominators to 2 in the next below code.

In [8]:
import numpy as np

In [9]:
def trainNB0(trainMatrix,trainCategory):
    """
    create  trainNB0 in this cell.
    
    returns:
        p0Vect: probability vectors with classify 0
        p1Vect: probability vectors with classify 1
        pAbusive: probability abusive for input documents.
    """
    numTrainDocs = len(trainMatrix)
    numWords = len(trainMatrix[0])
    pAbusive = np.sum(trainCategory) / float(numTrainDocs) # two class problem calculate c_1, c_0 = 1 - c_1
    
    # initialize
    p0Num = np.zeros((1,numWords))
    p1Num = np.zeros((1,numWords))
    p0Denom = 0.
    p1Denom = 0.
    
    for i in range(numTrainDocs):
        # create "condition c_1_list" or "condition c_0_list"
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom += np.sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += np.sum(trainMatrix[i])
    
    # calculate p(w_i|c_i)
    p1Vect = p1Num / p1Denom
    p0Vect = p0Num / p0Denom
    
    return p0Vect,p1Vect,pAbusive

In [10]:
listOPosts,listClasses = loadDataSet()
myVocabList = createVocabList(listOPosts)
trainMat = []
for postinDoc in listOPosts:
    trainMat.append(setOfWords2Vec(myVocabList,postinDoc))
p0V,p1V,pAb = trainNB0(trainMat,listClasses)
print("pAb:",pAb)
print("p0V:",p0V)
print("p1V:",p1V)
print("argmax p1V: ",np.argmax(p1V[0]))

pAb: 0.5
p0V: [[0.         0.04166667 0.         0.04166667 0.04166667 0.04166667
  0.04166667 0.04166667 0.04166667 0.         0.         0.04166667
  0.         0.125      0.04166667 0.         0.04166667 0.04166667
  0.04166667 0.04166667 0.         0.         0.04166667 0.04166667
  0.         0.04166667 0.08333333 0.         0.04166667 0.04166667
  0.04166667 0.        ]]
p1V: [[0.05263158 0.         0.05263158 0.         0.         0.
  0.         0.         0.         0.05263158 0.15789474 0.
  0.05263158 0.         0.         0.05263158 0.         0.10526316
  0.         0.         0.05263158 0.05263158 0.         0.
  0.05263158 0.         0.05263158 0.05263158 0.         0.05263158
  0.05263158 0.10526316]]
argmax p1V:  10


First, you found the probability that a document was abusive: pAb; this is 0.5, which is correct. Next, you found the probabilities of the words from our vocabulary given the document class. Let’s see if this makes sense. The first word in our vocabulary is "dalmation". This appears once in the 0 class and never in the 1 class. The probabilities are 0.04166667 and 0.0. This makes sense. Let’s look for the largest probability. That’s 0.15789474 in the P(1) array at index 21. If you look at the word in myVocabList at index 14, you’ll see that it’s the word "stupid". 

**This tells you that the word stupid is most indicative of a class 1 (abusive).**

Then we change "np.zeros" to "np.ones",using "log" and Denom equal 2 in initialize part.

In [17]:
def trainNB(trainMatrix,trainCategory):
    """
    create trainNB and changed some code.
    returns:
        p0Vect: probability vectors with classify 0
        p1Vect: probability vectors with classify 1
        pAbusive: probability abusive for input documents.
    """
    numTrainDocs = len(trainMatrix)
    numWords = len(trainMatrix[0])
    pAbusive = np.sum(trainCategory) / float(numTrainDocs) # two class problem calculate c_1, c_0 = 1 - c_1
    
    # initialize
    p0Num = np.ones((1,numWords))
    p1Num = np.ones((1,numWords))
    p0Denom = 2.
    p1Denom = 2.
    
    for i in range(numTrainDocs):
        # create "condition c_1_list" or "condition c_0_list"
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom += np.sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += np.sum(trainMatrix[i])
    
    # calculate p(w_i|c_i)
    p1Vect = np.log(p1Num / p1Denom)
    p0Vect = np.log(p0Num / p0Denom)
    
    return p0Vect,p1Vect,pAbusive

In [18]:
def classifyNB(vec2Classify,p0Vec,p1Vec,pClass1):
    """
    create classifyNB.
    returns:
        1: abusive
        0: not
    """
    #lnp(w1|c1)+lnp(w2|c1)+⋯+p(wn|c1)+lnp(c1)ln(p(c1|w))=ln(p(w|c1)p(c1))=lnp(w|ci)+lnp(c1)=lnp(w1|c1)+lnp(w2|c1)+⋯+p(wn|c1)+lnp(c1)
    p1 = np.sum(vec2Classify * p1Vec) + np.log(pClass1)
    p0 = np.sum(vec2Classify * p0Vec) + np.log(1. - pClass1)
    if p1 >p0:
        return 1
    else:
        return 0

In [19]:
def testingNB():
    listOPosts,listClasses = loadDataSet()
    myVocabList = createVocabList(listOPosts)
    trainMat = []
    for postinDoc in listOPosts:
        trainMat.append(setOfWords2Vec(myVocabList,postinDoc))
    p0V,p1V,PAb = trainNB(np.array(trainMat),np.array(listClasses))
    testEntry = ["love","my","dalmation"]
    thisDoc = np.array(setOfWords2Vec(myVocabList,testEntry))
    print(testEntry,"classified as: ",classifyNB(thisDoc,p0V,p1V,PAb))
    testEntry = ["stupid","grabage"]
    thisDoc = np.array(setOfWords2Vec(myVocabList,testEntry))
    print(testEntry,"classified as: ",classifyNB(thisDoc,p0V,p1V,PAb))

In [20]:
testingNB()

['love', 'my', 'dalmation'] classified as:  0
['stupid', 'grabage'] classified as:  1


### 4.3 Prepare: the bag-of-words document model


Up until this point we’ve treated the presence or absence of a word as a feature. This could be described as a set-of-words model. If a word appears more than once in a document, that might convey some sort of information about the document over just the word occurring in the document or not. This approach is known as a bag-of-words model. A bag of words can have multiple occurrences of each word, whereas a set of words can have only one occurrence of each word. To accommodate for this we need to slightly change the function setOfWords2Vec() and call it bagOfWords2VecMN().
The code to use the bag-of-words model is given in the following listing. It’s nearly identical to the function setOfWords2Vec() listed earlier, except every time it encoun- ters a word, it increments the word vector rather than setting the word vector to 1 for a given index.

In [None]:
def bagOfWords2VecMN(vocabList,inputSet):
    returnVec = [0] * len(vocabList)
    for word in inputSet:
        if word in vocabList:
            # changed "ginve index" to  1
            # so ,it's calculate number of some word appeared
            returnVec[vocabList.index(word)] += 1
    return returnVec

### 4.4Test: cross validation with naïve Bayes

In [22]:
import re

In [75]:
def textParse(bigString):
    
    listOfToken = re.split('\W*',bigString)
    return [tok.lower() for tok in listOfToken if len(tok)>2]

In [79]:
def spamTest():
    
    docList = []
    classList = []
    fullText = []
    # load and parse text files,this about 9 lines
    for i in range(1,26):
        wordList = textParse(open('data_set/email/spam/{}.txt'.format(i),errors="ignore").read())
        docList.append(wordList) # create document list like [["hello","world"],["hey","name"]]
        fullText.extend(wordList) # full text
        classList.append(1) # create classes list it's two-classes
        wordList = textParse(open('data_set/email/ham/{}.txt'.format(i)).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    # end of the load and parse text file
    
    vocabList = createVocabList(docList) # create vocabulary to list like ["hello","world","hey","name"]
    
    # start split traning set and test set,about 6 lines
    # this part we can call "hold-out cross validation"
    trainingSet = list(x for x in range(50))
    testSet = []
    for i in range(10):
        randIndex = int(np.random.uniform(0,len(trainingSet))) # Draw samples from a uniform distribution.
        testSet.append(trainingSet[randIndex]) # get random int and input to test set.
        del(trainingSet[randIndex]) # delete Already extracted numbers.
        
    # end of the split
    
    # start training set
    trainMat = []
    trainClasses = []
    for docIndex in trainingSet:
        trainMat.append(setOfWords2Vec(vocabList,docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V,p1V,pSpam = trainNB(np.array(trainMat),np.array(trainClasses))
    
    # end traning set
    
    # testing test set
    errorCount = 0
    for docIndex in testSet:
        wordVector = setOfWords2Vec(vocabList,docList[docIndex])
        if classifyNB(np.array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
            errorCount += 1 # compute error rate
    print("The error rate is: ",float(errorCount / len(testSet)))
        

In [85]:
spamTest()

The error rate is:  0.0


  return _compile(pattern, flags).split(string, maxsplit)


In fact, the method of "hold-out cross validation" used by this code is not particularly good.

we can using "np.random.shuffle()" and get top 10 to inputing test set, get index of original set.

The function spamTest() displays the error rate from 10 randomly selected emails. Since these are randomly selected, the results may be different each time. If there’s an error, it will display the word list for that document to give you an idea of what was misclassified. To get a good estimate of the error rate, you should repeat this proce- dure multiple times, say 10, and average the results.

### 4.5 Summary

Using probabilities can sometimes be more effective than using hard rules for classifi- cation. Bayesian probability and Bayes’ rule gives us a way to estimate unknown proba- bilities from known values.
You can reduce the need for a lot of data by assuming conditional independence among the features in your data. The assumption we make is that the probability of one word doesn’t depend on any other words in the document. We know this assump- tion is a little simple. That’s why it’s known as naïve Bayes. Despite its incorrect assumptions, naïve Bayes is effective at classification.
There are a number of practical considerations when implementing naïve Bayes in a modern programming language. Underflow is one problem that can be addressed by using the logarithm of probabilities in your calculations. The bag-of-words model is an improvement on the set-of-words model when approaching document classifica- tion. There are a number of other improvements, such as removing stop words, and you can spend a long time optimizing a tokenizer.
The probability theory you learned in this chapter will be used again later in the book, and this chapter was a great introduction to the full power of Bayesian probabil- ity theory. We’re going to take a break from probability theory. You’ll next see a classi- fication method called logistic regression and some optimization algorithms.