# Bayes?
This interpretation of probability that we use belongs to the category called Bayesian
probability; it’s popular and it works well. Bayesian probability is named after Thomas
Bayes, who was an eighteenth-century theologian. Bayesian probability allows prior
knowledge and logic to be applied to uncertain statements. There’s another
interpretation called frequency probability, which only draws conclusions from data
and doesn’t allow for logic and prior knowledge.

# Theory:
An abstract illustration of the procedure used by the naive Bayes classifier to choose the topic for a document. In the training corpus, most documents are automotive, so the classifier starts out at a point closer to the “automotive” label. But it then considers the effect of each feature. In this example, the input document contains the word “dark,” which is a weak indicator for murder mysteries, but it also contains the word “football,” which is a strong indicator for sports documents. After every feature has made its contribution, the classifier checks which label it is closest to, and assigns that label to the input.



![here](../images/naive_bayes/naive-bayes-triangle.png)

## Calculating label likelihoods with naive Bayes: 
Naive Bayes begins by calculating the prior probability of each label, based on how frequently each label occurs in the training data. Every feature then contributes to the likelihood estimate for each label, by multiplying it by the probability that input values with that label will have that feature. The resulting likelihood score can be thought of as an estimate of the probability that a randomly selected value from the training set would have both the given label and the set of features, assuming that the feature probabilities are all independent.



![](../images/naive_bayes/naive_bayes_bargraph.png)

## Mathematics:
### Conditional Probability:
Let’s assume for a moment that we have a jar containing seven stones.

3 Gray and 4 Black Balls .

If we stick a hand into this jar and randomly pull out a stone, what are the chances that the stone will be gray?

There are 7 possible stones and 3 are gray, so the probability is P(gray) = 3/7.

What is the probability of grabbing a black stone? It’s P(black) = 4/7



![](../images/naive_bayes/cp.png)

What if the seven stones were in two buckets?

If you want to calculate the P(gray) or P(black) , would knowing the bucket change the answer? Yes, which is known as conditional probability.



![](../images/naive_bayes/cp1.png)

We can write this as P(gray|bucketB) , and this would be read as “the prob-
ability of gray given bucket B.”

It’s not hard to see that P(gray|bucketA) is 2/4 and P(gray|bucketB) is 1/3.

To formalize how to calculate the conditional probability, we can say
P(gray|bucketB) = P(gray and bucketB)/P(bucketB)

i.e P(gray|bucketB)  or P(x|condition) = P(gray and bucketB)/P(bucketB)  = (1/7) / (3/7) = 1/3

Another useful way to manipulate conditional probabilities is known as Bayes’ rule.
Bayes’ rule tells us how to swap the symbols in a conditional probability statement. If
we have P(x|c) but want to have P(c|x) , we can find it with the following:

P(c|x) = P(x|c) P(c) / P(x)

What is the probability of bucket being A or B if the chosen ball is gray?

P(bucketB|gray) = P(gray|bucketB) P(bucketB) / P(gray)

                                  = (1/3) (3/7) / (3/4) = 4/21 = 0.1904

P(bucketA|gray) = P(gray|bucketA) P(bucketA) / P(gray)

                                  = (1/2) (4/7) / (3/4) = 4/14 = 0.2857

## Pseudo Code:
### Document classification:

``` python
Count the number of documents in each class
for every training document:
    for each class:
        if a token appears in the document ➞ increment the count for that token
        increment the count for tokens
    for each class:
        for each token:
            divide the token count by the total token count to get conditional probabilities
    return conditional probabilities for each class
```    

In [23]:
from numpy import *
 
def loadDataSet():
    postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
              ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
              ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
              ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
              ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
              ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = [0,1,0,1,0,1] #1 is abusive, 0 not
    return postingList,classVec
 
def createVocabList(dataSet):
    vocabSet = set([]) #create empty set
    for document in dataSet:
        vocabSet = vocabSet | set(document) #union of the two sets
    return list(vocabSet)
 
def setOfWords2Vec(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] = 1
        else: 
            print("the word: %s is not in my Vocabulary!" % word)
    return returnVec
 
def trainNB0(trainMatrix,trainCategory):  
    numTrainDocs = len(trainMatrix)  
    numWords = len(trainMatrix[0])  
    pAbusive = sum(trainCategory)/float(numTrainDocs)
    p0Num = ones(numWords); p1Num = ones(numWords) #change to ones() 
    p0Denom = 2.0; p1Denom = 2.0 #change to 2.0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
        p1Vect = log(p1Num/p1Denom) #change to log()
        p0Vect = log(p0Num/p0Denom) #change to log()
        return p0Vect,p1Vect,pAbusive

def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = sum(vec2Classify * p1Vec) + log(pClass1) #element-wise mult
    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else: 
        return 0

Type:  Supervised Learning   
Use cases: Text classification   
Pros: Works with a small amount of data, handles multiple classes  
Cons: Sensitive to how the input data is prepared   
Works with: Nominal values i.e Numeric or Boolean values  
Analyse:  Use Histogram to analyse the training   
