# <center>基于概率论的分类方法：朴素贝叶斯</center>

要求分类器给出一个最优的类别猜测结果，同时给出这个猜测的概率估值。  

## 4.1 基于贝叶斯决策理论的分类方法

### 朴素贝叶斯
* 优点：在数据较少的情况下任然有效，可以处理多类别问题
* 缺点：对于输入数据的准备方式计较敏感
* 适用数据类型：标称型数据
### 贝叶斯决策理论
朴素贝叶斯是贝叶斯决策理论的一部分。
![%E6%88%AA%E5%9B%BE1504767131.png](attachment:%E6%88%AA%E5%9B%BE1504767131.png)
p1(x, y)表示属于类别1的概率，p2(x, y)表示属于类别2的概率。
 * If p1(x, y) > p2(x, y), then the class is 1.
 * If p2(x, y) > p1(x, y), then the class is 2.
贝叶斯决策理论的核心思想：选择具有最高概率的决策。

## 4.2条件概率
* 条件概率 conditional probability
* 条件概率计算公式

计算从B桶中获取灰色石头的概率   直接等于1/3
![image.png](attachment:image.png)
$$  p(gray|bucketB) = \frac{p(gray \quad and \quad bucketB)}{p(bucketB)} $$ 

$p(gray \quad and \quad bucketB) = 1/7$ 

$p(bucketB) = 3/7$ 

$p(gray|bucketB) = \frac{p(gray \quad and \quad bucketB)}{p(bucketB)} = (1/7)/(3/7)=1/3$
* __贝叶斯准则__
$$  p(c|x) = \frac{p(x|c)p(c)}{p(x)} $$ 
$$  p(cx) = p(x) \cdot  p(c|x)   $$ 
$$  p(cx) = p(c) \times  p(x|c)   $$ 

## 4.3使用条件概率来分类

### 贝叶斯分类准则
$$  p(c_i|x,y) = \frac{p(x,y|c_i)p(c_i)}{p(x,y)} $$ 
![image.png](attachment:image.png)

## 4.4使用朴素贝叶斯进行文档分类

朴素贝叶斯常用与文档分类。
#### 朴素贝叶斯的分类器（两种实现方式）
* 基于伯努利模型实验（本节）
* 基于多项式模型实验（4.5.4节）

## 4.5 使用Python进行文本分类

### 4.5.1 准备数据：从文本中构建向量

文本转化为数字向量的过程

将文本看成单词向量或者词条向量，即将句子转化为向量

In [None]:

def loadDataSet():
    postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],   创建实验样本   flea 跳蚤
                 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = [0,1,0,1,0,1]     类别标签的集合   侮辱性的1   非侮辱性的0   #1 is abusive, 0 not  
    return postingList,classVec
                 
def createVocabList(dataSet):   得到包含在所有文档中出现的部重复词的列表 
    vocabSet = set([])         创建一个空集合        #create empty set  
    for document in dataSet:
        vocabSet = vocabSet | set(document)  创建两个集合的并集   #union of the two sets
        将每篇文章的新词加到该集合中。
         | 两个集合set求并
    return list(vocabSet)

def setOfWords2Vec(vocabList, inputSet):
    输入： vocabLis 词汇表    inputSet 某个文档
    输出：文档向量
    returnVec = [0]*len(vocabList)        创建一个其中所含元素都为0的向量   =   0向量
    for word in inputSet:
        if word in vocabList:   如果词汇表出现了这个单词
            returnVec[vocabList.index(word)] = 1   将文档的输出向量设置为1
        else: print "the word: %s is not in my Vocabulary!" % word
    return returnVec  

In [6]:
import bayes

In [7]:
listOPosts,listClasses = bayes.loadDataSet()

In [8]:
myVocabList = bayes.createVocabList(listOPosts) #得到词汇表

In [24]:
print myVocabList

['cute', 'love', 'help', 'garbage', 'quit', 'I', 'problems', 'is', 'park', 'stop', 'flea', 'dalmation', 'licks', 'food', 'not', 'him', 'buying', 'posting', 'has', 'worthless', 'ate', 'to', 'maybe', 'please', 'dog', 'how', 'stupid', 'so', 'take', 'mr', 'steak', 'my']


In [25]:
print bayes.setOfWords2Vec(myVocabList,listOPosts[0]) #得到listOPosts[0]的文本向量

[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1]


In [26]:
print bayes.setOfWords2Vec(myVocabList,listOPosts[3])#得到listOPosts[3]的文本向量

[0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]


### 4.5.2 训练算法：从词向量计算概率

如何用数字计算概率

$p(c_i)$  通过类别i中文档数处理总的文档数来计算

$p( \textbf{w} |c_i)$           __由朴素贝叶斯得来__
* 将$\textbf{w}$展开为一个个独立的特征 $p(w_0,w_1,w_2...w_N|c_i)$
* 每个词相互独立，条件独立性  $\implies$  $p(w_0|c_i)p(w_1|c_i)p(w_2|c_i)...p(w_N|c_i)$

$$  p(c_i| \textbf{w} ) = \frac{p( \textbf{w} |c_i)p(c_i)}{p(\textbf{w})} $$ 

$\textbf{w}$ 表示向量（文本向量）

### 伪代码
``` 
计算每个文档类别中的文档数目
对每篇训练文档：
 对每个类别：
      如果词条出现在文档中 $\rightarrow $ 增加该词条的计数值
      增加所有词条的计数值
 对每个类别：
    对每个词条：
       将该词条的数目除以总词条数目得到概率
 返回每个类别的条件概率
 ``` 

In [None]:
def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)  测试数据对象的个数
    numWords = len(trainMatrix[0])   所有单词的个数
    pAbusive = sum(trainCategory)/float(numTrainDocs)       初始化概率  
                                                                                                                        对于两个概率（侮辱性 和 非侮辱性）的概率，pBbusiveB= 1 - pAbusive
    p0Num = ones(numWords); p1Num = ones(numWords)      #change to ones() 
    p0Denom = 2.0; p1Denom = 2.0                        #change to 2.0
    
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]                   向量相加
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])                      对每个元素做除法
            
            
    p1Vect = log(p1Num/p1Denom)          #change to log()
    p0Vect = log(p0Num/p0Denom)          #change to log()
    return p0Vect,p1Vect,pAbusive

In [12]:
from numpy import *

In [13]:
listOPosts,listClasses = bayes.loadDataSet()   #加载数据

In [28]:
print listOPosts,'\n',listClasses

[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'], ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']] 
[0, 1, 0, 1, 0, 1]


In [14]:
myVocabList = bayes.createVocabList(listOPosts) #构建一个包含所有词的列表myVocabList

In [30]:
print myVocabList #构建词汇表

['cute', 'love', 'help', 'garbage', 'quit', 'I', 'problems', 'is', 'park', 'stop', 'flea', 'dalmation', 'licks', 'food', 'not', 'him', 'buying', 'posting', 'has', 'worthless', 'ate', 'to', 'maybe', 'please', 'dog', 'how', 'stupid', 'so', 'take', 'mr', 'steak', 'my']


In [16]:
trainmMat = []

In [17]:
'''
该for循环使用词向量来填充trainMat列表，

postinDoc 是每一个  listOPosts原始数据
'''
for postinDoc in listOPosts:
    trainmMat.append(bayes.setOfWords2Vec(myVocabList,postinDoc))

In [31]:
print trainmMat

[[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0], [1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1], [0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0]]


In [18]:
'''
给出属于侮辱性文档的概率以及两个类别的概率向量
trainmMat 数据文档的向量
listClasses 标签属性。
'''
p0v,p1v,pAb = bayes.trainNB0(trainmMat,listClasses)

In [37]:
print listClasses  #标签属性
print len(trainmMat)
print len(trainmMat[0])
print ones(trainmMat[0])
print  ones(trainmMat[0]) + [0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1]
print sum([0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1])

[0, 1, 0, 1, 0, 1]
6
32
[]
[]
7


In [19]:
pAb

0.5

In [21]:
p0v

array([-2.56494936, -2.56494936, -2.56494936, -3.25809654, -3.25809654,
       -2.56494936, -2.56494936, -2.56494936, -3.25809654, -2.56494936,
       -2.56494936, -2.56494936, -2.56494936, -3.25809654, -3.25809654,
       -2.15948425, -3.25809654, -3.25809654, -2.56494936, -3.25809654,
       -2.56494936, -2.56494936, -3.25809654, -2.56494936, -2.56494936,
       -2.56494936, -3.25809654, -2.56494936, -3.25809654, -2.56494936,
       -2.56494936, -1.87180218])

In [22]:
p1v

array([-3.04452244, -3.04452244, -3.04452244, -2.35137526, -2.35137526,
       -3.04452244, -3.04452244, -3.04452244, -2.35137526, -2.35137526,
       -3.04452244, -3.04452244, -3.04452244, -2.35137526, -2.35137526,
       -2.35137526, -2.35137526, -2.35137526, -3.04452244, -1.94591015,
       -3.04452244, -2.35137526, -2.35137526, -3.04452244, -1.94591015,
       -3.04452244, -1.65822808, -3.04452244, -2.35137526, -3.04452244,
       -3.04452244, -3.04452244])

### 4.5.3 测试算法：根据现实情况修改分类器

In [None]:
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = sum(vec2Classify * p1Vec) + log(pClass1)    #element-wise mult
    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else: 
        return 0
    
def testingNB():
    listOPosts,listClasses = loadDataSet()
    myVocabList = createVocabList(listOPosts)
    trainMat=[]
    for postinDoc in listOPosts:
        trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
    p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses))
    testEntry = ['love', 'my', 'dalmation']
    thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)
    testEntry = ['stupid', 'garbage']
    thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)

In [38]:
bayes.testingNB()

['love', 'my', 'dalmation'] classified as:  0
['stupid', 'garbage'] classified as:  1


### 4.5.4 文档词袋模型

#### 程序4-4 朴素贝叶斯词袋模型

In [None]:
def bagOfWords2VecMN(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
    return returnVec

##  4.6 示列：使用朴素贝叶斯过滤垃圾邮件