## 朴素贝叶斯

### 条件概率

$$P(A|B) = \frac{P(AB)}{P(B)}$$
朴素贝叶斯的特征：
1. 假设每个变量相互独立不干扰，比如假设 bacon出现在delicious边上和bacon出现在unhealthy边上的可能相同
2. 假设每个变量都同样重要

显然，以上假设在实际中都是不正确的，但忽略这些细节，朴素贝叶斯在实际应用中却很有效

### 朴素贝叶斯分类言论

处理步骤：
1. 获取所有文章中所有词的集合
2. 获取每一篇文章的向量(文章中词在集合中是否出现，出现为1，不出现为0)
3. 利用以下公式
$$
\begin{aligned}
P(c_i|w) &= \frac{P(w|c_i)P(c_i)}{P(w)} \\
w&: 文章的向量  \\ 
c_i&: 文章的分类结果  \\
\end{aligned}
$$
其中：$$ P(c_i) = \frac{c_i出现频数}{文章总数量} $$
其中：$$ P(w|c_i) = P((w_0,w_1, \cdots, w_n)|c_i) $$
根据朴素贝叶斯独立性假设，以上公式变为：
$$ 
\begin{aligned}
P(w|c_i) &= P(w_0|c_i)P(w_1|c_i)\cdots P(w_n|c_i) \\ 
其中：P(w_0|c_i) &= \frac{c_i出现时候w_0出现的总频数和}{c_i类所有文章的词的数量和}
\end{aligned}
$$
对于一个输入w，其P(w)已经是确定值了，所以在这里可以不考虑分母，因为分母都一样

In [4]:
import numpy as np

In [7]:
def loadDataSet():
    postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
                 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = [0,1,0,1,0,1]    #1 is abusive, 0 not
    return postingList,classVec

In [8]:
postingList,classVec = loadDataSet()

In [5]:
import numpy as np
class Navie_Bayes:
    '''token_list:每一元素都是一个列表，该列表是一篇文章的分词结果
       classVec  ：每一个元素都对应token_list每个列表(文章的)的分类结果，比如1表示垃圾文章
    '''
    def __init__(self, token_list, classVec):
        self.token_list = token_list
        self.classVec = classVec
        self.vacabulary, self.vacabulary_index = self.__get_vacabulary()
        self.p0v, self.p1v, self.pclass1 = self.__train_probility()
        
    #获取所有数据中单词的集合
    def __get_vacabulary(self):
        tokens = []
        for words in self.token_list:
            tokens += words 
        tokens = set(tokens)
        index_dict = {}
        for _a, word in enumerate(tokens):
            index_dict[word] = _a
        return tokens, index_dict
    
    #获取inputset的向量值，1表示该词出现
    def setOfWord2Vec(self, inputset):
        articel_vec = [0]*len(self.vacabulary)  #创建和vacabulary同维度的向量
        for word in inputset:
            if word in self.vacabulary:
                articel_vec[self.vacabulary_index[word]] = 1
            else:
                pass
                #print('the word {} is not in my vocabulary'.format(word))
        return articel_vec
    
    #计算P(w0|ci)和P(ci)
    #改进1：为了避免P(w0|ci)中分子为0，导致整个P(w|ci)为0的情况，初始化occurrence为1
    #改进2：因为计算出来的概率小，所有概率的乘积会更小，这样有可能会丢失精度，在这里对求出的概率取自然数对数处理
    def __train_probility(self):
        #train_matircs文章向量列表，train_vec是文章分类结果
        train_matrics = []
        for words in self.token_list:
            train_matrics.append(self.setOfWord2Vec(words))
        train_vec = self.classVec
        num_train_article = len(train_matrics)
        num_word = len(train_matrics[0])
        prob_c1 = sum(train_vec) / num_train_article
        prob_w_c0 = np.ones(num_word)   #从zeros改为ones,初始化分子为1 改进1
        prob_w_c1 = np.ones(num_word)   #从zeros改为ones，初始化分子为1 改进1
        num_c0 = 2  #初始化分母从0改为2  改进1
        num_c1 = 2  #初始化分母从0改为2  改进1
        
        for i in range(num_train_article):
            if train_vec[i] == 1:
                prob_w_c1 += train_matrics[i]
                num_c1 += sum(train_matrics[i])
            else:
                prob_w_c0 += train_matrics[i]
                num_c0 += sum(train_matrics[i])                
        prob1_vec = np.log(prob_w_c1/num_c1)
        prob0_vec = np.log(prob_w_c0/num_c0)
        
        return prob0_vec, prob1_vec , prob_c1
    
    def classifyNB(self, input_article):
        #因为转换为log所以和即是概率的乘积
        input_vec = self.setOfWord2Vec(input_article)
        #矩阵*表示单个单个元素对应相乘，维度要一样，因为p1v保存了所有P(wi|ci)，
        #要求input_vec的P(wi|ci)，只要把p1v中属于input_vec的项提取出来即可
        p1 = sum(input_vec * self.p1v) + np.log(self.pclass1) 
        p2 = sum(input_vec * self.p0v) + np.log(1-self.pclass1)
        if p1 > p2:
            return 1
        else:
            return 0
        

In [175]:
my_navie_byes = Navie_Bayes(postingList,classVec)

In [176]:
my_navie_byes.classifyNB(['love','my','dalmation'])

0

In [177]:
my_navie_byes.classifyNB(['stupid','garbage'])

1

In [83]:
a = np.array([1,2,3,4])
b = np.array([[1,1,1,1],[2,2,2,2]])

### Bag of Words 词袋模型

1. 在1.2种获取一篇文章的向量时候，我们定义了一个setOfWord2Vec函数来实现。
2. 但是1.2种获取的向量没有考虑到当文章中某次多次出现的情况(多次和一次都是一样效果)，向量只包含0,1。
3. 为了解决上述缺陷，提出词袋模型，并对setOfWord2Vec做出修改,某词出现时其特征对应元素+1，重命名为bagOfWord2Vec

In [30]:
import numpy as np
from collections import Counter
class Navie_Bayes_BOW:
    '''token_list:每一元素都是一个列表，该列表是一篇文章的分词结果
       classVec  ：每一个元素都对应token_list每个列表(文章的)的分类结果，比如1表示垃圾文章
    '''
    def __init__(self, token_list, classVec):
        self.token_list = token_list
        self.classVec = classVec
        self.vacabulary, self.vacabulary_index = self.__get_vacabulary()
        self.p0v, self.p1v, self.pclass1 = self.__train_probility()
        
    #获取所有数据中单词的集合
    def __get_vacabulary(self):
        tokens = []
        for words in self.token_list:
            tokens += words 
        #移除高频词，因为高频词大多数都是连接等关系，等于语意无关的词，如果停用次比较全可以不这样处理
        #token_count = Counter(tokens)
        #token_count_sorted = list(sorted(token_count.items() , key=lambda x:x[1], reverse=True))
        #tokens = [token[1] for token in token_count_sorted[20:]]
        tokens = set(tokens)
        index_dict = {}
        for _a, word in enumerate(tokens):
            index_dict[word] = _a
        return tokens, index_dict       
    
    #获取inputset的向量值，1表示该词出现
    def bagOfWord2Vec(self, inputset):
        articel_vec = [0]*len(self.vacabulary)  #创建和vacabulary同维度的向量
        for word in inputset:
            if word in self.vacabulary:
                articel_vec[self.vacabulary_index[word]] += 1 # =1 变为 +=1 
            else:
                pass
                #print('the word {} is not in my vocabulary'.format(word))
        return articel_vec
    
    #计算P(w0|ci)和P(ci)
    #改进1：为了避免P(w0|ci)中分子为0，导致整个P(w|ci)为0的情况，初始化occurrence为1
    #改进2：因为计算出来的概率小，所有概率的乘积会更小，这样有可能会丢失精度，在这里对求出的概率取自然数对数处理
    def __train_probility(self):
        #train_matircs文章向量列表，train_vec是文章分类结果
        train_matrics = []
        for words in self.token_list:
            train_matrics.append(self.bagOfWord2Vec(words))
        train_vec = self.classVec
        num_train_article = len(train_matrics)
        num_word = len(train_matrics[0])
        prob_c1 = sum(train_vec) / num_train_article
        prob_w_c0 = np.ones(num_word)   #从zeros改为ones,初始化分子为1 改进1
        prob_w_c1 = np.ones(num_word)   #从zeros改为ones，初始化分子为1 改进1
        num_c0 = 2  #初始化分母从0改为2  改进1
        num_c1 = 2  #初始化分母从0改为2  改进1
        
        #print(len(train_vec), num_train_article)
        for i in range(num_train_article):
            if train_vec[i] == 1:
                prob_w_c1 += train_matrics[i]
                num_c1 += sum(train_matrics[i])
            else:
                prob_w_c0 += train_matrics[i]
                num_c0 += sum(train_matrics[i])                
        prob1_vec = np.log(prob_w_c1/num_c1)
        prob0_vec = np.log(prob_w_c0/num_c0)
        
        return prob0_vec, prob1_vec , prob_c1
    
    def classifyNB(self, input_article):
        #因为转换为log所以和即是概率的乘积
        input_vec = self.bagOfWord2Vec(input_article)
        #print(input_vec)
        #print(self.p1v)
        #矩阵*表示单个单个元素对应相乘，维度要一样，因为p1v保存了所有P(wi|ci)，
        #要求input_vec的P(wi|ci)，只要把p1v中属于input_vec的项提取出来即可
        p1 = sum(input_vec * self.p1v) + np.log(self.pclass1) 
        p2 = sum(input_vec * self.p0v) + np.log(1-self.pclass1)     
        if p1 > p2:
            return 1
        else:
            return 0
        

### 利用朴素贝叶斯对垃圾邮件分类

In [31]:
import re
import os
import random
def textParse(bigstring):
    listOfTokens = re.split(r'\W+',bigstring)
    return [word.lower() for word in listOfTokens if len(word) > 2] #剔除长度小于2的词

def spamtest():
    doclist = []
    classvec = []
    path1 = '.\email\ham'
    path2 = '.\email\spam'
    for i in range(1,26):
        #print(os.path.join(path1,'{}.txt'.format(i)))
        word_list = textParse(open(os.path.join(path1,'{}.txt'.format(i))).read())    
        doclist.append(word_list)
        classvec.append(1)
        word_list = textParse(open(os.path.join(path2,'{}.txt'.format(i))).read())
        doclist.append(word_list)
        classvec.append(0)  
    num = len(classvec)
    #随机选择一些数据做为测试数据，剩余的数据作为训练数据，该方法又称交叉验证
    randomindex = random.sample(range(num), 10)
    trainset = [article  for _i,article in enumerate(doclist) if _i not in randomindex]
    trainsetclass = [a  for _i,a in enumerate(classvec) if _i not in randomindex]
    
    #testset = [article  for _i,article in enumerate(word_list) if _i in randomindex]
    
    my_bayes = Navie_Bayes(trainset, trainsetclass)
    errorcount = 0
    #print(num,len(word_list) , len(classvec),randomindex)
    for i  in randomindex:
        if my_bayes.classifyNB(doclist[i]) != classvec[i]:
            errorcount +=1
    
    print('the error rate is {:6f}'.format(errorcount / len(randomindex)))

In [20]:
spamtest()

the error rate is 0.000000


In [18]:
#path = os.path.join('E:\MYGIT\Self-learning\Machine Learning In Action\Ch04\email\ham', '23.txt')
#print(open(path, encoding='utf-8').read())

In [23]:
import re
#listoftokens = list(re.split(r'\W+', lines))

In [406]:
' '.join(listoftokens)

'Yay to you both doing fine I m working on an MBA in Design Strategy at CCA top art school It s a new program focusing on more of a right brained creative and strategic approach to management I m an 1 8 of the way done today '

### 利用朴素 贝叶斯对新闻分类

In [32]:
path1 = './news/时政新闻.txt'
path2 = './news/娱乐新闻.txt'

In [33]:
import re
import jieba
import random
import time

In [34]:
def get_stopwords():
    stopwords = []
    with open('stopwords.txt') as f:
        linestr = f.readline()
        while linestr != '':
            stopwords.append(linestr.strip())
            linestr = f.readline()  
    return stopwords

def textPrase_chinese(bigstring, stopwords):
    bigstring = bigstring.strip().replace(' ','')
    jieba_list = jieba.cut(bigstring)
    words_list = []
    for word in jieba_list:
        if word not in stopwords:
            words_list.append(word)
    return  words_list   

def news_classify_test():
    time_point1 = time.time()
    doclist = []
    classvec = []
    path1 = './news/时政新闻.txt'  #该新闻分类为1，如果多余两个类别，要对原来的Navie_Bayes_BOW函数的里预测进行修改
    path2 = './news/娱乐新闻.txt'  #该新闻分类为0
    stopwords = get_stopwords()
    with open(path1, encoding='utf-8') as f:
        linestr = f.readline()
        for _i in range(500): #获取前500片文章
            word_list = textPrase_chinese(linestr, stopwords)    
            doclist.append(word_list)
            classvec.append(1)
            linestr = f.readline()
    with open(path2, encoding='utf-8') as f:
        linestr = f.readline()
        for _i in range(500):#获取前500片文章
            word_list = textPrase_chinese(linestr, stopwords)
            doclist.append(word_list)
            classvec.append(0) 
            linestr = f.readline()
            
    num = len(classvec)
    #随机选择一些数据做为测试数据，剩余的数据作为训练数据，该方法又称交叉验证
    randomindex = random.sample(range(num), 30) #随机生成10个不同数
    trainset = [article  for _i,article in enumerate(doclist) if _i not in randomindex]
    trainsetclass = [a  for _i,a in enumerate(classvec) if _i not in randomindex]
    
    #testset = [article  for _i,article in enumerate(word_list) if _i in randomindex]
    time_point2 = time.time()
    print('数据据预处理时间：{:4f}s'.format(time_point2 -time_point1))
    
    my_bayes = Navie_Bayes_BOW(trainset, trainsetclass)
    errorcount = 0
    
    time_point3 = time.time()
    #print(len(my_bayes.vacabulary))
    print('模型建立时间：{:4f}s'.format(time_point3 -time_point2))
    
    #print(num,len(word_list) , len(classvec),randomindex)
    for i  in randomindex:
        if my_bayes.classifyNB(doclist[i]) != classvec[i]:
            errorcount +=1
    print('预测时间: {:4f}s'.format(time.time()-time_point3))
    print('the error rate is {:4f}'.format(errorcount / len(randomindex)))    

In [36]:
news_classify_test()

数据据预处理时间：18.704432s
模型建立时间：5.007609s
预测时间: 0.624001s
the error rate is 0.033333


In [419]:
aa = np.random.rand(15000)
bb = np.random.rand(15000)

In [436]:
%%time
sum(aa * bb)

Wall time: 7 ms


3732.475282787691

## 使用sklearn来实现朴素贝叶斯

In [42]:
import re
import jieba
import random
import time

def get_stopwords():
    stopwords = []
    with open('stopwords.txt') as f:
        linestr = f.readline()
        while linestr != '':
            stopwords.append(linestr.strip())
            linestr = f.readline()  
    return stopwords

def textPrase_chinese(bigstring, stopwords):
    bigstring = bigstring.strip().replace(' ','')
    jieba_list = jieba.cut(bigstring)
    words_list = []
    for word in jieba_list:
        if word not in stopwords:
            words_list.append(word)
    return  words_list   

def get_dataset():
    doclist = []
    classvec = []
    path1 = './news/时政新闻.txt'  #该新闻分类为1，如果多余两个类别，要对原来的Navie_Bayes_BOW函数的里预测进行修改
    path2 = './news/娱乐新闻.txt'  #该新闻分类为0
    stopwords = get_stopwords()
    with open(path1, encoding='utf-8') as f:
        linestr = f.readline()
        for _i in range(500): #获取前500片文章
            word_list = textPrase_chinese(linestr, stopwords)    
            doclist.append(word_list)
            classvec.append(1)
            linestr = f.readline()
    with open(path2, encoding='utf-8') as f:
        linestr = f.readline()
        for _i in range(500):#获取前500片文章
            word_list = textPrase_chinese(linestr, stopwords)
            doclist.append(word_list)
            classvec.append(0) 
            linestr = f.readline()
            
    return doclist, classvec


使用词袋模型来获取词向量，不进行word embedding

In [38]:
def get_vec_of_dataset(dataset):
    tokens = []
    for words in dataset:
        tokens += words 
    #移除高频词，因为高频词大多数都是连接等关系，等于语意无关的词，如果停用次比较全可以不这样处理
    #token_count = Counter(tokens)
    #token_count_sorted = list(sorted(token_count.items() , key=lambda x:x[1], reverse=True))
    #tokens = [token[1] for token in token_count_sorted[20:]]
    tokens = set(tokens)
    #tokens = list(tokens)
    tokens_dict = {} #主要是为了防止每次使用list.index来获取索引，导致程序运行时间变长
    for i,  word in enumerate(tokens):
        tokens_dict[word]= i
    print(len(tokens))
    
    #获取inputset的向量值，1表示该词出现
    articel_vec = np.zeros((len(dataset), len(tokens)))
    timecount = 0
    for i in range(len(dataset)):
        for word in dataset[i]:
            if word in tokens: #如果tokens是列表的话，这里性能最少降低十几倍
                articel_vec[i][tokens_dict[word]] += 1 # =1 变为 +=1 
            else:
                pass
                #print('the word {} is not in my vocabulary'.format(word))
    return articel_vec

In [43]:
dataset,y = get_dataset()#两种类别文章各500篇

In [44]:
print(len(dataset), len(y))

1000 1000


In [45]:
%%time
X = get_vec_of_dataset(dataset)

42613
Wall time: 718 ms


In [102]:
X.shape

(970, 41768)

In [31]:
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB

In [113]:
# Gaussian Naive Bayes
def Naive_model_Gaussion(X,y,test_rate=0.3):
    X_train, X_test, y_train, y_test = train_test_split\
    (X, y, test_size=test_rate, random_state=0)
    gnb = GaussianNB()
    gnb.fit(X_train, y_train)
    predict_prob_y = gnb.predict(X_test)
    print(classification_report(y_test, predict_prob_y))
# Multinomial Naive Bayes   
def Naive_model_Multinomial(X,y,test_rate=0.3):
    X_train, X_test, y_train, y_test = train_test_split\
    (X, y, test_size=test_rate, random_state=0)
    clf = MultinomialNB()
    clf.fit(X_train, y_train)
    predict_prob_y = clf.predict(X_test)
    print(classification_report(y_test, predict_prob_y))
# Bernoulli Naive Bayes  
def Naive_model_Bernoulli(X,y,test_rate=0.3):
    X_train, X_test, y_train, y_test = train_test_split\
    (X, y, test_size=test_rate, random_state=0)
    clf = BernoulliNB()
    clf.fit(X_train, y_train)
    predict_prob_y = clf.predict(X_test)
    print(classification_report(y_test, predict_prob_y))

In [114]:
Naive_model_Gaussion(X,y)
Naive_model_Multinomial(X,y)
Naive_model_Bernoulli(X,y)

              precision    recall  f1-score   support

           0       0.92      1.00      0.96       148
           1       1.00      0.91      0.95       143

    accuracy                           0.96       291
   macro avg       0.96      0.95      0.96       291
weighted avg       0.96      0.96      0.96       291

              precision    recall  f1-score   support

           0       1.00      0.99      1.00       148
           1       0.99      1.00      1.00       143

    accuracy                           1.00       291
   macro avg       1.00      1.00      1.00       291
weighted avg       1.00      1.00      1.00       291

              precision    recall  f1-score   support

           0       1.00      0.80      0.89       148
           1       0.83      1.00      0.91       143

    accuracy                           0.90       291
   macro avg       0.91      0.90      0.90       291
weighted avg       0.91      0.90      0.90       291



由上面测试，不难看出，Gaussian Naive Bayes在对文本二类分类效果比较好

### 用TF-IDF来构建文本向量

In [115]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

In [124]:
vectorized = TfidfVectorizer(max_features = 20000) #设置文本单词个数最大值

In [125]:
dataset_new = [' '.join(words) for words  in dataset]
X1 = vectorized.fit_transform(dataset_new).toarray()
y1 = np.array(y)

In [126]:
X.shape

(970, 41768)

In [127]:
Naive_model_Gaussion(X1,y1)
Naive_model_Multinomial(X1,y1)
Naive_model_Bernoulli(X1,y1)

              precision    recall  f1-score   support

           0       0.92      0.99      0.95       148
           1       0.99      0.91      0.95       143

    accuracy                           0.95       291
   macro avg       0.96      0.95      0.95       291
weighted avg       0.95      0.95      0.95       291

              precision    recall  f1-score   support

           0       0.98      0.99      0.99       148
           1       0.99      0.98      0.99       143

    accuracy                           0.99       291
   macro avg       0.99      0.99      0.99       291
weighted avg       0.99      0.99      0.99       291

              precision    recall  f1-score   support

           0       1.00      0.87      0.93       148
           1       0.88      1.00      0.94       143

    accuracy                           0.93       291
   macro avg       0.94      0.94      0.93       291
weighted avg       0.94      0.93      0.93       291

