* 自己用 Python 实现 Naive Bayes 方法，并在给定的数据集上验证结果。

### 相关公式
* $P(y|S)=\frac{P(y,S)}{P(S)}=\frac{P(S|y)*P(y)}{P(S)}$
* $P(S|y=emotion)=P(w_1,w_2,w_3,...,w_n|y=emotion) \approx P(w_1|y=emotion)...P(w_n|y=emotion)$
* $P(w_1|y=emotion) \approx \frac{Count(w_1,y=emotion)+1}{Count(y=emotion)+V+1}$

### 显示系统版本

In [94]:
import sys
import platform

print("Mac系统版本是: {}{}".format(platform.system(),platform.mac_ver()[0]))
print("Python 版本是: {}".format(sys.version.split()[0]))

Mac系统版本是: Darwin10.12.6
Python 版本是: 3.6.2


### 查看一下数据

In [129]:
!head -n 5 ../code/pos_train.txt| nl

     1	装了xp系统后，没有出现网友说的驱动不好装的情况
     2	总的来说,比较干净,而且地理位置很好,市区繁华地段.进出方便.
     3	2、散热很好，这个不用解释了
     4	温度控制的非常好，噪音也不大，
     5	早上6点多有"按摩"电话过来，^_^；不想被打扰的话拔掉电话插头吧


In [99]:
!head -n 5 ../code/neg_train.txt | nl

     1	光驱不大好，给人想散架的的感觉。哈哈，总体上还可以。装系统有点绕手，害的我装了一个小时。
     2	2，散热是有点问题，CPU cache 是小了点。不能运行很大的软件。
     3	上月入住，将近500元一天的房费，卫生间很小，像经济性酒店。走廊房间一股霉味。浴缸下水管漏水，淌的一地都是。窗户外面不到30米就是居民小区，喇叭不断。更可笑的是，结账时被告知，携程预定的房费是不含服务费的，又多收了10%的服务费。在携程定了这么多次，五星也订很多了，还是头一次遇到。总之，酒店差，房费不合理！
     4	CTRIP上怎么让它这么忽悠顾客的 ？！！！！！！！
     5	于丹教授讲《论语》不能很正确地反映儒家原来的思想，她除了讲《论语》不好外，别的还可以。


In [105]:
import jieba.posseg as pseg
from collections import Counter
from math import log

lm_pos = Counter() # 正向词频表
vec_pos = 0.0 # 正向词表大小

with open("../code/pos_train.txt") as f: # 统计词频
    for line in f.readlines():
        line_no += 1
        for word,flag in pseg.cut(line):
            if flag == "x":
                continue
            #print("%s\t%s"%(word,flag))
            lm_pos[word] += 1

for key in lm_pos.keys(): # 计算正向词表
    vec_pos += 1

s_pos = float(sum(lm_pos.values())) # 正向总词频
print("vec_pos=%f\ts_pos=%f"%(vec_pos,s_pos))

vec_pos=27733.000000	s_pos=508469.000000


In [106]:
lm_neg = Counter()
vec_neg = 0.0 # 负向词表大小

with open("../code/neg_train.txt") as f:
    for line in f.readlines():
        line_no += 1
        for word,flag in pseg.cut(line):
            if flag == "x":
                continue
            #print("%s\t%s"%(word,flag))
            lm_neg[word] += 1

for key in lm_neg.keys(): # 计算负向词表
    vec_neg += 1

s_neg = float(sum(lm_neg.values())) # 负向词频                    
print("vec_neg=%f\ts_neg=%f"%(vec_neg,s_neg))

vec_neg=21685.000000	s_neg=449708.000000


In [134]:
s_ = s_pos + s_neg # 总词频
vec_s = vec_pos + vec_neg # 总词表大小
pos_prob = s_pos / s_ # 积极情绪先验概率
neg_prob = 1 - pos_prob # 消极情绪先验概率

#print(pos_prob, neg_prob)
print("pos_prob=%f\tneg_prob=%f"%(pos_prob,neg_prob))

pos_prob=0.511927	neg_prob=0.488073


In [108]:
pos_test_list= [] # 读取正向测试集
with open("../code/pos_test.txt") as f:
    for line in f.readlines():
        pos_test_list.append(line)

In [109]:
TP = 0
FP = 0

for test_str in pos_test_list:
    s_pos = 0.0 # p(s/y=pose)初始化
    s_neg = 0.0 # p(s/y=neg)初始化
    for word,flag in pseg.cut(test_str):
        if flag == "x": # 去掉标点符号
            continue
        p_pos_word = (lm_pos[word] + 1) / (s_pos + vec_s + 1) # 平滑概率，得到正向词概率
        p_neg_word = (lm_neg[word] + 1) / (s_neg + vec_s + 1) # 平滑概率，得到负向词概率
        #print(word,p_pos_word,p_neg_word)
        s_pos += log(p_pos_word)
        s_neg += log(p_neg_word)
    s_pos = s_pos + log(pos_prob) # 加上积极先验概率，得到正向句子概率
    s_neg = s_neg + log(neg_prob) # 加上消极先验概率，得到负向句子概率
    #print(s_pos,s_neg)
    if s_pos > s_neg:
        TP += 1
    else:
        FP += 1        
        
print("TP=%d\tFP=%d"%(TP,FP))

TP=4365	FP=600


In [110]:
neg_test_list= [] # 消极测试集
with open("../code/neg_test.txt") as f:
    for line in f.readlines():
        neg_test_list.append(line)

In [111]:
TN = 0
FN = 0

for test_str in neg_test_list:
    s_pos = 0 # p(s/y=pose)
    s_neg = 0 # p(s/y=neg)
    for word,flag in pseg.cut(test_str):
        if flag == "x": # 去掉标点符号
            continue
        p_pos_word = (lm_pos[word] + 1) / (s_pos + vec_s + 1) # 平滑，得到正向词概率
        p_neg_word = (lm_neg[word] + 1) / (s_neg + vec_s + 1) # 平滑，得到负向词概率
        #print(word,p_pos_word,p_neg_word)
        s_pos += log(p_pos_word)
        s_neg += log(p_neg_word)
    s_pos = s_pos + log(pos_prob) # 加上积极先验概率，得到正向句子概率
    s_neg = s_neg + log(neg_prob) # 加上消极先验概率，得到负向句子概率
    #print(s_pos,s_neg)
    if s_pos < s_neg: # 判断句子正负向
        TN += 1
    else:
        FN += 1        
        
print("TN=%d\tFN=%d"%(TN,FN))

TN=4142	FN=1431


In [112]:
import pandas as pd
confusion_matrix = pd.DataFrame(
    {'predicted_no': [TN, FN],'predicted_yes':[FP, TP]},
    index=['actual_no','actual_yes'])
confusion_matrix

Unnamed: 0,predicted_no,predicted_yes
actual_no,4142,600
actual_yes,1431,4365


In [113]:
print ("TP=%d, TN=%d, FP=%d, FN=%d"%(TP, TN, FP, FN))
print ("Accuracy = %f" % (float(TP+TN)/float(TP+TN+FP+FN)))
print ("ErrorRate = %f" % (float(FP+FN)/float(TP+TN+FP+FN)))
print ("True Positive Rate = %f" % (float(TP)/float(TP+FN)))
print ("False Positive Rate = %f" % (float(FP)/float(FP+TN)))
print ("Specificity = %f" % (float(TN)/float(TN+FP)))
print ("Precision = %f" % (float(TP)/float(TP+FP)))
print ("Prevalence = %f" % (float(TP+FN)/float(TP+TN+FP+FN)))

TP=4365, TN=4142, FP=600, FN=1431
Accuracy = 0.807269
ErrorRate = 0.192731
True Positive Rate = 0.753106
False Positive Rate = 0.126529
Specificity = 0.873471
Precision = 0.879154
Prevalence = 0.550009
