## Naive Bayes （朴素贝叶斯）

朴素贝叶斯是使用概率论来分类的算法。其中朴素的含义是指其各特征条件独立。

优点：在数据较少的情况下仍然有效，可以处理多类别问题。

缺点：对于输入数据的准备方式较为敏感。

适用数据类型：标称型数据。

### 问题描述

本节问题数据集来源某游戏的对战记录，其中1-5为一方阵容，6-10为另一方阵容，要求能够构建模型，预测对阵的胜负。

### 示例说明

本节演示了

- 词袋的使用
- 分类标签的数值化
- 数据集分割为训练集和测试集
- NB模型的训练与预测

In [15]:
from pandas import DataFrame, read_table
import pandas as pd

df = pd.read_table('/home/pytest/data/battle.txt', 
                   sep=' ', names=['1','2','3','4','5','6','7','8','9','10','11'], 
                   encoding='utf-8', engine='python')
df[0:10]

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11
0,屠夫,大树,剑圣,见证,天狗,蝙蝠,超人,闪电,钢骨,白虎,败
1,蝙蝠,神谕,闪电,钢骨,白虎,大树,幽鬼,天狗,冰龙,黑龙,胜
2,幽鬼,石像,魔童,冰龙,黑龙,守护,斧王,屠夫,恶魔,魔童,胜
3,屠夫,大树,剑圣,见证,天狗,幽鬼,石像,骨法,纸人,魔童,败
4,蝙蝠,神谕,闪电,钢骨,白虎,大树,守护,屠夫,大树,蜘蛛,败
5,屠夫,大树,剑圣,见证,天狗,蝙蝠,超人,闪电,天狗,白虎,胜
6,蝙蝠,神谕,闪电,钢骨,白虎,屠夫,熊猫,末日,火猫,黑龙,胜
7,屠夫,大树,剑圣,见证,天狗,蝙蝠,超人,闪电,钢骨,白虎,败
8,蝙蝠,神谕,闪电,钢骨,白虎,屠夫,大树,神谕,纸人,见证,胜
9,幽鬼,石像,魔童,冰龙,黑龙,幽鬼,石像,魔童,天狗,冰龙,败


构建由所有英雄构成的词袋，用于进行特征值的数值化

In [16]:
all_heros = list(pd.Series(pd.Categorical(df['1'].append(df['2']).append(df['3']).append(df['4']).append(df['5'])
                                     .append(df['6']).append(df['7']).append(df['8']).append(df['9']).append(df['10']))
                      .categories))

将词替换为词袋中对应的数值下标

In [17]:
for key in all_heros:
    df.replace(key, all_heros.index(key), inplace=True)
df[0:1]

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11
0,8,4,3,24,6,23,26,29,28,18,败


In [18]:
df['result'] = pd.Categorical(df['11']).codes
df.drop(['11'], axis=1, inplace=True)
df[0:1]

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,result
0,8,4,3,24,6,23,26,29,28,18,1


In [19]:
# Naive_Bayes Classifier  
def nb_classifier(train_x, train_y):  
    from sklearn.naive_bayes import GaussianNB
    model = GaussianNB()  
    model.fit(train_x, train_y)  
    return model

In [20]:
from sklearn import cross_validation

# 训练集和测试集切分
train, test = cross_validation.train_test_split(df, test_size=0.2)

# 属性集合和标签集合切分
train_x = train.as_matrix()[:,:-1]
train_y = train.as_matrix()[:,-1]

test_x = test.as_matrix()[:,:-1]
test_y = test.as_matrix()[:,-1]

In [30]:
import numpy as np
import time
from sklearn import metrics

num_train, num_feat = train_x.shape  
num_test, num_feat = test_x.shape

# 判断标签是否为0-1标签
is_binary_class = (len(np.unique(train_y)) == 2)

print '******************** Data Info *********************'  
print '#training data: %d, #testing_data: %d, dimension: %d' % (num_train, num_test, num_feat)  
      
start_time = time.time()
model = nb_classifier(train_x, train_y)  
print 'training took %fs!' % (time.time() - start_time)  

predict = model.predict(test_x)

# 对于0-1标签，可以计算其精确率和召回率
if is_binary_class:  
    precision = metrics.precision_score(test_y, predict)  
    recall = metrics.recall_score(test_y, predict)  
    print 'precision: %.2f%%, recall: %.2f%%' % (100 * precision, 100 * recall)
    
# 计算准确率
accuracy = metrics.accuracy_score(test_y, predict)  
print 'accuracy: %.2f%%' % (100 * accuracy)

******************** Data Info *********************
#training data: 19, #testing_data: 5, dimension: 10
training took 0.001301s!
precision: 50.00%, recall: 100.00%
accuracy: 80.00%


In [26]:
import numpy as np 
realdata = DataFrame(np.array([[u'幽鬼', u'大树', u'剑圣', u'见证', u'天狗', u'蝙蝠', u'超人', u'闪电', u'天狗', u'白虎']]))
for key in all_heros:
    realdata.replace(key, all_heros.index(key), inplace=True)
realdata

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,10,4,3,24,6,23,26,29,6,18


In [27]:
predict = model.predict(realdata)
print predict

[1]
