作业三 朴素贝叶斯情感分类  
任务描述：利用nltk语料库中的影评来进行朴素贝叶斯情感分类训练。  
影评导入：from nltk.corpus import movie_reviews  
文件具体目录在…\nltk_data\corpora\movie_reviews，已做好分类标注，消极与积极影评各1000条。  
步骤：经过文本预处理（去噪、分句、分词、去停词、取词干、修剪）和特征选择，生成特征词表，之后利用朴素贝叶斯模型进行训练。（每个步骤最好注释一下）  
选择前80%（即前800条消极影评与前800条积极影评）作为训练集，后20%作为测试集。  
输出：准确率Accuracy，精确率Precision，召回率Recall和F1值，精确到小数点后两位。其中，F1 = ( 2 * Precision * Recall ) / ( Precision + Recall)。  
例如：  
Accuracy = 0.98  
Precision = 0.67  
Recall = 0.32  
F1 = 0.43  


In [1]:
import random
from nltk.corpus import movie_reviews  # 影评语料库
from nltk.corpus import stopwords
from nltk import FreqDist
from nltk import NaiveBayesClassifier
from nltk.classify import accuracy
import string


def is_invalid(word):    # 预先定义好去停词、标点函数
    return word in sw or word in punctuation


def doc_features(doc):  # 预先定义提取单词在过滤文档中的频率作为特征
    doc_words = FreqDist(w for w in doc if not is_invalid(w))
    features = {}   # 字典
    for word in word_features:
        features['count(%s)' % word] = (doc_words.get(word, 0))
    return features


#   生成标注好的句子、标签列表
#   已经提供了分词后的句子，因此不需要手动分句、分词
labeled_docs = [(list(movie_reviews.words(fid)), cat) for cat in movie_reviews.categories() for fid in
                movie_reviews.fileids(cat)]

#   选取特征词
review_words = movie_reviews.words()

sw = set(stopwords.words('english'))  # 标记停用词
punctuation = set(string.punctuation)  # 标记标点

#   通过函数去停词和标点
filtered = [w.lower() for w in review_words if not is_invalid(w.lower())]

#   单词按频率排序，词频最高的5%的单词作为特征
words = FreqDist(filtered)
N = int(0.05 * len(words.keys()))
word_features = list(words.keys())[:N]

# 从原始数据中提取特征
feature_sets = [(doc_features(d), c) for (d, c) in labeled_docs]
# 训练集，测试集（选取正负样本的各前800个作为训练集，各后200个作为测试集）
train_set, test_set = feature_sets[0:800]+feature_sets[1000:1800], feature_sets[800:1000]+feature_sets[1800:2000]
# 利用训练集训练模型
classifier = NaiveBayesClassifier.train(train_set)

# 定义TP、FP、TN、FN
True_Positive = 0
False_Positive = 0
True_Negative = 0
False_Negative = 0

results = classifier.classify_many([fs for (fs, l) in test_set])
for ((fs, l), r) in zip(test_set, results):
    if(l == 'pos') and (l == r):
        True_Positive += 1
    elif (l == 'pos') and (l != r):
        False_Positive += 1
    elif (l == 'neg') and (l == r):
        True_Negative += 1
    elif (l == 'neg') and (l != r):
        False_Negative += 1

print(True_Positive, False_Positive, True_Negative, False_Negative)
# 利用训练集训练的模型，用测试集计算Accuracy、Precision、Recall、F1
# Accuracy = (TP+TN)/sum
Accuracy = (True_Positive+True_Negative)/400
# Precision = TP/(TP+FP)
Precision = True_Positive/(True_Positive+False_Positive)
# Recall = TP/(TP+FN)
Recall = True_Positive/(True_Positive+False_Negative)
# F1 = ( 2 * Precision * Recall ) / ( Precision + Recall)
F1 = (2 * Precision * Recall) / (Precision + Recall)
print("Accuracy = ", Accuracy)
print("Precision = ", Precision)
print("Recall = ", "%.2f" % Recall)
print("F1 = ", "%.2f" % F1)

158 42 174 26
Accuracy =  0.83
Precision =  0.79
Recall =  0.86
F1 =  0.82
