这部分介绍NLTK中用于文本分类的分类器

接下来是使用NLTK进行文本分类的简单示例

In [2]:
import random
import nltk
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)
              for category in movie_reviews.categories()
              for fileid in movie_reviews.fileids(category)]
random.shuffle(documents) # 读取文本并打乱，减小过拟合，让训练集和测试集的分布相近

In [3]:
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]

def document_features(document):
    """特征工程"""
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

In [11]:
featuresets = [(document_features(d), c) for (d,c) in documents] # 注意NLTK特征集的形式为（feature dict, label）
train_set, test_set = featuresets[100:], featuresets[:100]
# 使用朴素贝叶斯算法进行分类
classifier = nltk.NaiveBayesClassifier.train(train_set)

NLTK中提供了多种度量指标来对分类器的分类结果进行度量

In [17]:
test_labels = [label for (_, label) in test_set]
test_features = [feature for (feature, _) in test_set]
test_predict = [classifier.classify(test_feature) for test_feature in test_features]

In [21]:
print("accuracy:", nltk.classify.accuracy(classifier, test_set)) # 查看测试集上的准确率
print(nltk.ConfusionMatrix(test_predict, test_labels)) # 混淆矩阵
classifier.show_most_informative_features(5) # 最有效的五个特征

accuracy: 0.78
    |  n  p |
    |  e  o |
    |  g  s |
----+-------+
neg |<39>13 |
pos |  9<39>|
----+-------+
(row = reference; col = test)

Most Informative Features
       contains(miscast) = True              neg : pos    =      8.2 : 1.0
 contains(unimaginative) = True              neg : pos    =      7.6 : 1.0
        contains(shoddy) = True              neg : pos    =      7.0 : 1.0
        contains(sexist) = True              neg : pos    =      7.0 : 1.0
     contains(atrocious) = True              neg : pos    =      7.0 : 1.0


NLTK提供了决策树、朴素贝叶斯、最大熵分类器三种经典的机器学习中的分类器

In [22]:
decision_tree_classifier = nltk.classify.decisiontree.DecisionTreeClassifier.train(train_set) # 决策树
print("decision tree accuracy:", nltk.classify.accuracy(decision_tree_classifier, test_set)) # 查看测试集上的准确率

decision tree accuracy: 0.6


In [24]:
max_entropy_classifier = nltk.classify.maxent.MaxentClassifier.train(train_set) # 决策树
print("max entropy accuracy:", nltk.classify.accuracy(max_entropy_classifier, test_set)) # 查看测试集上的准确率

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.499


  exp_nf_delta = 2 ** nf_delta
  sum1 = numpy.sum(exp_nf_delta * A, axis=0)
  sum2 = numpy.sum(nf_exp_nf_delta * A, axis=0)
  deltas -= (ffreq_empirical - sum1) / -sum2


         Final               nan        0.499
max entropy accuracy: 0.52


Scikit-Learn库是Python中的机器学习库，提供了大量的机器学习算法，NLTK中可以通过Sklearn使用sklearn模块的分类器

In [25]:
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB

MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(train_set)
print("MultinomialNB accuracy:",nltk.classify.accuracy(MNB_classifier, test_set))

MultinomialNB accuracy: 0.78
