- 我们怎样才能识别语言数据中能明显用于对其分类的特征？
- 我们怎样才能构建语言模型，用于自动执行语言处理任务？
- 从这些模型中我们可以学到哪些关于语言的知识？

In [1]:
def gender_features(word):
...     return {'last_letter': word[-1]}

gender_features('Shrek')

{'last_letter': 'k'}

## 男女性别分类

In [20]:
from nltk.corpus import names
labeled_names = [(name, 'male') for name in names.words('male.txt')+ 
                [(name, 'female') for name in names.words('female.txt')]]

In [21]:
import random, nltk
random.shuffle(labeled_names)

# 特征提取
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [22]:
classifier.classify(gender_features('Neo'))

'male'

In [10]:
classifier.classify(gender_features('Trinity'))

'male'

In [11]:
print(nltk.classify.accuracy(classifier, test_set))

1.0


In [12]:
classifier.show_most_informative_features(5)

Most Informative Features


## 文档分类

In [23]:
from nltk.corpus import movie_reviews

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]


In [18]:
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000] 

def document_features(document): 
    document_words = set(document) 
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

In [19]:
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)