## 6.1 supervised classification
- 文本 -> 特征提取，建模表示 -> 机器学习模型

### Gender Identify
-  特征:  男女名字特点，a,e,i结尾的是女性，K,o,r,s结尾的是男性
    - 特征使用简单的类型编码，易于计算

In [1]:
from nltk.corpus import names
import random

In [8]:
import nltk
# nltk.download()

In [15]:
def gender_features(word):
    return {'last_letter':word[-1]}
gender_features('Shrek')

{'last_letter': 'k'}

In [3]:
names = ([(name,'male') for name in names.words('male.txt')] + [(name,'female') for name in names.words('female.txt')])
random.shuffle(names)

In [9]:

# 特征提取
featruesets = [(gender_features(n),g) for (n,g)in names]
train_set, test_set = featruesets[500:],featruesets[:500]

#  模型训练
classifier = nltk.NaiveBayesClassifier.train(train_set)

#  模型测试
print(classifier.classify(gender_features("Neo")))
print(classifier.classify(gender_features("Trinity")))

In [11]:
#  测试集合评估这个分类器
nltk.classify.accuracy(classifier,test_set)

0.778

In [13]:
#  区分哪些特征对于区分名字是最有效的， 以a结尾的名字是男性的38倍。似然比来比较不同特征结果关系
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'a'            female : male   =     36.8 : 1.0
             last_letter = 'k'              male : female =     31.9 : 1.0
             last_letter = 'p'              male : female =     18.8 : 1.0
             last_letter = 'f'              male : female =     14.7 : 1.0
             last_letter = 'v'              male : female =     10.6 : 1.0


In [14]:
from nltk.classify import apply_features
train_set = apply_features(gender_features, names[500:])

### 选择正确的特征
- 并不是特征越多越好，容易在小数据集上过拟合，泛化效果不佳

In [6]:
#  一个特征提取器，过拟合性别特征。这个特征提取器返回的特征集包括大量指
# 定的特征，从而导致对于相对较小的名字语料库过拟合。

def gender_features2(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features
gender_features2('John')

{'count(a)': 0,
 'count(b)': 0,
 'count(c)': 0,
 'count(d)': 0,
 'count(e)': 0,
 'count(f)': 0,
 'count(g)': 0,
 'count(h)': 1,
 'count(i)': 0,
 'count(j)': 1,
 'count(k)': 0,
 'count(l)': 0,
 'count(m)': 0,
 'count(n)': 1,
 'count(o)': 1,
 'count(p)': 0,
 'count(q)': 0,
 'count(r)': 0,
 'count(s)': 0,
 'count(t)': 0,
 'count(u)': 0,
 'count(v)': 0,
 'count(w)': 0,
 'count(x)': 0,
 'count(y)': 0,
 'count(z)': 0,
 'firstletter': 'j',
 'has(a)': False,
 'has(b)': False,
 'has(c)': False,
 'has(d)': False,
 'has(e)': False,
 'has(f)': False,
 'has(g)': False,
 'has(h)': True,
 'has(i)': False,
 'has(j)': True,
 'has(k)': False,
 'has(l)': False,
 'has(m)': False,
 'has(n)': True,
 'has(o)': True,
 'has(p)': False,
 'has(q)': False,
 'has(r)': False,
 'has(s)': False,
 'has(t)': False,
 'has(u)': False,
 'has(v)': False,
 'has(w)': False,
 'has(x)': False,
 'has(y)': False,
 'has(z)': False,
 'lastletter': 'n'}

In [11]:
featuresets = [(gender_features2(n), g) for (n,g) in names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)


#  尴尬了

0.804

###  错误分析

In [12]:
train_names = names[1500:]
devtest_names = names[500:1500]
test_names = names[:500]


In [16]:
train_set = [(gender_features(n), g) for (n,g) in train_names]
devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]
test_set = [(gender_features(n), g) for (n,g) in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, devtest_set) 

0.752

In [17]:
errors = []
for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append( (tag, guess, name) )

In [19]:
for (tag, guess, name) in sorted(errors): # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
    print('correct=%-8s guess=%-8s name=%-30s'%((tag, guess, name)))

correct=female   guess=male     name=Aimil                         
correct=female   guess=male     name=Alisun                        
correct=female   guess=male     name=Allison                       
correct=female   guess=male     name=Allyn                         
correct=female   guess=male     name=Allyson                       
correct=female   guess=male     name=Alyss                         
correct=female   guess=male     name=Anett                         
correct=female   guess=male     name=Annabal                       
correct=female   guess=male     name=Ardys                         
correct=female   guess=male     name=Aryn                          
correct=female   guess=male     name=Beret                         
correct=female   guess=male     name=Blair                         
correct=female   guess=male     name=Brandais                      
correct=female   guess=male     name=Calypso                       
correct=female   guess=male     name=Carlynn    

In [21]:
"""
浏览这个错误列表，它明确指出一些多个字母的后缀可以指示名字性别。例如：yn 结
尾的名字显示以女性为主，尽管事实上，n 结尾的名字往往是男性；以 ch 结尾的名字通常
是男性，尽管以 h 结尾的名字倾向于是女性。因此，调整我们的特征提取器包括两个字母后缀的特征
"""
def gender_features(word):
    return {'suffix1': word[-1:],
            'suffix2': word[-2:]}
train_set = [(gender_features(n), g) for (n,g) in train_names]
devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, devtest_set)

0.783