## 第16讲 作业

### 题 1 —— 调整停用词表


- 察看课件中第5个分类器 `clf_5` 在训练结束后得到的特征名称 `feature_names`，会找到以下的词
    - 13-year-old、14-year-old、...
    - 0x[a-f0-9]+
    - 1024x1024、1024x1024x8、1024x512、...
    - 12-point、12-step、12-story
    - 等等


- 这些变异单词（如"year-old"、"-step"、"-story"、"...and"等）可能也是可以放入停用词表的，试
    - 找一些
    - 放入停用词表中
    - 进行分类识别分析

In [1]:
import IPython
import sklearn as sk
import numpy as np
from sklearn.model_selection import cross_val_score, KFold
from scipy.stats import sem# 导入计算标准误差函数 sem

### 定义一个函数 用于提取停用词
筛选条件如下：
* 单词长度大于1
* 单词拼写正确
* 不在原停用词表中

In [2]:
from nltk.corpus import wordnet
import re 
import enchant #拼写检查
def Words_Filter(word_list):
    stop_words = get_stop_words()
    Dict = enchant.Dict("en_En") 
    output=[]
    for i in range(len(word_list)):
        res=re.findall(r'[a-zA-Z\-]+[a-z]', word_list[i])
        length=len(res)
        if length > 0:
            for j in range(length):
                if Dict.check(res[j]) and not res[j] in output and len(res[j])>1 and not res[j] in stop_words: 
                    output.append(res[j])
    return output
#####################################
# 原课件中从txt文件中读取停用词的函数
def get_stop_words(address='stopwords_en.txt'):
    result = set()
    for line in open(address, 'r').readlines():
        result.add(line.strip())
    return result

### 获取 20-新闻组

In [3]:
import sklearn.datasets as ds
news = ds.fetch_20newsgroups(subset='all')
SPLIT_PERC = 0.75
split_size = int(len(news.data)*SPLIT_PERC)
X_train = news.data[:split_size]# 样本数据 —— 训练集
X_test = news.data[split_size:]# 样本数据 ——测试集
y_train = news.target[:split_size]# 目标数据 —— 训练集
y_test = news.target[split_size:]# 目标数据 —— 测试集

In [4]:
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB# 导入 MultinomialNB 类
from sklearn.pipeline import Pipeline
from sklearn import metrics
# 定义交叉验证评价函数
def evaluate_cross_validation(clf, X, y, K):
    # 创建 K-折 交叉验证迭代器（5折）
    #cv = KFold(len(y), K, shuffle=True, random_state=0)
    cv = KFold(K, shuffle=True, random_state=0)
    # 返回得分
    scores = cross_val_score(clf, X, y, cv=cv)
    print (scores)
    print (("平均得分: {0:.3f} (+/-{1:.3f})").format(
        np.mean(scores), sem(scores)))
stop_words = get_stop_words()
clf_5 = Pipeline([
    ('vect', TfidfVectorizer(
                stop_words=stop_words,
                token_pattern=r"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",         
    )),
    ('clf', MultinomialNB(alpha=0.01)),
])

### 原始分类器clf_5得分

In [5]:
evaluate_cross_validation(clf_5, news.data, news.target, 5)

[0.9204244  0.91960732 0.91828071 0.92677103 0.91854603]
平均得分: 0.921 (+/-0.002)


In [6]:
clf_5.fit(X_train, y_train)
y_pred = clf_5.predict(X_test)

In [7]:
feature_names=clf_5.named_steps['vect'].get_feature_names()
new_list=Words_Filter(feature_names)
print('原有停用词',len(stop_words),'个；')
print('共新增',len(new_list),'个停用词。')

原有停用词 318 个；
共新增 42617 个停用词。


### 将新加入的停用词写入txt文件中

In [8]:
with open('stopwords_en_add.txt', 'w') as file:
    for i in range(len(new_list)):
        s = str(new_list[i].replace('[', '')).replace(']', '')
        s = s.replace("'", '').replace(",", '') + '\n'
        file.write(s)

In [9]:
stop_words = get_stop_words()
stop_words_new=stop_words.copy()
for line in open('stopwords_en_add.txt', 'r').readlines():
        stop_words_new.add(line.strip())

In [10]:
clf_6_1 = Pipeline([
    ('vect', TfidfVectorizer(
                stop_words=stop_words_new,
                token_pattern=r"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",         
    )),
    ('clf', MultinomialNB(alpha=0.01)),
])
evaluate_cross_validation(clf_6_1, news.data, news.target, 5)

  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))


[0.86286472 0.86893075 0.86813478 0.86627753 0.86521624]
平均得分: 0.866 (+/-0.001)


### 结果分析
* 得分较之前更低，表明停用词选取不当;

#### 观察提取的新停用词所得到的txt文件，分析原因如下：
* 所选的停用词过多，其中很大一部分是用于判断文本类型的，而如果加入停用词中，势必会削弱分类器的判断能力，因为停用词即意味着不采用这些词作为判断依据。

### 改进方法：
* 选取前856个明显的非实义词作为新停用词

In [26]:
stop_words = get_stop_words()
stop_words_new=stop_words.copy()
file=open ('stopwords_en_add.txt', 'r') 
for line in file.readlines()[0:856]:#选取前856个明显的非实义词作为新停用词
    stop_words_new.add(line.strip())
print("新加入",len(stop_words_new)-len(stop_words),"个单词作为停用词。")
#for line in open('stop_words_1000.txt', 'r').readlines():
#        stop_words_new.add(line.strip())

新加入 856 作为停用词。


In [27]:
clf_6_2 = Pipeline([
    ('vect', TfidfVectorizer(
                stop_words=stop_words_new,
                token_pattern=r"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",         
    )),
    ('clf', MultinomialNB(alpha=0.01)),
])
evaluate_cross_validation(clf_6_2, news.data, news.target, 5)

  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))


[0.91777188 0.9169541  0.91668878 0.92544441 0.91668878]
平均得分: 0.919 (+/-0.002)


### 结果分析：
* 得分比原始得分略低，但好于之前的clf6_1
* 说明依旧有一些停用词其实是原本作为重要的判断依据的，将他们加入停用词中，会降低得分

### 改进方法：
* 由于1. feature_name 中的特征词过于庞大，多大约14万；2.本身停用词和直接用于判断的词的定义较为模糊 所以难以筛选出较合适的停用词
* 所以决定试试在网上用一个更为庞大的停用词库，该库不针对本数据集。

In [30]:
stop_words_long={'soon', 'wasnt', 'thereupon', 'hardly', 'herself', 'getting', 'means', 'against', 'nonetheless', 'per', 'recently', 'take', 'last', 'so', 'often', "shouldn't", 'during', 'around', 'therere', 'outside', 'selves', 'sent', 'whim', 'sec', 'significantly', 'same', 'end', 'whos', 'that', 'could', 'obtained', 'over', 'moreover', 'enough', 'five', 'g', 'through', 'particular', 'latterly', 'toward', 'willing', 'hid', 'and', 'both', 'm', 'biol', 'stop', 'act', 'thus', 'abst', 'thou', 'otherwise', 'id', 'anyone', 'somewhat', 'whereafter', 'self', 'theyre', 'y', "who'll", 'besides', 'j', 'whenever', 'important', 'nowhere', 'youd', 'about', 'within', 'thank', 'u', 'them', 'seemed', 'theres', 'affects', "i've", 'nearly', "didn't", 'if', 'regardless', 'without', 'her', 'inc', 'itd', 'miss', 'but', 'gave', 'this', 'an', 'ever', 'www', 'co', 'respectively', 'apparently', 'say', 'herein', 'tries', 'various', 'though', 'needs', 'specified', 'ts', 'yes', 'vs', 'anyways', 'she', 'awfully', 'nor', 'might', 'particularly', 'relatively', 'along', 'anyway', 'ord', 'beforehand', 'therein', 'due', 'regarding', 'old', 'couldnt', 'whatever', 'inward', 'que', 'let', 'present', 'need', 'amongst', 'show', 'unless', 'they', 'hereupon', 'know', 'sup', 'edu', 'unlike', 'thoughh', 'however', 'has', 'b', 'hers', 'invention', 'nay', 'readily', "there'll", 'below', 'taken', 'viz', 'twice', 'became', 'into', 'also', 'says', 'adj', 'each', 'rd', 'somebody', 'lately', 'possible', 'elsewhere', 'giving', 'him', 'sometimes', 'whether', 'especially', 'which', 'ourselves', 'beginnings', 'e', 'than', 'because', 'arise', 'whence', 'whereupon', "you'll", 'happens', 'similarly', 'between', 'pp', 'seven', 'almost', 'accordance', 'unfortunately', 'certain', 'whod', 'w', 'something', 'suggest', 'by', 'sorry', "you've", 'later', 'somethan', 'gone', 'pages', 'ed', "can't", 'way', 'further', 'me', 'when', 'of', "they'll", 'got', 'importance', 'obviously', 'myself', 'anybody', 'related', 'resulted', 'on', 'merely', 'mostly', 'contain', 'give', 'necessary', 'ought', 'latter', 'past', 'un', 'noone', 'formerly', 'beside', 'research', 'sub', 'given', 'home', 'whereas', 'z', 'wouldnt', 'theyd', 'causes', "'ve", 'wed', 't', 'do', 'thereto', 'obtain', 'line', 'together', 'ending', 'usefully', 'makes', 'yourselves', 'meanwhile', 'is', 'although', 'million', 'in', 'shows', 'according', 'up', 'use', 'liked', 'your', 'no', 'beyond', 'available', 'what', 'et', 'another', 'very', 'whither', 'eight', 'behind', 'affecting', 'either', 'hi', "i'll", 'known', 'ask', 'as', 'he', 'plus', 'then', 'being', 'among', 'km', 'actually', 'provides', 'saying', 'doing', 'followed', 'make', 'truly', 'upon', 'page', 'whoever', 'refs', 'oh', 'etc', 'other', 'containing', 'results', 'others', 'largely', 'am', 'necessarily', 'hence', 'look', 'few', 'a', 'there', 'mrs', 'those', 'p', 'towards', 'vol', 'thru', 'im', 'thereafter', 'namely', 'quite', 'first', 'his', 'tends', 'n', 'shed', 'really', 'aside', 'index', 'youre', 'ah', 'f', 'did', 'world', 'thereof', 'words', 'lets', 'back', 'tell', 'two', 'we', 'meantime', 'com', 'six', "don't", 'for', 'contains', 'every', 'must', 'none', 'specify', 'thered', "we've", 'too', 'immediately', 'here', 'specifying', 'werent', 'showns', 'downwards', 'throughout', 'ml', 'former', 'near', 'name', 'came', 'should', 'who', 'announce', 'their', 'wont', 'anyhow', 'its', "haven't", 'immediate', 'uses', 'are', 'ok', 'ours', 'fix', 'predominantly', 'everything', 'itself', 'throug', 'aren', 'my', 'useful', "hasn't", 'added', 'heres', 'run', 'arent', 'down', 'thereby', 'poorly', 'any', 'na', 'the', 'always', 'wheres', 'til', 'certainly', 'come', 'affected', 'information', 'ref', 'specifically', 'hundred', 'anywhere', 'follows', 'still', 'whomever', 'previously', 'welcome', 'above', 'tried', 're', 'mr', 'okay', 'thats', 'everybody', 'mug', 'thanks', 'while', 'even', 'many', 'somewhere', 'eighty', 'ones', 'far', 'able', 'section', 'except', 'vols', 'l', 'alone', 'ran', 'taking', 'fifth', 'where', 'these', 'auth', "they've", 'get', 'eg', 'et-al', 'yet', 'across', 'different', 'primarily', 'becomes', 'strongly', 'th', "that've", 'v', 'h', 'our', 'much', 'themselves', 'whom', 'took', 'new', 'potentially', 'likely', 'ex', 'several', 'less', 's', 'since', 'nobody', 'resulting', 'noted', 'similar', 'not', 'seeming', 'to', 'neither', 'off', 'out', 'maybe', 'may', 'furthermore', 'sufficiently', 'wants', 'approximately', "we'll", 'kept', 'himself', 'showed', 'had', 'whats', 'slightly', 'again', 'lest', 'knows', 'why', 'goes', 'least', 'afterwards', 'theirs', 'beginning', 'want', 'four', 'accordingly', 'following', 'more', 'some', 'saw', 'becoming', 'part', 'try', 'ie', 'made', 'anything', 'kg', 'ups', 'anymore', 'qv', 'until', 'briefly', "it'll", 'yours', 'cause', 'used', 'nine', 'one', 'thence', 'were', 'just', 'found', 'unlikely', 'gets', 'see', 'somehow', 'begin', 'own', 'widely', 'shes', 'yourself', 'go', 'have', 'mean', 'now', 'from', 'keep\tkeeps', 'wish', 'or', 'someone', 'looking', 'everyone', 'howbeit', 'rather', "what'll", 'become', 'normally', 'once', 'only', 'usefulness', 'be', 'said', 'hereby', 'usually', 'substantially', 'would', 'significant', "there've", 'wherever', 'probably', 'can', 'comes', 'it', 'seem', "'ll", 'o', 'how', 'been', 'k', 'possibly', 'shown', 'seems', 'next', 'hes', 'ninety', 'omitted', 'us', 'cannot', 'wherein', 'perhaps', 'you', 'everywhere', 'with', 'ltd', 'already', 'nothing', 'c', 'recent', 'hed', 'value', "doesn't", 'ff', 'overall', 'successfully', 'believe', 'nevertheless', 'having', 'never', 'thanx', 'most', 'owing', 'under', 'x', 'before', 'went', 'nd', 'think', 'all', 'was', 'onto', 'asking', 'at', 'thousand', 'seen', 'trying', 'quickly', 'mainly', "that'll", 'mg', 'whole', 'therefore', 'such', 'unto', 'brief', 'using', 'seeing', 'gotten', 'q', 'r', 'whose', 'after', 'little', 'non', "isn't", 'instead', 'done', 'shall', 'looks', 'else', 'hither', 'proud', 'hereafter', 'd', 'i', 'sometime', 'placed', 'zero', 'indeed', 'nos', 'regards', 'via', 'date', 'ca', 'like', 'sure', 'please', 'effect', 'away', 'begins', 'gives', 'promptly', 'forth', 'put', 'tip', 'whereby', 'right', 'does', "she'll"}
print("从网上下载的新停用词库包含",len(stop_words_long),"个停用词。")

从网上下载的新停用词库包含 666 个停用词。


In [31]:
clf_6_3 = Pipeline([
    ('vect', TfidfVectorizer(
                stop_words=stop_words_long,
                token_pattern=r"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",         
    )),
    ('clf', MultinomialNB(alpha=0.01)),
])
evaluate_cross_validation(clf_6_3, news.data, news.target, 5)

  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))


[0.92175066 0.92013797 0.91960732 0.92517909 0.91960732]
平均得分: 0.921 (+/-0.001)


In [36]:
def Train_and_evaluate(clf, X_train, X_test, y_train, y_test):
    clf.fit(X_train, y_train)
    print ("训练集精度：",clf.score(X_train, y_train))
    print ("测试集精度：",clf.score(X_test, y_test),'\n')
print("clf_5的测试精度：")
Train_and_evaluate(clf_5, X_train, X_test, y_train, y_test)
print("clf_6_3的测试精度：")
Train_and_evaluate(clf_6_3, X_train, X_test, y_train, y_test)

clf_5的测试精度：
训练集精度： 0.9969576906749682
测试集精度： 0.9178692699490663 

clf_6_3的测试精度：
训练集精度： 0.9971699448139238
测试集精度： 0.9199915110356537 



### 结果分析：
* 较原始的得分略有进步；
* 由于该停用词库针对所有文本，并不针对本数据集，所以优势并不明显

### 采用NLTK的stopwords包效果也不理想

In [37]:
import nltk; nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\solit\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [38]:
# NLTK Stop words
from nltk.corpus import stopwords
stop_words_nltk = stopwords.words('english')
stop_words_nltk.extend(['from', 'subject', 're', 'use'])
clf_6_4 = Pipeline([
    ('vect', TfidfVectorizer(
                stop_words=stop_words_nltk,
                token_pattern=r"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",         
    )),
    ('clf', MultinomialNB(alpha=0.01)),
])
evaluate_cross_validation(clf_6_4, news.data, news.target, 5)

[0.9198939  0.919342   0.91748474 0.9265057  0.91881136]
平均得分: 0.920 (+/-0.002)


### 题 2 —— 采用 `CountVectorizer` 分类器进行调参分析


- 复习课件中对文本特征提取工具对象 `TfidfVectorizer()` 进行的调参


- 采用 `CountVectorizer()` 对象，并调整参数，尝试改进分类器
    - 第一次调参
        - token_pattern=r"\\b[a-z0-9_\\-\\.]+[a-z][a-z0-9_\\-\\.]+\\b"
    - 第二次调参
        - token_pattern=r"\\b[a-z0-9_\\-\\.]+[a-z][a-z0-9_\\-\\.]+\\b"
        - stop_words —— 取自文件 'data/stopwords_en.txt'


- 要求
    - 分类器对象采用管线生成
    - 进行交叉验证评价
    - 最后采用得分最高的分类器，对训练集和测试集进行验证
    - 输出分类测试报告

In [32]:
def train_and_evaluate(clf, X_train, X_test, y_train, y_test):
    clf.fit(X_train, y_train)
    print ("训练集精度 Accuracy：")
    print (clf.score(X_train, y_train))
    print ("测试集精度 Accuracy：")
    print (clf.score(X_test, y_test))
    y_pred = clf.predict(X_test)
    print ("分类测试报告：")
    print (metrics.classification_report(y_test, y_pred))
    print ("混淆矩阵：")
    print (metrics.confusion_matrix(y_test, y_pred))

### 第一次调参
* token_pattern=r"\\b[a-z0-9_\\-\\.]+[a-z][a-z0-9_\\-\\.]+\\b"

In [77]:
clf_7 = Pipeline([
    ('vect', CountVectorizer(token_pattern=r"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b")),
    ('clf', MultinomialNB()),  #光滑化处理
])
evaluate_cross_validation(clf_7, news.data, news.target, 5)

[0.88222812 0.87795171 0.8694614  0.88113558 0.85911382]
平均得分: 0.874 (+/-0.004)


### 第二次调参
* token_pattern=r"\\b[a-z0-9_\\-\\.]+[a-z][a-z0-9_\\-\\.]+\\b"
* stop_words —— 取自文件 'data/stopwords_en.txt'

In [79]:
stop_words = get_stop_words()
clf_8 = Pipeline([
    ('vect', CountVectorizer(
                             stop_words=stop_words,
                             token_pattern=r"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b"
    )),
    ('clf', MultinomialNB()),  #光滑化处理
])
evaluate_cross_validation(clf_8, news.data, news.target, 5)

[0.89681698 0.89493234 0.88829928 0.8991775  0.88166622]
平均得分: 0.892 (+/-0.003)


### 对训练集和测试集进行验证
### 同时输出分类测试报告

In [80]:
train_and_evaluate(clf_8, X_train, X_test, y_train, y_test)

训练集精度 Accuracy：
0.9707089288241121
测试集精度 Accuracy：
0.890704584040747
分类测试报告：
              precision    recall  f1-score   support

           0       0.91      0.87      0.89       216
           1       0.70      0.89      0.78       246
           2       0.97      0.61      0.75       274
           3       0.73      0.89      0.80       235
           4       0.89      0.91      0.90       231
           5       0.86      0.91      0.89       225
           6       0.90      0.74      0.81       248
           7       0.92      0.91      0.91       275
           8       0.96      0.96      0.96       226
           9       0.97      0.96      0.96       250
          10       0.98      0.99      0.99       257
          11       0.90      0.98      0.94       261
          12       0.91      0.88      0.90       216
          13       0.96      0.94      0.95       257
          14       0.92      0.95      0.93       246
          15       0.84      0.94      0.89       234
    

### 思考题 （不用提交）


- 调研（限于 Python）中文文本分类识别的工具、模块、案例

### 翻转课堂（自学）


- Pipeline
- 正则表达式(Regular Expression)