# 第八章 应用机器学习于情感分析
本章包含以下内容：
·	清洗和准备文本数据
·	根据文本数据建立特征向量
·	训练机器学习模型来区分正面或者负面评论
·	用基于外存的学习方法来处理大型文本数据集
·	根据文档推断主题进行分类

In [1]:
#使用IMDB电影评论数据集，来源于http://ai.stanford.edu/~amaas/data/sentiment
import pandas as pd
import numpy as np

In [2]:
movie_data = pd.read_csv('./data/movie_data.csv', header=0)
np.random.seed(10)
movie_data = movie_data.reindex(np.random.permutation(movie_data.index))
movie_data = movie_data.reset_index()
movie_data.head()

Unnamed: 0,index,review,sentiment
0,27632,i was greatly moved when i watched the movie.h...,1
1,36119,"The warmest, most engaging movie of its genre,...",1
2,4796,"Wow, I think the overall average rating of thi...",1
3,3648,"I liked the quiet noir of the first part, the ...",1
4,24501,This is another Sci-Fi channel original movie ...,0


### 词袋模型

In [3]:
#统计每个单词出现的频率计算次袋模型
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
docs = np.array(
    ["I love my wife bin bin", "I love she so much!", "I can't lose my wife", "I want stay with you forever，bin bin"])
bag = count.fit_transform(docs)
print(count.vocabulary_)
print(bag.toarray())

{'love': 4, 'my': 6, 'wife': 11, 'bin': 0, 'she': 7, 'so': 8, 'much': 5, 'can': 1, 'lose': 3, 'want': 10, 'stay': 9, 'with': 12, 'you': 13, 'forever': 2}
[[2 0 0 0 1 0 1 0 0 0 0 1 0 0]
 [0 0 0 0 1 1 0 1 1 0 0 0 0 0]
 [0 1 0 1 0 0 1 0 0 0 0 1 0 0]
 [2 0 1 0 0 0 0 0 0 1 1 0 1 1]]


特征向量的每个索引位置对应于词汇存储在CountVectorizer字典项中的整数值。例如，索引位置0上的第一个特征等同于单词'bin'的词频，它只出现在最后一个和第一个文档中。这些值的特征向量也叫原词频率：tf（t，d），即t项在文档中出现的次数d。count.vocabulary_中每个单词对应的数字是该单词在特征向量中的位置索引。

### 通过词频逆反文档频率评估单词的相关性

![](picture/评估单词的相关性.png)

In [7]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True,
                         norm='l2',
                         smooth_idf=True)
np.set_printoptions(precision=2)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())
#tfidf,就是可以把在所有文档中都大量出现的词的权重给降低了，比如这里的"I"

[[0.76 0.   0.   0.   0.38 0.   0.38 0.   0.   0.   0.   0.38 0.   0.  ]
 [0.   0.   0.   0.   0.41 0.53 0.   0.53 0.53 0.   0.   0.   0.   0.  ]
 [0.   0.56 0.   0.56 0.   0.   0.44 0.   0.   0.   0.   0.44 0.   0.  ]
 [0.58 0.   0.37 0.   0.   0.   0.   0.   0.   0.37 0.37 0.   0.37 0.37]]


![](picture/tfidf的l2标准化.png)

### 清洗文本数据
Porter分词算法,返回单词的词根

In [22]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()


def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]


tokenizer_porter("runers like runing and thus they run")


def tokenizer(text):
    return text.split()


tokenizer_porter("runers like runing and thus they run")

['runer', 'like', 'rune', 'and', 'thu', 'they', 'run']

In [17]:
#去除停用词
import nltk

nltk.download("stopwords")
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /Users/apple/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [18]:
stop = stopwords.words("english")
[w for w in tokenizer_porter("a runer likes running and runs a lot"
                             ) if w not in stop]

['runer', 'like', 'run', 'run', 'lot']

In [42]:
#stop=stopwords.words("chinse"),没有中文
#训练逻辑回归模型
movie_data = movie_data[["review", "sentiment"]]
X_train = movie_data.loc[:25000, 'review'].values
y_train = movie_data.loc[:25000, 'sentiment'].values
X_test = movie_data.loc[25000:, 'review'].values
y_test = movie_data.loc[25000:, 'sentiment'].values
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

In [59]:
tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [str.split, str.split],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf': [False],
               'vect__norm': [None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)

gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

KeyboardInterrupt: 

In [57]:
#最优的参数
print("Best parameter set :%s" % gs_lr_tfidf.best_params_)
#分类准确率
print('CV Accuracy:%.3f' % gs_lr_tfidf.bset_score)
#测试集上的准确率
clf = gs_lr_tfidf.best_estimator_
print("Test Accuracy:%.3f" % clf.score(X_test, y_test))

['The',
 'warmest,',
 'most',
 'engaging',
 'movie',
 'of',
 'its',
 'genre,',
 'Those',
 'Lips,',
 'Those',
 'Eyes,',
 'made',
 'me',
 'smile',
 'and',
 'cry',
 'as',
 'it',
 'reminded',
 'me',
 'of',
 'the',
 'work',
 'it',
 'takes',
 'to',
 'pursue',
 'a',
 'dream',
 'and',
 'the',
 'pain',
 'of',
 'disappointment.',
 'Hulce',
 'and',
 'Langella',
 'are',
 'superb',
 'and',
 'the',
 'story',
 'seems',
 'to',
 'write',
 'itself.',
 'A',
 'brilliant',
 'screenplay',
 'by',
 'David',
 'Shaber',
 '(one',
 'of',
 'my',
 'favorites!',
 '-',
 'see',
 'The',
 'Warriors',
 'and',
 'Nighthawks',
 'for',
 'more...)',
 'and',
 'beautiful',
 'sets',
 'filmed',
 'on',
 'location',
 '(I',
 'think)',
 'at',
 'the',
 'actual',
 'summer',
 'theater',
 'in',
 'which',
 'the',
 'story',
 'takes',
 'place.',
 'You',
 "can't",
 'see',
 'this',
 'movie',
 'and',
 'not',
 'want',
 'to',
 'drop',
 'everything',
 'and',
 'get',
 'into',
 'the',
 'theater!',
 'Please',
 'check',
 'this',
 'video',
 'out',
 'i

### 处理更大的数据集
在线算法和核心学习
从上一节执行过的代码示例中你可能已经注意到，在做网格搜索时，为50 000个电影评论数据集构造特征向量的代价可能非常昂贵。在许多实际应用中，超出计算机内存的规模处理更大数据集的情况并不少见。但并不是每个人都能访问超级计算机，所以现在将采用一种被称为“核外学习”的技术，这种技术可以通过对数据集的小批增量来模拟分类器完成大型数据集的处理工作。本节将用scikit-learn的SGDClassifier的partial_fit函数从本地驱动器直接获取流式文件，并用文件的小批次文档训练逻辑回归模型。

In [1]:
import re
from nltk.corpus import stopwords

stop = stopwords.words('english')


def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized


def stream_docs(path):
    with open(path, 'r', encoding='utf-8') as csv:
        next(csv)  # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label  #生成器，每调用一次他就会产生一行数据，和标签

In [2]:
next(stream_docs(path="data/movie_data.csv"))

('"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />""Murder in Greenwich"" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich f

In [3]:
def get_minibacth(doc_strem, size):
    docs, y = [], []
    try:
        for s in range(size):
            text, label = next(doc_strem)
            docs.append(text)
            y.append(label)
    except StopIteration:
        docs, y = None, None
    return docs, y

不幸的是，因为需要把全部的单词保存在内存，所以无法调用CountVectorizer函数做核心学习。另外，TfidfVectorizer需要把训练集的所有特征向量保存在内存，以计算逆文档频率。然而，scikit-learn实现的另一个有用的向量化工具是HashingVectorizer，该工具用于文本处理并且独立于数据，调用由奥斯汀·爱珀白提出的基于哈希技术的32位MurmurHash3函数（https://sites.google.com/site/murmurhash/）

In [4]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier
vec=HashingVectorizer(decode_error='ignore',
                      n_features=2*21,
                      preprocessor=None,
                      tokenizer=tokenizer)
clf=SGDClassifier(loss='log',random_state=1,n_iter_no_change=1)
doc_strem=stream_docs('data/movie_data.csv')

In [5]:
import pyprind
import numpy as np
pbar= pyprind.ProgBar(45)
from tqdm._tqdm_notebook import tqdm_notebook
classs=np.array([0,1])
for _ in tqdm_notebook(range(45)):
    X_train,y_train=get_minibacth(doc_strem,1000)
    if not X_train:
        break
        print(_)
    X_train=vec.transform(X_train)
    clf.partial_fit(X_train,y_train,classes=classs)
    pbar.update()



Please use `tqdm.notebook.*` instead of `tqdm._tqdm_notebook.*`
  from tqdm._tqdm_notebook import tqdm_notebook


  0%|          | 0/45 [00:00<?, ?it/s]

In [6]:
X_test,y_test=get_minibacth(doc_strem,size=50)
X_test=vec.transform(X_test)

In [7]:
print("Accuracy:%3.f"%clf.score(X_test,y_test))

Accuracy:  1


In [8]:
clf=clf.partial_fit(X_test,y_test)

### 文本LDA主题建模
![](picture/LDA.png)

In [10]:
import pandas as pd
df=pd.read_csv("data/movie_data.csv",encoding='utf-8')
from sklearn.feature_extraction.text import CountVectorizer

In [13]:
count=CountVectorizer(stop_words="english",
                      max_df=0.1,#最大词频设置为0.1，删除那些词频很大的单词
                      max_features=5000)#最大的特征数量是5000，只做最常出现的5000个单词
X=count.fit_transform(df['review'].values)

In [15]:
from sklearn.decomposition import LatentDirichletAllocation
lda=LatentDirichletAllocation(n_components=10,
                              random_state=123,
                              learning_method="batch")#通过设置参数learning_method='batch'，让lda评估器在一次迭代中根据所有可用的训练数据（词袋矩阵）进行估计，这比在线学习方法慢，但可能会带来更准确的预测结果（设置参数learning_method='online'，则与第2章以及本章讨论的在线或小批量学习类似）。

In [16]:
X_topics=lda.fit_transform(X)

#### scikit-learn的LDA实现采用期望最大化（EM）算法来迭代更新参数估计。本章尚未讨论EM算法，但如果想了解更多信息，请参阅维基百科上的精彩概述（https://en.wikipedia.org/wiki/Expectation-maximization_algorithm）以及科罗拉多里德讲解的如何使用LDA的详细教程《LDA：迈向更深入的了解》，该教程可以免费从下述网站获得：http://obphio.us/pdfs/lda_tutorial.pdf。拟合LDA之后，现在可以访问lda实例的components_属性，该属性存储包含每10个主题的按升序排列的单词重要性（此处为5000）的矩阵。

In [17]:
lda.components_.shape
#每个主题有5000个按重要程度排序的单词，单词数量就是特征数量

(10, 5000)

In [19]:
n_top_words=5
feature_names=count.get_feature_names()#5000个单词，即是特征名字
for idx,topic in enumerate(lda.components_):#5000个单词对应的tf-idf
    print("Topics: %d"%(idx+1))
    print(" ".join([feature_names[i] for i in topic.argsort()[:-n_top_words-1:-1]]))#排序后的倒数五个单词,因为默认是升序

Topics: 1
worst minutes awful script stupid
Topics: 2
family mother father children girl
Topics: 3
american war dvd music tv
Topics: 4
human audience cinema art sense
Topics: 5
police guy car dead murder
Topics: 6
horror house sex girl woman
Topics: 7
role performance comedy actor performances
Topics: 8
series episode war episodes tv
Topics: 9
book version original read novel
Topics: 10
action fight guy guys cool




In [33]:
X_topics#50000条影评，被分为10类电影的概率

array([[0.0011114 , 0.19959939, 0.00111146, ..., 0.00111146, 0.00111178,
        0.00111127],
       [0.9857103 , 0.00158776, 0.00158769, ..., 0.00158763, 0.0015877 ,
        0.00158774],
       [0.38619934, 0.43091153, 0.106478  , ..., 0.00119069, 0.00119094,
        0.00119079],
       ...,
       [0.44586093, 0.0027788 , 0.00277831, ..., 0.00277824, 0.10483994,
        0.09596916],
       [0.00384681, 0.53244645, 0.00384701, ..., 0.00384698, 0.00384675,
        0.00384695],
       [0.00833551, 0.48130427, 0.00833415, ..., 0.00833491, 0.00833449,
        0.45201744]])

In [34]:
#选出几段对第六个主题的电影的描述
horror=X_topics[:,5].argsort()[::-1]#倒序，就是最有可能分为恐怖电影的影评的概率从大到小
for idx,mo in enumerate(horror[:3]):
    print("\nHorror movie review #%d"%(idx+1))
    print(df["review"][mo][:300],'...')


Horror movie review #1
House of Dracula works from the same basic premise as House of Frankenstein from the year before; namely that Universal's three most famous monsters; Dracula, Frankenstein's Monster and The Wolf Man are appearing in the movie together. Naturally, the film is rather messy therefore, but the fact that ...

Horror movie review #2
Okay, what the hell kind of TRASH have I been watching now? "The Witches' Mountain" has got to be one of the most incoherent and insane Spanish exploitation flicks ever and yet, at the same time, it's also strangely compelling. There's absolutely nothing that makes sense here and I even doubt there  ...

Horror movie review #3
<br /><br />Horror movie time, Japanese style. Uzumaki/Spiral was a total freakfest from start to finish. A fun freakfest at that, but at times it was a tad too reliant on kitsch rather than the horror. The story is difficult to summarize succinctly: a carefree, normal teenage girl starts coming fac ...


### 本章学习了如何利用机器学习算法根据文本文档的倾向性对其分类，这是NLP领域情感分析的基本任务。本章不仅学习了如何使用词袋模型将文档编码为特征向量，而且还学习了如何使用tf-idf通过相关性对词频进行加权。由于在该过程中创建了大量特征向量，处理这样的文本数据在计算上可能非常昂贵。上一节学习了如何利用外部存储或增量学习来训练机器学习算法，而不需要将整个数据集加载到计算机内存。最后引入了LDA主题建模的概念，以无监督的方式将电影评论分为不同的类别。下一章中我们将使用自己实现的文档分类器并学习如何将其嵌入到网络应用。