この章では**自然言語処理(Natural Language Processing:NLP)**の一分野である**感情(センチメント)分析(sentiment analysis）**を取り上げる。
極性(polarity)に基づいて文章を分類する。また以下の内容と取り上げる。
- テキストデータのクレンジングと準備
- テキスト文書からの特徴ベクトルの構築
- 映画レビューを肯定的な文と否定的な文に分類する機械学習のモデルのトレーニング
- アウトオブコア学習に基づく大規模なテキストデータセットの処理
- 文章コレクションからカテゴリのトピックを推定する。

## 8.1 IMDbの映画レビューデータセットでのテキスト処理
感情分析は**意見マイニング(opinion mining)**と呼ばれる

### 8.1.1 映画レビューデータセットを取得する。
http://ai.stanford.edu/~amaas/data/sentiment/

### 8.1.2 映画レビューデータセットをより便利なフォーマットに変換する。
映画レビューをpandasのDataFrameオブジェクトに読み込む。(約10分)進捗状況と完了までの推定時間を**PyPrind(Python Progress INDIcator)**パッケージを使用することにより、確認する事ができる。

In [20]:
import pyprind
import pandas as pd
import os

In [1]:
'''
# 'basepath'の値を展開したレビューデータセットのディレクトリに書き換える
basepath = 'C:/Users/zundo/Desktop/aclImdb'
labels = {'pos':1, 'neg':0}
pbar = pyprind.ProgBar(50000)
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in os.listdir(path):
            with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]], ignore_index = True)
            pbar.update()
            
df.columns = ['review', 'sentiment']
'''

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:02:44


データセットに組み込まれているクラスラベルはソート済みであるため、np.randomサブモジュールのpermutation関数を使って行の順番をシャッフルしたDataFrameオブジェクトを作成する。こうすると8.3章でデータをローカルドライブから直接ストリーミングするときに、トレーニングデータセットとテストデータセットに分割するにに役立つ。

作業を行いやすくすために、ひとまとめにしたうえでシャッフルした映画レビューデータセットをCSVファイルに保存する。

In [21]:
#import numpy as np
#np.random.seed(0)
#df = df.reindex(np.random.permutation(df.index))
#df.to_csv('movie_data.csv', index = False, encoding= 'utf-8')

NameError: name 'df' is not defined

In [22]:
df = pd.read_csv('movie_data.csv', encoding = 'utf-8')
df.head()

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


## 8.2 Bowモデルの紹介
文章や単語などのカテゴリデータは、数値に変換しておく。テキストを数値の特徴ベクトルとして表現できる**BoW(bag-ofWorks)**モデルを紹介する。
1. 文章の集合全体から、たとえば単語という一意な**トークン(token)**からなる**語彙(vacaburary)**を作成する。
1. 各文書での各単語の出現回数を含んだ特徴ベクトルを構築する。

### 8.2.1 単語を特徴ベクトルに変換する

In [23]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()
docs = np.array([
    'The sun is shining',
    'The weather is sweet',
    'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(docs)
# print(bag)

In [24]:
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


In [25]:
print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


### 8.2.2 TF-IDFを使って単語の関連性を評価する
テキストデータを解析していると各クラスに分類される複数の文書において、同じ単語が出現する。そういいた頻繁に出現する単語はたいてい意味のある情報を含んでない。ｓこで**TF-IDF(Term Frequency-Inverse Document Frequency)**という手法を使用することで、そういった単語の重みを減らすことができる。

In [26]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(use_idf = True,
                        norm = 'l2', 
                        smooth_idf = True)

np.set_printoptions(precision=2)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]


### 8.2.3　テキストデータのクレンジング
不要な文字を取り除くことによって、テキストデータをクレンジング(洗浄）することが最初の重要な手順である。

In [27]:
df.loc[0, 'review'][-50:]

'is seven.<br /><br />Title (Brazil): Not Available'

感情分析に確実に役立つ**顔文字(emoticon)**だけを残し、それ以外の句読点は削除する。

In [28]:
import re
def preprocessor(text):
    # HTMLマークアップを削除
    text = re.sub('[^>]*>','',text) 
    
    # 顔文字を検索しemoticonsに格納
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|p)',text) 
    
    # 単語の一部ではない文字を削除し、テキストを小文字に変換する
    text = (re.sub('[\W]+',' ', text.lower()) + ''.join(emoticons).replace('-',''))
    return text

In [29]:
preprocessor(df.loc[0, 'review'][-50:])

'title brazil not available'

In [30]:
preprocessor("</a>This :) is :( atest :-)!")

'this is atest :):(:)'

In [31]:
df['review'] = df['review'].apply(preprocessor)

### 8.4.2 文章をトークン化する
文章をトークン化する1つの方法は、クレンジングした文章を空白文字(スペース、タブ、改行、リターン、改ページ）で区切り、個々の単語に分割することである。

In [32]:
def tokenizer(text):
    return text.split()

tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

トークン化の便利な手法の1つに、**ワードステミング(word stemming)**がある。これは単語を原型に変換することで、関連する単語を同じ語幹にマッピングすることである。

In [33]:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

機械学習をする前に**ストップワードの除去(stop-word removal)**を行う。ストップワードとはあらゆる種類のテキストでみられるごくありふれた単語のことである。

In [34]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\zundo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [35]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
[w for w in tokenizer('a runner likes running and runs a lot')[-10:]
    if w not in stop]

['runner', 'likes', 'running', 'runs', 'lot']

### 8.2.5 文章を分類するロジスティック回帰モデルのトレーニング

In [36]:
X_train = df.loc[:25000,'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

In [37]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

In [38]:
tfidf = TfidfVectorizer(strip_accents=None,
                       lowercase = False,
                       preprocessor = None)
param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
              'clf__penalty': ['l1', 'l2'],
              'clf__C': [1.0, 10.0, 100.0]},
             {'vect__ngram_range': [(1, 1)],
             'vect__stop_words': [stop, None],
             'vect__tokenizer': [tokenizer, tokenizer_porter],
             'vect__use_idf': [False],
             'vect__norm': [None],
             'clf__penalty': ['l1', 'l2'],
             'clf__C': [1.0, 10.0, 100.0]}]

lr_tfidf = Pipeline([('vect', tfidf),
                    ('clf', LogisticRegression(random_state=0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf,
                          param_grid,
                          scoring = 'accuracy',
                          cv = 5, verbose = 1,
                          n_jobs = 1)

In [39]:
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed: 155.2min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=False, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...nalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'vect__ngram_range': [(1, 1)], 'vect__stop_words': [['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's...se_idf': [False], 'vect__norm': [None], 'clf__penalty': ['l1', 'l2'], 'clf__C': [1.0, 10.0, 100.0]}],
       pre_dispatch='2*n_jobs', refit=True, return_tra

In [48]:
print('Best parameter set : %s' % gs_lr_tfidf.best_params_)

Best parameter set : {'clf__C': 1.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x000001F11D908378>}


In [49]:
print('CV Accuracy: %.3f' % gs_lr_tfidf.best_score_)

CV Accuracy: 0.817


In [50]:
clf = gs_lr_tfidf.best_estimator_
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))

Test Accuracy: 0.822


In [46]:
from sklearn.externals import joblib

joblib.dump(gs_lr_tfidf,'gs_lr_tfidf.pkl')

['gs_lr_tfidf.pkl']

In [47]:
# model_new = joblib.load('model.pkl')

## 8.3 さらに大規模のデータの処理：オンラインアルゴリズムとアウトオグコア学習
コンピューターのメモリに収まらないほどの大規模なデータセットの処理を可能にするために**アウトオグコア学習(out-of-core learning)**という手法を適用する。データセットの小さなバッチを使って随時的に適合させる。

In [8]:
import numpy as np
import re
from nltk.corpus import stopwords
stop = stopwords.words('english')
def tokenizer(text):
    text = re.sub('[^>]*>','',text) 
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|p)',text.lower()) 
    text = (re.sub('[\W]+',' ', text.lower()) + ''.join(emoticons).replace('-',''))
    tokenizer = [w for w in text.split() if w not in stop]
    return tokenizer

In [9]:
def stream_docs(path):
    with open(path, 'r', encoding='utf-8') as csv:
        next(csv)# ヘッダーを読み飛ばす
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

In [10]:
next(stream_docs(path='movie_data.csv'))

('"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />""Murder in Greenwich"" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich f

In [14]:
# stream_docs関数から文章ストリームを受け取り、size引数により指定された文章の個数を返す
def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append (text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

In [15]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier
vect = HashingVectorizer(decode_error = 'ignore',
                        n_features = 2**21,
                        preprocessor = None,
                        tokenizer = tokenizer) # tokenizerでHashingVectorizerを初期化
clf = SGDClassifier(loss = 'log', # ロジスティク分類器を初期化
                   random_state = 1,
                   n_iter = 1)
doc_stream = stream_docs(path = 'movie_data.csv')

In [16]:
import pyprind
pbar = pyprind.ProgBar(45)
classes = np.array([0, 1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size = 1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()



0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:38


In [17]:
X_test,y_test = get_minibatch(doc_stream, size =5000)
X_test = vect.transform(X_test)
print('Accuracy : %.3f' % clf.score(X_test, y_test))

Accuracy : 0.805


In [18]:
from sklearn.externals import joblib

joblib.dump(clf,'lr.pkl')

['lr.pkl']

### 8.4.1 潜在ディリクレ配分を使ってテキスト文章を分解する
潜在ディクレ配分（LDA)は生成的確率モデルであり、さまざまな文章に同時に出現する一連の単語を見つけ出そうとする。出現頻度の高い単語はトピックを表す。LDAへの入力はBoWモデルである。
- 文章からトピックへの行列
- 単語からトピックへの行列

LDAによるBoW行列の分解は、分解された2つの行列を掛け合わせた場合に元のBoW行列再現できるような方法で行われる。

### 8.4.2 scikit-learnの潜在ディリクレ配分

In [63]:
import pandas as pd
df = pd.read_csv('movie_data.csv', encoding='utf-8')

In [66]:
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer(stop_words='english',
                       max_df = .1,
                       max_features = 5000)
X = count.fit_transform(df['review'].values)

In [67]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_topics = 10,
                               random_state = 123,
                               learning_method = 'batch')
X_topics = lda.fit_transform(X)



LDAの適合後は、LDAのcomponents_属性にアクセスできる。この属性には10種のトピック(昇順)ごとに、単語の重要ど（5000個）を含んだ行列が含まれている。

In [68]:
lda.components_.shape

(10, 5000)

In [69]:
lda.components_

array([[8.56e+01, 1.00e+02, 3.34e+02, ..., 4.44e+02, 2.83e+02, 3.19e+01],
       [3.46e+01, 1.01e+01, 6.42e+01, ..., 1.00e-01, 1.00e-01, 4.50e+00],
       [1.86e+01, 1.61e+02, 1.34e+02, ..., 1.00e-01, 1.00e-01, 5.68e+00],
       ...,
       [1.16e-01, 1.87e+01, 1.19e+01, ..., 1.00e-01, 1.00e-01, 1.96e+02],
       [7.51e+00, 2.04e+01, 5.45e+01, ..., 1.00e-01, 1.00e-01, 1.22e-01],
       [1.00e-01, 3.26e+01, 1.08e+02, ..., 1.00e-01, 1.00e-01, 3.04e+00]])

In [73]:
n_top_words = 5
feature_names = count.get_feature_names()
for topic_idx,topic in enumerate(lda.components_):
    print('Topic %d:' % (topic_idx +1))
    print(", ".join([feature_names[i]
                 for i in topic.argsort() \
                 [:-n_top_words-1:-1]]))

Topic 1:
worst, minutes, awful, script, stupid
Topic 2:
family, mother, father, children, girl
Topic 3:
american, war, dvd, music, tv
Topic 4:
human, audience, cinema, art, sense
Topic 5:
police, guy, car, dead, murder
Topic 6:
horror, house, sex, girl, woman
Topic 7:
role, performance, comedy, actor, performances
Topic 8:
series, episode, war, episodes, tv
Topic 9:
book, version, original, read, novel
Topic 10:
action, fight, guy, guys, cool


1. Generally bad movies
1. Movies about familiy
1. War Movies
1. Art Movies
1. Crime movies
1. Horror movies
1. Comedy movies
1. Movies someshow related to TV shows
1. Movies based on books
1. Action movies

In [74]:
horror = X_topics[:, 5].argsort()[::-1]
for iter_idx, movie_idx in enumerate(horror[:3]):
    print('\nHorror movie #%d:' % (iter_idx +1))
    print(df['review'][movie_idx][:300],'...')


Horror movie #1:
House of Dracula works from the same basic premise as House of Frankenstein from the year before; namely that Universal's three most famous monsters; Dracula, Frankenstein's Monster and The Wolf Man are appearing in the movie together. Naturally, the film is rather messy therefore, but the fact that ...

Horror movie #2:
Okay, what the hell kind of TRASH have I been watching now? "The Witches' Mountain" has got to be one of the most incoherent and insane Spanish exploitation flicks ever and yet, at the same time, it's also strangely compelling. There's absolutely nothing that makes sense here and I even doubt there  ...

Horror movie #3:
<br /><br />Horror movie time, Japanese style. Uzumaki/Spiral was a total freakfest from start to finish. A fun freakfest at that, but at times it was a tad too reliant on kitsch rather than the horror. The story is difficult to summarize succinctly: a carefree, normal teenage girl starts coming fac ...


In [76]:
horror = X_topics[:, 4].argsort()[::-1]
for iter_idx, movie_idx in enumerate(horror[:3]):
    print('\nCrime movie #%d:' % (iter_idx +1))
    print(df['review'][movie_idx][:300],'...')


Crime movie #1:
**SPOILERS** Extremely brutal police drama set in San Francisco involving a sting operation that goes terribly wrong. A cop Det. Falon, Sam Elliott,mistakenly and savagely beats to death an undercover policeman Winch, Mike Watson,thinking that he murdered his partner Det. Sam Levinson, Mike Burstyn. ...

Crime movie #2:
Two stars <br /><br />Amanda Plummer looking like a young version of her father, Christopher Plummer in drag, stars in this film along with Robert Forster--who really should have put a little shoe black on top of that bald spot.<br /><br />I've never seen Amanda Plummer in a good film. She always pl ...

Crime movie #3:
A film without conscience. Drifter agrees to kill a man for a mobster for money. Then they double cross him. Meanwhile he falls in love with the dead man's wife, and, without her knowing he's the killer, moves in with her. Then he "accidentally" kills her when she finds out. Then, in a WALKING TALL  ...
