自然言語処理の一分野である、感情分析を取り上げる。<br>
IMDBの5,000件の映画レビューで構成されたデータセットを操作し、<br>
肯定的または否定的なレビューを分類できる予測器を構築する。<br>

ここでは次の内容を取り上げる。<br>

<ul>
    <li>テキストデータのクレンジングと準備</li>
    <li>テキスト文書からの特徴ベクトルの構築</li>
    <li>映画レビューを肯定的な文と否定的な文に分類する機械学習モデルのトレーニング</li>
    <li>アウトオブコア学習にもとづく大規模なデータセットの処理</li>
    <li>文章コレクションからカテゴリのトピックを推定</li>
<ul>

### 以下、映画レビューデータセットで肯定的、否定的を判断する。

### 映画データセットをより扱いやすい形式に変換する。

In [1]:
import pyprind
import pandas as pd
import os
"""
basepath = '../../../Downloads/aclImdb'
labels = {'pos':1, 'neg':0}
pbar = pyprind.ProgBar(5000)
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in os.listdir(path):
            with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]], ignore_index = True)
            pbar.update()

df.columns = ['review', 'sentiment']
"""

"\nbasepath = '../../../Downloads/aclImdb'\nlabels = {'pos':1, 'neg':0}\npbar = pyprind.ProgBar(5000)\ndf = pd.DataFrame()\nfor s in ('test', 'train'):\n    for l in ('pos', 'neg'):\n        path = os.path.join(basepath, s, l)\n        for file in os.listdir(path):\n            with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:\n                txt = infile.read()\n            df = df.append([[txt, labels[l]]], ignore_index = True)\n            pbar.update()\n\ndf.columns = ['review', 'sentiment']\n"

In [2]:
"""
import numpy as np
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
"""

'\nimport numpy as np\nnp.random.seed(0)\ndf = df.reindex(np.random.permutation(df.index))\n'

In [3]:
# df.to_csv('movie_data.csv', index=False, encoding='utf-8')

In [4]:
df = pd.read_csv('movie_data.csv', encoding='utf-8')
df.head(3)

Unnamed: 0,review,sentiment
0,My family and I normally do not watch local mo...,1
1,"Believe it or not, this was at one time the wo...",0
2,"After some internet surfing, I found the ""Home...",0


### BoWモデルの紹介

In [5]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()
docs = np.array([
    'The sun is shining',
    'The weather is sweet',
    'The sun is shining, the weather is sweet, and one and one is two'
])
bag = count.fit_transform(docs)

In [6]:
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


In [7]:
print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


### TF-IDFを使って単語の関連性を評価する

In [8]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)
np.set_printoptions(precision=2)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]


### テキストデータのクレンジング

In [9]:
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text)
    text = (re.sub('[\W]+', ' ', text.lower()) +
            ' '.join(emoticons).replace('-', ''))
    return text

In [10]:
df['review'] = df['review'].apply(preprocessor)

### 文章をトークン化する

In [11]:
def tokenizer(text):
    return text.split()

tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

ワードステミングは単語を原型にする。<br>
Poterステミングアルゴリズムを使用する。<br>

In [12]:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

### 英語のStopwordsの除去

ストップワードは、様々なクラスの文章の区別に有益となる情報を含んでいないと見なされる。<br>

In [13]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/Takanori/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [14]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

In [15]:
[w for w in tokenizer_porter('a runner likes running and runs a lot') if w not in stop]

['runner', 'like', 'run', 'run', 'lot']

### 文章を分類するロジスティック回帰モデルのトレーニング

In [16]:
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

In [17]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

In [18]:
tfidf = TfidfVectorizer(
    strip_accents = None,
    lowercase = False,
    preprocessor = None
)

param_grid = [{'vect__ngram_range':[(1, 1)],
             'vect__stop_words' : [stop, None],
             'vect__tokenizer':[tokenizer, tokenizer_porter],
             'clf__penalty': ['l1', 'l2'],
              'clf__C' : [1.0, 10.0, 100.0]},
             {'vect__ngram_range':[(1, 1)],
              'vect__stop_words':[stop, None],
              'vect__tokenizer': [tokenizer, tokenizer_porter],
              'vect__use_idf' : [False],
              'vect__norm' : [None],
              'clf__penalty': ['l1', 'l2'],
            'clf__C' : [1.0, 10.0, 100.0]}]

In [19]:
lr_tfidf = Pipeline([('vect', tfidf), ('clf', LogisticRegression(random_state=0))])

In [20]:
gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid, scoring = 'accuracy', cv = 5, verbose=1, n_jobs = 1)

In [21]:
# gs_lr_tfidf.fit(X_train, y_train)

In [22]:
# モデルのdump

# import pickle
# pickle.dump(gs_lr_tfidf, open('logisticregression.pkl','wb'), protocol=3 )

In [23]:
import pickle
# pickle形式で開く
with open('logisticregression.pkl', 'rb')\
     as f: gs_lr_tfidf = pickle.Unpickler(f).load()

In [24]:
print('Best parameter set: %s' % gs_lr_tfidf.best_params_)

Best parameter set: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x1137f5158>}


トレーニングデータセットでの５分割交差検証の正解率の平均と、<br>
テストデータセットの平均を出力する。<br>

In [25]:
print('CV Accuracy: %3f' % gs_lr_tfidf.best_score_) 

CV Accuracy: 0.893244


In [26]:
clf = gs_lr_tfidf.best_estimator_

In [27]:
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))

Test Accuracy: 0.900


### さらに大規模なデータの処理：オンラインアルゴリズムとアウトオブコア学習

アウトオブコア学習では、データセットの小さなバッチを使って分類器を逐次的に適合させる。<br>
ここでは、scikit-learnのSGDClassifierクラスのpartial_fitメソッドを使用することで、<br>
ローカルドライブから文章を直接ストリーミングし、文章の小さなミニバッチを使ってロジスティック回帰モデルのトレーニングを行う。<br>

#### tokenizer関数の定義

In [28]:
import numpy as np
import re
from nltk.corpus import stopwords
stop = stopwords.words('english')
def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower())+' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop] 
    return tokenized

In [40]:
def stream_docs(path):
    with open(path, 'r', encoding='utf-8') as csv:
        next(csv)
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

In [41]:
next(stream_docs(path='movie_data.csv'))

('"My family and I normally do not watch local movies for the simple reason that they are poorly made, they lack the depth, and just not worth our time.<br /><br />The trailer of ""Nasaan ka man"" caught my attention, my daughter in law\'s and daughter\'s so we took time out to watch it this afternoon. The movie exceeded our expectations. The cinematography was very good, the story beautiful and the acting awesome. Jericho Rosales was really very good, so\'s Claudine Barretto. The fact that I despised Diether Ocampo proves he was effective at his role. I have never been this touched, moved and affected by a local movie before. Imagine a cynic like me dabbing my eyes at the end of the movie? Congratulations to Star Cinema!! Way to go, Jericho and Claudine!!"',
 1)

get_minibatch関数を定義する。<br>
この関数はstream_docs関数から文章ストリームを受け取り、size引数によって指定された個数の文章を返す。<br>

In [46]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

In [47]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier
vect = HashingVectorizer(
    decode_error='ignore',
    n_features=2**21,
    preprocessor=None,
    tokenizer=tokenizer
)
clf = SGDClassifier(loss='log', random_state=1, n_iter=1)
doc_stream = stream_docs(path='movie_data.csv')

In [48]:
import pyprind
pbar = pyprind.ProgBar(45)
classes = np.array([0, 1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes = classes)
    pbar.update()



0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:26


In [49]:
X_test, y_test = get_minibatch(doc_stream, size=5000)

In [52]:
X_test = vect.transform(X_test)

In [56]:
X_test.shape

(5000, 2097152)

In [57]:
print('Accracy: %3.f' % clf.score(X_test, y_test))

Accracy:   1


In [58]:
clf = clf.partial_fit(X_test, y_test)



### 潜在ディリクレ配分によるトピックモデルの構築