<a href="https://colab.research.google.com/github/Takkar-915/movie_review/blob/main/BoW_Logistic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

必要なライブラリをインストール

In [None]:
!pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


データセットを読み込む

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import tarfile

#tarファイルに含まれているファイルをすべて取り出す。
with tarfile.open('/content/drive/MyDrive/3年前期/知的情報システム開発/データセット/aclImdb_v1.tar.gz', 'r:gz') as tar:
  tar.extractall()

レビュー文とネガポジのラベルをもつデータセットにする

In [None]:
import pandas as pd
import os

basepath = 'aclImdb'

#ポジティブ：1  ネガティブ ：0
labels = {'pos': 1, 'neg' :0}

"""
aclImdbのファイル構成としてtrainフォルダ,testフォルダの下に
それぞれpos,negフォルダがある。
以下のようにパスを結合して、レヴュー文と評価のみを取り出したデータセット
を作成する
"""

df = pd.DataFrame()
for i in ('test','train'):
  for j in ('pos', 'neg'):
    path = os.path.join(basepath,i,j)

    #path直下のファイル一覧を取得(ついでにファイル名でソート)
    for file in sorted(os.listdir(path)):
      
      #ファイルの読み込み
      with open(os.path.join(path,file),'r',encoding='utf-8') as infile:
        txt = infile.read()
        
      #データフレームに追加
      df =df.append([[txt,labels[j]]],ignore_index=True)

#カラムの設定
df.columns = ['review','sentiment']

上で作成したデータフレームをシャッフル

In [None]:
import numpy as np

np.random.seed(0)
#ランダムに並び替える
df = df.reindex(np.random.permutation(df.index))

csvファイルで保存

In [None]:
df.to_csv('movie_data.csv', index = False, encoding='utf-8')

cvsファイルの中身を確認

In [None]:
df = pd.read_csv('movie_data.csv', encoding='utf-8')
df.head(10)
#df.shape

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0
5,Leave it to Braik to put on a good show. Final...,1
6,Nathan Detroit (Frank Sinatra) is the manager ...,1
7,"To understand ""Crash Course"" in the right cont...",1
8,I've been impressed with Chavez's stance again...,1
9,This movie is directed by Renny Harlin the fin...,1


正規表現を用いてHTMLマークアップ、句読点などの不要な情報を削除

In [None]:
import re

#データのクレンジングのための関数を定義
def preprocessor(text):

  #HTMLマークアップの削除
  text = re.sub('<[^>]*>', '', text)

  return text

#各行に上の関数を適用
df['review'] = df['review'].apply(preprocessor)

正規表現の結果を確認する

In [None]:
df.head(10)

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0
5,Leave it to Braik to put on a good show. Final...,1
6,Nathan Detroit (Frank Sinatra) is the manager ...,1
7,"To understand ""Crash Course"" in the right cont...",1
8,I've been impressed with Chavez's stance again...,1
9,This movie is directed by Renny Harlin the fin...,1


テキストデータを分析するために、文書をトークン化する

In [None]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

#空白文字でトークン化
def tokenizer(text):
    return text.split()

#トークン化したものから語幹を取り出す(Porterステミングアルゴリズム)
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

あまりに一般的に使われるような単語を除外(BoWの時のみ使う)



In [None]:
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords

#一般的な(ネガポジ判別の役に立たない)単語の除外
stop = stopwords.words('english') 

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


訓練データとテストデータに分割

In [None]:
x_train = df.loc[:25000,'review'].values
y_train = df.loc[:25000,'sentiment'].values
x_test = df.loc[25000:,'review'].values
y_test = df.loc[25000:,'sentiment'].values

グリッドサーチにより最適なパラメタをもとめる。

In [None]:
#パイプライン
from sklearn.pipeline import Pipeline
#ロジスティクス回帰
from sklearn.linear_model import LogisticRegression
#TF-IDF
from sklearn.feature_extraction.text import CountVectorizer
#グリッドサーチ
from sklearn.model_selection import GridSearchCV

#TF_IDFで特徴量抽出
bow = CountVectorizer()

#Pipeline と GridSearchCVを組み合わせ使う時は処理の名前__パラメータ名で指定！
param_grid = [{'vect__ngram_range': [(1, 1)], 
               'vect__stop_words': [stop,None],  
               'vect__tokenizer': [tokenizer, tokenizer_porter], #トークン化の手法
               'clf__penalty': ['l1', 'l2'],  #正則化。l1かl2
               'clf__C': [10.0, 50.0, 100.0]}, #正則化の強さを決めるパラメータ
              ]

#random_state = 0で再現性を確保
lr_bow = Pipeline([('vect', bow),
                     ('clf', LogisticRegression(random_state=0, solver='liblinear'))])

#グリッドサーチを行うGridSearchクラスをインスタンス化
gs_lr_bow = GridSearchCV(lr_bow,  #チューニングを行うモデル
                           param_grid,  #パラメタ候補値
                           scoring='accuracy',  #評価手法
                           cv=3,        #3分割交差検証
                           n_jobs=-1) #コア数の指定

学習

In [None]:
gs_lr_bow.fit(x_train, y_train)

GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('vect', CountVectorizer()),
                                       ('clf',
                                        LogisticRegression(random_state=0,
                                                           solver='liblinear'))]),
             n_jobs=-1,
             param_grid=[{'clf__C': [10.0, 50.0, 100.0],
                          'clf__penalty': ['l1', 'l2'],
                          'vect__ngram_range': [(1, 1)],
                          'vect__stop_words': [['i', 'me', 'my', 'myself', 'we',
                                                'our', 'ours', 'ourselves',
                                                'you', "you're", "you've",
                                                "you'll", "you'd", 'your',
                                                'yours', 'yourself',
                                                'yourselves', 'he', 'him',
                                                'his', 'himself

性能評価

In [None]:
print('Best parameter set: %s ' % gs_lr_bow.best_params_)
print('CV Accuracy: %.3f' % gs_lr_bow.best_score_)

Best parameter set: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x7f50ce201e60>} 
CV Accuracy: 0.870


In [None]:
clf = gs_lr_bow.best_estimator_

評価指標

In [None]:
# 精度測定用関数のインポート
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

y_pred = clf.predict(x_test)

accuracy = accuracy_score(y_test, y_pred)
print('正解率: ', accuracy)

precision = precision_score(y_test, y_pred)
print('適合率:',precision)

recall = recall_score(y_test, y_pred)
print('再現率: ', recall)

f1 = f1_score(y_test, y_pred)
print('F値',f1)


正解率:  0.8794
適合率: 0.8770531015786955
再現率:  0.8819049146155696
F値 0.8794723166100339


学習モデルの保存


In [None]:
import pickle

with open('bow_logistic.pickle', mode='wb') as f:
    pickle.dump(clf,f,protocol=2)

保存したモデルの使用

In [None]:
with open('bow_logistic.pickle', mode='rb') as f:
    clf = pickle.load(f)

sample_review = ["Its finally over. The story that I learned so many things from, the story which had a deep impression in my soul, my very silver soul.It was the best farewell imaginable.The art was top notch as it was shown in the trailers.The story was heartbreaking yet heartwarming at the same time. I don't think there will be another anime which engages me more than Gintama for the rest of my life."]

#モデルを用いた予測
ans = clf.predict(sample_review)


if ans == 1:
  print("positive")

elif ans == 0:
  print("negative")

positive
