<a href="https://colab.research.google.com/github/Tonoyama/amazon_review/blob/master/amazon.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Amazon レビュー分析

## データ収集


In [None]:
!curl http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Magazine_Subscriptions.json.gz -o Magazine.gz

In [None]:
!gzip -d Magazine.gz

In [None]:
!mv Magazine Magazine.json

## EDA(探索的データ解析)

In [None]:
import pandas as pd
import json

In [None]:
df_f = pd.read_json('Magazine.json', lines=True)
df_f.head()

In [None]:
df_f.columns

In [None]:
df_f.shape

### 星ごとのレビューを取り出す 

In [None]:
review_5 = df_f[df_f['overall'] == 5]
review_43 = df_f[(df_f['overall'] == 4) | (df_f['overall'] == 3)]
review_21 = df_f[(df_f['overall'] == 2 ) | (df_f['overall'] == 1)]

In [None]:
review_43.head()

目的変数として `overall` を取り出す。


In [None]:
y = df_f.loc[:,['overall']]
y.value_counts()

星評価を plot する。

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(x='overall',data=df_f)
plt.show()

説明変数として、`vote`(投票), `verified`(認証済み),`reviewTime`(レビューした時間),`reviewerID`(レビュワーID),`asin`(プロダクト ID), `reviewText`(レビューテキスト), `summary`(要約) 


In [None]:
x = df_f.loc[:,['vote', 'verified','reviewTime', 'reviewerID', 'asin', 'reviewText', 'summary']]
x.head()

In [None]:
x.describe()

欠損値(`NaN`) の有無を調べる


In [None]:
y.isnull().sum()

In [None]:
x.isnull().sum()

今回は、投票の欠損値を `0` で埋める。


In [None]:
x['vote'] = x['vote'].fillna(0)
x.head()

In [None]:
x = x['reviewText'] + ' ' + x['summary']

数字に `,` があるとエラーになるため、replaceで空文字に変換する 

In [None]:
x = x.replace(r',', '')
x = x.replace(r'^[1-9]+', '')

In [None]:
x.head()

In [None]:
x.shape

In [None]:
x = x.astype(str)

In [None]:
x_df = pd.DataFrame(data=x)
x_df.columns = ['review']
x_df.head()

### ワードクラウドで単語をざっとみる

In [None]:
from wordcloud import WordCloud

In [None]:
review_text = x['reviewText'].values 

wc = WordCloud(
    min_font_size=3,
    max_words = 3000,
    background_color='white'
    )

review_wordcloud = wc.generate(str(review_text))

plt.figure(figsize = (10,10))
plt.imshow(review_wordcloud, interpolation = 'bilinear')
plt.title('reviewText of all stars')
plt.axis("off")
plt.show()

In [None]:
summary_text = review_5['reviewText'].values 

summary_wordcloud = wc.generate(str(summary_text))

plt.figure(figsize = (10,10))
plt.imshow(review_wordcloud, interpolation = 'bilinear')
plt.title('5 star : reviewText')
plt.axis("off")
plt.show()

In [None]:
review_5[['reviewText','summary']].sample(10)

In [None]:
summary_text = review_43['reviewText'].values 

summary_wordcloud = wc.generate(str(summary_text))

plt.figure(figsize = (10,10))
plt.imshow(review_wordcloud, interpolation = 'bilinear')
plt.title('3, 4 star : reviewText')
plt.axis("off")
plt.show()

In [None]:
review_43[['reviewText','summary']].sample(10)

In [None]:
summary_text = review_21['reviewText'].values 

summary_wordcloud = wc.generate(str(summary_text))

plt.figure(figsize = (10,10))
plt.imshow(review_wordcloud, interpolation = 'bilinear')
plt.title('1, 2 star : reviewText')
plt.axis("off")
plt.show()

In [None]:
review_21[['reviewText','summary']].sample(10)

In [None]:
summary_text = x['summary'].values 

summary_wordcloud = wc.generate(str(summary_text))

plt.figure(figsize = (10,10))
plt.imshow(review_wordcloud, interpolation = 'bilinear')
plt.title('summary')
plt.axis("off")
plt.show()

### 学習用データ分割

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=1234)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

In [None]:
cv=CountVectorizer(min_df=0,max_df=1,binary=False,ngram_range=(1,3))
#transformed train reviews
cv_train_reviews=cv.fit_transform(x_train)
#transformed test reviews
cv_test_reviews=cv.transform(x_test)

print('BOW_cv_train:',cv_train_reviews.shape)
print('BOW_cv_test:',cv_test_reviews.shape)

In [None]:
tv=TfidfVectorizer(min_df=0,max_df=1,use_idf=True,ngram_range=(1,3))
#transformed train reviews
tv_train_reviews=tv.fit_transform(x_train)
#transformed test reviews
tv_test_reviews=tv.transform(x_test)
print('Tfidf_train:',tv_train_reviews.shape)
print('Tfidf_test:',tv_test_reviews.shape)

## 分析

### ロジスティック回帰で分類

In [None]:
lr=LogisticRegression(penalty='l2',max_iter=500,C=1,random_state=0)
# Bag of words
lr_bow=lr.fit(cv_train_reviews,y_train)
print(lr_bow)
# TF-IDF 特徴量
lr_tfidf=lr.fit(tv_train_reviews,y_train)
print(lr_tfidf)

LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=500,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [None]:
# Bag of words で予測する
lr_bow_predict=lr.predict(cv_test_reviews)
# TF-IDF で予測する
lr_tfidf_predict=lr.predict(tv_test_reviews)

In [None]:
# Bag of words の精度
lr_bow_score=accuracy_score(y_test,lr_bow_predict)
print("lr_bow_score :",lr_bow_score)
# TF-IDF の精度
lr_tfidf_score=accuracy_score(y_test,lr_tfidf_predict)
print("lr_tfidf_score :",lr_tfidf_score)

lr_bow_score : 0.6312714623377782

lr_tfidf_score : 0.5994737546269455

In [None]:
lr_bow_report=classification_report(y_test,lr_bow_predict, target_names=['1','2', '3', '4', '5'])
print(lr_bow_report)

#Classification report for tfidf features
lr_tfidf_report=classification_report(y_test,lr_tfidf_predict, target_names=['1','2', '3', '4', '5'])
print(lr_tfidf_report)

              precision    recall  f1-score   support

           1       0.83      0.10      0.18      2745
           2       0.90      0.07      0.14      1336
           3       0.91      0.07      0.12      1700
           4       0.85      0.08      0.15      3200
           5       0.62      1.00      0.77     13442

    accuracy                           0.63     22423
   macro avg       0.82      0.26      0.27     22423
weighted avg       0.72      0.63      0.52     22423

              precision    recall  f1-score   support

           1       0.00      0.00      0.00      2745
           2       0.00      0.00      0.00      1336
           3       0.00      0.00      0.00      1700
           4       0.00      0.00      0.00      3200
           5       0.60      1.00      0.75     13442

    accuracy                           0.60     22423
   macro avg       0.12      0.20      0.15     22423
weighted avg       0.36      0.60      0.45     22423

### ランダムフォレストで分析

時間がかかるため、なし。

```python
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(random_state=1234)
clf.fit(tv_train_reviews, y_train)
```

```python
print("score : ", clf.score(x_test, y_test))
```

```python
from sklearn.metrics import confusion_matrix, accuracy_score

cm = confusion_matrix(y_val, y_pred)
print(cm)
acc = accuracy_score(y_val, y_pred)
print(acc)
```

### LSTM で分析

In [None]:
import keras
from keras.layers import Dense,LSTM
from keras.models import Sequential

In [None]:
model = Sequential()
model.add(Dense(units=75 , activation = 'relu' , input_dim = cv_train_reviews.shape[1]))
model.add(Dense(units=50 , activation = 'relu'))
model.add(Dense(units=25 , activation = 'relu'))
model.add(Dense(units=10 , activation = 'relu')) 
model.add(Dense(units=1 , activation = 'sigmoid'))
model.compile(optimizer='adam' , loss='categorical_crossentropy' , metrics=['accuracy'])

In [None]:
model.summary()

In [None]:
model.fit(cv_train_reviews,y_train , epochs = 10)

In [None]:
model.evaluate(cv_test_reviews,y_test)[1]