<a href="https://colab.research.google.com/github/Tonoyama/amazon_review/blob/master/amazon.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Amazon レビュー分析

## データ収集


In [None]:
!curl http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Magazine_Subscriptions.json.gz -o Magazine.gz

In [None]:
!gzip -d Magazine.gz

In [None]:
!mv Magazine Magazine.json

## EDA(探索的データ解析)

### データ読み込み

In [None]:
import pandas as pd
import json

In [None]:
df_f = pd.read_json('Magazine.json', lines=True)
df_f.head()

In [None]:
df_f.columns

In [None]:
df_f.shape

### 星ごとのレビューを取り出す 

In [None]:
review_5 = df_f[df_f['overall'] == 5]
review_43 = df_f[(df_f['overall'] == 4) | (df_f['overall'] == 3)]
review_21 = df_f[(df_f['overall'] == 2 ) | (df_f['overall'] == 1)]

In [None]:
review_43.head()

目的変数として `overall` を取り出す。


In [None]:
y = df_f.loc[:,['overall']]
y.value_counts()

星評価を plot する。

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(x='overall',data=df_f)
plt.show()

説明変数として、`vote`(投票), `verified`(認証済み),`reviewTime`(レビューした時間),`reviewerID`(レビュワーID),`asin`(プロダクト ID), `reviewText`(レビューテキスト), `summary`(要約) 


In [None]:
x = df_f.loc[:,['vote', 'verified','reviewTime', 'reviewerID', 'asin', 'reviewText', 'summary']]
x.head()

In [None]:
x.describe()

欠損値(`NaN`) の有無を調べる


In [None]:
y.isnull().sum()

In [None]:
x.isnull().sum()

今回は、投票の欠損値を `0` で埋める。


In [None]:
x['vote'] = x['vote'].fillna(0)
x.head()

In [None]:
x = x['reviewText'] + ' ' + x['summary']

数字に `,` があるとエラーになるため、replaceで空文字に変換する 

In [None]:
x = x.replace(r',', '')
x = x.replace(r'^[1-9]+', '')

In [None]:
x.head()

In [None]:
x.shape

In [None]:
x = x.astype(str)

In [None]:
x_df = pd.DataFrame(data=x)
x_df.columns = ['review']
x_df.head()

### ワードクラウドで単語をざっとみる

In [None]:
from wordcloud import WordCloud

In [None]:
review_text = x['reviewText'].values 

wc = WordCloud(
    min_font_size=3,
    max_words = 3000,
    background_color='white'
    )

review_wordcloud = wc.generate(str(review_text))

plt.figure(figsize = (10,10))
plt.imshow(review_wordcloud, interpolation = 'bilinear')
plt.title('reviewText of all stars')
plt.axis("off")
plt.show()

In [None]:
summary_text = review_5['reviewText'].values 

summary_wordcloud = wc.generate(str(summary_text))

plt.figure(figsize = (10,10))
plt.imshow(review_wordcloud, interpolation = 'bilinear')
plt.title('5 star : reviewText')
plt.axis("off")
plt.show()

In [None]:
review_5[['reviewText','summary']].sample(10)

In [None]:
summary_text = review_43['reviewText'].values 

summary_wordcloud = wc.generate(str(summary_text))

plt.figure(figsize = (10,10))
plt.imshow(review_wordcloud, interpolation = 'bilinear')
plt.title('3, 4 star : reviewText')
plt.axis("off")
plt.show()

In [None]:
review_43[['reviewText','summary']].sample(10)

In [None]:
summary_text = review_21['reviewText'].values 

summary_wordcloud = wc.generate(str(summary_text))

plt.figure(figsize = (10,10))
plt.imshow(review_wordcloud, interpolation = 'bilinear')
plt.title('1, 2 star : reviewText')
plt.axis("off")
plt.show()

In [None]:
review_21[['reviewText','summary']].sample(10)

In [None]:
summary_text = x['summary'].values 

summary_wordcloud = wc.generate(str(summary_text))

plt.figure(figsize = (10,10))
plt.imshow(review_wordcloud, interpolation = 'bilinear')
plt.title('summary')
plt.axis("off")
plt.show()

### 頻度分析

In [None]:
import re
import unicodedata
import nltk
from nltk.corpus import stopwords
# add appropriate words that will be ignored in the analysis
ADDITIONAL_STOPWORDS = ['covfefe']

import matplotlib.pyplot as plt

In [None]:
!python -c "import nltk; nltk.download()"

In [None]:
def basic_clean(text):
  """
  A simple function to clean up the data. All the words that
  are not designated as a stop word is then lemmatized after
  encoding and basic regex parsing are performed.
  """
  wnl = nltk.stem.WordNetLemmatizer()
  stopwords = nltk.corpus.stopwords.words('english') + ADDITIONAL_STOPWORDS
  text = (unicodedata.normalize('NFKD', text)
    .encode('ascii', 'ignore')
    .decode('utf-8', 'ignore')
    .lower())
  words = re.sub(r'[^\w\s]', '', text).split()
  return [wnl.lemmatize(word) for word in words if word not in stopwords]

In [None]:
words = basic_clean(''.join(str(review_5['reviewText'].tolist())))

In [None]:
words[:20]

In [None]:
(pd.Series(nltk.ngrams(words, 2)).value_counts())[:10]

In [None]:
(pd.Series(nltk.ngrams(words, 3)).value_counts())[:10]

In [None]:
bigrams_series = (pd.Series(nltk.ngrams(words, 2)).value_counts())[:12]
trigrams_series = (pd.Series(nltk.ngrams(words, 3)).value_counts())[:12]
bigrams_series.sort_values().plot.barh(color='blue', width=.9, figsize=(12, 8))
plt.title('20 Most Frequently Occuring Bigrams')
plt.ylabel('Bigram')
plt.xlabel('# of Occurances')

https://towardsdatascience.com/from-dataframe-to-n-grams-e34e29df3460

### 学習用データ分割

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=1234)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

In [None]:
cv=CountVectorizer(min_df=0,max_df=1,binary=False,ngram_range=(1,3))
#transformed train reviews
cv_train_reviews=cv.fit_transform(x_train)
#transformed test reviews
cv_test_reviews=cv.transform(x_test)

print('BOW_cv_train:',cv_train_reviews.shape)
print('BOW_cv_test:',cv_test_reviews.shape)

In [None]:
tv=TfidfVectorizer(min_df=0,max_df=1,use_idf=True,ngram_range=(1,3))
#transformed train reviews
tv_train_reviews=tv.fit_transform(x_train)
#transformed test reviews
tv_test_reviews=tv.transform(x_test)

print('Tfidf_train:',tv_train_reviews.shape)
print('Tfidf_test:',tv_test_reviews.shape)

## 分析

### ランダムフォレストで分析

ランダムフォレストは、アンサンブル学習による機械学習アルゴリズムの1つ。複数の決定木を弱識別木として使い、それを統合し、最も正しい結果を得る。

時間がかかるため、今回はなし。

```python
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(random_state=1234)
clf.fit(tv_train_reviews, y_train)
```

```python
print("score : ", clf.score(x_test, y_test))
```

```python
from sklearn.metrics import confusion_matrix, accuracy_score

cm = confusion_matrix(y_val, y_pred)
print(cm)
acc = accuracy_score(y_val, y_pred)
print(acc)
```

### LSTM で分析

In [None]:
from keras import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout

embedding_size=32
max_words=5000

model = Sequential()
model.add(Embedding(max_words, embedding_size, input_length=cv_test_reviews.shape[1]))
model.add(LSTM(100))
model.add(Dense(3,activation='softmax'))

print(model.summary())

In [None]:
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [None]:
y_train_dummies = pd.get_dummies(y_train).values
print('Shape of Label tensor: ', y_train_dummies.shape)

In [None]:
model.fit(cv_train_reviews, y_train_dummies, validation_data = (cv_test_reviews, y_test), epochs = 20, batch_size = 64, shuffle = True)