<a href="https://colab.research.google.com/github/Tonoyama/amazon_review/blob/master/amazon.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Amazon レビュー分析

## データ収集


In [None]:
!curl http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Magazine_Subscriptions.json.gz -o Magazine.gz

In [None]:
!gzip -d Magazine.gz

In [None]:
!mv Magazine Magazine.json

## EDA(探索的データ解析)

In [None]:
import pandas as pd
import json

In [None]:
df_f = pd.read_json('Magazine.json', lines=True)
df_f.head()

In [None]:
df_f.shape

目的変数として `overall` を取り出す。


In [None]:
y = df_f.loc[:,['overall']]
y.value_counts()

星評価を plot する。

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(x='overall',data=df_f)
plt.show()

説明変数として、`vote`(投票), `verified`(認証済み),`reviewTime`(レビューした時間),`reviewerID`(レビュワーID),`asin`(プロダクト ID), `reviewText`(レビューテキスト), `summary`(要約) 


In [None]:
x = df_f.loc[:,['vote', 'verified','reviewTime', 'reviewerID', 'asin', 'reviewText', 'summary']]
x.head()

In [None]:
x.describe()

### 前処理

欠損値(`NaN`) の有無を調べる


In [None]:
y.isnull().sum()

In [None]:
x.isnull().sum()

今回は、投票の欠損値を `0` で埋める。


In [None]:
x['vote'] = x['vote'].fillna(0)
x.head()

数字に `,` があるとエラーになるため、replaceで空文字に変換する 

In [None]:
x = x.str.replace('^[1-9]+', '')

In [None]:
from wordcloud import WordCloud

In [None]:
review_text = x['reviewText'].values 

wc = WordCloud(
    min_font_size=3,
    max_words = 3000,
    background_color='white'
    )

review_wordcloud = wc.generate(str(review_text))

plt.figure(figsize = (10,10))
plt.imshow(review_wordcloud, interpolation = 'bilinear')
plt.axis("off")
plt.show()

In [None]:
summary_text = x['summary'].values 

summary_wordcloud = wc.generate(str(summary_text))

plt.figure(figsize = (10,10))
plt.imshow(review_wordcloud, interpolation = 'bilinear')
plt.axis("off")
plt.show()

2つだけ取りだして学習データにする。

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, test_size=0.2)
print(x_train, x_test, y_train, y_test)

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
x_train = x_train.values
print(x_train)
clf = RandomForestClassifier(random_state=1234)
clf.fit(x_train, y_train)

In [None]:
print("score : ", clf.score(x_test, y_test))