### 作業目的: 使用樹型模型進行文章分類

本次作業主利用[Amazon Review data中的All Beauty](https://nijianmo.github.io/amazon/index.html)來進行review評價分類(文章分類)

資料中將review分為1,2,3,4,5分，而在這份作業，我們將評論改分為差評價、普通評價、優良評價(1,2-->1差評、3-->2普通評價、4,5-->3優良評價)

### 載入套件

In [1]:
import json
import re
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
import pandas as pd

### 資料前處理
文本資料較為龐大，這裡我們取前10000筆資料來進行作業練習

In [2]:
#load json data
corpus = []
labels = []

with open('./D28_All_Beauty.json', 'r',encoding="utf-8") as f:
    for review in f:
        _dic = json.loads(review)
        
        #parse corpus(reviewText) and label(overall)
        corpus.append(_dic.setdefault('reviewText',''))
        
        #transform labels: 1,2 --> 1 and 3 --> 2 and 4,5 --> 3
        labels.append(round((int(_dic['overall'])-1)/2)+1)
        
for data in zip(labels[:5], corpus[:5]):
    print(data, end='\n'*2)

(3, 'As advertised. Reasonably priced')

(3, 'Like the oder and the feel when I put it on my face.  I have tried other brands but the reviews from people I know they prefer the oder of this brand. Not hard on the face when dry.  Does not leave dry skin.')

(1, 'I bought this to smell nice after I shave.  When I put it on I smelled awful.  I am 19 and I smelled like a grandmother with too much perfume.')

(3, 'HEY!! I am an Aqua Velva Man and absolutely love this stuff, been using it for over 50 years. This is a true after shave lotion classic. Not quite sure how many women that have been attracted to me because of Aqua Velva,  I do know for sure that it\'s just to many to count. Ha.  Not sure how long this has been around but the Williams Company ran a paper advertisement, taken from a 1949 magazine, which features Ralph Bellamy of Detective Story and Ezio Pinza of South Pacific for Aqua Velva After Shave Lotion. I\'m sure you all remember Ralph Bellamy and Ezio Pinza from the 40\'s ri

In [3]:
#preprocessing data
#remove email address, punctuations, and change line symbol(\n)
pattern = r'\S*@\S*|\\n|[^a-zA-Z0-9]'

for i, review in enumerate(corpus):
    fil_review = [w for w in re.sub(pattern, ' ', review).split(' ') if w != ' ']
    corpus[i] = ' '.join(fil_review)

print('\n\n'.join(corpus[:5]))

As advertised  Reasonably priced

Like the oder and the feel when I put it on my face   I have tried other brands but the reviews from people I know they prefer the oder of this brand  Not hard on the face when dry   Does not leave dry skin 

I bought this to smell nice after I shave   When I put it on I smelled awful   I am 19 and I smelled like a grandmother with too much perfume 

HEY   I am an Aqua Velva Man and absolutely love this stuff  been using it for over 50 years  This is a true after shave lotion classic  Not quite sure how many women that have been attracted to me because of Aqua Velva   I do know for sure that it s just to many to count  Ha   Not sure how long this has been around but the Williams Company ran a paper advertisement  taken from a 1949 magazine  which features Ralph Bellamy of Detective Story and Ezio Pinza of South Pacific for Aqua Velva After Shave Lotion  I m sure you all remember Ralph Bellamy and Ezio Pinza from the 40 s right   There slogan was   Ther

In [4]:
#split corpus and label into train and test
x_train, x_test, y_train, y_test = train_test_split(corpus, labels, test_size=0.2)
print(len(x_train), len(x_test), len(y_train), len(y_test))

4215 1054 4215 1054


In [5]:
#change corpus into vector
#you can use tfidf or BoW here
tfidf_vec = TfidfVectorizer()
tfidf_vec.fit(x_train)

#transform training and testing corpus into vector form
x_train = tfidf_vec.transform(x_train).toarray()
x_test = tfidf_vec.transform(x_test).toarray()

### 訓練與預測

In [6]:
#build classification model (decision tree, random forest, or adaboost)
#start training
tree = DecisionTreeClassifier(max_depth=10, min_samples_split=2)
tree.fit(x_train, y_train)

DecisionTreeClassifier(max_depth=10)

In [7]:
#start inference
y_pred = tree.predict(x_test)

In [8]:
#calculate accuracy
print(f'Accuracy: {tree.score(x_test,y_test)}')

Accuracy: 0.969639468690702


In [9]:
#calculate confusion matrix, precision, recall, and f1-score
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.96      0.62      0.75        37
           2       1.00      0.15      0.26        20
           3       0.97      1.00      0.98       997

    accuracy                           0.97      1054
   macro avg       0.98      0.59      0.67      1054
weighted avg       0.97      0.97      0.96      1054

[[ 23   0  14]
 [  0   3  17]
 [  1   0 996]]


由上述資訊可以發現, 模型在好評的準確度高(precision, recall都高), 而在差評的部分表現較不理想, 在普通評價的部分大部分跟差評搞混,
同學可以試著學習到的各種方法來提升模型的表現