作業目的: 使用樹型模型進行文章分類

本次作業主利用Amazon Review data中的All Beauty來進行review評價分類(文章分類)

資料中將review分為1,2,3,4,5分，而在這份作業，我們將評論改分為差評價、普通評價、優良評價(1,2-->1差評、3-->2普通評價、4,5-->3優良評價)

In [3]:
import json
import re
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

資料前處理
文本資料較為龐大，這裡我們取前10000筆資料來進行作業練習

In [4]:
#load json data
all_reviews = []
with open('All_Beauty.json', encoding='utf-8') as f:
    for review in f:
        all_reviews.append(json.loads(review))
        
all_reviews[0]

{'asin': '0143026860',
 'overall': 1.0,
 'reviewText': 'great',
 'reviewTime': '02 19, 2015',
 'reviewerID': 'A1V6B6TNIC10QE',
 'reviewerName': 'theodore j bigham',
 'summary': 'One Star',
 'unixReviewTime': 1424304000,
 'verified': True}

In [5]:
#parse label(overall) and corpus(reviewText)
corpus = []
labels = []

for i in range(10000):
    if 'overall' not in all_reviews[i] or 'reviewText' not in all_reviews[i]:
        continue
        #continue 強制跳出本次迴圈
    corpus.append(all_reviews[i]['reviewText'])
    labels.append(all_reviews[i]['overall'])
    
#transform labels: 1,2 --> 1 and 3 --> 2 and 4,5 --> 3
label_map = {1: 1, 2: 1, 3: 2, 4: 3, 5: 3}
labels = [label_map[int(label)] for label in labels]

In [11]:
#preprocessing data
#remove email address, punctuations, and change line symbol(\n)
import string

for corpu in corpus:
    pattern = re.compile('\S*@\S*\s?')
    pattern.sub('', corpu)
    corpu = (corpu.strip()).translate(str.maketrans('', '', string.punctuation))

In [12]:
#split corpus and label into train and test

x_train, x_test, y_train, y_test = train_test_split(corpus, labels, test_size=0.2, random_state=0, shuffle=True)
len(x_train), len(x_test), len(y_train), len(y_test)

(7996, 1999, 7996, 1999)

In [13]:
#change corpus into vector
#you can use tfidf or BoW here

vectorizer = TfidfVectorizer()
vectorizer.fit(x_train)

#transform training and testing corpus into vector form
x_train = vectorizer.transform(x_train)
x_test = vectorizer.transform(x_test)

In [15]:
#訓練與預測
#build classification model (decision tree, random forest, or adaboost)
#start training

#建立決策樹模型
decision_tree_cls = DecisionTreeClassifier(criterion='entropy', max_depth=6,
                                           min_samples_split=10, min_samples_leaf=5)

#使用決策樹模型進行訓練
decision_tree_cls.fit(x_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=6, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=5, min_samples_split=10,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [16]:
#start inference
#以訓練好的決策樹進行預測
y_pred = decision_tree_cls.predict(x_test)

In [18]:
#calculate accuracy
print ("Accuracy: {}".format(decision_tree_cls.score(x_test,y_test)))

Accuracy: 0.9019509754877438


In [19]:
#calculate confusion matrix, precision, recall, and f1-score
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.65      0.18      0.28       134
           2       0.14      0.01      0.03        73
           3       0.91      0.99      0.95      1792

    accuracy                           0.90      1999
   macro avg       0.57      0.39      0.42      1999
weighted avg       0.86      0.90      0.87      1999

[[  24    4  106]
 [   1    1   71]
 [  12    2 1778]]


由上述資訊可以發現, 模型在好評的準確度高(precision, recall都高), 而在差評的部分表現較不理想, 在普通評價的部分大部分跟差評搞混, 同學可以試著學習到的各種方法來提升模型的表現