### 作業目的: 使用樹型模型進行文章分類

本次作業主利用[Amazon Review data中的All Beauty](https://nijianmo.github.io/amazon/index.html)來進行review評價分類(文章分類)

資料中將review分為1,2,3,4,5分，而在這份作業，我們將評論改分為差評價、普通評價、優良評價(1,2-->1差評、3-->2普通評價、4,5-->3優良評價)

### 載入套件

In [1]:
import json
import re
import nltk
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

### 資料前處理
文本資料較為龐大，這裡我們取前10000筆資料來進行作業練習

In [2]:
#load json data
all_reviews = []
with open('All_Beauty.json', 'r') as f:
    for line in f.readlines():
        all_reviews.append(json.loads(line))        

In [3]:
all_reviews[0]

{'overall': 1.0,
 'verified': True,
 'reviewTime': '02 19, 2015',
 'reviewerID': 'A1V6B6TNIC10QE',
 'asin': '0143026860',
 'reviewerName': 'theodore j bigham',
 'reviewText': 'great',
 'summary': 'One Star',
 'unixReviewTime': 1424304000}

In [4]:
all_reviews[0]['overall']

1.0

In [5]:
all_reviews[0]['reviewText']

'great'

In [6]:
# 沒有'reviewText'
all_reviews[547]

{'overall': 5.0,
 'verified': True,
 'reviewTime': '03 4, 2017',
 'reviewerID': 'A3TQJ5AQXW6CZH',
 'asin': '1620213982',
 'style': {'Size:': ' Color42'},
 'reviewerName': 'mona',
 'summary': 'Five Stars',
 'unixReviewTime': 1488585600}

In [7]:
#parse label(overall) and corpus(reviewText)
corpus = []
labels = []

#transform labels: 1,2 --> 1 and 3 --> 2 and 4,5 --> 3

for i in range(len(all_reviews)):
    if i==10000:
        break
    try:
        corpus.append(all_reviews[i]['reviewText'])
        if all_reviews[i]['overall'] == 1.0 or all_reviews[i]['overall'] == 2.0:
            labels.append(1)
        if all_reviews[i]['overall'] == 3.0:
            labels.append(2)
        if all_reviews[i]['overall'] == 4.0 or all_reviews[i]['overall'] == 5.0:
            labels.append(3)
    except:
        #corpus.append('None')
        #print(i)
        continue

#corpus = [all_reviews[i]['reviewText'] for i in range(len(all_reviews))]
#labels = [all_reviews[i]['overall'] for i in range(len(all_reviews))]

In [8]:
corpus[:5]

['great',
 "My  husband wanted to reading about the Negro Baseball and this a great addition to his library\n Our library doesn't haveinformation so this book is his start. Tthank you",
 'This book was very informative, covering all aspects of game.',
 'I am already a baseball fan and knew a bit about the Negro leagues, but I learned a lot more reading this book.',
 "This was a good story of the Black leagues. I bought the book to teach in my high school reading class. I found it very informative and exciting. I would recommend to anyone interested in the history of the black leagues. It is well written, unlike a book of facts. The McKissack's continue to write good books for young audiences that can also be enjoyed by adults!"]

In [9]:
#preprocessing data
#remove email address, punctuations, and change line symbol(\n)
pattern1 = r"(\w+@\w+\.\w+)"
pattern2 = r"([^a-zA-Z0-9\s])|[\n]"

corpus2 = []

for text in corpus:
    match = re.sub(pattern1, '', text)  
    match2 = re.sub(pattern2, '', match).lower()

    corpus2.append(match2)
    
print(corpus2[:5])

['great', 'my  husband wanted to reading about the negro baseball and this a great addition to his library our library doesnt haveinformation so this book is his start tthank you', 'this book was very informative covering all aspects of game', 'i am already a baseball fan and knew a bit about the negro leagues but i learned a lot more reading this book', 'this was a good story of the black leagues i bought the book to teach in my high school reading class i found it very informative and exciting i would recommend to anyone interested in the history of the black leagues it is well written unlike a book of facts the mckissacks continue to write good books for young audiences that can also be enjoyed by adults']


In [10]:
#split corpus and label into train and test
x_train, x_test, y_train, y_test = train_test_split(corpus2, labels, test_size = 0.2, random_state = 1)

len(x_train), len(x_test), len(y_train), len(y_test)

(7996, 1999, 7996, 1999)

In [11]:
#change corpus into vector
#you can use tfidf or BoW here
cv = CountVectorizer(max_features = 2000)

#transform training and testing corpus into vector form
x_train = cv.fit_transform(x_train).toarray()
x_test = cv.fit_transform(x_test).toarray()

### 訓練與預測

In [12]:
#build classification model (decision tree, random forest, or adaboost)
decision_tree_cls = DecisionTreeClassifier(criterion='entropy', max_depth=3,
                                           min_samples_split=10, min_samples_leaf=5)

#start training
decision_tree_cls.fit(x_train, y_train)

#start inference
y_pred = decision_tree_cls.predict(x_test)

In [13]:
#calculate accuracy
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')

Accuracy: 0.896448224112056


In [14]:
#calculate confusion matrix, precision, recall, and f1-score
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.20      0.02      0.04       135
           2       0.00      0.00      0.00        64
           3       0.90      0.99      0.95      1800

    accuracy                           0.90      1999
   macro avg       0.37      0.34      0.33      1999
weighted avg       0.83      0.90      0.85      1999

[[   3    0  132]
 [   1    0   63]
 [  11    0 1789]]


  _warn_prf(average, modifier, msg_start, len(result))


由上述資訊可以發現, 模型在好評的準確度高(precision, recall都高), 而在差評的部分表現較不理想, 在普通評價的部分大部分跟差評搞混,
同學可以試著學習到的各種方法來提升模型的表現