### 作業目的: 使用樹型模型進行文章分類

本次作業主利用[Amazon Review data中的All Beauty](http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/All_Beauty.json.gz)來進行review評價分類(文章分類)

資料中將review分為1,2,3,4,5分，而在這份作業，我們將評論改分為差評價、普通評價、優良評價(1,2-->1差評、3-->2普通評價、4,5-->3優良評價)

### 載入套件

In [1]:
pip install sklearn



In [2]:
import json
import re
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

### 資料前處理
文本資料較為龐大，這裡我們取前10000筆資料來進行作業練習

In [3]:
# get data
!wget http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/All_Beauty.json.gz

--2021-09-01 10:23:30--  http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/All_Beauty.json.gz
Resolving deepyeti.ucsd.edu (deepyeti.ucsd.edu)... 169.228.63.50
Connecting to deepyeti.ucsd.edu (deepyeti.ucsd.edu)|169.228.63.50|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 47350910 (45M) [application/octet-stream]
Saving to: ‘All_Beauty.json.gz’


2021-09-01 10:23:31 (66.5 MB/s) - ‘All_Beauty.json.gz’ saved [47350910/47350910]



In [4]:
# unzip data
!gunzip All_Beauty.json.gz

In [38]:
#load json data
all_reviews = []
###<your code>###
with open("All_Beauty.json", "r") as f:
  for reviews in f:
    all_reviews.append(json.loads(reviews))

all_reviews = all_reviews[:10000]
len(all_reviews), all_reviews[0], type(all_reviews), all_reviews[0]['overall']

(10000,
 {'asin': '0143026860',
  'overall': 1.0,
  'reviewText': 'great',
  'reviewTime': '02 19, 2015',
  'reviewerID': 'A1V6B6TNIC10QE',
  'reviewerName': 'theodore j bigham',
  'summary': 'One Star',
  'unixReviewTime': 1424304000,
  'verified': True},
 list,
 1.0)

In [54]:
#parse label(overall) and corpus(reviewText)
corpus = []
labels = []

###<your code>###
for entry in all_reviews:
  if 'overall' not in entry or 'reviewText' not in entry:
    continue
  labels.append(entry['overall'])
  corpus.append(entry['reviewText'])

#transform labels: 1,2 --> 1 and 3 --> 2 and 4,5 --> 3
###<your code>###
for l in range(len(labels)):
  if labels[l] == 1.0 or labels[l] == 2.0:
    labels[l] = 1
  elif labels[l] == 3.0:
    labels[l] = 2
  elif labels[l] == 4.0 or labels[l] == 5.0:
    labels[l] = 3

In [55]:
corpus[:10]

['great',
 "My  husband wanted to reading about the Negro Baseball and this a great addition to his library\n Our library doesn't haveinformation so this book is his start. Tthank you",
 'This book was very informative, covering all aspects of game.',
 'I am already a baseball fan and knew a bit about the Negro leagues, but I learned a lot more reading this book.',
 "This was a good story of the Black leagues. I bought the book to teach in my high school reading class. I found it very informative and exciting. I would recommend to anyone interested in the history of the black leagues. It is well written, unlike a book of facts. The McKissack's continue to write good books for young audiences that can also be enjoyed by adults!",
 'Today I gave a book about the Negro Leagues of Baseball to a traveling friend. Its a book I\'ve read more than once and felt that my friend would truly enjoy. It felt like giving a gift that you wanted to keep for yourself. I parted with the book knowing that

In [59]:
#preprocessing data
#remove email address, punctuations, and change line symbol(\n)
regex_pattern = r'\W+|\\n|\S*@\S*'

for v in range(len(corpus)):
  corpus[v] = re.sub(regex_pattern, " ", corpus[v])

corpus[:10]
###<your code>###

['great',
 'My husband wanted to reading about the Negro Baseball and this a great addition to his library Our library doesn t haveinformation so this book is his start Tthank you',
 'This book was very informative covering all aspects of game ',
 'I am already a baseball fan and knew a bit about the Negro leagues but I learned a lot more reading this book ',
 'This was a good story of the Black leagues I bought the book to teach in my high school reading class I found it very informative and exciting I would recommend to anyone interested in the history of the black leagues It is well written unlike a book of facts The McKissack s continue to write good books for young audiences that can also be enjoyed by adults ',
 'Today I gave a book about the Negro Leagues of Baseball to a traveling friend Its a book I ve read more than once and felt that my friend would truly enjoy It felt like giving a gift that you wanted to keep for yourself I parted with the book knowing that my friend would

In [66]:
#split corpus and label into train and test
x_train, x_test, y_train, y_test = train_test_split(corpus, labels, test_size = 0.2, random_state = 0)

len(x_train), len(x_test), len(y_train), len(y_test)

(7996, 1999, 7996, 1999)

In [67]:
#change corpus into vector
#you can use tfidf or BoW here

###<your code>###
vector = TfidfVectorizer()
vector.fit(x_train)
#transform training and testing corpus into vector form
x_train = vector.transform(x_train)
x_test = vector.transform(x_test)

### 訓練與預測

In [69]:
#build classification model (decision tree, random forest, or adaboost)
#start training
decision_tree_class = DecisionTreeClassifier(max_depth=6)

decision_tree_class.fit(x_train, y_train)

###<your code>###

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=6, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [71]:
#start inference
y_pred = decision_tree_class.predict(x_test)

In [75]:
#calculate accuracy
print("Accuracy: {:.2f}\ndepth: {}\nleaves: {}".format(decision_tree_class.score(x_test, y_test),
                                                   decision_tree_class.get_depth(),
                                                   decision_tree_class.get_n_leaves()))

Accuracy: 0.90
depth: 6
leaves: 33


In [76]:
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.68      0.21      0.32       134
           2       0.00      0.00      0.00        73
           3       0.91      0.99      0.95      1792

    accuracy                           0.90      1999
   macro avg       0.53      0.40      0.42      1999
weighted avg       0.86      0.90      0.87      1999

[[  28    4  102]
 [   3    0   70]
 [  10    2 1780]]


由上述資訊可以發現, 模型在好評的準確度高(precision, recall都高), 而在差評的部分表現較不理想, 在普通評價的部分大部分跟差評搞混,
同學可以試著學習到的各種方法來提升模型的表現