### 作業目的: 使用樹型模型進行文章分類

本次作業主利用[Amazon Review data中的All Beauty](https://nijianmo.github.io/amazon/index.html)來進行review評價分類(文章分類)

資料中將review分為1,2,3,4,5分，而在這份作業，我們將評論改分為差評價、普通評價、優良評價(1,2-->1差評、3-->2普通評價、4,5-->3優良評價)

### 載入套件

In [1]:
import pandas as pd
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

### 資料前處理
文本資料較為龐大，這裡我們取前10000筆資料來進行作業練習

In [2]:
#load json data
all_reviews = pd.read_json('All_Beauty.json', lines=True, nrows=10000)
all_reviews.shape

(10000, 12)

In [3]:
all_reviews.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,style,image
0,1,True,"02 19, 2015",A1V6B6TNIC10QE,143026860,theodore j bigham,great,One Star,1424304000,,,
1,4,True,"12 18, 2014",A2F5GHSXFQ0W6J,143026860,Mary K. Byke,My husband wanted to reading about the Negro ...,... to reading about the Negro Baseball and th...,1418860800,,,
2,4,True,"08 10, 2014",A1572GUYS7DGSR,143026860,David G,"This book was very informative, covering all a...",Worth the Read,1407628800,,,
3,5,True,"03 11, 2013",A1PSGLFK1NSVO,143026860,TamB,I am already a baseball fan and knew a bit abo...,Good Read,1362960000,,,
4,5,True,"12 25, 2011",A6IKXKZMTKGSC,143026860,shoecanary,This was a good story of the Black leagues. I ...,"More than facts, a good story read!",1324771200,5.0,,


In [4]:
all_reviews['overall'].unique()

array([1, 4, 5, 2, 3], dtype=int64)

In [5]:
all_reviews[all_reviews['reviewText'].isnull()]

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,style,image
547,5,True,"03 4, 2017",A3TQJ5AQXW6CZH,1620213982,mona,,Five Stars,1488585600,,{'Size:': ' Color42'},
3594,5,True,"05 21, 2015",A2CMCSBNYJETQY,1620213982,Bobby Hamrick,,Part of a great combo,1432166400,,{'Size:': ' 6.25 Inches'},[https://images-na.ssl-images-amazon.com/image...
4105,5,True,"01 7, 2015",A2W5DS4107108S,1620213982,Dimitry,,Just a good idea!,1420588800,3.0,{'Size:': ' 6.25 Inches'},[https://images-na.ssl-images-amazon.com/image...
6361,5,True,"11 19, 2016",A2MZYX8PMNV32V,B000050FDY,Amani albadawi,,Five Stars,1479513600,,{'Size:': ' 2 Count'},
6437,5,True,"03 20, 2016",ACEV4EGUYH56O,B000050FDY,Amazon Customer,,Five Stars,1458432000,,{'Size:': ' 2 Count'},


In [6]:
#發現reviewText有5筆有NaN，先去除
all_reviews = all_reviews[all_reviews['reviewText'].notnull()]
all_reviews[all_reviews['reviewText'].isnull()]

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,style,image


In [7]:
#parse label(overall) and corpus(reviewText)
corpus = all_reviews['reviewText'].tolist()
       
#transform labels: 1,2 --> 1 and 3 --> 2 and 4,5 --> 3
def transform_labels(x):
    if x == 3:
        x = 2
    elif (x == 1) or (x == 2):
        x = 1
    elif (x == 4) or (x == 5):
        x =3
    else:
        x
    return x

labels = all_reviews['overall'].apply(transform_labels).tolist()

len(corpus), len(labels)

(9995, 9995)

In [8]:
corpus[0:5]

['great',
 "My  husband wanted to reading about the Negro Baseball and this a great addition to his library\n Our library doesn't haveinformation so this book is his start. Tthank you",
 'This book was very informative, covering all aspects of game.',
 'I am already a baseball fan and knew a bit about the Negro leagues, but I learned a lot more reading this book.',
 "This was a good story of the Black leagues. I bought the book to teach in my high school reading class. I found it very informative and exciting. I would recommend to anyone interested in the history of the black leagues. It is well written, unlike a book of facts. The McKissack's continue to write good books for young audiences that can also be enjoyed by adults!"]

In [9]:
labels[0:5]

[1, 3, 3, 3, 3]

In [10]:
#preprocessing data
#remove email address, punctuations, and change line symbol(\n)

#確認一下有哪些有email address
pattern = r"\S*@\S*\s?"
matches = []
for i,text in enumerate(corpus):
    match = re.findall(pattern, text)
    if len(match) != 0:
        matches.append((i, match))
matches

[(2726, ['@ ']),
 (4961, ['@ ']),
 (4972, ['ROBERTY....aaroberty@comcast.net.']),
 (6653, ['Youngbern@aol.com']),
 (6698, ['@ ']),
 (7963, ['leaned@Panasonic']),
 (9727, ['@ '])]

In [11]:
corpus[7963]

'Dead on arrival - tried to use this product but did not work - tried new batteries to no avail...past return date...sad, sad, sad...have several other Panasonic personal grooming shavers which prompted me to purchase this...lesson leaned@Panasonic'

In [12]:
pattern = r"\S*@\S*\s?|\n|\W"
' '.join(w for w in re.sub(pattern, ' ', corpus[7963]).lower().split())

'dead on arrival tried to use this product but did not work tried new batteries to no avail past return date sad sad sad have several other panasonic personal grooming shavers which prompted me to purchase this lesson'

In [13]:
#remove email address, punctuations, and change line symbol(\n), lower all case
pattern = r"\S*@\S*\s?|\n|\W"
preprocess_text = lambda x: ' '.join(w for w in re.sub(pattern, ' ', x).lower().split())
corpus = [preprocess_text(text) for text in corpus]
corpus[7963]

'dead on arrival tried to use this product but did not work tried new batteries to no avail past return date sad sad sad have several other panasonic personal grooming shavers which prompted me to purchase this lesson'

In [14]:
corpus[:5]

['great',
 'my husband wanted to reading about the negro baseball and this a great addition to his library our library doesn t haveinformation so this book is his start tthank you',
 'this book was very informative covering all aspects of game',
 'i am already a baseball fan and knew a bit about the negro leagues but i learned a lot more reading this book',
 'this was a good story of the black leagues i bought the book to teach in my high school reading class i found it very informative and exciting i would recommend to anyone interested in the history of the black leagues it is well written unlike a book of facts the mckissack s continue to write good books for young audiences that can also be enjoyed by adults']

In [15]:
#split corpus and label into train and test
x_train, x_test, y_train, y_test = train_test_split(corpus, labels, test_size=0.2, random_state=0)
len(x_train), len(x_test), len(y_train), len(y_test)

(7996, 1999, 7996, 1999)

In [16]:
x_train[:5]

['great deal good price nice strong material i would buy many more if i needed to very fast delivery i just installed it and am hoping it last if it breaks or does not work i will update review',
 'got the 4 pack box for my husband and it is very good value cheaper than anywhere else that i ve seen so far',
 'great product and fast shipping',
 'quality finish heavy and long enough for a merkur 23c',
 'great product well made and stable provides my brush a great place to dry properly stylish appearance as well']

In [17]:
#change corpus into vector
#you can use tfidf or BoW here
vectorizer = TfidfVectorizer()
vectorizer.fit(x_train)

#transform training and testing corpus into vector form
x_train = vectorizer.transform(x_train)
x_test = vectorizer.transform(x_test)

### 訓練與預測

#### 決策樹 Decision Tree

In [18]:
#build classification model (decision tree, random forest, or adaboost)
#start training
decision_tree_cls = DecisionTreeClassifier(criterion='gini', max_depth=6)
decision_tree_cls.fit(x_train, y_train)

#start inference
y_pred = decision_tree_cls.predict(x_test)

#calculate accuracy
print(f'train accuracy score: {decision_tree_cls.score(x_train,y_train)}')
print(f'test accuracy score: {decision_tree_cls.score(x_test,y_test)}')

#calculate confusion matrix, precision, recall, and f1-score
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

train accuracy score: 0.9114557278639319
test accuracy score: 0.9044522261130565
              precision    recall  f1-score   support

           1       0.68      0.21      0.32       134
           2       0.00      0.00      0.00        73
           3       0.91      0.99      0.95      1792

    accuracy                           0.90      1999
   macro avg       0.53      0.40      0.42      1999
weighted avg       0.86      0.90      0.87      1999

[[  28    4  102]
 [   3    0   70]
 [  10    2 1780]]


由上述資訊可以發現, 模型在好評的準確度高(precision, recall都高), 而在差評的部分表現較不理想, 在普通評價的部分大部分跟差評搞混,
同學可以試著學習到的各種方法來提升模型的表現

#### 隨機森林 Random Forest

In [19]:
#start training
forest_cls = RandomForestClassifier(n_estimators=50, criterion='gini', max_depth=6)
forest_cls.fit(x_train, y_train)

#start inference
y_pred = forest_cls.predict(x_test)

#calculate accuracy
print(f'train accuracy score: {forest_cls.score(x_train,y_train)}')
print(f'test accuracy score: {forest_cls.score(x_test,y_test)}')

#calculate confusion matrix, precision, recall, and f1-score
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

train accuracy score: 0.892696348174087
test accuracy score: 0.896448224112056
              precision    recall  f1-score   support

           1       0.00      0.00      0.00       134
           2       0.00      0.00      0.00        73
           3       0.90      1.00      0.95      1792

    accuracy                           0.90      1999
   macro avg       0.30      0.33      0.32      1999
weighted avg       0.80      0.90      0.85      1999

[[   0    0  134]
 [   0    0   73]
 [   0    0 1792]]


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


#### Adaboost

In [20]:
adaboost_cls = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(criterion='gini',
                                                                        max_depth=6),
                                  n_estimators=50,
                                  learning_rate=0.8)
adaboost_cls.fit(x_train, y_train)

#start inference
y_pred = adaboost_cls.predict(x_test)

#calculate accuracy
print(f'train accuracy score: {adaboost_cls.score(x_train,y_train)}')
print(f'test accuracy score: {adaboost_cls.score(x_test,y_test)}')

#calculate confusion matrix, precision, recall, and f1-score
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

train accuracy score: 0.9882441220610305
test accuracy score: 0.8894447223611806
              precision    recall  f1-score   support

           1       0.47      0.24      0.32       134
           2       0.05      0.01      0.02        73
           3       0.91      0.97      0.94      1792

    accuracy                           0.89      1999
   macro avg       0.48      0.41      0.43      1999
weighted avg       0.85      0.89      0.87      1999

[[  32    4   98]
 [   4    1   68]
 [  32   15 1745]]
