### 作業目的: 使用樹型模型進行文章分類

本次作業主利用[Amazon Review data中的All Beauty](https://nijianmo.github.io/amazon/index.html)來進行review評價分類(文章分類)

資料中將review分為1,2,3,4,5分，而在這份作業，我們將評論改分為差評價、普通評價、優良評價(1,2-->1差評、3-->2普通評價、4,5-->3優良評價)

In [1]:
# <>. Download "All_Beauty.json.gz" from http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/All_Beauty.json.gz
#!gunzip All_Beauty.json.gz

### 載入套件

In [2]:
import json
import re
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

C:\ProgramData\Anaconda3\lib\site-packages\numpy\.libs\libopenblas.NOIJJG62EMASZI6NYURL6JBKM4EVBGM7.gfortran-win_amd64.dll
C:\ProgramData\Anaconda3\lib\site-packages\numpy\.libs\libopenblas.TXA6YQSD3GCQQC22GEQ54J2UDCXDXHWN.gfortran-win_amd64.dll
  stacklevel=1)


### 資料前處理
文本資料較為龐大，這裡我們取前10000筆資料來進行作業練習

In [3]:
#load json data
all_reviews = []
###<your code>###
file_name = 'All_Beauty.json'
#
#{"overall": 1.0, "verified": true, "reviewTime": "02 19, 2015", "reviewerID": "A1V6B6TNIC10QE", "asin": "0143026860", "reviewerName": "theodore j bigham", "reviewText": "great", "summary": "One Star", "unixReviewTime": 1424304000}
#{"overall": 4.0, "verified": true, "reviewTime": "12 18, 2014", "reviewerID": "A2F5GHSXFQ0W6J", "asin": "0143026860", "reviewerName": "Mary K. Byke", "reviewText": "My  husband wanted to reading about the Negro Baseball and this a great addition to his library\n Our library doesn't haveinformation so this book is his start. Tthank you", "summary": "... to reading about the Negro Baseball and this a great addition to his library Our library doesn't haveinformation so ...", "unixReviewTime": 1418860800}
#
with open(file_name, encoding='utf-8') as fp:
    i=0
    for line in fp:
        all_reviews.append(json.loads(line)) # python jason string convert to dict
        i+=1
        
i, type(all_reviews[0]), all_reviews[0]

(371345,
 dict,
 {'overall': 1.0,
  'verified': True,
  'reviewTime': '02 19, 2015',
  'reviewerID': 'A1V6B6TNIC10QE',
  'asin': '0143026860',
  'reviewerName': 'theodore j bigham',
  'reviewText': 'great',
  'summary': 'One Star',
  'unixReviewTime': 1424304000})

In [4]:
#parse label(overall) and corpus(reviewText)
corpus = []
labels = []

###<your code>###
for index in range(len(all_reviews)):
    # 取出'reviewText' and 'overall' 同時存在的文章:
    if 'reviewText' in all_reviews[index] and 'overall' in all_reviews[index]:
        
        for key, value in all_reviews[index].items():
            #取得評價分類:
            if key == 'reviewText':
                if value != '':
                    corpus.append(value)
                else:
                    corpus.append('')
            #取得評價並分級分數:
            if key == 'overall':
                #transform labels: 1,2 --> 1 and 3 --> 2 and 4,5 --> 3
                if value >=1.0 and value <=2.0:
                    labels.append(1)
                elif value ==3.0:
                    labels.append(2)
                elif value >=4.0 and value <=5.0:
                    labels.append(3)
                else:
                    labels.append('')

###<your code>###
print(len(corpus),len(labels))

370946 370946


In [5]:
corpus[:5], labels[:5]

(['great',
  "My  husband wanted to reading about the Negro Baseball and this a great addition to his library\n Our library doesn't haveinformation so this book is his start. Tthank you",
  'This book was very informative, covering all aspects of game.',
  'I am already a baseball fan and knew a bit about the Negro leagues, but I learned a lot more reading this book.',
  "This was a good story of the Black leagues. I bought the book to teach in my high school reading class. I found it very informative and exciting. I would recommend to anyone interested in the history of the black leagues. It is well written, unlike a book of facts. The McKissack's continue to write good books for young audiences that can also be enjoyed by adults!"],
 [1, 3, 3, 3, 3])

In [6]:
#preprocessing data
#remove email address, punctuations, and change line symbol(\n)
###<your code>###
pattern = r'\S*@\S*|\.|\\n'
for i, corpu in enumerate(corpus):
    preprocess_string = re.sub(pattern,' ',corpu).strip(' ')  #去除 "e-mail address格式" && "標點符號" && "換行符號", 去除字頭字尾空白。
    corpus[i] = preprocess_string
    #if i >=10:
    #    break  


In [7]:
corpus[:5]

['great',
 "My  husband wanted to reading about the Negro Baseball and this a great addition to his library\n Our library doesn't haveinformation so this book is his start  Tthank you",
 'This book was very informative, covering all aspects of game',
 'I am already a baseball fan and knew a bit about the Negro leagues, but I learned a lot more reading this book',
 "This was a good story of the Black leagues  I bought the book to teach in my high school reading class  I found it very informative and exciting  I would recommend to anyone interested in the history of the black leagues  It is well written, unlike a book of facts  The McKissack's continue to write good books for young audiences that can also be enjoyed by adults!"]

In [8]:
print(len(corpus),len(labels))

370946 370946


In [9]:
#split corpus and label into train and test
###<your code>###
x_train, x_test, y_train, y_test = train_test_split(corpus, labels, test_size=0.2, random_state=0)
len(x_train), len(x_test), len(y_train), len(y_test)

(296756, 74190, 296756, 74190)

In [10]:
#change corpus into vector
#you can use tfidf or BoW here
###<your code>###
vectorizer = TfidfVectorizer()
vectorizer.fit(x_train)

#transform training and testing corpus into vector form
x_train = vectorizer.transform(x_train) ###<your code>###
x_test = vectorizer.transform(x_test) ###<your code>###

In [11]:
x_train.shape, len(y_train), x_test.shape, len(y_test)

((296756, 65893), 296756, (74190, 65893), 74190)

### 訓練與預測

In [28]:
#build classification model (decision tree, random forest, or adaboost)
#start training

###<your code>###
model = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(criterion='entropy',
                                  max_depth=4,
                                  min_samples_split=20,
                                  min_samples_leaf=5),
                                  n_estimators=50,
                                  learning_rate=0.1)
model.fit(x_train, y_train)

AdaBoostClassifier(base_estimator=DecisionTreeClassifier(criterion='entropy',
                                                         max_depth=4,
                                                         min_samples_leaf=5,
                                                         min_samples_split=20),
                   learning_rate=0.1)

In [29]:
#start inference
###<your code>###
y_pred = model.predict(x_test)

In [30]:
#calculate accuracy
###<your code>###
accuracy = model.score(x_test,y_test)
print(f"Accuracy: {accuracy}")
# 查看adaboost含的分類器個數
print(f"Number of trees: {len(model.estimators_)}")

Accuracy: 0.8021701037875725
Number of trees: 50


In [31]:
#calculate confusion matrix, precision, recall, and f1-score
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.76      0.29      0.42     11877
           2       0.54      0.01      0.03      5788
           3       0.81      0.99      0.89     56525

    accuracy                           0.80     74190
   macro avg       0.70      0.43      0.45     74190
weighted avg       0.78      0.80      0.75     74190

[[ 3464    35  8378]
 [  574    85  5129]
 [  523    38 55964]]


由上述資訊可以發現, 模型在好評的準確度高(precision, recall都高), 而在差評的部分表現較不理想, 在普通評價的部分大部分跟差評搞混,
同學可以試著學習到的各種方法來提升模型的表現