### 作業目的: 使用樹型模型進行文章分類

本次作業主利用[Amazon Review data中的All Beauty](https://nijianmo.github.io/amazon/index.html)來進行review評價分類(文章分類)

資料中將review分為1,2,3,4,5分，而在這份作業，我們將評論改分為差評價、普通評價、優良評價(1,2-->1差評、3-->2普通評價、4,5-->3優良評價)

### 載入套件

In [1]:
import json
import re
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

### 資料前處理
文本資料較為龐大，這裡我們取前10000筆資料來進行作業練習

In [5]:
#load json data
all_reviews = []
with open('All_Beauty.json', 'r', encoding='utf-8') as f:
    for line in f.readlines():
        all_reviews.append(json.loads(line))
all_reviews[0]

{'overall': 1.0,
 'verified': True,
 'reviewTime': '02 19, 2015',
 'reviewerID': 'A1V6B6TNIC10QE',
 'asin': '0143026860',
 'reviewerName': 'theodore j bigham',
 'reviewText': 'great',
 'summary': 'One Star',
 'unixReviewTime': 1424304000}

In [6]:
# parse label(overall) and corpus(reviewText)
corpus, labels = [], []
for review in all_reviews[:10000]:
    if 'reviewText' not in review or 'overall' not in review:
        continue
    corpus.append(review['reviewText'])
    labels.append(review['overall'])
        
# transform labels: 1,2 --> 1 and 3 --> 2 and 4,5 --> 3
label_map = {1: 1, 2: 1, 3: 2, 4: 3, 5: 3}
labels = [label_map[int(label)] for label in labels]

In [14]:
# 作法一
from nltk.corpus import stopwords

import nltk

nltk.download('stopwords')

# Lemmatize with POS Tag
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer 

## 創建Lemmatizer
lemmatizer = WordNetLemmatizer() 
def get_wordnet_pos(word):
    """將pos_tag結果mapping到lemmatizer中pos的格式"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)


def clean_content(X):
    stop_words = set(stopwords.words('english'))
    content_clean = []
    # remove non-alphabet characters
    pattern = r"\S*@\S*|\\n|[^a-zA-Z0-9 ]"
    X_clean = re.sub(pattern,' ', X).lower()
    # tokenize
    X_word_tokenize = nltk.word_tokenize(X_clean)

    # stopwords_lemmatizer
    for word in X_word_tokenize:
        if word not in stop_words:
            word = lemmatizer.lemmatize(word, get_wordnet_pos(word))
            content_clean.append(word)
                
    return ' '.join(content_clean)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [7]:
# 作法二
# preprocessing data
# remove email address, punctuations, and change line symbol(\n)
pattern = r'\S*@\S*|\\n|\W'
preprocess_text = lambda x: ' '.join([w for w in re.sub(pattern, ' ', x).split() if w != ''])
corpus = [preprocess_text(text) for text in corpus]

In [8]:
# split corpus and labels into train and test
x_train, x_test, y_train, y_test = train_test_split(corpus, labels, test_size=0.2, random_state=0)
len(x_train), len(x_test), len(y_train), len(y_test)

(7996, 1999, 7996, 1999)

In [9]:
# change corpus into vector
# you can use tf-idf or BOW here
vectorizer = TfidfVectorizer()
vectorizer.fit(x_train)

# transform training and testing corpus into vector form
x_train = vectorizer.transform(x_train)
x_test = vectorizer.transform(x_test)

### 訓練與預測

In [10]:
# build classification model (decision tree, random forest, or adaboost)
# start training
decision_tree_cls = DecisionTreeClassifier(criterion='entropy', max_depth=6)
decision_tree_cls.fit(x_train, y_train)

DecisionTreeClassifier(criterion='entropy', max_depth=6)

In [11]:
# start inference
y_pred = decision_tree_cls.predict(x_test)

In [12]:
# calculate accuracy
print(f"Accuracy: {decision_tree_cls.score(x_test, y_test)}")

Accuracy: 0.9004502251125562


In [13]:
# calculate confusion matrix, precision, recall, and f1-score
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.61      0.14      0.23       134
           2       0.33      0.01      0.03        73
           3       0.91      0.99      0.95      1792

    accuracy                           0.90      1999
   macro avg       0.62      0.38      0.40      1999
weighted avg       0.87      0.90      0.87      1999

[[  19    1  114]
 [   1    1   71]
 [  11    1 1780]]


由上述資訊可以發現, 模型在好評的準確度高(precision, recall都高), 而在差評的部分表現較不理想, 在普通評價的部分大部分跟差評搞混,
同學可以試著學習到的各種方法來提升模型的表現