### 作業目的: 使用樹型模型進行文章分類

本次作業主利用[Amazon Review data中的All Beauty](https://nijianmo.github.io/amazon/index.html)來進行review評價分類(文章分類)

資料中將review分為1,2,3,4,5分，而在這份作業，我們將評論改分為差評價、普通評價、優良評價(1,2-->1差評、3-->2普通評價、4,5-->3優良評價)

### 載入套件

In [11]:
import json
import re
import nltk
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from nltk.corpus import stopwords

# Lemmatize with POS Tag
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer 

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


### 資料前處理
文本資料較為龐大，這裡我們取前10000筆資料來進行作業練習

In [12]:
#load json data
all_reviews = []
###<your code>###
with open("./data/All_Beauty.json", "r",encoding="utf-8") as f:
    for review in f:
        all_reviews.append(json.loads(review))
        
all_reviews[0]

{'overall': 1.0,
 'verified': True,
 'reviewTime': '02 19, 2015',
 'reviewerID': 'A1V6B6TNIC10QE',
 'asin': '0143026860',
 'reviewerName': 'theodore j bigham',
 'reviewText': 'great',
 'summary': 'One Star',
 'unixReviewTime': 1424304000}

In [13]:
#parse label(overall) and corpus(reviewText)
corpus = []
labels = []

###<your code>###
for review in all_reviews[:10000]:
    if review.get("reviewText",False) and review.get("overall",False):
        corpus.append(review["reviewText"])
        labels.append(review["overall"])
        

#transform labels: 1,2 --> 1 and 3 --> 2 and 4,5 --> 3
###<your code>###
for i, label in enumerate(labels):
    if label == 1 or label == 2:
        labels[i] = 1
    elif label == 3:
        labels[i] = 2
    else:
        labels[i] = 3


In [14]:
#preprocessing data
#remove email address, punctuations, and change line symbol(\n)

###<your code>###

lemmatizer = WordNetLemmatizer() 
def get_wordnet_pos(word):
    """將pos_tag結果mapping到lemmatizer中pos的格式"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

def clean_content(corpus):
    pattern = r'\S*@\S*|\\n|[^a-zA-Z0-9 ]'
    X_clean = [re.sub(pattern," ",x).lower() for x in corpus if x != ""]
    # tokenize
    X_word_tokenize = [nltk.word_tokenize(x) for x in X_clean]
    # stopwords_lemmatizer
    X_stopwords_lemmatizer = []
    stop_words = set(stopwords.words('english'))
    for content in X_word_tokenize:
        content_clean = []
        for word in content:
            if word not in stop_words:
                word = lemmatizer.lemmatize(word, get_wordnet_pos(word))
                content_clean.append(word)
        X_stopwords_lemmatizer.append(content_clean)
    
    X_output = [' '.join(x) for x in X_stopwords_lemmatizer]
    
    return X_output
    

In [15]:
corpus = clean_content(corpus)

In [18]:
#split corpus and label into train and test
###<your code>###
x_train, x_test, y_train, y_test = train_test_split(corpus, labels, test_size=0.1, random_state=0)

len(x_train), len(x_test), len(y_train), len(y_test)

(8995, 1000, 8995, 1000)

In [19]:
#change corpus into vector
#you can use tfidf or BoW here
###<your code>###
tfidf_vec = TfidfVectorizer()
tfidf_vec.fit(x_train)

#transform training and testing corpus into vector form
x_train = tfidf_vec.transform(x_train).toarray()
x_test = tfidf_vec.transform(x_test).toarray()

### 訓練與預測

In [20]:
#build classification model (decision tree, random forest, or adaboost)
#start training

###<your code>###
tree = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(criterion='entropy',max_depth=3,min_samples_split=10,min_samples_leaf=5),
                          n_estimators=10,
                          learning_rate=0.5)
tree.fit(x_train,y_train)


AdaBoostClassifier(base_estimator=DecisionTreeClassifier(criterion='entropy',
                                                         max_depth=3,
                                                         min_samples_leaf=5,
                                                         min_samples_split=10),
                   learning_rate=0.5, n_estimators=10)

In [21]:
#start inference
y_pred = tree.predict(x_test)

In [22]:
#calculate accuracy
###<your code>###
print(f'Accuracy: {tree.score(x_test,y_test)}')

Accuracy: 0.89


In [23]:
#calculate confusion matrix, precision, recall, and f1-score
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.64      0.10      0.18        69
           2       0.00      0.00      0.00        44
           3       0.89      1.00      0.94       887

    accuracy                           0.89      1000
   macro avg       0.51      0.37      0.37      1000
weighted avg       0.84      0.89      0.85      1000

[[  7   0  62]
 [  0   0  44]
 [  4   0 883]]


  _warn_prf(average, modifier, msg_start, len(result))


由上述資訊可以發現, 模型在好評的準確度高(precision, recall都高), 而在差評的部分表現較不理想, 在普通評價的部分大部分跟差評搞混,
同學可以試著學習到的各種方法來提升模型的表現