### 作業目的: 使用樹型模型進行文章分類

本次作業主利用[Amazon Review data中的All Beauty](https://nijianmo.github.io/amazon/index.html)來進行review評價分類(文章分類)

資料中將review分為1,2,3,4,5分，而在這份作業，我們將評論改分為差評價、普通評價、優良評價(1,2-->1差評、3-->2普通評價、4,5-->3優良評價)

### 載入套件

In [76]:
import json
import re
import nltk
nltk.download('stopwords')

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from nltk.corpus import stopwords

# Lemmatize with POS Tag
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer 

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\03950\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### 資料前處理
文本資料較為龐大，這裡我們取前10000筆資料來進行作業練習

In [77]:
#load json data
all_reviews = []
###<your code>###
with open("./data/All_Beauty.json", "r",encoding="utf-8") as f:
    for review in f:
        all_reviews.append(json.loads(review))
        
all_reviews[0]

{'overall': 1.0,
 'verified': True,
 'reviewTime': '02 19, 2015',
 'reviewerID': 'A1V6B6TNIC10QE',
 'asin': '0143026860',
 'reviewerName': 'theodore j bigham',
 'reviewText': 'great',
 'summary': 'One Star',
 'unixReviewTime': 1424304000}

In [110]:
#parse label(overall) and corpus(reviewText)
corpus = []
labels = []

###<your code>###
for review in all_reviews[:10000]:
    if review.get("reviewText",False) and review.get("overall",False):
        corpus.append(review["reviewText"])
        labels.append(review["overall"])
        

#transform labels: 1,2 --> 1 and 3 --> 2 and 4,5 --> 3
###<your code>###
# for i, label in enumerate(labels):
#     if label == 1 or label == 2:
#         labels[i] = 1
#     elif label == 3:
#         labels[i] = 2
#     else:
#         labels[i] = 3


In [111]:
corpus

['great',
 "My  husband wanted to reading about the Negro Baseball and this a great addition to his library\n Our library doesn't haveinformation so this book is his start. Tthank you",
 'This book was very informative, covering all aspects of game.',
 'I am already a baseball fan and knew a bit about the Negro leagues, but I learned a lot more reading this book.',
 "This was a good story of the Black leagues. I bought the book to teach in my high school reading class. I found it very informative and exciting. I would recommend to anyone interested in the history of the black leagues. It is well written, unlike a book of facts. The McKissack's continue to write good books for young audiences that can also be enjoyed by adults!",
 'Today I gave a book about the Negro Leagues of Baseball to a traveling friend. Its a book I\'ve read more than once and felt that my friend would truly enjoy. It felt like giving a gift that you wanted to keep for yourself. I parted with the book knowing that

In [112]:
#preprocessing data
#remove email address, punctuations, and change line symbol(\n)

###<your code>###

lemmatizer = WordNetLemmatizer() 
def get_wordnet_pos(word):
    """將pos_tag結果mapping到lemmatizer中pos的格式"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

def clean_content(corpus):
    pattern = r'\S*@\S*|\\n|[^a-zA-Z0-9 ]'
    X_clean = [re.sub(pattern," ",x).lower() for x in corpus if x != ""]
    # tokenize
    X_word_tokenize = [nltk.word_tokenize(x) for x in X_clean]
    # stopwords_lemmatizer
    X_stopwords_lemmatizer = []
    stop_words = set(stopwords.words('english'))
    for content in X_word_tokenize:
        content_clean = []
        for word in content:
            if word not in stop_words:
                word = lemmatizer.lemmatize(word, get_wordnet_pos(word))
                content_clean.append(word)
        X_stopwords_lemmatizer.append(content_clean)
    
    X_output = [' '.join(x) for x in X_stopwords_lemmatizer]
    
    return X_output
    

In [113]:
corpus = clean_content(corpus)

In [114]:
corpus

['great',
 'husband want reading negro baseball great addition library library haveinformation book start tthank',
 'book informative cover aspect game',
 'already baseball fan knew bit negro league learn lot reading book',
 'good story black league bought book teach high school reading class found informative excite would recommend anyone interested history black league well write unlike book fact mckissack continue write good book young audience also enjoy adult',
 'today give book negro league baseball travel friend book read felt friend would truly enjoy felt like give gift want keep part book know friend would enjoy reading journey back east give book spent thirty minute flip page say goodbye story know come across book part book like wish friend well journey friend mine journeying great send visit friend friendly gift well leaf book page come across paragraph want retain memory friend book book line show word negro baseball player face every day life color barrier prevent gain na

In [115]:
#split corpus and label into train and test
###<your code>###
x_train, x_test, y_train, y_test = train_test_split(corpus, labels, test_size=0.1, random_state=0)

len(x_train), len(x_test), len(y_train), len(y_test)

(8995, 1000, 8995, 1000)

In [116]:
#change corpus into vector
#you can use tfidf or BoW here
###<your code>###
tfidf_vec = TfidfVectorizer()
tfidf_vec.fit(x_train)

#transform training and testing corpus into vector form
x_train = tfidf_vec.transform(x_train).toarray()
x_test = tfidf_vec.transform(x_test).toarray()

### 訓練與預測

In [117]:
#build classification model (decision tree, random forest, or adaboost)
#start training

###<your code>###
tree = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(criterion='entropy',max_depth=3,min_samples_split=10,min_samples_leaf=5),
                          n_estimators=10,
                          learning_rate=0.5)
tree.fit(x_train,y_train)


AdaBoostClassifier(algorithm='SAMME.R',
                   base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                         class_weight=None,
                                                         criterion='entropy',
                                                         max_depth=3,
                                                         max_features=None,
                                                         max_leaf_nodes=None,
                                                         min_impurity_decrease=0.0,
                                                         min_impurity_split=None,
                                                         min_samples_leaf=5,
                                                         min_samples_split=10,
                                                         min_weight_fraction_leaf=0.0,
                                                         presort='deprecated',
                      

In [118]:
#start inference
y_pred = tree.predict(x_test)

In [119]:
#calculate accuracy
###<your code>###
print(f'Accuracy: {tree.score(x_test,y_test)}')

Accuracy: 0.736


In [120]:
#calculate confusion matrix, precision, recall, and f1-score
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

         1.0       0.42      0.12      0.19        42
         2.0       0.00      0.00      0.00        27
         3.0       0.50      0.02      0.04        44
         4.0       0.00      0.00      0.00       155
         5.0       0.74      1.00      0.85       732

    accuracy                           0.74      1000
   macro avg       0.33      0.23      0.22      1000
weighted avg       0.58      0.74      0.63      1000

[[  5   2   0   0  35]
 [  4   0   1   0  22]
 [  0   0   1   0  43]
 [  1   0   0   0 154]
 [  2   0   0   0 730]]


  _warn_prf(average, modifier, msg_start, len(result))


由上述資訊可以發現, 模型在好評的準確度高(precision, recall都高), 而在差評的部分表現較不理想, 在普通評價的部分大部分跟差評搞混,
同學可以試著學習到的各種方法來提升模型的表現