## **0. Tải bộ dữ liệu**
**Lưu ý:** Nếu không thể tải bằng gdown do bị giới hạn số lượt tải, các bạn hãy tải thủ công và đưa lên drive của mình, sau đó copy từ drive vào colab.
```python
from google.colab import drive

drive.mount('/content/drive')
!cp /path/to/dataset/on/your/drive .
```

In [1]:
# https://drive.google.com/file/d/1f7WAwkuTFgLzCCTs2HZv3AmtBkET4l1M/view?usp=share_link
!gdown --id 1f7WAwkuTFgLzCCTs2HZv3AmtBkET4l1M

Downloading...
From: https://drive.google.com/uc?id=1f7WAwkuTFgLzCCTs2HZv3AmtBkET4l1M
To: /content/sem_eval_2018.zip
100% 662k/662k [00:00<00:00, 158MB/s]


In [2]:
!unzip './sem_eval_2018.zip'

Archive:  ./sem_eval_2018.zip
   creating: sem_eval_2018/
  inflating: sem_eval_2018/val.csv   
  inflating: sem_eval_2018/test.csv  
  inflating: sem_eval_2018/train.csv  


## **1. Import các thư viện cần thiết**

In [3]:
!pip install unidecode

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting unidecode
  Downloading Unidecode-1.3.6-py3-none-any.whl (235 kB)
[K     |████████████████████████████████| 235 kB 14.2 MB/s 
[?25hInstalling collected packages: unidecode
Successfully installed unidecode-1.3.6


In [25]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
import re
import nltk
import unidecode

nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score

RANDOM_SEED = 1
tf.random.set_seed(RANDOM_SEED)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## **2. Chuẩn bị dữ liệu**

In [5]:
english_stop_words = stopwords.words('english') # Lấy danh sách stopwords từ thư viện ntlk
stemmer = PorterStemmer() # Khai báo stemmer object (dùng để stemming trong hàm normalize text)

# Xây dựng hàm text normalization
def text_normalize(text):
    text = text.lower() # Chuyển chữ viết thường 
    text = unidecode.unidecode(text) # Mã hóa về ASCII
    text = text.strip() # Xóa kí tự đặc biệt ở đầu và cuối string
    text = re.sub(r'[^\w\s]', '', text) # Loại bỏ dấu câu
    text = ' '.join([word for word in text.split(' ') if word not in english_stop_words]) # Xóa stopwords
    text = ' '.join([stemmer.stem(word) for word in text.split(' ')]) # Stemming
 
    return text

In [18]:
BATCH_SIZE = 128
LR = 1e-1
MAX_SEQ_LEN = 128
MAX_FEATURES = 5000 
EMBEDDING_DIMS = 64
ROOT_FOLDER_PATH = './sem_eval_2018'

train_filepath = os.path.join(ROOT_FOLDER_PATH, 'train.csv')
val_filepath = os.path.join(ROOT_FOLDER_PATH, 'val.csv')
test_filepath = os.path.join(ROOT_FOLDER_PATH, 'test.csv')

train_df = pd.read_csv(train_filepath, 
                index_col=0) 
val_df = pd.read_csv(val_filepath, 
                index_col=0) 
test_df = pd.read_csv(test_filepath, 
                index_col=0) 

train_df['Tweet'] = train_df['Tweet'].apply(lambda p: text_normalize(p)).astype(str) 
val_df['Tweet'] = val_df['Tweet'].apply(lambda p: text_normalize(p)).astype(str) 
test_df['Tweet'] = test_df['Tweet'].apply(lambda p: text_normalize(p)).astype(str) 

class_lst = np.array(train_df.columns[2:])
n_classes = len(class_lst)

X_train, y_train = train_df['Tweet'].to_numpy(), train_df[class_lst].astype('int').to_numpy()
X_val, y_val = val_df['Tweet'].to_numpy(), val_df[class_lst].astype('int').to_numpy()
X_test, y_test = test_df['Tweet'].to_numpy(), test_df[class_lst].astype('int').to_numpy()

In [19]:
corpus = X_train.tolist()
vectorizer = TfidfVectorizer().fit(corpus)

In [20]:
X_train = vectorizer.transform(X_train)
X_val = vectorizer.transform(X_val)
X_test = vectorizer.transform(X_test)

In [21]:
def inverse_label(class_lst, onehot_label):

    return class_lst[onehot_label > 0]

## **3. Xây dựng mô hình**

In [30]:
xgboost_model = XGBClassifier(objective='binary:logistic',
                              learning_rate=LR,
                              random_state=RANDOM_SEED,
                              verbosity=2)
xgboost_multilabel_model = MultiOutputClassifier(xgboost_model)

## **4. Thực hiện huấn luyện**

In [31]:
xgboost_multilabel_model.fit(X_train, y_train)

[08:46:51] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 0 pruned nodes, max_depth=3
[08:46:51] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 0 pruned nodes, max_depth=3
[08:46:51] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 0 pruned nodes, max_depth=3
[08:46:51] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 0 pruned nodes, max_depth=3
[08:46:51] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 0 pruned nodes, max_depth=3
[08:46:51] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 0 pruned nodes, max_depth=3
[08:46:51] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 0 pruned nodes, max_depth=3
[08:46:51] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 8 extra node

MultiOutputClassifier(estimator=XGBClassifier(random_state=1, verbosity=2))

## **5. Đánh giá và trực quan hóa**

In [29]:
xgboost_multilabel_model.score(X_test, y_test)

0.13685179502915004