# Introduction

---
**Random Forest**

- Mô hình *Rừng ngẫu nhiên* (Random Forest) là thuật toán học máy theo kiểu tổ hợp.
- Mô hình xây nhiều cây quyết định và kết hợp kết quả của chúng để ra kết quả cuối cùng.
- Bài toán phân loại: Lấy kết quả theo đa số.
- Bài toán hồi quy: Lấy kết quả trung bình.

---

---
**Emotion Dataset**

- Bộ dữ liệu ghi lại các tweet được phân loại theo cảm xúc.
- Phân loại theo 6 cảm xúc.

---

---
**Model**

- Mục tiêu bài toán: Phân loại cảm xúc theo tin tweet.
- Dữ liệu: Không cần chuẩn hoá.
- Công nghệ: Tensorflow hay cụ thể là TF-DF.
- Chương trình được viết và chạy trên google colab, nếu muốn chạy trên các công cụ khác yêu cầu *download* list thư viện dưới đây

---

In [5]:
!python --version

Python 3.10.12


In [6]:
!pip --version

pip 24.1.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)


In [7]:
!pip list

Package                            Version
---------------------------------- -------------------
absl-py                            1.4.0
accelerate                         1.1.1
aiohappyeyeballs                   2.4.4
aiohttp                            3.11.9
aiosignal                          1.3.1
alabaster                          1.0.0
albucore                           0.0.19
albumentations                     1.4.20
altair                             4.2.2
annotated-types                    0.7.0
anyio                              3.7.1
argon2-cffi                        23.1.0
argon2-cffi-bindings               21.2.0
array_record                       0.5.1
arviz                              0.20.0
astropy                            6.1.7
astropy-iers-data                  0.2024.12.2.0.35.34
astunparse                         1.6.3
async-timeout                      4.0.3
atpublic                           4.1.0
attrs                              24.2.0
audioread           

# Download and Import Library

In [1]:
!pip install tensorflow_decision_forests
!pip install colabtools



In [2]:
import tensorflow_decision_forests as tfdf

import os
import numpy as np
import pandas as pd
import tensorflow as tf
import math

try:
  from wurlitzer import sys_pipes
except:
  from colabtools.googlelog import CaptureLog as sys_pipes

from IPython.core.magic import register_line_magic
from IPython.display import Javascript
# Check the version of TensorFlow Decision Forests
print("Found TensorFlow Decision Forests v" + tfdf.__version__)

Found TensorFlow Decision Forests v1.11.0


# Data Preprocession

---
- Vì dữ liệu chỉ có 2 trường `text` và `label` nên phải tiền xử lí dữ liệu trước khi đưa vào mô hình huấn luyện, nếu không mô hình sẽ tạo thành một rừng cây chỉ có một lá, và giá trị sẽ nghiêng hẳn về một label, mô hình không hoạt động đúng chức năng của nó.
- Dùng cách tiền xử lí NLP cho dữ liệu:
    - Tạo Vocab.
    - Tokenizer data.
    - Tối ưu siêu tham số (có hoặc không).
---

**Load Data**
---

In [3]:

#                        Load Data from google Drive

from google.colab import drive
import pandas as pd
import os

drive.mount('/content/drive', force_remount=True)

for dirname, _, filenames in os.walk('/content/drive/My Drive/Colab Notebooks'
):
    for filename in filenames:
        print(os.path.join(dirname, filename))

file_drive_path = ('/content/drive/My Drive/Colab Notebooks/train-00000-of-00001.parquet')

df = pd.read_parquet(file_drive_path)
df.head()

Mounted at /content/drive
/content/drive/My Drive/Colab Notebooks/Untitled0.ipynb
/content/drive/My Drive/Colab Notebooks/RandomForestDEMO.ipynb
/content/drive/My Drive/Colab Notebooks/NaiveBayesDEMO.ipynb
/content/drive/My Drive/Colab Notebooks/LinearRegressionDEMO.ipynb
/content/drive/My Drive/Colab Notebooks/LogisticRegressionDEMO.ipynb
/content/drive/My Drive/Colab Notebooks/kaggle.json
/content/drive/My Drive/Colab Notebooks/SVM.ipynb
/content/drive/My Drive/Colab Notebooks/SVM_LinearDEMO.ipynb
/content/drive/My Drive/Colab Notebooks/train-00000-of-00001.parquet
/content/drive/My Drive/Colab Notebooks/Visuallize_Dataset.ipynb
/content/drive/My Drive/Colab Notebooks/RandomForest_Lab01.ipynb
/content/drive/My Drive/Colab Notebooks/nlp-getting-started/sample_submission.csv
/content/drive/My Drive/Colab Notebooks/nlp-getting-started/test.csv
/content/drive/My Drive/Colab Notebooks/nlp-getting-started/train.csv
/content/drive/My Drive/Colab Notebooks/SVM_dataset/submission_instructions

Unnamed: 0,text,label
0,i feel awful about it too because it s my job ...,0
1,im alone i feel awful,0
2,ive probably mentioned this before but i reall...,1
3,i was feeling a little low few days back,0
4,i beleive that i am much more sensitive to oth...,2


In [None]:

#                        Load Data from local path

import pandas as pd

your_local_path = '' # Change your local path here

Local_file_path = your_local_path + 'train-00000-of-00001.parquet'

df = pd.read_parquet(Local_file_path)
df.head()

**Preprocess Data**
---

In [12]:
# Print Class Labels

label = "label"

classes = df[label].unique().tolist()
print(f"Label classes: {classes}")

df[label] = df[label].map(classes.index)

Label classes: [0, 1, 2, 3, 4, 5]


In [10]:
# Map Data

label_string = ['Sadness', 'Joy', 'Love', 'Anger', 'Fear', 'Surprise']

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

vectorizer = TfidfVectorizer(stop_words='english', max_features=1000, min_df=5, max_df=0.9)
X = vectorizer.fit_transform(df['text'])  # Biến đổi văn bản thành ma trận TF-IDF
y = df['label']  # Nhãn của văn bản

In [5]:
for data in X:
    print(data)
    break

  (0, 293)	0.09348418155703282
  (0, 51)	0.43010706204134014
  (0, 465)	0.4059685966598052
  (0, 649)	0.4966032845837647
  (0, 470)	0.22943719661394182
  (0, 199)	0.3883533371092306
  (0, 375)	0.43747406507531744


In [6]:
# Change DataFrame Structure into Tensorflow Dataset

data = pd.DataFrame.sparse.from_spmatrix(X, columns=vectorizer.get_feature_names_out())
data['label'] = y

**Split Data**
---

In [8]:
from sklearn.model_selection import train_test_split

train_ds_pd, test_ds_pd = train_test_split(data, test_size=0.2, random_state=42)
print("{} examples in training, {} examples for testing.".format(
    train_ds_pd.shape[0], test_ds_pd.shape[0]))

train_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, task=tfdf.keras.Task.CLASSIFICATION, label="label")
test_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(test_ds_pd, task=tfdf.keras.Task.CLASSIFICATION, label="label")


333447 examples in training, 83362 examples for testing.


# Build and Train Random Forest Model

In [9]:
# Specify the model.
model_1 = tfdf.keras.RandomForestModel(task=tfdf.keras.Task.CLASSIFICATION,estimator=100)

# Optionally, add evaluation metrics.
model_1.compile(
    metrics=["accuracy"])

# Train the model.
# "sys_pipes" is optional. It enables the display of the training logs.
with sys_pipes():
  model_1.fit(x=train_dataset)

Use /tmp/tmpdwmsihg9 as temporary training directory
Reading training dataset...
Training dataset read in 0:05:52.686357. Found 333447 examples.
Training model...


I0000 00:00:1733650918.189772    1741 kernel.cc:782] Start Yggdrasil model training
I0000 00:00:1733650918.195112    1741 kernel.cc:783] Collect training examples
I0000 00:00:1733650918.196811    1741 kernel.cc:795] Dataspec guide:
column_guides {
  column_name_pattern: "^__LABEL$"
  type: CATEGORICAL
  categorial {
    min_vocab_frequency: 0
    max_vocab_count: -1
  }
}
default_column_guide {
  categorial {
    max_vocab_count: 2000
  }
  discretized_numerical {
    maximum_num_bins: 255
  }
}
ignore_columns_without_guides: false
detect_numerical_as_discretized_numerical: false

I0000 00:00:1733650918.206568    1741 kernel.cc:401] Number of batches: 334
I0000 00:00:1733650918.210886    1741 kernel.cc:402] Number of examples: 333447
I0000 00:00:1733650922.567895    1741 kernel.cc:802] Training dataset:
Number of records: 333447
Number of columns: 1001

Number of columns by type:
	NUMERICAL: 1000 (99.9001%)
	CATEGORICAL: 1 (0.0999001%)

Columns:

NUMERICAL: 1000 (99.9001%)
	1: "ability

Model trained in 0:12:06.771858
Compiling model...
Model compiled.


# Evaluate Model

In [10]:
evaluation = model_1.evaluate(test_dataset, return_dict=True)
print()

for name, value in evaluation.items():
  print(f"{name}: {value:.4f}")


loss: 0.0000
accuracy: 0.3393
