**Analisis dan Prediksi Sentimen Teks dengan Deep Learning pada Data IndoNLU**

Dataset: https://github.com/indobenchmark/indonlu

# **Preparation**

Memuat data IndoNLU

In [1]:
!git clone https://github.com/indobenchmark/indonlu

Cloning into 'indonlu'...
remote: Enumerating objects: 500, done.[K
remote: Counting objects: 100% (184/184), done.[K
remote: Compressing objects: 100% (74/74), done.[K
remote: Total 500 (delta 115), reused 139 (delta 110), pack-reused 316[K
Receiving objects: 100% (500/500), 9.45 MiB | 16.46 MiB/s, done.
Resolving deltas: 100% (235/235), done.


Impor library

In [2]:
import random
import numpy as np
import pandas as pd
import torch
from torch import optim
import torch.nn.functional as F
from tqdm import tqdm
from transformers import BertForSequenceClassification, BertConfig, BertTokenizer
from nltk.tokenize import TweetTokenizer
from indonlu.utils.forward_fn import forward_sequence_classification
from indonlu.utils.metrics import document_sentiment_metrics_fn
from indonlu.utils.data_utils import DocumentSentimentDataset, DocumentSentimentDataLoader

Mendefinisikan fungsi-fungsi umum yang akan digunakan dalam proyek ini, seperti inisialisasi seed, menghitung jumlah parameter dalam suatu modul, mendapatkan learning rate, dan mengubah metrik ke dalam format string.

In [3]:
# Mengatur dan menetapkan random seed
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)

# Menghitung jumlah parameter dalam model
def count_param(module, trainable=False):
    if trainable:
        return sum(p.numel() for p in module.parameters() if p.requires_grad)
    else:
        return sum(p.numel() for p in module.parameters())
# Mengatur learning rate
def get_lr(optimizer):
    for param_group in optimizer.param_groups:
        return param_group['lr']

# Mengonversi metriks ke dalam string
def metrics_to_string(metric_dict):
    string_list = []
    for key, value in metric_dict.items():
        string_list.append('{}:{:.2f}'.format(key, value))
    return ' '.join(string_list)

Menetapkan random seed untuk memastikan reproduktibilitas eksperimen dan hasil yang konsisten dalam setiap eksekusi

In [4]:
# Mengatur random seed
set_seed(19072021)

# **Konfigurasi dan Load Pre-trained Model**

Memuat tokenizer dan konfigurasi model BERT dari model bahasa Indonesia yang telah dilatih sebelumnya

In [5]:
# Memuat tokenizer dan konfigurasi
tokenizer = BertTokenizer.from_pretrained('indobenchmark/indobert-base-p1')
config = BertConfig.from_pretrained('indobenchmark/indobert-base-p1')
config.num_labels = DocumentSentimentDataset.NUM_LABELS

# Inisiasi model
model = BertForSequenceClassification.from_pretrained('indobenchmark/indobert-base-p1', config=config)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/229k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/498M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at indobenchmark/indobert-base-p1 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
# Memanggil model yang telah diinisiasi
model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(50000, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [7]:
# Melihat jumlah parameter
count_param(model)

124443651

# **Persiapan Dataset Analisis Sentimen**

In [8]:
train_dataset_path = '/content/indonlu/dataset/smsa_doc-sentiment-prosa/train_preprocess.tsv'
valid_dataset_path = '/content/indonlu/dataset/smsa_doc-sentiment-prosa/valid_preprocess.tsv'
test_dataset_path = '/content/indonlu/dataset/smsa_doc-sentiment-prosa/test_preprocess_masked_label.tsv'

In [9]:
# Mendefinisikan variabel untuk kedua kelas sebelumnya
train_dataset = DocumentSentimentDataset(train_dataset_path, tokenizer, lowercase=True)
valid_dataset = DocumentSentimentDataset(valid_dataset_path, tokenizer, lowercase=True)
test_dataset = DocumentSentimentDataset(test_dataset_path, tokenizer, lowercase=True)

train_loader = DocumentSentimentDataLoader(dataset=train_dataset, max_seq_len=512, batch_size=32, num_workers=2, shuffle=True)
valid_loader = DocumentSentimentDataLoader(dataset=valid_dataset, max_seq_len=512, batch_size=32, num_workers=2, shuffle=False)
test_loader = DocumentSentimentDataLoader(dataset=test_dataset, max_seq_len=512, batch_size=32, num_workers=2, shuffle=False)

In [10]:
# Mencetak hasilnya pada salah satu sampel data
print(train_dataset[0])

(array([    2,  6540,    92,  2970,   213,  4259,  3553,   899,    34,
         259,  5590,   262,  2558,   386,   899,  1687,    26,  1574,
       30470,   899,  3310, 30468, 22130, 30360,  6123,  6368, 30468,
       22130, 30360,  2652,  1746, 30468,  8869,  6540,    34,  6315,
        1622,  1256,  8949,   899, 30468,  4222,  1622,   752,   245,
         295,  2083, 30470,  2346,  7107,   300, 30470,   405,   724,
        5189, 30470,   843, 17464,   899,   540, 10989,  3331,  1107,
       30468,   119,  3221,    79,    34,  2170,    98,  9167, 30457,
           3]), array(0), 'warung ini dimiliki oleh pengusaha pabrik tahu yang sudah puluhan tahun terkenal membuat tahu putih di bandung . tahu berkualitas , dipadu keahlian memasak , dipadu kretivitas , jadilah warung yang menyajikan menu utama berbahan tahu , ditambah menu umum lain seperti ayam . semuanya selera indonesia . harga cukup terjangkau . jangan lewatkan tahu bletoka nya , tidak kalah dengan yang asli dari tegal !')


In [11]:
# Mendefinisikan variabel untuk menempatkan DocumentSentimentDataset.LABEL2INDEX dan DocumentSentimentDataset.INDEX2LABEL
w2i, i2w = DocumentSentimentDataset.LABEL2INDEX, DocumentSentimentDataset.INDEX2LABEL
print(w2i)
print(i2w)

{'positive': 0, 'neutral': 1, 'negative': 2}
{0: 'positive', 1: 'neutral', 2: 'negative'}


# **Uji Model dengan Contoh Kalimat**

In [12]:
# Mengecek model pada contoh kalimat
text = 'Bahagia hatiku melihat pernikahan putri sulungku yang cantik jelita'
subwords = tokenizer.encode(text)
subwords = torch.LongTensor(subwords).view(1, -1).to(model.device)

logits = model(subwords)[0]
label = torch.topk(logits, k=1, dim=-1)[1].squeeze().item()

print(f'Text: {text} | Label : {i2w[label]} ({F.softmax(logits, dim=-1).squeeze()[label] * 100:.3f}%)')

Text: Bahagia hatiku melihat pernikahan putri sulungku yang cantik jelita | Label : positive (39.380%)


# **Fine Tuning dan Evaluasi**


In [13]:
optimizer = optim.Adam(model.parameters(), lr=3e-6)
model = model.cuda()

Pelatihan model

In [14]:
# Pelatihan
n_epochs = 10
for epoch in range(n_epochs):
    model.train()
    torch.set_grad_enabled(True)

    total_train_loss = 0
    list_hyp, list_label = [], []

    train_pbar = tqdm(train_loader, leave=True, total=len(train_loader))
    for i, batch_data in enumerate(train_pbar):
        # Forward model
        loss, batch_hyp, batch_label = forward_sequence_classification(model, batch_data[:-1], i2w=i2w, device='cuda')

        # Update model
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        tr_loss = loss.item()
        total_train_loss = total_train_loss + tr_loss

        # Kalkulasi metriks
        list_hyp += batch_hyp
        list_label += batch_label

        train_pbar.set_description("(Epoch {}) TRAIN LOSS:{:.4f} LR:{:.8f}".format((epoch+1),
            total_train_loss/(i+1), get_lr(optimizer)))

    # Kalkulasi metriks pelatihan
    metrics = document_sentiment_metrics_fn(list_hyp, list_label)
    print("(Epoch {}) TRAIN LOSS:{:.4f} {} LR:{:.8f}".format((epoch+1),
        total_train_loss/(i+1), metrics_to_string(metrics), get_lr(optimizer)))

    # Evaluasi pada data validasi
    model.eval()
    torch.set_grad_enabled(False)

    total_loss, total_correct, total_labels = 0, 0, 0
    list_hyp, list_label = [], []

    pbar = tqdm(valid_loader, leave=True, total=len(valid_loader))
    for i, batch_data in enumerate(pbar):
        batch_seq = batch_data[-1]
        loss, batch_hyp, batch_label = forward_sequence_classification(model, batch_data[:-1], i2w=i2w, device='cuda')

        # Kalkulasi total loss
        valid_loss = loss.item()
        total_loss = total_loss + valid_loss

        # Kalkulasi matriks validasi
        list_hyp += batch_hyp
        list_label += batch_label
        metrics = document_sentiment_metrics_fn(list_hyp, list_label)

        pbar.set_description("VALID LOSS:{:.4f} {}".format(total_loss/(i+1), metrics_to_string(metrics)))

    metrics = document_sentiment_metrics_fn(list_hyp, list_label)
    print("(Epoch {}) VALID LOSS:{:.4f} {}".format((epoch+1),
        total_loss/(i+1), metrics_to_string(metrics)))

(Epoch 1) TRAIN LOSS:0.3480 LR:0.00000300: 100%|██████████| 344/344 [02:41<00:00,  2.13it/s]


(Epoch 1) TRAIN LOSS:0.3480 ACC:0.87 F1:0.82 REC:0.79 PRE:0.86 LR:0.00000300


VALID LOSS:0.1944 ACC:0.93 F1:0.90 REC:0.89 PRE:0.90: 100%|██████████| 40/40 [00:07<00:00,  5.60it/s]


(Epoch 1) VALID LOSS:0.1944 ACC:0.93 F1:0.90 REC:0.89 PRE:0.90


(Epoch 2) TRAIN LOSS:0.1549 LR:0.00000300: 100%|██████████| 344/344 [02:44<00:00,  2.09it/s]


(Epoch 2) TRAIN LOSS:0.1549 ACC:0.95 F1:0.93 REC:0.92 PRE:0.93 LR:0.00000300


VALID LOSS:0.1753 ACC:0.94 F1:0.91 REC:0.91 PRE:0.91: 100%|██████████| 40/40 [00:07<00:00,  5.59it/s]


(Epoch 2) VALID LOSS:0.1753 ACC:0.94 F1:0.91 REC:0.91 PRE:0.91


(Epoch 3) TRAIN LOSS:0.1187 LR:0.00000300: 100%|██████████| 344/344 [02:44<00:00,  2.09it/s]


(Epoch 3) TRAIN LOSS:0.1187 ACC:0.96 F1:0.95 REC:0.95 PRE:0.96 LR:0.00000300


VALID LOSS:0.1675 ACC:0.94 F1:0.91 REC:0.90 PRE:0.92: 100%|██████████| 40/40 [00:07<00:00,  5.61it/s]


(Epoch 3) VALID LOSS:0.1675 ACC:0.94 F1:0.91 REC:0.90 PRE:0.92


(Epoch 4) TRAIN LOSS:0.0892 LR:0.00000300: 100%|██████████| 344/344 [02:45<00:00,  2.08it/s]


(Epoch 4) TRAIN LOSS:0.0892 ACC:0.97 F1:0.96 REC:0.96 PRE:0.97 LR:0.00000300


VALID LOSS:0.1866 ACC:0.93 F1:0.90 REC:0.89 PRE:0.92: 100%|██████████| 40/40 [00:07<00:00,  5.63it/s]


(Epoch 4) VALID LOSS:0.1866 ACC:0.93 F1:0.90 REC:0.89 PRE:0.92


(Epoch 5) TRAIN LOSS:0.0662 LR:0.00000300: 100%|██████████| 344/344 [02:44<00:00,  2.09it/s]


(Epoch 5) TRAIN LOSS:0.0662 ACC:0.98 F1:0.97 REC:0.97 PRE:0.98 LR:0.00000300


VALID LOSS:0.1982 ACC:0.93 F1:0.91 REC:0.90 PRE:0.92: 100%|██████████| 40/40 [00:07<00:00,  5.56it/s]


(Epoch 5) VALID LOSS:0.1982 ACC:0.93 F1:0.91 REC:0.90 PRE:0.92


(Epoch 6) TRAIN LOSS:0.0465 LR:0.00000300: 100%|██████████| 344/344 [02:44<00:00,  2.09it/s]


(Epoch 6) TRAIN LOSS:0.0465 ACC:0.99 F1:0.98 REC:0.98 PRE:0.98 LR:0.00000300


VALID LOSS:0.2132 ACC:0.93 F1:0.90 REC:0.90 PRE:0.91: 100%|██████████| 40/40 [00:07<00:00,  5.61it/s]


(Epoch 6) VALID LOSS:0.2132 ACC:0.93 F1:0.90 REC:0.90 PRE:0.91


(Epoch 7) TRAIN LOSS:0.0317 LR:0.00000300: 100%|██████████| 344/344 [02:44<00:00,  2.09it/s]


(Epoch 7) TRAIN LOSS:0.0317 ACC:0.99 F1:0.99 REC:0.99 PRE:0.99 LR:0.00000300


VALID LOSS:0.2260 ACC:0.93 F1:0.91 REC:0.89 PRE:0.92: 100%|██████████| 40/40 [00:07<00:00,  5.60it/s]


(Epoch 7) VALID LOSS:0.2260 ACC:0.93 F1:0.91 REC:0.89 PRE:0.92


(Epoch 8) TRAIN LOSS:0.0265 LR:0.00000300: 100%|██████████| 344/344 [02:45<00:00,  2.08it/s]


(Epoch 8) TRAIN LOSS:0.0265 ACC:0.99 F1:0.99 REC:0.99 PRE:0.99 LR:0.00000300


VALID LOSS:0.2359 ACC:0.94 F1:0.91 REC:0.89 PRE:0.93: 100%|██████████| 40/40 [00:07<00:00,  5.56it/s]


(Epoch 8) VALID LOSS:0.2359 ACC:0.94 F1:0.91 REC:0.89 PRE:0.93


(Epoch 9) TRAIN LOSS:0.0164 LR:0.00000300: 100%|██████████| 344/344 [02:45<00:00,  2.08it/s]


(Epoch 9) TRAIN LOSS:0.0164 ACC:1.00 F1:0.99 REC:0.99 PRE:1.00 LR:0.00000300


VALID LOSS:0.2450 ACC:0.94 F1:0.91 REC:0.91 PRE:0.92: 100%|██████████| 40/40 [00:07<00:00,  5.58it/s]


(Epoch 9) VALID LOSS:0.2450 ACC:0.94 F1:0.91 REC:0.91 PRE:0.92


(Epoch 10) TRAIN LOSS:0.0126 LR:0.00000300: 100%|██████████| 344/344 [02:44<00:00,  2.09it/s]


(Epoch 10) TRAIN LOSS:0.0126 ACC:1.00 F1:1.00 REC:0.99 PRE:1.00 LR:0.00000300


VALID LOSS:0.2733 ACC:0.94 F1:0.91 REC:0.91 PRE:0.92: 100%|██████████| 40/40 [00:07<00:00,  5.56it/s]

(Epoch 10) VALID LOSS:0.2733 ACC:0.94 F1:0.91 REC:0.91 PRE:0.92





Melakukan evaluasi model pada data test dengan menggunakan model yang telah dilatih sebelumnya

In [15]:
# Evaluasi pada data test
model.eval()
torch.set_grad_enabled(False)

total_loss, total_correct, total_labels = 0, 0, 0
list_hyp, list_label = [], []

pbar = tqdm(test_loader, leave=True, total=len(test_loader))
for i, batch_data in enumerate(pbar):
    _, batch_hyp, _ = forward_sequence_classification(model, batch_data[:-1], i2w=i2w, device='cuda')
    list_hyp += batch_hyp

# Menyimpan prediksi
df = pd.DataFrame({'label':list_hyp}).reset_index()
df.to_csv('pred.txt', index=False)

print(df)

100%|██████████| 16/16 [00:01<00:00,  8.54it/s]

     index     label
0        0  negative
1        1  negative
2        2  negative
3        3  negative
4        4  negative
..     ...       ...
495    495   neutral
496    496   neutral
497    497   neutral
498    498  positive
499    499  positive

[500 rows x 2 columns]





# **Prediksi Sentimen**

Pengujian prediksi sentimen berdasarkan teks

In [16]:
text = 'Bahagia hatiku melihat pernikahan putri sulungku yang cantik jelita'

# Tokenisasi teks menjadi subwords menggunakan tokenizer
subwords = tokenizer.encode(text)
subwords = torch.LongTensor(subwords).view(1, -1).to(model.device)

logits = model(subwords)[0]

# Mendapatkan label dengan probabilitas tertinggi
label = torch.topk(logits, k=1, dim=-1)[1].squeeze().item()

print(f'Teks     : {text}\nPrediksi : {i2w[label]} ({F.softmax(logits, dim=-1).squeeze()[label] * 100:.3f}%)')

Teks     : Bahagia hatiku melihat pernikahan putri sulungku yang cantik jelita
Prediksi : positive (99.947%)


In [17]:
text = 'Ronaldo pergi ke Mall Grand Indonesia membeli cilok'

# Tokenisasi teks menjadi subwords menggunakan tokenizer
subwords = tokenizer.encode(text)
subwords = torch.LongTensor(subwords).view(1, -1).to(model.device)

logits = model(subwords)[0]

# Mendapatkan label dengan probabilitas tertinggi
label = torch.topk(logits, k=1, dim=-1)[1].squeeze().item()

print(f'Teks     : {text}\nPrediksi : {i2w[label]} ({F.softmax(logits, dim=-1).squeeze()[label] * 100:.3f}%)')

Teks     : Ronaldo pergi ke Mall Grand Indonesia membeli cilok
Prediksi : neutral (99.847%)


In [18]:
text = 'Sayang, aku marah'
subwords = tokenizer.encode(text)
subwords = torch.LongTensor(subwords).view(1, -1).to(model.device)

logits = model(subwords)[0]
label = torch.topk(logits, k=1, dim=-1)[1].squeeze().item()

print(f'Text: {text} | Label : {i2w[label]} ({F.softmax(logits, dim=-1).squeeze()[label] * 100:.3f}%)')

Text: Sayang, aku marah | Label : negative (99.953%)


In [19]:
text = 'Merasa kagum dengan toko ini tapi berubah menjadi kecewa setelah melakukan transaksi'

# Tokenisasi teks menjadi subwords menggunakan tokenizer
subwords = tokenizer.encode(text)
subwords = torch.LongTensor(subwords).view(1, -1).to(model.device)

logits = model(subwords)[0]

# Mendapatkan label dengan probabilitas tertinggi
label = torch.topk(logits, k=1, dim=-1)[1].squeeze().item()

print(f'Teks     : {text}\nPrediksi : {i2w[label]} ({F.softmax(logits, dim=-1).squeeze()[label] * 100:.3f}%)')

Teks     : Merasa kagum dengan toko ini tapi berubah menjadi kecewa setelah melakukan transaksi
Prediksi : negative (99.961%)


In [20]:
text = 'Awalnya aku merasa selalu tidak percaya diri, namun setelah mencobanya ternyata tidak seperti yang aku bayangkan, sekarang aku merasa lebih percaya diri'

# Tokenisasi teks menjadi subwords menggunakan tokenizer
subwords = tokenizer.encode(text)
subwords = torch.LongTensor(subwords).view(1, -1).to(model.device)

logits = model(subwords)[0]

# Mendapatkan label dengan probabilitas tertinggi
label = torch.topk(logits, k=1, dim=-1)[1].squeeze().item()

print(f'Teks     : {text}\nPrediksi : {i2w[label]} ({F.softmax(logits, dim=-1).squeeze()[label] * 100:.3f}%)')

Teks     : Awalnya aku merasa selalu tidak percaya diri, namun setelah mencobanya ternyata tidak seperti yang aku bayangkan, sekarang aku merasa lebih percaya diri
Prediksi : positive (99.822%)
