# Finetuning CASA with FastText

## Cached FastText weight vector
if you are running the 12 tasks from IndoNLU, you can download the cached weight vectors from the following URL
- [FastText ID-4B](https://storage.googleapis.com/babert-pretraining/IndoNLU_finals/models/fasttext/fasttext-4B-id-uncased.zip)
- [FastText CC-ID](https://storage.googleapis.com/babert-pretraining/IndoNLU_finals/models/fasttext/fasttext-cc-id.zip)

## How to create the FastText weight vectors for a new dataset
1. Install `fasttext` dependency
```bash
     pip install fasttext
```
2. Download our [FastText vector file](https://storage.googleapis.com/babert-pretraining/IndoNLU_finals/models/fasttext/fasttext.4B.id.300.epoch5.uncased.vec.zip) and unzip the file
2. Create a vocabulary file consisting of all unique tokens in the dataset, as examples please check [CASA uncased vocab](../dataset/casa_absa-prosa/vocab_uncased.txt) or [CASA cased vocab](../dataset/casa_absa-prosa/vocab.txt) on the dataset folder
2. Execute `print-word-vectors funtion` command from `fasttext`
```bash
     ./fasttext print-word-vectors fasttext.4B.id.300.epoch5.uncased.bin < INPUT_VOCAB_PATH > OUTPUT_VECTOR_PATH
```
3. In this tutorial, the we will use [CASA uncased vocab](../dataset/casa_absa-prosa/vocab_uncased.txt) as our `INPUT_VOCAB_PATH` and `./embeddings/fasttext_casa.txt` as the `OUTPUT_VECTOR_PATH`

In [1]:
import os, sys
sys.path.append('../')
os.chdir('../')

import random
import numpy as np
import pandas as pd
import torch
from torch import optim
import torch.nn.functional as F
from tqdm import tqdm

from transformers import BertTokenizer, BertConfig, BertForPreTraining
from nltk.tokenize import TweetTokenizer, word_tokenize

from modules.multi_label_classification import BertForMultiLabelClassification
from utils.functions import SimpleTokenizer, load_vocab, gen_embeddings
from utils.forward_fn import forward_sequence_multi_classification
from utils.metrics import absa_metrics_fn
from utils.data_utils import AspectBasedSentimentAnalysisProsaDataset, AspectBasedSentimentAnalysisDataLoader

In [2]:
vocab_path = './dataset/casa_absa-prosa/vocab_uncased.txt' # We use `./` instead of `../` because we perform `os.chdir()` on the previous cell
vector_path = './embeddings/fasttext_casa.txt'

In [3]:
###
# common functions
###
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    
def count_param(module, trainable=False):
    if trainable:
        return sum(p.numel() for p in module.parameters() if p.requires_grad)
    else:
        return sum(p.numel() for p in module.parameters())
    
def get_lr(optimizer):
    for param_group in optimizer.param_groups:
        return param_group['lr']

def metrics_to_string(metric_dict):
    string_list = []
    for key, value in metric_dict.items():
        string_list.append('{}:{:.2f}'.format(key, value))
    return ' '.join(string_list)

In [4]:
# Set random seed
set_seed(26092020)

# Load Model

In [5]:
# Load Tokenizer & Embedding
_, vocab_map = load_vocab(vocab_path)
tokenizer = SimpleTokenizer(vocab_map, TweetTokenizer(), lower=True)
vocab_list = list(tokenizer.vocab.keys())
embeddings = gen_embeddings(vocab_list, vector_path, emb_dim=300)

# Load Config
config = BertConfig.from_pretrained('indobenchmark/indobert-base-p1')
config.num_labels = max(AspectBasedSentimentAnalysisProsaDataset.NUM_LABELS)
config.num_labels_list = AspectBasedSentimentAnalysisProsaDataset.NUM_LABELS
config.hidden_size = 300 # Use the same vector size as the fasttext embedding
config.num_attention_heads = 10 # Make num_attention_heads to a number that satisfy hidden_size % num_attention_heads = 0
config.vocab_size = len(embeddings)

# Instantiate model
model = BertForMultiLabelClassification(config=config)
model.bert.embeddings.word_embeddings.weight.data.copy_(torch.FloatTensor(embeddings))

Loading embedding file: ./embeddings/fasttext_casa.txt
Embeddings: 2276 x 300
Pre-trained: 2265 (99.52%)


tensor([[ 0.1833,  0.4513,  0.6671,  ...,  0.9816,  0.1172,  0.6214],
        [ 0.1449,  0.0882, -0.5603,  ..., -0.1639,  0.2462,  0.0972],
        [-0.2511, -0.1280, -0.2047,  ...,  0.4223,  0.6745, -0.1174],
        ...,
        [ 0.7961,  0.5256,  0.5663,  ...,  0.1916,  0.7011,  0.4579],
        [ 0.5069,  0.4808,  0.5925,  ...,  0.7520,  0.1409,  0.6555],
        [ 0.4596,  0.7537,  0.5713,  ...,  0.9568,  0.5453,  0.6907]])

In [6]:
model

BertForMultiLabelClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(2276, 300, padding_idx=0)
      (position_embeddings): Embedding(512, 300)
      (token_type_embeddings): Embedding(2, 300)
      (LayerNorm): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=300, out_features=300, bias=True)
              (key): Linear(in_features=300, out_features=300, bias=True)
              (value): Linear(in_features=300, out_features=300, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=300, out_features=300, bias=True)
              (LayerNorm): LayerNorm((300,), eps=1e-12, elemen

In [7]:
count_param(model)

27440982

# Prepare Dataset

In [8]:
train_dataset_path = './dataset/casa_absa-prosa/train_preprocess.csv'
valid_dataset_path = './dataset/casa_absa-prosa/valid_preprocess.csv'
test_dataset_path = './dataset/casa_absa-prosa/test_preprocess_masked_label.csv'

In [9]:
train_dataset = AspectBasedSentimentAnalysisProsaDataset(train_dataset_path, tokenizer, lowercase=True)
valid_dataset = AspectBasedSentimentAnalysisProsaDataset(valid_dataset_path, tokenizer, lowercase=True)
test_dataset = AspectBasedSentimentAnalysisProsaDataset(test_dataset_path, tokenizer, lowercase=True)

train_loader = AspectBasedSentimentAnalysisDataLoader(dataset=train_dataset, max_seq_len=512, batch_size=16, num_workers=16, shuffle=True)  
valid_loader = AspectBasedSentimentAnalysisDataLoader(dataset=valid_dataset, max_seq_len=512, batch_size=16, num_workers=16, shuffle=False)  
test_loader = AspectBasedSentimentAnalysisDataLoader(dataset=test_dataset, max_seq_len=512, batch_size=16, num_workers=16, shuffle=False)

In [10]:
w2i, i2w = AspectBasedSentimentAnalysisProsaDataset.LABEL2INDEX, AspectBasedSentimentAnalysisProsaDataset.INDEX2LABEL
print(w2i)
print(i2w)

{'negative': 0, 'neutral': 1, 'positive': 2}
{0: 'negative', 1: 'neutral', 2: 'positive'}


# Test model on sample sentences

In [11]:
text = 'mesin 3SZ - VE 1500 cc ini memang lebih pas di badan Avanza, bertenaga namun hemat bahan bakar'
subwords = tokenizer.encode(text)
subwords = torch.LongTensor(subwords).view(1, -1).to(model.device)

logits = model(subwords)[0]
labels = [torch.topk(logit, k=1, dim=-1)[1].squeeze().item() for logit in logits]

print(f'Text: {text}')
for i, label in enumerate(labels):
    print(f'Label `{AspectBasedSentimentAnalysisProsaDataset.ASPECT_DOMAIN[i]}` : {i2w[label]} ({F.softmax(logits[i], dim=-1).squeeze()[label] * 100:.3f}%)')

Text: mesin 3SZ - VE 1500 cc ini memang lebih pas di badan Avanza, bertenaga namun hemat bahan bakar
Label `fuel` : neutral (35.591%)
Label `machine` : neutral (39.669%)
Label `others` : negative (37.829%)
Label `part` : negative (39.417%)
Label `price` : positive (38.619%)
Label `service` : negative (39.931%)


In [12]:
text = 'Jazz enak. tetapi mahal harga dan perawatan gila .'
subwords = tokenizer.encode(text)
subwords = torch.LongTensor(subwords).view(1, -1).to(model.device)

logits = model(subwords)[0]
labels = [torch.topk(logit, k=1, dim=-1)[1].squeeze().item() for logit in logits]

print(f'Text: {text}')
for i, label in enumerate(labels):
    print(f'Label `{AspectBasedSentimentAnalysisProsaDataset.ASPECT_DOMAIN[i]}` : {i2w[label]} ({F.softmax(logits[i], dim=-1).squeeze()[label] * 100:.3f}%)')

Text: Jazz enak. tetapi mahal harga dan perawatan gila .
Label `fuel` : negative (35.298%)
Label `machine` : negative (35.593%)
Label `others` : positive (35.325%)
Label `part` : neutral (35.073%)
Label `price` : neutral (36.711%)
Label `service` : positive (38.238%)


In [13]:
text = 'Toyota sekarang harga kemahalan, kualitas menurun, service parah, dan boros bensin. '
subwords = tokenizer.encode(text)
subwords = torch.LongTensor(subwords).view(1, -1).to(model.device)

logits = model(subwords)[0]
labels = [torch.topk(logit, k=1, dim=-1)[1].squeeze().item() for logit in logits]

print(f'Text: {text}')
for i, label in enumerate(labels):
    print(f'Label `{AspectBasedSentimentAnalysisProsaDataset.ASPECT_DOMAIN[i]}` : {i2w[label]} ({F.softmax(logits[i], dim=-1).squeeze()[label] * 100:.3f}%)')

Text: Toyota sekarang harga kemahalan, kualitas menurun, service parah, dan boros bensin. 
Label `fuel` : neutral (36.924%)
Label `machine` : negative (34.913%)
Label `others` : negative (34.523%)
Label `part` : positive (39.489%)
Label `price` : neutral (36.348%)
Label `service` : positive (37.397%)


# Fine Tuning & Evaluation

In [14]:
optimizer = optim.Adam(model.parameters(), lr=1e-4)
model = model.cuda()

In [15]:
# Train
n_epochs = 15
for epoch in range(n_epochs):
    model.train()
    torch.set_grad_enabled(True)
 
    total_train_loss = 0
    list_hyp, list_label = [], []

    train_pbar = tqdm(train_loader, leave=True, total=len(train_loader))
    for i, batch_data in enumerate(train_pbar):
        # Forward model
        loss, batch_hyp, batch_label = forward_sequence_multi_classification(model, batch_data[:-1], i2w=i2w, device='cuda')

        # Update model
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        tr_loss = loss.item()
        total_train_loss = total_train_loss + tr_loss

        # Calculate metrics
        list_hyp += batch_hyp
        list_label += batch_label

        train_pbar.set_description("(Epoch {}) TRAIN LOSS:{:.4f} LR:{:.8f}".format((epoch+1),
            total_train_loss/(i+1), get_lr(optimizer)))

    # Calculate train metric
    metrics = absa_metrics_fn(list_hyp, list_label)
    print("(Epoch {}) TRAIN LOSS:{:.4f} {} LR:{:.8f}".format((epoch+1),
        total_train_loss/(i+1), metrics_to_string(metrics), get_lr(optimizer)))

    # Evaluate on validation
    model.eval()
    torch.set_grad_enabled(False)
    
    total_loss, total_correct, total_labels = 0, 0, 0
    list_hyp, list_label = [], []

    pbar = tqdm(valid_loader, leave=True, total=len(valid_loader))
    for i, batch_data in enumerate(pbar):
        batch_seq = batch_data[-1]        
        loss, batch_hyp, batch_label = forward_sequence_multi_classification(model, batch_data[:-1], i2w=i2w, device='cuda')
        
        # Calculate total loss
        valid_loss = loss.item()
        total_loss = total_loss + valid_loss

        # Calculate evaluation metrics
        list_hyp += batch_hyp
        list_label += batch_label
        metrics = absa_metrics_fn(list_hyp, list_label)

        pbar.set_description("VALID LOSS:{:.4f} {}".format(total_loss/(i+1), metrics_to_string(metrics)))
        
    metrics = absa_metrics_fn(list_hyp, list_label)
    print("(Epoch {}) VALID LOSS:{:.4f} {}".format((epoch+1),
        total_loss/(i+1), metrics_to_string(metrics)))

(Epoch 1) TRAIN LOSS:4.1539 LR:0.00010000: 100%|██████████| 51/51 [00:03<00:00, 14.58it/s]
  0%|          | 0/6 [00:00<?, ?it/s]

(Epoch 1) TRAIN LOSS:4.1539 ACC:0.77 F1:0.30 REC:0.33 PRE:0.34 LR:0.00010000


  _warn_prf(average, modifier, msg_start, len(result))
VALID LOSS:3.7344 ACC:0.79 F1:0.29 REC:0.33 PRE:0.26: 100%|██████████| 6/6 [00:00<00:00, 11.08it/s]
  0%|          | 0/51 [00:00<?, ?it/s]

(Epoch 1) VALID LOSS:3.7344 ACC:0.79 F1:0.29 REC:0.33 PRE:0.26


(Epoch 2) TRAIN LOSS:3.7166 LR:0.00010000: 100%|██████████| 51/51 [00:03<00:00, 14.50it/s]
  0%|          | 0/6 [00:00<?, ?it/s]

(Epoch 2) TRAIN LOSS:3.7166 ACC:0.78 F1:0.30 REC:0.34 PRE:0.43 LR:0.00010000


VALID LOSS:3.2399 ACC:0.79 F1:0.31 REC:0.34 PRE:0.40: 100%|██████████| 6/6 [00:00<00:00, 11.21it/s]
  0%|          | 0/51 [00:00<?, ?it/s]

(Epoch 2) VALID LOSS:3.2399 ACC:0.79 F1:0.31 REC:0.34 PRE:0.40


(Epoch 3) TRAIN LOSS:2.9910 LR:0.00010000: 100%|██████████| 51/51 [00:03<00:00, 14.70it/s]
  0%|          | 0/6 [00:00<?, ?it/s]

(Epoch 3) TRAIN LOSS:2.9910 ACC:0.81 F1:0.44 REC:0.42 PRE:0.68 LR:0.00010000


VALID LOSS:2.5191 ACC:0.88 F1:0.60 REC:0.56 PRE:0.72: 100%|██████████| 6/6 [00:00<00:00, 10.90it/s]
  0%|          | 0/51 [00:00<?, ?it/s]

(Epoch 3) VALID LOSS:2.5191 ACC:0.88 F1:0.60 REC:0.56 PRE:0.72


(Epoch 4) TRAIN LOSS:2.3263 LR:0.00010000: 100%|██████████| 51/51 [00:03<00:00, 14.72it/s]
  0%|          | 0/6 [00:00<?, ?it/s]

(Epoch 4) TRAIN LOSS:2.3263 ACC:0.87 F1:0.57 REC:0.56 PRE:0.72 LR:0.00010000


VALID LOSS:2.1287 ACC:0.87 F1:0.59 REC:0.58 PRE:0.66: 100%|██████████| 6/6 [00:00<00:00, 10.75it/s]
  0%|          | 0/51 [00:00<?, ?it/s]

(Epoch 4) VALID LOSS:2.1287 ACC:0.87 F1:0.59 REC:0.58 PRE:0.66


(Epoch 5) TRAIN LOSS:1.9501 LR:0.00010000: 100%|██████████| 51/51 [00:03<00:00, 14.68it/s]
  0%|          | 0/6 [00:00<?, ?it/s]

(Epoch 5) TRAIN LOSS:1.9501 ACC:0.89 F1:0.62 REC:0.61 PRE:0.71 LR:0.00010000


VALID LOSS:2.0207 ACC:0.88 F1:0.60 REC:0.59 PRE:0.65: 100%|██████████| 6/6 [00:00<00:00, 10.70it/s]
  0%|          | 0/51 [00:00<?, ?it/s]

(Epoch 5) VALID LOSS:2.0207 ACC:0.88 F1:0.60 REC:0.59 PRE:0.65


(Epoch 6) TRAIN LOSS:1.7829 LR:0.00010000: 100%|██████████| 51/51 [00:03<00:00, 14.60it/s]
  0%|          | 0/6 [00:00<?, ?it/s]

(Epoch 6) TRAIN LOSS:1.7829 ACC:0.90 F1:0.64 REC:0.63 PRE:0.73 LR:0.00010000


VALID LOSS:1.8857 ACC:0.88 F1:0.61 REC:0.60 PRE:0.65: 100%|██████████| 6/6 [00:00<00:00, 10.94it/s]
  0%|          | 0/51 [00:00<?, ?it/s]

(Epoch 6) VALID LOSS:1.8857 ACC:0.88 F1:0.61 REC:0.60 PRE:0.65


(Epoch 7) TRAIN LOSS:1.6555 LR:0.00010000: 100%|██████████| 51/51 [00:03<00:00, 14.60it/s]
  0%|          | 0/6 [00:00<?, ?it/s]

(Epoch 7) TRAIN LOSS:1.6555 ACC:0.90 F1:0.64 REC:0.63 PRE:0.73 LR:0.00010000


VALID LOSS:1.8391 ACC:0.88 F1:0.61 REC:0.61 PRE:0.67: 100%|██████████| 6/6 [00:00<00:00, 10.66it/s]
  0%|          | 0/51 [00:00<?, ?it/s]

(Epoch 7) VALID LOSS:1.8391 ACC:0.88 F1:0.61 REC:0.61 PRE:0.67


(Epoch 8) TRAIN LOSS:1.5603 LR:0.00010000: 100%|██████████| 51/51 [00:03<00:00, 14.61it/s]
  0%|          | 0/6 [00:00<?, ?it/s]

(Epoch 8) TRAIN LOSS:1.5603 ACC:0.90 F1:0.66 REC:0.65 PRE:0.74 LR:0.00010000


VALID LOSS:1.7766 ACC:0.89 F1:0.66 REC:0.64 PRE:0.70: 100%|██████████| 6/6 [00:00<00:00, 11.10it/s]
  0%|          | 0/51 [00:00<?, ?it/s]

(Epoch 8) VALID LOSS:1.7766 ACC:0.89 F1:0.66 REC:0.64 PRE:0.70


(Epoch 9) TRAIN LOSS:1.4270 LR:0.00010000: 100%|██████████| 51/51 [00:03<00:00, 14.63it/s]
  0%|          | 0/6 [00:00<?, ?it/s]

(Epoch 9) TRAIN LOSS:1.4270 ACC:0.91 F1:0.69 REC:0.67 PRE:0.77 LR:0.00010000


VALID LOSS:1.6964 ACC:0.89 F1:0.64 REC:0.62 PRE:0.69: 100%|██████████| 6/6 [00:00<00:00, 11.06it/s]
  0%|          | 0/51 [00:00<?, ?it/s]

(Epoch 9) VALID LOSS:1.6964 ACC:0.89 F1:0.64 REC:0.62 PRE:0.69


(Epoch 10) TRAIN LOSS:1.3518 LR:0.00010000: 100%|██████████| 51/51 [00:03<00:00, 14.75it/s]
  0%|          | 0/6 [00:00<?, ?it/s]

(Epoch 10) TRAIN LOSS:1.3518 ACC:0.92 F1:0.73 REC:0.71 PRE:0.79 LR:0.00010000


VALID LOSS:1.6517 ACC:0.91 F1:0.73 REC:0.70 PRE:0.80: 100%|██████████| 6/6 [00:00<00:00, 10.73it/s]
  0%|          | 0/51 [00:00<?, ?it/s]

(Epoch 10) VALID LOSS:1.6517 ACC:0.91 F1:0.73 REC:0.70 PRE:0.80


(Epoch 11) TRAIN LOSS:1.3000 LR:0.00010000: 100%|██████████| 51/51 [00:03<00:00, 14.49it/s]
  0%|          | 0/6 [00:00<?, ?it/s]

(Epoch 11) TRAIN LOSS:1.3000 ACC:0.92 F1:0.75 REC:0.72 PRE:0.81 LR:0.00010000


VALID LOSS:1.7199 ACC:0.90 F1:0.74 REC:0.74 PRE:0.75: 100%|██████████| 6/6 [00:00<00:00, 11.06it/s]
  0%|          | 0/51 [00:00<?, ?it/s]

(Epoch 11) VALID LOSS:1.7199 ACC:0.90 F1:0.74 REC:0.74 PRE:0.75


(Epoch 12) TRAIN LOSS:1.1829 LR:0.00010000: 100%|██████████| 51/51 [00:03<00:00, 14.57it/s]
  0%|          | 0/6 [00:00<?, ?it/s]

(Epoch 12) TRAIN LOSS:1.1829 ACC:0.93 F1:0.79 REC:0.77 PRE:0.83 LR:0.00010000


VALID LOSS:1.6484 ACC:0.90 F1:0.73 REC:0.72 PRE:0.76: 100%|██████████| 6/6 [00:00<00:00, 10.68it/s]
  0%|          | 0/51 [00:00<?, ?it/s]

(Epoch 12) VALID LOSS:1.6484 ACC:0.90 F1:0.73 REC:0.72 PRE:0.76


(Epoch 13) TRAIN LOSS:1.1448 LR:0.00010000: 100%|██████████| 51/51 [00:03<00:00, 14.35it/s]
  0%|          | 0/6 [00:00<?, ?it/s]

(Epoch 13) TRAIN LOSS:1.1448 ACC:0.94 F1:0.82 REC:0.80 PRE:0.85 LR:0.00010000


VALID LOSS:1.6318 ACC:0.91 F1:0.76 REC:0.74 PRE:0.78: 100%|██████████| 6/6 [00:00<00:00, 10.97it/s]
  0%|          | 0/51 [00:00<?, ?it/s]

(Epoch 13) VALID LOSS:1.6318 ACC:0.91 F1:0.76 REC:0.74 PRE:0.78


(Epoch 14) TRAIN LOSS:1.0587 LR:0.00010000: 100%|██████████| 51/51 [00:03<00:00, 14.17it/s]
  0%|          | 0/6 [00:00<?, ?it/s]

(Epoch 14) TRAIN LOSS:1.0587 ACC:0.94 F1:0.84 REC:0.82 PRE:0.87 LR:0.00010000


VALID LOSS:1.5689 ACC:0.91 F1:0.77 REC:0.77 PRE:0.79: 100%|██████████| 6/6 [00:00<00:00, 11.04it/s]
  0%|          | 0/51 [00:00<?, ?it/s]

(Epoch 14) VALID LOSS:1.5689 ACC:0.91 F1:0.77 REC:0.77 PRE:0.79


(Epoch 15) TRAIN LOSS:1.0276 LR:0.00010000: 100%|██████████| 51/51 [00:03<00:00, 14.64it/s]
  0%|          | 0/6 [00:00<?, ?it/s]

(Epoch 15) TRAIN LOSS:1.0276 ACC:0.94 F1:0.83 REC:0.81 PRE:0.86 LR:0.00010000


VALID LOSS:1.5291 ACC:0.92 F1:0.80 REC:0.80 PRE:0.82: 100%|██████████| 6/6 [00:00<00:00, 10.96it/s]

(Epoch 15) VALID LOSS:1.5291 ACC:0.92 F1:0.80 REC:0.80 PRE:0.82





In [16]:
# Evaluate on test
model.eval()
torch.set_grad_enabled(False)

total_loss, total_correct, total_labels = 0, 0, 0
list_hyp, list_label = [], []

pbar = tqdm(test_loader, leave=True, total=len(test_loader))
for i, batch_data in enumerate(pbar):
    _, batch_hyp, _ = forward_sequence_multi_classification(model, batch_data[:-1], i2w=i2w, device='cuda')
    list_hyp += batch_hyp

# Save prediction
df = pd.DataFrame({'label':list_hyp}).reset_index()
df.to_csv('pred.txt', index=False)

print(df)

100%|██████████| 12/12 [00:00<00:00, 20.54it/s]


     index                                              label
0        0  [neutral, neutral, neutral, negative, neutral,...
1        1  [neutral, neutral, positive, neutral, neutral,...
2        2  [neutral, neutral, neutral, positive, neutral,...
3        3  [neutral, neutral, negative, neutral, neutral,...
4        4  [neutral, neutral, positive, neutral, neutral,...
..     ...                                                ...
175    175  [neutral, neutral, neutral, neutral, neutral, ...
176    176  [neutral, neutral, neutral, negative, neutral,...
177    177  [neutral, neutral, neutral, positive, neutral,...
178    178  [neutral, neutral, neutral, negative, neutral,...
179    179  [positive, neutral, neutral, neutral, neutral,...

[180 rows x 2 columns]


# Test fine-tuned model on sample sentences

In [17]:
text = 'mesin 3SZ - VE 1500 cc ini memang lebih pas di badan Avanza, bertenaga namun hemat bahan bakar'
subwords = tokenizer.encode(text)
subwords = torch.LongTensor(subwords).view(1, -1).to(model.device)

logits = model(subwords)[0]
labels = [torch.topk(logit, k=1, dim=-1)[1].squeeze().item() for logit in logits]

print(f'Text: {text}')
for i, label in enumerate(labels):
    print(f'Label `{AspectBasedSentimentAnalysisProsaDataset.ASPECT_DOMAIN[i]}` : {i2w[label]} ({F.softmax(logits[i], dim=-1).squeeze()[label] * 100:.3f}%)')

Text: mesin 3SZ - VE 1500 cc ini memang lebih pas di badan Avanza, bertenaga namun hemat bahan bakar
Label `fuel` : positive (82.744%)
Label `machine` : positive (95.344%)
Label `others` : neutral (85.283%)
Label `part` : neutral (96.303%)
Label `price` : neutral (97.331%)
Label `service` : neutral (99.295%)


In [18]:
text = 'Jazz enak. tetapi mahal harga dan perawatan gila .'
subwords = tokenizer.encode(text)
subwords = torch.LongTensor(subwords).view(1, -1).to(model.device)

logits = model(subwords)[0]
labels = [torch.topk(logit, k=1, dim=-1)[1].squeeze().item() for logit in logits]

print(f'Text: {text}')
for i, label in enumerate(labels):
    print(f'Label `{AspectBasedSentimentAnalysisProsaDataset.ASPECT_DOMAIN[i]}` : {i2w[label]} ({F.softmax(logits[i], dim=-1).squeeze()[label] * 100:.3f}%)')

Text: Jazz enak. tetapi mahal harga dan perawatan gila .
Label `fuel` : neutral (99.786%)
Label `machine` : neutral (99.413%)
Label `others` : neutral (81.806%)
Label `part` : neutral (98.523%)
Label `price` : negative (48.632%)
Label `service` : positive (45.997%)


In [19]:
text = 'Toyota sekarang harga kemahalan, kualitas menurun, service parah'
subwords = tokenizer.encode(text)
subwords = torch.LongTensor(subwords).view(1, -1).to(model.device)

logits = model(subwords)[0]
labels = [torch.topk(logit, k=1, dim=-1)[1].squeeze().item() for logit in logits]

print(f'Text: {text}')
for i, label in enumerate(labels):
    print(f'Label `{AspectBasedSentimentAnalysisProsaDataset.ASPECT_DOMAIN[i]}` : {i2w[label]} ({F.softmax(logits[i], dim=-1).squeeze()[label] * 100:.3f}%)')

Text: Toyota sekarang harga kemahalan, kualitas menurun, service parah
Label `fuel` : neutral (99.345%)
Label `machine` : neutral (95.504%)
Label `others` : neutral (90.489%)
Label `part` : neutral (86.372%)
Label `price` : negative (68.773%)
Label `service` : negative (48.384%)
