# <center> Sentence Classification using IndoBERT</h1> </center>


In this notebook, we will learn how to fine tuning a BERT model for sentiment classification using transformers library from 🤗. Specifically, given a sentence (in bahasa Indonesia), using the model we will predict whether the sentiment is positive or negative.

___Note:___ make sure you use GPU when training the model, if you run this notebook in Google Colab, set the hardware accelerator in the notebook setting onto GPU, if not, the training will take a very long time.

## 0. Preparation 
### 0.1 Install sentencepiece, transformers and datasets library

Since by default, transformers is not installed in google colab, thus  we need to install it manually.

In [1]:
!pip install sentencepiece==0.1.95
!pip install transformers==4.2.2
!pip install datasets==1.2.0

Collecting sentencepiece==0.1.95
  Downloading sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2 MB)
[?25l[K     |▎                               | 10 kB 34.0 MB/s eta 0:00:01[K     |▌                               | 20 kB 28.7 MB/s eta 0:00:01[K     |▉                               | 30 kB 19.9 MB/s eta 0:00:01[K     |█                               | 40 kB 16.8 MB/s eta 0:00:01[K     |█▍                              | 51 kB 7.8 MB/s eta 0:00:01[K     |█▋                              | 61 kB 9.1 MB/s eta 0:00:01[K     |██                              | 71 kB 8.6 MB/s eta 0:00:01[K     |██▏                             | 81 kB 9.6 MB/s eta 0:00:01[K     |██▌                             | 92 kB 10.3 MB/s eta 0:00:01[K     |██▊                             | 102 kB 7.7 MB/s eta 0:00:01[K     |███                             | 112 kB 7.7 MB/s eta 0:00:01[K     |███▎                            | 122 kB 7.7 MB/s eta 0:00:01[K     |███▌                   

### 0.2 Import Library

In [2]:
import copy
from datasets import Dataset
import pickle
import transformers
import torch
from torch.utils.data import DataLoader
from transformers import AutoModelForSequenceClassification, BertTokenizerFast
from tqdm.auto import tqdm
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report

### 0.2 Set Config

In [3]:
# Config that will be used in the rest of this notebook

# the number of data for every batch in the training process
batch_size = 4

# maximum vector length for every input 
encoder_max_len = 100

## 1. Data

### 1.1 Download the dataset
For this tutorial, we use sentence level sentiment dataset (SmSA) provide by [IndoNLU](https://github.com/indobenchmark/indonlu). The data itself already cleaned.

In [4]:
# Lets load the data 

# Train data
url_sentence_data_train= "https://raw.githubusercontent.com/indobenchmark/indonlu/master/dataset/smsa_doc-sentiment-prosa/train_preprocess.tsv"
sentence_data_train_df = pd.read_csv(url_sentence_data_train, sep="\t", error_bad_lines = False, header = None, names=['Sentence', 'Sentiment'])
print("Train data")
display(sentence_data_train_df.head())

print("\n")

# Test data
# In this test data we actually use the validation dataset from the repository, since the real test dataset is masked
url_sentence_data_test= "https://raw.githubusercontent.com/indobenchmark/indonlu/master/dataset/smsa_doc-sentiment-prosa/valid_preprocess.tsv"
sentence_data_test_df = pd.read_csv(url_sentence_data_test, sep="\t", error_bad_lines = False, header = None, names=['Sentence', 'Sentiment'])
print("Test data")
display(sentence_data_test_df.head())

Train data


Unnamed: 0,Sentence,Sentiment
0,warung ini dimiliki oleh pengusaha pabrik tahu...,positive
1,mohon ulama lurus dan k212 mmbri hujjah partai...,neutral
2,lokasi strategis di jalan sumatera bandung . t...,positive
3,betapa bahagia nya diri ini saat unboxing pake...,positive
4,duh . jadi mahasiswa jangan sombong dong . kas...,negative




Test data


Unnamed: 0,Sentence,Sentiment
0,"meski masa kampanye sudah selesai , bukan bera...",neutral
1,tidak enak,negative
2,restoran ini menawarkan makanan sunda . kami m...,positive
3,lokasi di alun alun masakan padang ini cukup t...,positive
4,betapa bejad kader gerindra yang anggota dprd ...,negative


### 1.2 Simple EDA and train test split

In [5]:
# the number of label that we will predict
num_labels = len(sentence_data_train_df['Sentiment'].unique())
print("Number of label: {}".format(num_labels))
print("\n")
print("The labels:")
for label in sentence_data_train_df['Sentiment'].unique():
  print(f"- {label}")

Number of label: 3


The labels:
- positive
- neutral
- negative


In [6]:
# Check the number of data for each label in train data
sentence_data_train_df.Sentiment.value_counts()

positive    6416
negative    3436
neutral     1148
Name: Sentiment, dtype: int64

The data seems unbalance, but to keep the simplicity for this notebook, we will treat it as it is without any sampling technique. Now, lets split the train data to get development set, which will be used to evaluate the model in the training phase.

In [7]:
# split train set to get the dev set 
sentence_data_train_df, sentence_data_dev_df= train_test_split(sentence_data_train_df, test_size=0.3, stratify = sentence_data_train_df.Sentiment, random_state = 42)

In [8]:
# Once again, check the number of data for each label in train data
sentence_data_train_df.Sentiment.value_counts()

positive    4491
negative    2405
neutral      804
Name: Sentiment, dtype: int64

In [9]:
# Check the number of data for each label in train data
sentence_data_dev_df.Sentiment.value_counts()

positive    1925
negative    1031
neutral      344
Name: Sentiment, dtype: int64

In [29]:
# Check the number of data for each label in test data
sentence_data_test_df.Sentiment.value_counts()

positive    735
negative    394
neutral     131
Name: Sentiment, dtype: int64

### 1.3 Preprocess data

In [10]:
# load the data into dataset object, this class from huggingface dataset library is useful to make it easier for preprocessing data
# and transform it into the desireable input for the model

dataset_train = Dataset.from_pandas(sentence_data_train_df[['Sentence', 'Sentiment']])
dataset_dev = Dataset.from_pandas(sentence_data_dev_df[['Sentence', 'Sentiment']])
dataset_test = Dataset.from_pandas(sentence_data_test_df[['Sentence', 'Sentiment']])

#### Load tokenizer

In [11]:
# First, lets load the tokenizer used to transform the sentence into the desireable input for the bert model

# we use this tokenizer/model checkpoint, because we will classify bahasa Indonesia sentences and these mode is very light, so it doesn't take a long time to train, 
# there are many choices for the checkpointl, see huggingface model hub (https://huggingface.co/models)

tokenizer_checkpoint = "indobenchmark/indobert-lite-base-p1" 
# load the tokenizer
tokenizer = BertTokenizerFast.from_pretrained(tokenizer_checkpoint, do_lower_case=True)

# setting tokenizer padding in the right side
pad_on_right = tokenizer.padding_side == "right"

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=224974.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2.0, style=ProgressStyle(description_wi…




In [12]:
# label map, to mapping the labels into numerical values, useful in the preprocessing data and getting the output label
unique_label = set(sentence_data_train_df['Sentiment'].unique())
label2id = {tag: id for id, tag in enumerate(unique_label)}
id2label = {id: tag for tag, id in label2id.items()}

print(label2id)
print(id2label)

{'positive': 0, 'neutral': 1, 'negative': 2}
{0: 'positive', 1: 'neutral', 2: 'negative'}


Next, let's set the encode function to preprocess/map the data in dataset object into the desirable form of data, so it can be processed later by the model. Please check [this page](https://huggingface.co/docs/datasets/processing.html) and [this page](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) for more information about preprocess data in dataset object.

In [13]:
# set encode function
def encode(example, encoder_max_len=encoder_max_len):
    
    text = copy.copy(example['Sentence'])
    label = copy.copy(example['Sentiment'])

    for i in range(len(label)):
        label[i] = label2id[label[i]]
        

    encoder_inputs = tokenizer(text, is_split_into_words=False, truncation=True, max_length=encoder_max_len, 
                               padding='max_length', return_overflowing_tokens=False, return_offsets_mapping=True)

    input_ids = encoder_inputs['input_ids']
    input_attention = encoder_inputs['attention_mask']
    offset_mapping = encoder_inputs.pop("offset_mapping") 

    outputs = {'input_ids':input_ids, 'attention_mask': input_attention, 
               "offset_mapping": offset_mapping, "labels": label}
    
    return outputs


In [14]:
tokenized_dataset_train = dataset_train.map(encode, batched=True, remove_columns=dataset_train.column_names)
tokenized_dataset_train.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels', 'offset_mapping'], output_all_columns=True)
tokenized_dataset_dev = dataset_dev.map(encode, batched=True, remove_columns=dataset_dev.column_names)
tokenized_dataset_dev.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels', 'offset_mapping'], output_all_columns=True)
tokenized_dataset_test = dataset_test.map(encode, batched=True, remove_columns=dataset_test.column_names)
tokenized_dataset_test.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels', 'offset_mapping'], output_all_columns=True)

HBox(children=(FloatProgress(value=0.0, max=8.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=4.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




In [15]:
# load dataset into dataloader, for more details about DataLoader, please check the pytorch documentation about DataLoader
train_dl = DataLoader(tokenized_dataset_train, batch_size, shuffle=True)
valid_dl = DataLoader(tokenized_dataset_dev, batch_size, shuffle=False)
test_dl = DataLoader(tokenized_dataset_test, batch_size, shuffle=False)

## 2. Model

### 2.1 Load model

In [16]:
# load model that will be fine tuned
model_checkpoint = tokenizer_checkpoint

model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1542.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=46739879.0, style=ProgressStyle(descrip…




Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at indobenchmark/indobert-lite-base-p1 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### Set some training configuration

In [17]:
# set get device function
def get_default_device():
    """Pick GPU if available, else CPU"""
    if torch.cuda.is_available():
        return torch.device('cuda')
    else:
        return torch.device('cpu')

In [18]:
# get device
device = get_default_device()

In [19]:
# set optimizer and put model into device
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
model = model.to(device)

In [20]:
# folder to put trained models
model_dir = "Model"

# create the folder
!mkdir Model

In [21]:
# !!! don't forget to save the label map, so that we can use it to translate the model output, we will put these into the "Model" folder
with open(f"{model_dir}/label2id_pkl", 'wb') as f1:
    pickle.dump(label2id, f1)
    
with open(f'{model_dir}/id2label_pkl', 'wb') as f2:
    pickle.dump(id2label, f2)  

## 3. Training

In [22]:
# function to get the learning rate value
def get_lr(optimizer):
    for param_group in optimizer.param_groups:
        return param_group['lr']

In [23]:
# function to write the information about saved(best) model, feel free to modify this function based on your need
def write_best_model_info(epoch, monitor_val):
    f = open(f"{model_dir}/model_info.txt", "w")
    f.write(f"epoch:{epoch}\n")
    f.write(f"monitor_val:{monitor_val}\n")
    f.close()

Define custom training function

In [24]:
# fit function for training, it takes number of epoch, the model we want to fine tuned, train set loader, valid/dev set loader, optimizer 
# and the metric we want to monitor as the parameters, this function monitor 'accuracy' by default 

def fit(num_epochs, model, train_loader, valid_loader, opt, monitor='acc'):
    
    max_eval_score = 0
    monitor_val = 0
    monitor_val_max = 0

    for epoch in range(num_epochs):        
        model.train()
        train_loss = 0
        list_train_true_labels = []
        list_train_pred_label = []
        train_pbar = tqdm(train_loader, leave=True, total=len(train_loader))        
        for i, batch_data in enumerate(train_pbar):
            input_ids, attention_mask, labels = batch_data["input_ids"], batch_data["attention_mask"], batch_data["labels"]
            list_train_true_labels += labels.tolist()
            input_ids = input_ids.to(device)
            attention_mask = attention_mask.to(device)
            labels = labels.to(device)
            opt.zero_grad()
            output = model(input_ids = input_ids, attention_mask = attention_mask, labels = labels)
            loss = output.loss
            train_loss += loss.item()
            logits = output.logits
            logits_labels = torch.argmax(logits, dim=1).tolist()
            list_train_pred_label += logits_labels
            loss.backward()
            opt.step()
            train_pbar.set_description("(Epoch {}) TRAIN LOSS:{:.4f} LR:{:.8f}".format((epoch+1), train_loss/(i+1), get_lr(opt)))

        acc = accuracy_score(list_train_true_labels, list_train_pred_label)
        precision, recall, f1, _ = precision_recall_fscore_support(list_train_true_labels, list_train_pred_label, average='macro')
        print("(Epoch {}) TRAIN LOSS:{:.4f} ACC:{:.4f} PREC:{:.4f} REC:{:.4f} F1:{:.4f} LR:{:.8f}".format((epoch+1), train_loss/(i+1), acc, precision, recall, 
                                                                                                          f1, get_lr(optimizer)))

        model.eval()
        pbar = tqdm(valid_loader, leave=True, total=len(valid_loader))
        with torch.no_grad():
            val_loss = 0
            list_val_true_labels = []
            list_val_pred_label = []
            for idx, data in enumerate(pbar):
                input_ids, attention_mask, labels = data["input_ids"], data["attention_mask"], data["labels"]
                input_ids = input_ids.to(device)
                attention_mask = attention_mask.to(device)
                list_val_true_labels += labels.tolist()
                labels = labels.to(device)
                opt.zero_grad()
                output = model(input_ids = input_ids, attention_mask = attention_mask, labels = labels)
                loss = output.loss
                logits = output.logits
                logits_labels = torch.argmax(logits, dim=1).tolist()
                list_val_pred_label += logits_labels
                val_loss += loss.item()
                pbar.set_description("(Epoch {}) VALID LOSS:{:.4f}".format((epoch+1), val_loss/(i+1)))

        acc = accuracy_score(list_val_true_labels, list_val_pred_label)
        precision, recall, f1, _ = precision_recall_fscore_support(list_val_true_labels, list_val_pred_label, average='macro')
        print("(Epoch {}) VALIDLOSS:{:.4f} ACC:{:.4f} PREC:{:.4f} REC:{:.4f} F1:{:.4f} LR:{:.8f}".format((epoch+1), val_loss/(i+1), acc, precision, recall, 
                                                                                                          f1, get_lr(optimizer)))        
        if monitor == "acc":
            monitor_val = acc
        elif monitor == "precision":
            monitor_val = precision
        elif monitor == "recall":
            monitor_val = recall
        elif monitor == "f1":
            monitor_val = f1
        else:
            monitor_val = acc

        if monitor_val > monitor_val_max: 
            monitor_val_max = monitor_val 
            model.save_pretrained(f"{model_dir}/best_model_rel/")
            write_best_model_info(epoch+1, monitor_val)
        else: 
            pass

Begin training

In [25]:
# Training
# feel free to change the number of epoch as you need, but its better not to exceed 20
num_epochs = 5

# lets training and saved the best model based on f1 score
fit(num_epochs, model, train_dl, valid_dl, optimizer, monitor='f1')

HBox(children=(FloatProgress(value=0.0, max=1925.0), HTML(value='')))


(Epoch 1) TRAIN LOSS:0.2860 ACC:0.8947 PREC:0.8646 REC:0.8477 F1:0.8555 LR:0.00001000


HBox(children=(FloatProgress(value=0.0, max=825.0), HTML(value='')))


(Epoch 1) VALIDLOSS:0.0909 ACC:0.9306 PREC:0.9321 REC:0.8941 F1:0.9089 LR:0.00001000


HBox(children=(FloatProgress(value=0.0, max=1925.0), HTML(value='')))


(Epoch 2) TRAIN LOSS:0.1397 ACC:0.9544 PREC:0.9421 REC:0.9379 F1:0.9400 LR:0.00001000


HBox(children=(FloatProgress(value=0.0, max=825.0), HTML(value='')))


(Epoch 2) VALIDLOSS:0.0794 ACC:0.9348 PREC:0.9254 REC:0.9112 F1:0.9171 LR:0.00001000


HBox(children=(FloatProgress(value=0.0, max=1925.0), HTML(value='')))


(Epoch 3) TRAIN LOSS:0.0725 ACC:0.9756 PREC:0.9739 REC:0.9671 F1:0.9704 LR:0.00001000


HBox(children=(FloatProgress(value=0.0, max=825.0), HTML(value='')))


(Epoch 3) VALIDLOSS:0.0915 ACC:0.9291 PREC:0.8998 REC:0.9241 F1:0.9112 LR:0.00001000


HBox(children=(FloatProgress(value=0.0, max=1925.0), HTML(value='')))


(Epoch 4) TRAIN LOSS:0.0385 ACC:0.9874 PREC:0.9871 REC:0.9845 F1:0.9858 LR:0.00001000


HBox(children=(FloatProgress(value=0.0, max=825.0), HTML(value='')))


(Epoch 4) VALIDLOSS:0.0933 ACC:0.9345 PREC:0.9106 REC:0.9259 F1:0.9180 LR:0.00001000


HBox(children=(FloatProgress(value=0.0, max=1925.0), HTML(value='')))


(Epoch 5) TRAIN LOSS:0.0277 ACC:0.9914 PREC:0.9925 REC:0.9890 F1:0.9907 LR:0.00001000


HBox(children=(FloatProgress(value=0.0, max=825.0), HTML(value='')))


(Epoch 5) VALIDLOSS:0.1080 ACC:0.9303 PREC:0.9268 REC:0.9059 F1:0.9137 LR:0.00001000


## 4. Evaluation

In [26]:
# Load the best model
best_model_checkpoint = f"{model_dir}/best_model_rel/"
best_model = AutoModelForSequenceClassification.from_pretrained(best_model_checkpoint)
best_model = best_model.to(device)

In [27]:
# evaluation function
def eval(model, data_loader, opt):

    target_names = []
    for id, label in id2label.items():
        target_names.append(label)

    pbar = tqdm(data_loader, leave=True, total=len(data_loader))

    with torch.no_grad():
        list_true_labels = []
        list_pred_label = []
        for idx, data in enumerate(pbar):
            input_ids, attention_mask, labels = data["input_ids"], data["attention_mask"], data["labels"]
            input_ids = input_ids.to(device)
            attention_mask = attention_mask.to(device)
            list_true_labels += labels.tolist()
            opt.zero_grad()
            output = model(input_ids = input_ids, attention_mask = attention_mask)
            logits = output.logits
            logits_labels = torch.argmax(logits, dim=1).tolist()
            list_pred_label += logits_labels

    print(classification_report(list_true_labels, list_pred_label, target_names=target_names))

In [28]:
# evaluate the best model using data test
eval(best_model, test_dl, optimizer)

HBox(children=(FloatProgress(value=0.0, max=315.0), HTML(value='')))


              precision    recall  f1-score   support

    positive       0.98      0.94      0.96       735
     neutral       0.86      0.85      0.86       131
    negative       0.89      0.96      0.93       394

    accuracy                           0.94      1260
   macro avg       0.91      0.92      0.92      1260
weighted avg       0.94      0.94      0.94      1260



## 5. Using the model

In [39]:
# function to predict sentiment of a sentence
def predict(input_sentence, model):

  with torch.no_grad():
    encoder_inputs = tokenizer(input_sentence, is_split_into_words=False, truncation=True, max_length=encoder_max_len, 
                            padding='max_length', return_overflowing_tokens=False, return_offsets_mapping=True, return_tensors="pt")

    input_ids, attention_mask = encoder_inputs["input_ids"], encoder_inputs["attention_mask"]
    input_ids = input_ids.to(device)
    attention_mask = attention_mask.to(device)
    output = model(input_ids = input_ids, attention_mask = attention_mask)
    logits = output.logits
    logits_labels = torch.argmax(logits, dim=1).tolist()
    label = id2label[logits_labels[0]]

    return label

In [40]:
input_sentence = "jakarta rasa cikampek ! mana kata nya menta lita bobodoh ? sedikit-sedikit ngadu ! ah payah"

# predict input_sentence with the best model
predict(input_sentence, best_model)

'negative'

### __*author: Hadi Muhshi*__