In [1]:
model_name = "ProsusAI/finbert"

from_pretrained() adalah sebuah method pada library PyTorch yang digunakan untuk menginisialisasi model dengan menggunakan parameter yang telah dilatih sebelumnya (pretrained weights) pada dataset yang besar. Fungsi ini berguna untuk menghemat waktu dan sumber daya dalam proses pelatihan (training) model pada dataset yang sama atau mirip dengan dataset yang digunakan untuk melatih model sebelumnya. from_pretrained() umumnya digunakan dalam transfer learning atau fine-tuning model untuk tugas tertentu.

In [4]:
from transformers import BertForSequenceClassification, BertTokenizer

# initialize the tokenizer for BERT models
tokenizer = BertTokenizer.from_pretrained(model_name)

# initialize the model for sequence classification
model = BertForSequenceClassification.from_pretrained(model_name)

In [8]:
import torch.nn.functional as F
import torch

1. Tokenize

2. Token IDs -> model

3. Model activations -> probabilities ( using Softmax )

4. Argmax of those probabilities

In [5]:
# this is our example text

txt = ("Given the recent downturn in stocks especially in tech which is likely to persist as yields keep going up, "
       "I thought it would be prudent to share the risks of investing in ARK ETFs, written up very nicely by "
       "[The Bear Cave](https://thebearcave.substack.com/p/special-edition-will-ark-invest-blow). The risks comes "
       "primarily from ARK's illiquid and very large holdings in small cap companies. ARK is forced to sell its "
       "holdings whenever its liquid ETF gets hit with outflows as is especially the case in market downturns. "
       "This could force very painful liquidations at unfavorable prices and the ensuing crash goes into a "
       "positive feedback loop leading into a death spiral enticing even more outflows and predatory shorts.")

tokenizer.encode_plus(txt, max_length=512, truncation=True,padding='max_length', add_special_tokens=True, return_tensors='pt')

* encode_plus() digunakan untuk mengubah teks input menjadi representasi token.
* txt adalah input teks yang ingin di-tokenisasi.
* max_length digunakan untuk membatasi panjang maksimum teks input setelah di-tokenisasi.
* truncation=True digunakan untuk memotong teks input jika melebihi max_length.
* padding='max_length' digunakan untuk menambahkan padding pada teks input yang kurang dari max_length.
* add_special_tokens=True digunakan untuk menambahkan token-token khusus seperti [CLS], [SEP], dan [PAD] pada awal, akhir, dan padding dari teks input.
* return_tensors='pt' digunakan untuk mengembalikan output dalam bentuk PyTorch tensor.

Dalam kode di atas, hasil outputnya akan berupa token yang sudah di-padding dan diubah ke dalam bentuk PyTorch tensor. Hal ini berguna untuk digunakan pada model NLP yang memerlukan representasi input dalam bentuk tensor.

* [PAD]  = 0   = Used to fill empty space when input sequence is shorter than required sequence size for model
* [UNK]  = 100 = If a word/character is not found in BERTs vocabulary it will be represented by this unknown token
* [CLS]  = 101 = Represents the start of a sequence
* [SEP]  = 102 = Seperator token to denote the end of a sequence and as a seperator where there are multiple sequences
* [MASK] = 103 = Token used for masking other tokens, used for masked language modeling

In [6]:
token = tokenizer.encode_plus(txt, max_length=512, truncation=True,padding='max_length',
                               add_special_tokens=True, return_tensors='pt') # tf (tensorflow) replace pt (pytorch)

token

{'input_ids': tensor([[  101,  2445,  1996,  3522,  2091, 22299,  1999, 15768,  2926,  1999,
          6627,  2029,  2003,  3497,  2000, 29486,  2004, 16189,  2562,  2183,
          2039,  1010,  1045,  2245,  2009,  2052,  2022, 10975, 12672,  3372,
          2000,  3745,  1996, 10831,  1997, 19920,  1999, 15745,  3802, 10343,
          1010,  2517,  2039,  2200, 19957,  2011,  1031,  1996,  4562,  5430,
          1033,  1006, 16770,  1024,  1013,  1013,  1996,  4783,  2906, 27454,
          1012,  4942,  9153,  3600,  1012,  4012,  1013,  1052,  1013,  2569,
          1011,  3179,  1011,  2097,  1011, 15745,  1011, 15697,  1011,  6271,
          1007,  1012,  1996, 10831,  3310,  3952,  2013, 15745,  1005,  1055,
          5665, 18515, 21272,  1998,  2200,  2312,  9583,  1999,  2235,  6178,
          3316,  1012, 15745,  2003,  3140,  2000,  5271,  2049,  9583,  7188,
          2049,  6381,  3802,  2546,  4152,  2718,  2007,  2041, 12314,  2015,
          2004,  2003,  2926,  1996,  

without **kwargs

random_func(var1='hello', var2='world')

with **kwargs

input_dict = {'var1': 'hello', 'var2': 'world'}
random_func(**input_dict)

In [12]:
output = model(**token)

output

SequenceClassifierOutput(loss=None, logits=tensor([[-1.8200,  2.4484,  0.0216]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [13]:
# Extract Tensor

output[0]

tensor([[-1.8200,  2.4484,  0.0216]], grad_fn=<AddmmBackward0>)

In [14]:
# apply softmax to the logits output tensor of our model (in index 0) across dimension -1

probs = F.softmax(output[0], dim=-1)

probs

tensor([[0.0127, 0.9072, 0.0801]], grad_fn=<SoftmaxBackward0>)

(We use dim=-1 as -1* signifies our tensors final dimension, so if we had a 3D tensor with dims [0, 1, 2] writing dim=-1 is the equivalent to writing dim=2. In this case if we wrote dim=-2 this would be the equivalent to writing dim=1. For a 2D tensor with dims [0, 1], dim=-1 is the equivalent of dim=1.)*

Now we have a tensor containing three classes, all with outputs within the probability range of 0-1, these are our probabilities! We can see that class index 1 has the highest probability with a value of 0.9072. We can use PyTorch's argmax function to extract this, we can use argmax after importing torch.

In [15]:
pred = torch.argmax(probs)

pred

tensor(1)

Fungsi dari torch.argmax(probs) adalah untuk mengembalikan indeks dari nilai terbesar di antara tensor probs dalam satu dimensi tertentu. Dalam konteks pemrosesan bahasa alami, probs seringkali berisi probabilitas untuk setiap kelas pada output model, sehingga torch.argmax(probs) akan mengembalikan indeks kelas dengan probabilitas tertinggi.

Argmax outputs our winning class as 1 as expected. To convert this value from a PyTorch tensor to a Python integer we can use the item method.

In [16]:
pred.item()

1