# Zero-shot Sentiment Classification using GPT-2
This is a tutorial using KR3. We classify the reviews in KR3 **without any training**. This is known as *zero-shot*. We use GPT-2, the generative model. Basically, we compare the probability of positive and negative token coming after the input review.

Concepts
- Zero-shot text classification
- GPT-2

Libraries
- Datasets 
- Transformers
- PyTorch

## Basic setup
We load the dataset from hugging face hub.

In [1]:
# load dataset
from datasets import load_dataset

kr3 = load_dataset("Wittgensteinian/KR3", split='train')

kr3 = kr3.remove_columns(['__index_level_0__'])

ModuleNotFoundError: No module named 'datasets'

In [2]:
kr3

Dataset({
    features: ['Rating', 'Review'],
    num_rows: 641762
})

In [3]:
kr3.features

{'Rating': Value(dtype='int32', id=None),
 'Review': Value(dtype='string', id=None)}

We are not going to use ambiguous reviews, i.e. reviews whose rating is 2. These reviews weren't intended to be classified in the first place.

In [4]:
kr3_binary = kr3.filter(lambda x: x['Rating'] != 2)
kr3_binary

  0%|          | 0/642 [00:00<?, ?ba/s]

Dataset({
    features: ['Rating', 'Review'],
    num_rows: 459021
})

## How we use GPT-2 for classification
Here we show how we classify the reviews, with only single example.

We load the model and the tokenizer. Here we use [GPT-2 trained on Korean corpus](https://github.com/SKT-AI/KoGPT2). 

In [5]:
# load model and tokenizer (GPT-2 trained on Korean corpus)
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("skt/kogpt2-base-v2", pad_token='<pad>')

model = AutoModelForCausalLM.from_pretrained("skt/kogpt2-base-v2")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


This shows how the input text is tokenized. Entries in `input_ids` are the tokens. It's worth noting that `tokenized_reviews['input_ids']` is a 2D tensor, instead of 1D tensor. This is because the tokenizer returned the batch. Batch will be later used in inference using the entire dataset.

> It's fine to ignore `attention_mask` in here.

In [10]:
idx = 0
review = kr3_binary['Review'][idx]
label = kr3_binary['Rating'][idx]
tokenized_review = tokenizer(review, return_tensors='pt')
print(review)
print(tokenized_review)

숙성 돼지고기 전문점입니다. 건물 모양 때문에 매장 모양도 좀 특이하지만 쾌적한 편이고 살짝 레트로 감성으로 분위기 잡아놨습니다. 모든 직원분들께서 전부 가능하다고 멘트 쳐주시며, 고기는 초반 커팅까지는 구워주십니다. 가격 저렴한 편 아니지만 맛은 준수합니다. 등심덧살이 인상 깊었는데 구이로 별로일 줄 알았는데 육향 짙고 얇게 저며 뻑뻑하지 않았습니다. 하이라이트는 된장찌개. 진짜 굿입니다. 버터 간장밥, 골뱅이 국수 등 나중에 더 맛봐야 할 것들은 남겨뒀습니다.
{'input_ids': tensor([[44381, 26367,  6958, 10161,  8191, 21154, 10637,  9777,  9355, 13669,
          9777,  7235, 11732, 15846, 11686, 43752,  9266,  9466, 20387, 10286,
         11714,  9244, 12041, 33684, 13364,  7130, 16691,  9548, 18401,  7671,
          7285, 23916, 17483,  9826, 12524,   739, 18221, 13673,  8236,  7888,
          9061,  9065,  9446, 18622, 10114,  8614, 12109, 26089,  8236,  7895,
         12521, 11562, 29932,  9266, 22804, 32837, 22033, 37194,  9030,  7894,
          7216, 16912, 15464,  9958, 16693,  9073, 11434, 15126,  8149,  9566,
          9181, 31231,  9719,  8721, 14591,  6889, 25446,  9265,  7530,   739,
          7723,  7723,  9328, 10171, 16691,  9078,  9131, 51000,  9498,  8168,
          832

Now we feed the model with the tokenized review, and the model gives an output. Remember that we're trying to see the model's prediction at the end of the input text. Therefore, we see the last prediction (hence indexing via -1).

In [11]:
y = model(**tokenized_review)
prediction = y.logits[0][-1]

This tensor represents the unnormalized(before passing softmax) probabilities of each token coming out in the end of the sentence.

In [12]:
prediction

tensor([-6.5961, -6.9506, -5.8478,  ..., -1.8356, -5.1506, -3.0455],
       grad_fn=<SelectBackward0>)

This is the prediction of GPT-2, or token with the highest probability. This is not our interest though.

In [13]:
tokenizer.decode(prediction.argmax())

'그리고'

This is a pair of typical sentences expressing good/bad sentiment. We select a pair of tokens to represent *positive* and *negative* respectively.

In [14]:
print(tokenizer.tokenize('최고입니다')) 
print(tokenizer.tokenize('별로였습니다')) 

['▁최고', '입', '니', '다']
['▁별로', '였', '습니', '다']


In [34]:
print(tokenizer.encode('최고')) # it means 'Best'
print(tokenizer.encode('별로')) # it means 'Not good'

[10281]
[15126]


Is the probability assigned to token 'Best' higher than those assigned to token 'Not good'?  
If it is, we predict this input as *positive*. Otherwise, we predict this input as *negative*.
Remember that label==1 is for positive reviews, and 0 is for negative reviews.  
Therefore, code below shows whether our prediction is right. And we're right here!! Yeah!

In [16]:
((prediction[10281] > prediction[15126]) == label).item()

True

## Prediction on entire dataset using PyTorch
Now we make a prediction on the entire dataset. Do not make a mistake that it will be a simple for loop. That will take forever!  
We use **PyTorch** to exploit GPU for faster inference. The concept of **batch** comes along, differing from the case of single example.

### Tokenization
We tokenize every review in the dataset. Some of the processes done during tokenization are:
- *truncation* is when you truncate(=cut) the input text because it's too long(i.e. it exceeds the max_length).
- *padding* is when you add extra tokens in the end of the input text to create a batch. We do not pad here. Instead, we set dynamic padding when we create PyTorch DataLoader.

In [17]:
# tokenize
def tokenize_func(x):
    return tokenizer(x['Review'], max_length=256, truncation=True)

kr3_tokenized = kr3_binary.map(tokenize_func, batched=True)

  0%|          | 0/460 [00:00<?, ?ba/s]

Check the new features: `attention_mask` and `input_ids`. These are the parameters for the model(GPT-2).

In [18]:
kr3_tokenized

Dataset({
    features: ['Rating', 'Review', 'input_ids', 'attention_mask'],
    num_rows: 459021
})

Now that we do not need the feature `Review`, we remove it. Plus, we set the format of this dataset as 'torch', as we're going to use PyTorch.

In [19]:
kr3_tokenized = kr3_tokenized.remove_columns(['Review'])
kr3_tokenized.set_format('torch')
kr3_tokenized

Dataset({
    features: ['Rating', 'input_ids', 'attention_mask'],
    num_rows: 459021
})

### GPT-2 Inference using PyTorch

We make PyTorch dataloader. See the exmaple batch.  
- `Rating[i]` represents the label (0 or 1) for (*i+1*)th review in the batch.
- `attention_mask[i]` and `input_ids[i]` are the tokenized (*i+1*)th review in the batch.

Dynamic Padding
> We set dynamic padding using `DataCollatorWithPadding` from `transformers`. Dynamic padding pads to the longest sequence in the batch, instead of padding to certain fixed length. In the example below, we can deduce that the longest sequence in the batch had a length of 117.

In [20]:
# pytorch dataloader
from torch.utils.data import DataLoader
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
batch_size = 8
data_loader = DataLoader(kr3_tokenized, batch_size=batch_size, collate_fn=data_collator)

# example batch
batch = next(iter(data_loader))
print({k:v.size() for k,v in batch.items()})

{'Rating': torch.Size([8]), 'input_ids': torch.Size([8, 117]), 'attention_mask': torch.Size([8, 117])}


We set up GPU. If you don't have or haven't set up GPU, it would be hard to proceed.

In [21]:
import torch
if torch.cuda.is_available():
    device = torch.device('cuda')
else:
    print('GPU not ready')

In [22]:
# move model to the GPU
model = model.to(device)

We predict the sentiment of each review via inference of GPT-2. This loop will take some time.


> `input_lens` represents the length of each input text. This is used to obtain the token predicted right after the input text. 

> If you run out of GPU memory, try to reduce `max_length` (in tokenization) or `batch_size`.

In [None]:
from tqdm.notebook import tqdm

confusion_matrix = [[0,0],[0,0]]

for batch in tqdm(data_loader):
    batch = {k:v.to(device) for k,v in batch.items()} # move the data to the GPU
    y = model(input_ids = batch['input_ids'], attention_mask = batch['attention_mask']) # forward
    input_lens = batch['attention_mask'].sum(axis=1) # length of inputs in the batch

    for i in range(len(batch['Rating'])):
        next_token_prediction = y.logits[i, input_lens[i]-1] # output of the model for single review

        # prediction result
        predicted_label = (next_token_prediction[10281] > next_token_prediction[15126]).item() 
        true_label = batch['Rating'][i].item()
        confusion_matrix[true_label][predicted_label] += 1

### Result

In [39]:
import numpy as np
confusion_np_matrix = np.array(confusion_matrix)
confusion_np_matrix

array([[ 33148,  37762],
       [ 76503, 311608]])

Accuracy of 75% is better than random guess without knowing the distribution (50%) but worse than random guess known the distribution (85%). Note that best accuracy with pretrain-finetuning was [96%](https://wandb.ai/wittgensteinian/Parameter-Efficient-Tuning?workspace=user-wittgensteinian).

The model especially struggles with negative label. This trend is also shown in standard pretrain-finetuning approach. 

In [47]:
print('Accuracy:', confusion_np_matrix.diagonal().sum() / confusion_np_matrix.sum())
print('Precision for positive:', confusion_np_matrix[1,1] / confusion_np_matrix[:,1].sum())
print('Precision for negative:', confusion_np_matrix[0,0] / confusion_np_matrix[:,0].sum())
print('Recall for positive:', confusion_np_matrix[1,1] / confusion_np_matrix[1,:].sum())
print('Recall for negative:', confusion_np_matrix[0,0] / confusion_np_matrix[0,:].sum())

Accuracy: 0.7510680339243738
Precision for positive: 0.8919140166585569
Precision for negative: 0.30230458454551257
Recall for positive: 0.8028837111032668
Recall for negative: 0.46746580172049074


Let's try random tokens as indicator for sentiment and see what happens. The result looks terrible.

In [68]:
n1 = np.random.randint(100, tokenizer.vocab_size)
n2 = np.random.randint(100, tokenizer.vocab_size)
print(n1, tokenizer.decode([n1]))
print(n2, tokenizer.decode([n2]))

14003 시에
18372 조금씩


In [69]:
random_confusion_matrix = [[0,0],[0,0]]

for batch in tqdm(data_loader):
    batch = {k:v.to(device) for k,v in batch.items()} # move the data to the GPU
    y = model(input_ids = batch['input_ids'], attention_mask = batch['attention_mask']) # forward
    input_lens = batch['attention_mask'].sum(axis=1) # length of inputs in the batch

    for i in range(len(batch['Rating'])):
        next_token_prediction = y.logits[i, input_lens[i]-1] # output of the model for single review

        # prediction result
        predicted_label = (next_token_prediction[n1] > next_token_prediction[n2]).item() 
        true_label = batch['Rating'][i].item()
        random_confusion_matrix[true_label][predicted_label] += 1

  0%|          | 0/57378 [00:00<?, ?it/s]

In [71]:
import numpy as np
random_confusion_np_matrix = np.array(random_confusion_matrix)
random_confusion_np_matrix

array([[ 58097,  12813],
       [334177,  53934]])

In [73]:
print('Accuracy:', random_confusion_np_matrix.diagonal().sum() / random_confusion_np_matrix.sum())

Accuracy: 0.244065086346812
