# 143. Movie Review Sentiment Classification

- 내부적으로 모델은 실제로 두 가지 모델로 구성 : DistilBERT + Logistic Regression

**DistilBERT**는 문장을 처리하고 추출 된 정보를 다음 모델로 전달합니다. DistilBERT는 HuggingFace 팀이 개발하고 공개한 BERT의 작은 버전. 가볍고 빠른 BERT 버전으로 95% 성능 구현.

- 로지스틱 회귀 모델은 DistilBERT의 처리 결과를 받아 문장을 positive 또는 negative (각각 1 또는 0)로 분류  

- 두 모델간에 전달되는 데이터는 크기가 768 인 벡터. 이 벡터를 분류에 사용할 문장에 대한 embedding 으로 생각할 수 있음

<img src="https://jalammar.github.io/images/distilBERT/distilbert-bert-sentiment-classifier.png" width="700"/>

## Dataset
- 이 예에서 사용할 데이터 세트는 [SST2] (https://nlp.stanford.edu/sentiment/index.html)  
- 영화 리뷰의 문장이 포함되어 있으며 각 문장은 positive (값 1) 또는 negative (값 0 ) 으로 label 되어 있음 

!pip install transformers

In [3]:
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import tensorflow as tf

from transformers import *

if torch.cuda.is_available():
    device = torch.device("cuda")
    print("GPU 사용 :", torch.cuda.get_device_name())
else:
    device = torch.device("cpu")
    print("No GPU available, CPU 사용")

No GPU available, CPU 사용


## Importing the dataset

In [15]:
df = pd.read_csv(
    'https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', 
                         delimiter='\t', header=None, names=['text', 'label'])

In [16]:
print(df.shape)
df.head()

(6920, 2)


Unnamed: 0,text,label
0,"a stirring , funny and finally transporting re...",1
1,apparently reassembled from the cutting room f...,0
2,they presume their audience wo n't sit still f...,0
3,this is a visually stunning rumination on love...,1
4,jonathan parker 's bartleby should have been t...,1


몇개의 sentence 가 "positive" (value 1) 이고 몇개가  "negative" (having the value 0) 인지 count

In [17]:
df['label'].value_counts()

1    3610
0    3310
Name: label, dtype: int64

## Loading the Pre-trained BERT model

- distilBERT 모델 : BERT 보다 작지만 훨씬 빠르며 훨씬 적은 메모리를 필요로 함

In [18]:
# For DistilBERT:
model_class, tokenizer_class, pretrained_weights \
        = DistilBertModel, DistilBertTokenizer, 'distilbert-base-uncased'

# DistilBERT 대신 BERT model을 사용할 경우
# model_class, tokenizer_class, pretrained_weights \
#         = BertModel, BertTokenizer, 'bert-base-uncased'

tokenizer = tokenizer_class.from_pretrained(pretrained_weights)

model     = model_class.from_pretrained(pretrained_weights)
model.to(device)
None

## Preparing the Dataset

### 1) Tokenization
- BERT가 요구하는 형식으로 tokenize

In [19]:
tokenized = df['text'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

### 2) Padding

In [20]:
tokenized.values[:2]

array([list([101, 1037, 18385, 1010, 6057, 1998, 2633, 18276, 2128, 16603, 1997, 5053, 1998, 1996, 6841, 1998, 5687, 5469, 3152, 102]),
       list([101, 4593, 2128, 27241, 23931, 2013, 1996, 6276, 2282, 2723, 1997, 2151, 2445, 12217, 7815, 102])],
      dtype=object)

In [21]:
tokenizer.decode(tokenized.values[0])

'[CLS] a stirring, funny and finally transporting re imagining of beauty and the beast and 1930s horror films [SEP]'

In [22]:
max_len = max([len(i) for i in tokenized.values])
max_len

67

In [23]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

padded = pad_sequences(tokenized.values, padding='post')
padded

array([[  101,  1037, 18385, ...,     0,     0,     0],
       [  101,  4593,  2128, ...,     0,     0,     0],
       [  101,  2027,  3653, ...,     0,     0,     0],
       ...,
       [  101,  1996,  5896, ...,     0,     0,     0],
       [  101,  1037,  5667, ...,     0,     0,     0],
       [  101,  1037, 12090, ...,     0,     0,     0]], dtype=int32)

In [24]:
padded.shape

(6920, 67)

### 3) Masking
입력을 처리 할 때 추가 한 패딩을 무시(마스크)하도록 attention_mask 생성

In [25]:
attention_mask = np.where(padded !=0, 1, 0 )
print(attention_mask.shape)
attention_mask[0]

(6920, 67)


array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0])

## Data Loader 생성

In [26]:
input_ids      = torch.tensor(padded).to(torch.int64).to(device)
attention_mask = torch.tensor(attention_mask).to(device)

In [27]:
BATCH_SIZE = 32

# Dataset 생성
dataset    = TensorDataset(input_ids, attention_mask)
# Iterator 생성
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE)

In [28]:
len(dataloader)

217

### Feature Extraction

- `model ()`함수는 BERT를 통해 문장을 실행
- 처리 결과는`last_hidden_states`로 반환

In [29]:
import time

with torch.no_grad():  # BERT parameter 변경 않음
    last_hidden_states = []
    s = time.time()
    
    for i, batch in enumerate(dataloader, start=1):
        batch = tuple(t.to(device) for t in batch)
        hidden = model(batch[0], batch[1])
        
        if len(last_hidden_states) == 0:
            last_hidden_states = hidden[0]
        else:
            last_hidden_states \
                = torch.cat((last_hidden_states, hidden[0]), dim=0) # row-wise concatenate
        if i % 20 == 0:
            print(f"batch {i} : elapse time - {time.time()-s:.2f} sec")      
            s = time.time()

batch 20 : elapse time - 39.65 sec
batch 40 : elapse time - 38.12 sec
batch 60 : elapse time - 37.65 sec
batch 80 : elapse time - 38.89 sec
batch 100 : elapse time - 38.98 sec
batch 120 : elapse time - 39.45 sec
batch 140 : elapse time - 40.04 sec
batch 160 : elapse time - 41.38 sec
batch 180 : elapse time - 41.21 sec
batch 200 : elapse time - 42.18 sec


In [30]:
last_hidden_states.shape

torch.Size([6920, 67, 768])

- 각 문장의 첫 번째 토큰에 해당하는 출력만 분류에 필요하므로, 필요한 부분 만 슬라이스
- BERT가 문장 분류를 하는 방식은 모든 문장의 시작 부분에`[CLS]`(분류용)라는 토큰을 추가하는 것 이므로, 분류 문제의 경우 `[CLS]` token 의 embedding 만 필요.
- 이 embedding 출력은 전체 문장에 대한 임베딩으로 간주 될 수 있음.

<img src="https://jalammar.github.io/images/distilBERT/bert-output-tensor-selection.png" />

로지스틱 회귀 모델의 feature 로 사용되므로 `features` 변수에 저장합니다.

In [77]:
features = last_hidden_states[:,0,:]
features.shape

torch.Size([6920, 768])

sentence 가 positive 인지 negative 인지의 label 을 `labels` 변수로 만든다

>



In [78]:
labels = df['label']
labels.shape

(6920,)

## Train/Test Split
데이터 세트를 훈련 세트와 테스트 세트로 분리

In [121]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

## (방법-1) sklearn의 LogisticRegression model 을 훈련

In [122]:
lr_clf = LogisticRegression(max_iter=1000)

lr_clf.fit(train_features.to("cpu"), train_labels)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

## Evaluating Model #

In [123]:
lr_clf.score(test_features.to("cpu"), test_labels)

0.8283236994219653

## Proper SST2 scores
참고로, 이 dataset 에 대한 [highest accuracy score](http://nlpprogress.com/english/sentiment_analysis.html) **96.8** 임. DistilBERT  **fine-tuning** 을 통해 score 향상 가능.  fine-tuning 한 DistilBERT 는 accuracy score **90.7** 달성 가능. full size BERT model 은 **94.9** 달성 가능.

## (방법-2) Logistic Regression Neural Network 작성

In [124]:
import torch.nn as nn

class LRNN(nn.Module):
    def __init__(self, n_inputs):
        super(LRNN, self).__init__()
        self.linear1 = nn.Linear(n_inputs, 128)
        self.linear2 = nn.Linear(128, 64)
        self.linear3 = nn.Linear(64, 1)

    def forward(self, x):
        x = torch.relu(self.linear1(x))
        x = torch.relu(self.linear2(x))
        yhat = torch.sigmoid(self.linear3(x))
        return yhat

In [125]:
classifier = LRNN(train_features.shape[1])
classifier.to(device)

LRNN(
  (linear1): Linear(in_features=768, out_features=128, bias=True)
  (linear2): Linear(in_features=128, out_features=64, bias=True)
  (linear3): Linear(in_features=64, out_features=1, bias=True)
)

In [126]:
criterion = nn.BCELoss()
optimizer = torch.optim.SGD(classifier.parameters(), lr=0.1)

In [127]:
BATCH_SIZE = 64

train_labels = torch.tensor(train_labels.values, dtype=torch.float).unsqueeze(1).to(device)
test_labels = torch.tensor(test_labels.values, dtype=torch.float).unsqueeze(1).to(device)

dataset_train = TensorDataset(train_features, train_labels)
loader_train = DataLoader(dataset_train, batch_size=BATCH_SIZE)

dataset_test = TensorDataset(test_features, test_labels)
loader_test = DataLoader(dataset_test, batch_size=BATCH_SIZE)

In [128]:
train_labels.size()

torch.Size([5190, 1])

In [129]:
LOSS = []

epochs = 500

for epoch in range(epochs):
    for i, (x, y) in enumerate(loader_train):
        yhat = classifier(x.to(device))
        loss = criterion(yhat.to("cpu"), y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        LOSS.append(loss)
    if epoch % 20 == 0:
        print("epoch---", epoch)

epoch--- 0
epoch--- 20
epoch--- 40
epoch--- 60
epoch--- 80
epoch--- 100
epoch--- 120
epoch--- 140
epoch--- 160
epoch--- 180
epoch--- 200
epoch--- 220
epoch--- 240
epoch--- 260
epoch--- 280
epoch--- 300
epoch--- 320
epoch--- 340
epoch--- 360
epoch--- 380
epoch--- 400
epoch--- 420
epoch--- 440
epoch--- 460
epoch--- 480


In [130]:
accuracy = []
for x, y in loader_test:
    x, y = x.to(device), y.to(device)
    z = classifier(x)
    acc = sum((z > 0.5) == y)/z.size(0)
    accuracy.append(acc.item())

In [131]:
np.mean(accuracy)

0.8013392857142857

In [132]:
# x = 'Really terrible and boring movie...'
x = 'Really funny, good and lovely movie...'
tokenized = tokenizer.encode(x, add_special_tokens=True)

In [133]:
tokenizer.decode(tokenized)

'[CLS] really funny, good and lovely movie... [SEP]'

In [134]:
padded = pad_sequences([tokenized], maxlen=max_len, padding='post')
padded

array([[ 101, 2428, 6057, 1010, 2204, 1998, 8403, 3185, 1012, 1012, 1012,
         102,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0]], dtype=int32)

In [135]:
attention_mask = np.where(padded !=0, 1, 0 )
attention_mask

array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0]])

In [136]:
input_ids = torch.tensor(padded).to(torch.int64).to(device)
attention_mask = torch.tensor(attention_mask).to(device)

hidden = model(input_ids, attention_mask)
hidden[0].shape

torch.Size([1, 67, 768])

In [137]:
input = hidden[0][0 ,0, :].to("cpu").detach()
classifier(input.to(device)).item()

0.9988042116165161

In [138]:
lr_clf.predict(input.unsqueeze(0).to("cpu"))

array([1])