## Package Dependency

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

import torch
import transformers as ppb

import warnings
import os.path as path

In [2]:
warnings.filterwarnings('ignore')

In [3]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

# Import Dataset

The dataset we will use in this example is [SST2](https://nlp.stanford.edu/sentiment/index.html), which contains sentences from movie reviews, each labeled as either positive (has the value 1) or negative (has the value 0):

In [4]:
DATASET_DIR = "./data/sst2"

In [5]:
train_df = pd.read_csv(f"{DATASET_DIR}/train.tsv", delimiter = '\t', header = None)

In [6]:
train_df[1].value_counts()

1    3610
0    3310
Name: 1, dtype: int64

# Loading the Pre-trained BERT model

In [7]:
# load pretrained model/tokenizer

# For DistilBERT:
tokenizer = ppb.DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = ppb.DistilBertModel.from_pretrained('distilbert-base-uncased').to(device)

# Preparing the Dataset

Before we can hand our sentences to BERT, we need to so some minimal processing to put them in the format it requires.

## Tokenization

Our first step is to tokenize the sentences -- break them up into word and subwords in the format BERT is comfortable with.

In [9]:
tokenized = train_df[0].apply(lambda x: tokenizer.encode(x, add_special_tokens = True))

In [10]:
tokenized

0       [101, 1037, 18385, 1010, 6057, 1998, 2633, 182...
1       [101, 4593, 2128, 27241, 23931, 2013, 1996, 62...
2       [101, 2027, 3653, 23545, 2037, 4378, 24185, 10...
3       [101, 2023, 2003, 1037, 17453, 14726, 19379, 1...
4       [101, 5655, 6262, 1005, 1055, 12075, 2571, 376...
                              ...                        
6915    [101, 9145, 1010, 7570, 18752, 14116, 1998, 28...
6916    [101, 2202, 2729, 2003, 19957, 2864, 2011, 103...
6917    [101, 1996, 5896, 4472, 4121, 1010, 3082, 7832...
6918    [101, 1037, 5667, 2919, 2143, 2007, 5667, 2561...
6919    [101, 1037, 12090, 2135, 2512, 5054, 19570, 23...
Name: 0, Length: 6920, dtype: object

## Padding

In [11]:
# find max length among all sentences

max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)
        
print(f"Max sentence length: {max_len}")

Max sentence length: 67


In [12]:
padded = np.array([i + [0] * (max_len - len(i)) for i in tokenized.values])
padded

array([[  101,  1037, 18385, ...,     0,     0,     0],
       [  101,  4593,  2128, ...,     0,     0,     0],
       [  101,  2027,  3653, ...,     0,     0,     0],
       ...,
       [  101,  1996,  5896, ...,     0,     0,     0],
       [  101,  1037,  5667, ...,     0,     0,     0],
       [  101,  1037, 12090, ...,     0,     0,     0]])

In [13]:
padded.shape

(6920, 67)

## Masking

If we directly send padded to BERT, that would slightly confuse it. We need to create another variable to tell it to ignore (mask) the padding we've added when it's processing its input.  
That's what *attention_mask* is:

In [14]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(6920, 67)

## Modeling: Using BERT to Embed All Input Sentences

In [20]:
input_tensors = torch.tensor(padded).to(device)
attention_mask = torch.tensor(attention_mask).to(device)

# using pretrained model for embedding, it's not a part of training. so turn the auto rad off
with torch.no_grad():
    last_hidden_states = model(input_tensors[:2000], attention_mask = attention_mask[:2000])

In [21]:
last_hidden_states[0].shape

torch.Size([2000, 67, 768])

## The embedded `CLS` token can be thought of as an embedding for the entire sentence!!

So, for downstream task (classification for now), we can retrieve the embedding vector of `CLS` only

In [34]:
features = last_hidden_states[0][:, 0, :].cpu().numpy() # all sentences, only the first position: [CLS], all hidden unit outputs
features.shape

(2000, 768)

In [33]:
labels = np.array(train_df[1].values[:2000])
labels.shape

(2000,)

## Train/Test Split

Let's now split our dataset into traing set and testing (although we're using sentences from training set)

In [35]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels) # default test size = 0.25

print(f"train_features.shape: {train_features.shape}")
print(f"test_features.shape: {test_features.shape}")

train_features.shape: (1500, 768)
test_features.shape: (500, 768)


In [36]:
# logistic regression classification
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)

LogisticRegression()

## Evaluation

In [38]:
lr_clf.score(test_features, test_labels)

0.818

How good is this score? What can we compare it against? Let's first look at a dummy classifier:

In [45]:
from sklearn.dummy import DummyClassifier

dummy_clf = DummyClassifier()

scores = cross_val_score(dummy_clf, train_features, train_labels)

print(f"Dummy classifier scores: {scores.mean():.3f}")

Dummy classifier scores: 0.465
