<a href="https://colab.research.google.com/github/RenYuanXue/LearningBERT/blob/main/Fine_Tuning_with_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Load CoLA Dataset

Get wget package to download the dataset

In [2]:
!pip install wget

Collecting wget
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-cp37-none-any.whl size=9681 sha256=c1be62eda88565df776ac856e8ed9db8a2041d43ead811f35b49dea97af13f9c
  Stored in directory: /root/.cache/pip/wheels/40/15/30/7d8f7cea2902b4db79e3fea550d7d7b85ecb27ef992b618f3f
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


Download the zip file using wget

In [3]:
import wget
import os

print('Downloading dataset...')

# The url for the dataset zip file.
url = 'https://nyu-mll.github.io/CoLA/cola_public_1.1.zip'

# Download the file (if we haven't already)
if not os.path.exists('./cola_public_1.1.zip'):
  wget.download(url, './cola_public_1.1.zip')

Downloading dataset...


Unzip the dataset to file system

In [4]:
# Unzip the dataset (if we haven't done so).
if not os.path.exists('./cola_public/'):
  !unzip cola_public_1.1.zip

Archive:  cola_public_1.1.zip
   creating: cola_public/
  inflating: cola_public/README      
   creating: cola_public/tokenized/
  inflating: cola_public/tokenized/in_domain_dev.tsv  
  inflating: cola_public/tokenized/in_domain_train.tsv  
  inflating: cola_public/tokenized/out_of_domain_dev.tsv  
   creating: cola_public/raw/
  inflating: cola_public/raw/in_domain_dev.tsv  
  inflating: cola_public/raw/in_domain_train.tsv  
  inflating: cola_public/raw/out_of_domain_dev.tsv  


Load the dataset to pandas dataframe

In [5]:
import pandas as pd

# Load data to pandas dataframe.
df = pd.read_csv('./cola_public/raw/in_domain_train.tsv',
          delimiter = '\t', header = None,
          names = ['sentence_source', 'label', 'label_notes', 'sentence'])

# Report number of sentences in the dataframe.
print('Number of training sentences: {0}'.format(df.shape[0]))

# Display first few rows from the data.
df.head()

Number of training sentences: 8551


Unnamed: 0,sentence_source,label,label_notes,sentence
0,gj04,1,,"Our friends won't buy this analysis, let alone..."
1,gj04,1,,One more pseudo generalization and I'm giving up.
2,gj04,1,,One more pseudo generalization or I'm giving up.
3,gj04,1,,"The more we study verbs, the crazier they get."
4,gj04,1,,Day by day the facts are getting murkier.


If the sentence is grammatically correct, it is labelled as 1, otherwise 0.

Extract the sentences and labels used for BERT.

In [6]:
sentences = df.sentence.values
labels = df.label.values

## Tokenization & input formatting

Get transformers library in case don't have it.

In [7]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/f9/54/5ca07ec9569d2f232f3166de5457b63943882f7950ddfcc887732fc7fb23/transformers-4.3.3-py3-none-any.whl (1.9MB)
[K     |████████████████████████████████| 1.9MB 15.5MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 37.8MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/71/23/2ddc317b2121117bf34dd00f5b0de194158f2a44ee2bf5e47c7166878a97/tokenizers-0.10.1-cp37-cp37m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 55.4MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp37-none-any.whl size=893262 sha256=18934922cb

First, we need to load the BERT tokenizer.

In [8]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




Apply the tokenizer to our input sentences.

In [9]:
# Print original sentence.
print('Original: ', sentences[0])

# Print splitted sentences as tokens.
print('Tokenized: ', tokenizer.tokenize(sentences[0]))

# Print sentences mapped to token ids.
print('Token IDs: ', tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sentences[0])))

Original:  Our friends won't buy this analysis, let alone the next one we propose.
Tokenized:  ['our', 'friends', 'won', "'", 't', 'buy', 'this', 'analysis', ',', 'let', 'alone', 'the', 'next', 'one', 'we', 'propose', '.']
Token IDs:  [2256, 2814, 2180, 1005, 1056, 4965, 2023, 4106, 1010, 2292, 2894, 1996, 2279, 2028, 2057, 16599, 1012]


Use tokenizer encode function to encode sentences. However, it only takes care of truncating, not padding.

In [10]:
# Save mapped sentences.
input_ids = []

# Loop through all sentences.
for sentence in sentences:
  # The function encode will:
  # 1. Map each word to corresponded ids.
  # 2. Append [SEP] at end of each sentence.
  # 3. Prepand [CLS] to start of each sentence.
  # 4. Map tokens to their IDs.
  encoded_sentence = tokenizer.encode(
              sentence,
              add_special_tokens = True # Add [SEP] and [CLS]
              # max_length = 128, # Truncate all sentences.
              # return_tensors = 'pt' # Return pytorch tensors.
            )
  input_ids.append(encoded_sentence)

Now we need to padding and truncating. Find the maximum sentence length.

In [11]:
print('Max sentence length: {0}'.format(max([len(sentence) for sentence in input_ids])))

Max sentence length: 47


Therefore we choose max_length = 64, since 32 < 47 < 64.

In [12]:
# Use Keras to do the padding.
from keras.preprocessing.sequence import pad_sequences

MAX_LEN = 64

input_ids = pad_sequences(input_ids, maxlen = MAX_LEN, dtype = "long",
              value = 0, truncating = 'post', padding = 'post')

Now, create attention masks.

In [13]:
attention_masks = []

for sentence in input_ids:
  curr_mask = [int(token_id > 0) for token_id in sentence]

  attention_masks.append(curr_mask)

Make train and validation splits using sklearn.

In [14]:
from sklearn.model_selection import train_test_split

# 90% Train, 10% Validation.
train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(input_ids, labels,
                                            random_state = 2018, test_size = 0.1)

# Do the same for masking.
train_masks, validation_masks, _, _ = train_test_split(attention_masks, labels,
                              random_state = 2018, test_size = 0.1)

Now, convert all used inputs to torch tensors.

In [15]:
import torch

train_inputs = torch.tensor(train_inputs)
validation_inputs = torch.tensor(validation_inputs)

train_labels = torch.tensor(train_labels)
validation_labels = torch.tensor(validation_labels)

train_masks = torch.tensor(train_masks)
validation_masks = torch.tensor(validation_masks)

Lastly, we use torch's DataLoader to make batches of inputs.

In [16]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

batch_size = 32

# Create DataLoader for training set.
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler = train_sampler, batch_size = batch_size)

# Create DataLoader for validation set.
validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = RandomSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler = validation_sampler, batch_size = batch_size)

## Train Classification Model

In [17]:
from transformers import BertForSequenceClassification, AdamW, BertConfig

# Load the pretrained BERT model with single linear classification layer on top.
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased", # 12 layer BERT model, with an uncased vocab.
    # num_labels = 2, # Number of output labels = 2, binary classification.
    output_attentions = False, # Whether the model returns attentions weights.
    output_hidden_states = False # Whether the model returns all hidden states.
)

# Let Pytorch to run the model on GPU if available
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print('The model will run on GPU') if torch.cuda.is_available() else print('The model will run on CPU')
model.to(device)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

The model will run on GPU


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element