<a href="https://colab.research.google.com/github/Rhiannon1104/colab/blob/main/%E2%80%9CQuora_BERT_ipynb%E2%80%9D%E7%9A%84%E5%89%AF%E6%9C%AC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using BERT model to find if two questions are similar - Quora Question Pair Dataset

This notebook gives step by step implementation of using the BERT model developed by google, to fine tune on the famous QQP dataset on kaggle. The performance on this dataset is reported by Google in the paper, but this is a guide to implement the fine tuning of the BERT model for a sentence pair classification problem. This guide shall be followed for similar downstream tasks provided you have the task specific labelled dataset.

This notebook uses the pytorch hugging face implementation of BERT namely PyTorch-Transformers.

### Upload the train data from the source.

In [1]:
from google.colab import drive
drive.mount("/content/MyDrive")

Mounted at /content/MyDrive


### Installing the pytorch-transformers library and importing the necessary requirements.

In [2]:
!pip install pytorch-transformers
!pip install pytorch-pretrained-bert pytorch-nlp

import logging
logging.basicConfig(level=logging.INFO)
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from pytorch_pretrained_bert import BertTokenizer, BertConfig
from pytorch_pretrained_bert import BertAdam, BertForSequenceClassification
from tqdm import tqdm, trange
import pandas as pd
import io
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline

Collecting pytorch-transformers
  Downloading pytorch_transformers-1.2.0-py3-none-any.whl (176 kB)
[?25l[K     |█▉                              | 10 kB 22.2 MB/s eta 0:00:01[K     |███▊                            | 20 kB 24.4 MB/s eta 0:00:01[K     |█████▋                          | 30 kB 11.2 MB/s eta 0:00:01[K     |███████▍                        | 40 kB 9.6 MB/s eta 0:00:01[K     |█████████▎                      | 51 kB 5.4 MB/s eta 0:00:01[K     |███████████▏                    | 61 kB 6.0 MB/s eta 0:00:01[K     |█████████████                   | 71 kB 5.7 MB/s eta 0:00:01[K     |██████████████▉                 | 81 kB 6.3 MB/s eta 0:00:01[K     |████████████████▊               | 92 kB 4.9 MB/s eta 0:00:01[K     |██████████████████▋             | 102 kB 5.2 MB/s eta 0:00:01[K     |████████████████████▍           | 112 kB 5.2 MB/s eta 0:00:01[K     |██████████████████████▎         | 122 kB 5.2 MB/s eta 0:00:01[K     |████████████████████████▏       | 133 k

INFO:pytorch_pretrained_bert.modeling:Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .


In [36]:
#data = pd.read_csv(io.BytesIO(uploaded['questions.csv']))
train_data = pd.read_csv('/content/MyDrive/MyDrive/QQPdata/train.csv')
train_data = train_data.dropna()

valid_data = pd.read_csv('/content/MyDrive/MyDrive/QQPdata/valid.csv')
valid_data = valid_data.dropna()

### Sampling 5000 data points to enhance training speed and optimize memory usage. 

The original dataset consists of 404,348 samples.

In [37]:
# Randomly sample 5000 data points
#train_data = train_data.sample(n=3500)
#valid_data = valid_data.sample(n=1000)

# store the labels 
train_labels = train_data.is_duplicate.values
valid_labels = valid_data.is_duplicate.values

From the above, we see that the class percentages are fairly maintained even after sampling.

### Check the meta information about the dataset

In [39]:
train_data.info()

valid_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 124441 entries, 0 to 124441
Data columns (total 3 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   question1     124441 non-null  object
 1   question2     124441 non-null  object
 2   is_duplicate  124441 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 3.8+ MB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 67942 entries, 0 to 67942
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   question1     67942 non-null  object
 1   question2     67942 non-null  object
 2   is_duplicate  67942 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 2.1+ MB


### Prepare input data for BERT

We need special data preparation methods for BERT. Following are the steps.

1) Tokenize the sentences using the library. 

2) Append the string "[CLS]" to the beginning of the sentence and "[SEP]" to the end of the sentence. 

3) In case of sentence pairs, we need to append "[SEP]" at the end of the second sentence too. 

4) Obtain the input ids of the tokens from the output of the tokenizer. The model needs this to identify tokens uniquely.

5) BERT accepts input sequences in fixed sizes such as 128, 256, 320, 384, 512. So we need to truncate larger sequences or pad smaller sequences with 0.

6) A segment mask is to be specified to identify if the input is a single sentence or pair of sentences. Indicative values such 0 for first sentence and 1 for second sentence are used for this purpose.

7) An attention mask is to be specified, to let the model know which are the tokens and which are the paddings we introduced in step 5. 1 indicates token and 0 indicates padding.

This is the format in which the original BERT model was trained by google. So the users are also expected to follow the same for the best results.

In [40]:
# Load pre-trained model tokenizer (vocabulary)

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# function to tokenize and generate input ids for the tokens
# returns a list of input ids

def prep_data(ques1, ques2):
  all_input_ids = []
  
  for (q1,q2) in zip(ques1, ques2):
    
    # first sentence is appended with [CLS] and [SEP] in the beginning and end
    q1 = '[CLS] ' + q1 + ' [SEP] '
    tokens = tokenizer.tokenize(q1)
    
    # 0 denotes first sentence
    seg_ids = [0] * len(tokens)
    
    # second sentence is appended with [SEP] in the end
    q2 = q2 + ' [SEP] '
    tok_q2 = tokenizer.tokenize(q2)
    
    # seg ids is appended with 1 to denote second sentence
    seg_ids += [1] * len(tok_q2)
    
    # first and second sentence tokens are appended together
    tokens += tok_q2
    
    # input ids are generated for the tokens (one question pair)
    input_ids = tokenizer.convert_tokens_to_ids(tokens)

    # input ids are stored in a separate list
    all_input_ids.append(input_ids)
    
  return all_input_ids


train_input_ids = prep_data(train_data['question1'].values, train_data['question2'].values)

valid_input_ids = prep_data(valid_data['question1'].values, valid_data['question2'].values)

INFO:pytorch_pretrained_bert.tokenization:loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /root/.pytorch_pretrained_bert/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084


In [41]:
# set MAX_LEN as one of 128, 256, 320, 384, 512
MAX_LEN = 128

# Pad our input tokens
pad_input_ids_train = pad_sequences(train_input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")
pad_input_ids_valid = pad_sequences(valid_input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

In [42]:
# Create attention masks
attention_masks_train = []
attention_masks_valid = []

# Create a mask of 1s for each token followed by 0s for padding
for seq in pad_input_ids_train:
  seq_mask = [float(i>0) for i in seq]
  attention_masks_train.append(seq_mask)

for seq in pad_input_ids_valid:
  seq_mask = [float(i>0) for i in seq]
  attention_masks_valid.append(seq_mask)

### Check if GPU is available

In [43]:
import tensorflow as tf

device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


### Obtain the BERT model

In [44]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
model.cuda()

INFO:pytorch_pretrained_bert.modeling:loading archive file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz from cache at /root/.pytorch_pretrained_bert/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
INFO:pytorch_pretrained_bert.modeling:extracting archive file /root/.pytorch_pretrained_bert/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba to temp dir /tmp/tmp7237lx20
INFO:pytorch_pretrained_bert.modeling:Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

INFO:pytorch_pretrained_bert.modeling:Weights of BertForSequenceClassifi

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): BertLayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
   

In [45]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
torch.cuda.get_device_name(0)

'Tesla K80'

### Run the model

Split the input data into train and validation sets. I keep 20% of the data for validation. Convert all the inputs to tensors which is format for this library.

In [46]:
# Use train_test_split to split our data into train and validation sets for training

train_inputs = pad_input_ids_train
validation_inputs = pad_input_ids_valid

train_labels = train_labels
validation_labels = valid_labels

train_masks = attention_masks_train
validation_masks = attention_masks_valid

In [47]:
train_inputs = torch.tensor(train_inputs)
validation_inputs = torch.tensor(validation_inputs)
train_labels = torch.tensor(train_labels)
validation_labels = torch.tensor(validation_labels)
train_masks = torch.tensor(train_masks)
validation_masks = torch.tensor(validation_masks)

The DataLoader allows us to get only that particular batch neeed for that epoch. This helps save lot of memory because we wont be loading the entire data in memory during training.

In [48]:
# Select a batch size for training. For fine-tuning BERT on a specific task, the authors recommend a batch size of 16 or 32
batch_size = 32

# Create an iterator of our data with torch DataLoader. This helps save on memory during training because, unlike a for loop, 
# with an iterator the entire dataset does not need to be loaded into memory

train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)


### Get the same hyperparameters as the model and define the Adam Optimizer

In [49]:
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.0}
]

In [50]:
# This variable contains all of the hyperparemeter information our training loop needs
optimizer = BertAdam(optimizer_grouped_parameters,
                     lr=2e-5,
                     warmup=.1)



### Define a method to compute accuracy

In [51]:
# Function to calculate the accuracy of our predictions vs labels
def accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

### Run the neural network model

1) Obtain the batch of data to compute gradient from the data loader that we created above.

2) Do not store the gradients as they aren't needed in this case.

3) Make a forward pass in the network followed by backward pass (backpropagation).

4) Update network parameters.

5) Track the loss function.

6) Run predictions on the validation set and record accuracy.

7) Run steps 1 to 6 for the number of epochs specified.

In [52]:
train_loss_set = []

# Number of training epochs (authors recommend between 2 and 4)
epochs = 2

# trange is a tqdm wrapper around the normal python range
for _ in trange(epochs, desc="Epoch"):
  
  
  # Training
  
  # Set our model to training mode (as opposed to evaluation mode)
  model.train()
  
  # Tracking variables
  tr_loss = 0
  nb_tr_examples, nb_tr_steps = 0, 0
  
  # Train the data for one epoch
  for step, batch in enumerate(train_dataloader):
    
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)
    
    # Unpack the inputs from our dataloader
    b_input_ids, b_input_mask, b_labels = batch
    
    # Clear out the gradients (by default they accumulate)
    optimizer.zero_grad()
    
    # Forward pass
    loss = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
    train_loss_set.append(loss.item())    
    
    # Backward pass
    loss.backward()
    
    # Update parameters and take a step using the computed gradient
    optimizer.step()
    
    
    # Update tracking variables
    tr_loss += loss.item()
    nb_tr_examples += b_input_ids.size(0)
    nb_tr_steps += 1

  print("Train loss: {}".format(tr_loss/nb_tr_steps))
    
    
  # Validation

  # Put model in evaluation mode to evaluate loss on the validation set
  model.eval()

  # Tracking variables 
  eval_loss, eval_accuracy = 0, 0
  nb_eval_steps, nb_eval_examples = 0, 0

  # Evaluate data for one epoch
  for batch in validation_dataloader:
    
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)
    
    # Unpack the inputs from our dataloader
    b_input_ids, b_input_mask, b_labels = batch
    
    # Telling the model not to compute or store gradients, saving memory and speeding up validation
    with torch.no_grad():
      
      # Forward pass, calculate logit predictions
      logits = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
    
    # Move logits and labels to CPU
    logits = logits.detach().cpu().numpy()
    label_ids = b_labels.to('cpu').numpy()

    tmp_eval_accuracy = accuracy(logits, label_ids)
    
    eval_accuracy += tmp_eval_accuracy
    nb_eval_steps += 1

  print("Validation Accuracy: {}".format(eval_accuracy/nb_eval_steps))

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Train loss: 0.34584430083881523


Epoch:  50%|█████     | 1/2 [1:58:27<1:58:27, 7107.74s/it]

Validation Accuracy: 0.8706989563716259
Train loss: 0.22087137998836334


Epoch: 100%|██████████| 2/2 [3:56:55<00:00, 7107.52s/it]

Validation Accuracy: 0.8766135043942247





### Test on new question pairs to see the performance !!

Upload your question pairs test data and apply the same pre-processing techniques and see how the model works!! The questions are really challenging the model. Please refer the questions I uploaded in the file "test1.csv" in this repository.

In [53]:
test = pd.read_csv('/content/MyDrive/MyDrive/QQPdata/test.csv')
#test=test.sample(n=500)
all_input_ids = prep_data(test['question1'].values, test['question2'].values)

In [54]:
MAX_LEN = 128
# Pad our input tokens
pad_input_ids = pad_sequences(all_input_ids,
                          maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

In [55]:
# Create attention masks
attention_masks = []

# Create a mask of 1s for each token followed by 0s for padding
for seq in pad_input_ids:
  seq_mask = [float(i>0) for i in seq]
  attention_masks.append(seq_mask)

In [56]:
prediction_inputs = torch.tensor(pad_input_ids)
prediction_masks = torch.tensor(attention_masks)
  
batch_size = 32

prediction_data = TensorDataset(prediction_inputs, prediction_masks)
prediction_sampler = SequentialSampler(prediction_data)
prediction_dataloader = DataLoader(prediction_data, sampler=prediction_sampler, batch_size=batch_size)

In [57]:
# Prediction on test set

# Put model in evaluation mode
model.eval()

# Tracking variables 
predictions = []

input_sim = test["is_duplicate"].values
i=0
test_accuracy=0
# Predict 
for batch in prediction_dataloader:

  # Add batch to GPU
  batch = tuple(t.to(device) for t in batch)
  # Unpack the inputs from our dataloader
  b_input_ids, b_input_mask= batch
  # Telling the model not to compute or store gradients, saving memory and speeding up prediction
  with torch.no_grad():
    # Forward pass, calculate logit predictions
    logits = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)

  # Move logits and labels to CPU
  logits = logits.detach().cpu().numpy()
  
  # Store predictions and true labels
  predictions.append(logits)
  pred_flat = np.argmax(logits, axis=1).flatten()
  for n in range(0,32):
    if (i*32+n)<(test.size/3) and input_sim[i*32+n]==pred_flat[n]:
      test_accuracy=test_accuracy+1
  print(pred_flat)
  i=i+1
print("Test Accuracy:",test_accuracy/(test.size/3))

[0 1 1 0 1 0 1 1 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0]
[1 1 0 0 1 0 0 0 1 0 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0 1 0 0 0 0 1]
[0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 1]
[1 0 0 0 0 1 1 1 1 0 0 1 0 1 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0]
[0 0 0 1 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 1]
[0 0 0 0 1 1 1 1 1 0 0 0 1 1 0 0 0 1 1 0 0 1 0 0 1 0 0 0 1 1 1 1]
[0 1 0 0 0 1 0 0 0 1 1 1 0 0 0 1 1 1 1 1 1 1 0 1 0 0 0 0 1 1 0 0]
[1 0 0 0 1 1 1 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 1 1 1 0 0 1 1 1 0]
[0 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 1 0 0 1 0 1 0 1 0 1 0 1 1 0]
[1 1 0 0 0 1 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0]
[1 1 0 0 1 0 1 1 1 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 0 1 1 0 1 0 0 1 1]
[0 0 1 0 0 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 0 0 0 0 1 1 1 1 0 1 0 0]
[1 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 1 0 0 0]
[1 1 1 1 0 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 1 1 0 1 0 1 1 0 1 0 1]
[0 0 0 0 0

The algorithm works pretty good. The results are uploaded in results.csv file. It is interesting to see that it is getting a few complex classifications right. This is just a sample of 5000 points on fine tuning. With the whole dataset, this algorithm is great.

In [59]:
def save_model(model):
    output_dir = "/content/MyDrive/MyDrive/QQPdata/"

    torch.save(model.state_dict(), '/content/MyDrive/MyDrive/QQPdata/pytorch_model.bin')

save_model(model)