<a href="https://colab.research.google.com/github/Hemant7499/BERT-GLUE/blob/main/BERT_RTE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import torch

if torch.cuda.is_available():

    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla T4


In [2]:
!pip install transformers



In [3]:
!pip install wget

Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9656 sha256=062938fef397ef2867e482126733da3e20ae2da399c1c59c776a70afa27ed7d2
  Stored in directory: /root/.cache/pip/wheels/8b/f1/7f/5c94f0a7a505ca1c81cd1d9208ae2064675d97582078e6c769
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [4]:
import wget
import os

print('Downloading dataset...')

url = 'https://dl.fbaipublicfiles.com/glue/data/RTE.zip'

if not os.path.exists('./RTE.zip'):
    wget.download(url, './RTE.zip')

Downloading dataset...


In [5]:
if not os.path.exists('./RTE/'):
    !unzip RTE.zip

Archive:  RTE.zip
   creating: RTE/
  inflating: RTE/dev.tsv             
  inflating: RTE/test.tsv            
  inflating: RTE/train.tsv           


In [6]:
import pandas as pd
df = pd.read_csv("./RTE/train.tsv", delimiter='\t', header=None, names=['sentence1', 'sentence2', 'label'])

print('Number of training sentences: {:,}\n'.format(df.shape[0]))

df.sample(10)

Number of training sentences: 2,491



Unnamed: 0,sentence1,sentence2,label
2488,Brooklyn Borough Hall featured a Who's Who in ...,The Brooklyn Book Festival is held in Brooklyn...,entailment
678,"The ship, owned by Mitsui O.S.K. Lines and fla...",Mitsui O.S.K is the owner of the Cayman Islands.,not_entailment
1383,US Steel could even have a technical advantage...,US Steel may invest in strip casting.,entailment
1225,"Eleven years after 852 people, mostly Swedes, ...",100 or more people lost their lives in a ferry...,entailment
1817,Philip Morris the US food and tobacco group th...,Philip Morris owns the Marlboro brand.,entailment
1920,The 26-member International Energy Agency said...,The international community agreed to release ...,entailment
462,Between 143 and 152 people have now been hospi...,A railway disaster caused a fire.,not_entailment
1538,"Toshiba, the world's third-largest notebook co...",Toshiba produces notebook computers.,entailment
1652,September 26 - Trial against former Italian Pr...,Andreotti collaborates with the Mafia.,not_entailment
2135,"On February 1, 1990, during a spacewalk, Alexa...",US shuttle Atlantis docks with the Mir space s...,not_entailment


In [7]:
from transformers import BertTokenizer

# Load the BERT tokenizer.
print('Loading BERT tokenizer...')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

Loading BERT tokenizer...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



In [8]:
label_dict = {"entailment": 0, "not_entailment": 1}

In [9]:
df = df.dropna()

In [10]:
def load_data(df):
    token_ids = []
    mask_ids = []
    seg_ids = []
    y = []
    premise_list = df['sentence1'].to_list()[1:]
    hypothesis_list = df['sentence2'].to_list()[1:]
    label_list = df['label'].to_list()[1:]

    for (premise, hypothesis, label) in zip(premise_list, hypothesis_list, label_list):
        encoded = tokenizer.encode_plus(
            text=premise,
            text_pair=hypothesis,
            max_length=512,
            padding='max_length',  # Ensures uniform length
            truncation=True,
            return_tensors='pt',
            return_token_type_ids=True,  # Adds segment IDs
            return_attention_mask=True
        )

        token_ids.append(encoded['input_ids'].squeeze(0))  # Removes extra batch dimension
        mask_ids.append(encoded['attention_mask'].squeeze(0))
        seg_ids.append(encoded['token_type_ids'].squeeze(0))
        y.append(label_dict[str(label)])

    # Convert everything to tensors
    token_ids = torch.stack(token_ids)
    mask_ids = torch.stack(mask_ids)
    seg_ids = torch.stack(seg_ids)
    y = torch.tensor(y)

    dataset = TensorDataset(token_ids, mask_ids, seg_ids, y)
    return dataset


In [11]:
from torch.utils.data import TensorDataset, random_split

# Combine the training inputs into a TensorDataset.
dataset = load_data(df)

# Create a 90-10 train-validation split.

# Calculate the number of samples to include in each set.
train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size

# Divide the dataset by randomly selecting samples.
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

print('{:>5,} training samples'.format(train_size))
print('{:>5,} validation samples'.format(val_size))

2,240 training samples
  249 validation samples


In [12]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

# The DataLoader needs to know our batch size for training, so we specify it
# here. For fine-tuning BERT on a specific task, the authors recommend a batch
# size of 16 or 32.
batch_size = 16

# Create the DataLoaders for our training and validation sets.
# We'll take training samples in random order.
train_dataloader = DataLoader(
            train_dataset,  # The training samples.
            sampler = RandomSampler(train_dataset), # Select batches randomly
            batch_size = batch_size # Trains with this batch size.
        )

# For validation the order doesn't matter, so we'll just read them sequentially.
validation_dataloader = DataLoader(
            val_dataset, # The validation samples.
            sampler = SequentialSampler(val_dataset), # Pull out batches sequentially.
            batch_size = batch_size # Evaluate with this batch size.
        )

In [13]:
from transformers import BertForSequenceClassification, AdamW, BertConfig

# Load BertForSequenceClassification, the pretrained BERT model with a single
# linear classification layer on top.
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels = 2,
    output_attentions = False,
    output_hidden_states = False,)


model.cuda()

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [14]:
optimizer = torch.optim.AdamW(model.parameters(),
  lr = 2e-5,
  eps = 1e-8
)

In [15]:
from transformers import get_linear_schedule_with_warmup

epochs = 3

# Total number of training steps is [number of batches] x [number of epochs].
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps = 0, # Default value in run_glue.py
                                            num_training_steps = total_steps)

In [16]:
def accuracy(y_pred, y_test):
  acc = (torch.log_softmax(y_pred, dim=1).argmax(dim=1) == y_test).sum().float() / float(y_test.size(0))
  return acc

In [17]:
import time
import datetime

def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))

    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

In [18]:
def train(model, train_loader, val_loader, optimizer,scheduler):
  total_step = len(train_loader)

  for epoch in range(epochs):
    # Measure how long the training epoch takes.
    start = time.time()
    model.train()

    # Reset the total loss and accuracy for this epoch.
    total_train_loss = 0
    total_train_acc  = 0
    for batch_idx, (pair_token_ids, mask_ids, seg_ids, y) in enumerate(train_loader):

      # Unpack this training batch from our dataloader.
      pair_token_ids = pair_token_ids.to(device)
      mask_ids = mask_ids.to(device)
      seg_ids = seg_ids.to(device)
      labels = y.to(device)

      #clear any previously calculated gradients before performing a backward pass
      optimizer.zero_grad()

      #Get the loss and prediction
      loss, prediction = model(pair_token_ids,
                             token_type_ids=seg_ids,
                             attention_mask=mask_ids,
                             labels=labels).values()

      acc = accuracy(prediction.cpu(), labels.cpu())

      # Accumulate the training loss and accuracy over all of the batches so that we can
      # calculate the average loss at the end
      total_train_loss += loss.item()
      total_train_acc  += acc.item()

      # Perform a backward pass to calculate the gradients.
      loss.backward()

      # Clip the norm of the gradients to 1.0.
      # This is to help prevent the "exploding gradients" problem.
      torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

      # Update parameters and take a step using the computed gradient.
      optimizer.step()

      # Update the learning rate.
      scheduler.step()

    # Calculate the average accuracy and loss over all of the batches.
    train_acc  = total_train_acc/len(train_loader)
    train_loss = total_train_loss/len(train_loader)

    # Put the model in evaluation mode
    model.eval()

    total_val_acc  = 0
    total_val_loss = 0
    with torch.no_grad():
      for batch_idx, (pair_token_ids, mask_ids, seg_ids, y) in enumerate(val_loader):

        #clear any previously calculated gradients before performing a backward pass
        optimizer.zero_grad()

        # Unpack this validation batch from our dataloader.
        pair_token_ids = pair_token_ids.to(device)
        mask_ids = mask_ids.to(device)
        seg_ids = seg_ids.to(device)
        labels = y.to(device)

        #Get the loss and prediction
        loss, prediction = model(pair_token_ids,
                             token_type_ids=seg_ids,
                             attention_mask=mask_ids,
                             labels=labels).values()

        # Calculate the accuracy for this batch
        acc = accuracy(prediction.cpu(), labels.cpu())

        # Accumulate the validation loss and Accuracy
        total_val_loss += loss.item()
        total_val_acc  += acc.item()

    # Calculate the average accuracy and loss over all of the batches.
    val_acc  = total_val_acc/len(val_loader)
    val_loss = total_val_loss/len(val_loader)

    end = time.time()
    hours, rem = divmod(end-start, 3600)
    minutes, seconds = divmod(rem, 60)

    print(f'Epoch {epoch+1}: train_loss: {train_loss:.4f} train_acc: {train_acc:.4f} | val_loss: {val_loss:.4f} val_acc: {val_acc:.4f}')
    print("{:0>2}:{:0>2}:{:05.2f}".format(int(hours),int(minutes),seconds))

In [19]:
train(model, train_dataloader, validation_dataloader, optimizer,scheduler)

Epoch 1: train_loss: 0.6828 train_acc: 0.5813 | val_loss: 0.6995 val_acc: 0.5716
00:03:23.51
Epoch 2: train_loss: 0.5308 train_acc: 0.7482 | val_loss: 0.7709 val_acc: 0.5959
00:03:25.70
Epoch 3: train_loss: 0.3721 train_acc: 0.8536 | val_loss: 0.8781 val_acc: 0.5881
00:03:25.62


In [38]:
df = pd.read_csv("./RTE/test.tsv", delimiter='\t', header=None, names=['sentence1', 'sentence2'])

print('Number of test sentences: {:,}\n'.format(df.shape[0]))
df = df.drop(df.index[0])
df.sample(10)

Number of test sentences: 2,986



Unnamed: 0,sentence1,sentence2
1709,Abu Musa al-Hindi is one of the 13 men picked ...,Bilal was arrested along with other 12 men.
818,During Reinsdorf's 24 seasons as chairman of t...,Reinsdorf was the chairman of the White Sox fo...
1218,Klassen will also be racing in the 2010 winter...,The 2010 Winter Olympics will be held in Vanco...
2228,After an unparalleled succession of tragic ope...,Verdi wrote mainly comedies.
2965,"Mr. P V Narasimha Rao, former prime minister o...",Mr. Rao had cancer.
784,Emergency-control managers began asking reside...,Emergency managers began asking residents alon...
663,Hurricane Katrina slammed this legendary Gulf ...,Hurricane Katrina slammed ashore Monday.
1712,"I love it, it's just such fun, said Harry Pott...",The Potter books netted an estimated £435 mill...
853,In 1929 he founded the Institute of Mathematic...,The University of Milan was founded by Gian An...
2304,Tests on animals showed they could get through...,The thin membrane surrounding the stomach is c...


In [41]:
def predict(premise, hypothesis):
  sequence = tokenizer.encode_plus(premise, hypothesis, return_tensors="pt")['input_ids'].to(device)
  logits = model(sequence)[0]
  probabilities = torch.softmax(logits, dim=1).detach().cpu().tolist()[0]
  proba_yes = round(probabilities[1], 3)
  proba_no = round(probabilities[0], 3)
  print(f"premise: {premise},   hypothesis: {hypothesis} entailment: {proba_yes}, not_entailment: {proba_no}")

In [42]:
for i in range(1, 10):
  predict(df['sentence1'][i], df['sentence2'][i])

premise: Authorities in Brazil say that more than 200 people are being held hostage in a prison in the country's remote, Amazonian-jungle state of Rondonia.,   hypothesis:Authorities in Brazil hold 200 people as hostage. entailment: 0.875, not_entailment: 0.125
premise: A mercenary group faithful to the warmongering policy of former Somozist colonel Enrique Bermudez attacked an IFA truck belonging to the interior ministry at 0900 on 26 March in El Jicote, wounded and killed an interior ministry worker and wounded five others.,   hypothesis:An interior ministry worker was killed by a mercenary group. entailment: 0.888, not_entailment: 0.112
premise: The British ambassador to Egypt, Derek Plumbly, told Reuters on Monday that authorities had compiled the list of 10 based on lists from tour companies and from families whose relatives have not been in contact since the bombings.,   hypothesis:Derek Plumbly resides in Egypt. entailment: 0.867, not_entailment: 0.133
premise: Tibone estimated 

  predict(df['sentence1'][i], df['sentence2'][i])
