In this demo, BERT for question answering model will be explored from the huggingface library

In Part 1 of the demo, we will use a fine-tuned BERT on the **SQuAD** dataset and apply it (test) it) on the **CoQA** dataset.  In Part 2 of the demo you will learn how to fine tune BERT for question answering on the **SQuAD** dataset yourselves.

# PART 1

### Initialization & Setup


In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.35.0-py3-none-any.whl (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m24.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.19.0-py3-none-any.whl (311 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.2/311.2 kB[0m [31m37.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m105.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m66.6 MB/s[0m eta [36m0:00:00[0m
Co

In [None]:
import pandas as pd
import numpy as np
import torch
from transformers import BertForQuestionAnswering
from transformers import BertTokenizer

### Loading the CoQA dataset

In [None]:
coqa = pd.read_json('http://downloads.cs.stanford.edu/nlp/data/coqa/coqa-train-v1.0.json')
coqa

Unnamed: 0,version,data
0,1,"{'source': 'wikipedia', 'id': '3zotghdk5ibi9ce..."
1,1,"{'source': 'cnn', 'id': '3wj1oxy92agboo5nlq4r7..."
2,1,"{'source': 'gutenberg', 'id': '3bdcf01ogxu7zdn..."
3,1,"{'source': 'cnn', 'id': '3ewijtffvo7wwchw6rtya..."
4,1,"{'source': 'gutenberg', 'id': '3urfvvm165iantk..."
...,...,...
7194,1,"{'source': 'gutenberg', 'id': '34j10vatjfyw0ao..."
7195,1,"{'source': 'cnn', 'id': '3vj40nv2qinjocrcy7k4z..."
7196,1,"{'source': 'race', 'id': '3rjsc4xj10uw0to3vq0v..."
7197,1,"{'source': 'wikipedia', 'id': '3gs6s824sqxty8v..."


### Inspecting the data

In [None]:
coqa["data"][0]

{'source': 'wikipedia',
 'id': '3zotghdk5ibi9cex97fepx7jetpso7',
 'filename': 'Vatican_Library.txt',
 'story': 'The Vatican Apostolic Library (), more commonly called the Vatican Library or simply the Vat, is the library of the Holy See, located in Vatican City. Formally established in 1475, although it is much older, it is one of the oldest libraries in the world and contains one of the most significant collections of historical texts. It has 75,000 codices from throughout history, as well as 1.1 million printed books, which include some 8,500 incunabula. \n\nThe Vatican Library is a research library for history, law, philosophy, science and theology. The Vatican Library is open to anyone who can document their qualifications and research needs. Photocopies for private study of pages from books published between 1801 and 1990 can be requested in person or by mail. \n\nIn March 2014, the Vatican Library began an initial four-year project of digitising its collection of manuscripts, to 

The **CoQA** dataset contains ~7200 rows, and each row contains one paragraph and multiple question and answer pairs related to that paragraph.

If we print the first row, we see that there are 20 questions and answers for the first paragraph and that answers are in the form of start index and end index within the paragraph.  This is the standard format of any closed domain question answering dataset.

In [None]:
# deleting an unnecessary column
del coqa["version"]
coqa

Unnamed: 0,data
0,"{'source': 'wikipedia', 'id': '3zotghdk5ibi9ce..."
1,"{'source': 'cnn', 'id': '3wj1oxy92agboo5nlq4r7..."
2,"{'source': 'gutenberg', 'id': '3bdcf01ogxu7zdn..."
3,"{'source': 'cnn', 'id': '3ewijtffvo7wwchw6rtya..."
4,"{'source': 'gutenberg', 'id': '3urfvvm165iantk..."
...,...
7194,"{'source': 'gutenberg', 'id': '34j10vatjfyw0ao..."
7195,"{'source': 'cnn', 'id': '3vj40nv2qinjocrcy7k4z..."
7196,"{'source': 'race', 'id': '3rjsc4xj10uw0to3vq0v..."
7197,"{'source': 'wikipedia', 'id': '3gs6s824sqxty8v..."


### Converting the CoQA dataset to a more convenient format
We convert the CoQA dataset to a more convenient format by creating one question and answer pair per row.  This results in repeated content in the "text" column - once per questions and answer for the respective paragraph, we will be repeating the paragraph in the "text" column.

In [None]:
cols = ["text","question","answer"]
comp_list = []
for index, row in coqa.iterrows():
    for i in range(len(row["data"]["questions"])):
        temp_list = []
        temp_list.append(row["data"]["story"])
        temp_list.append(row["data"]["questions"][i]["input_text"])
        temp_list.append(row["data"]["answers"][i]["input_text"])
        comp_list.append(temp_list)
new_df = pd.DataFrame(comp_list, columns = cols)
new_df

Unnamed: 0,text,question,answer
0,"The Vatican Apostolic Library (), more commonl...",When was the Vat formally opened?,It was formally established in 1475
1,"The Vatican Apostolic Library (), more commonl...",what is the library for?,research
2,"The Vatican Apostolic Library (), more commonl...",for what subjects?,"history, and law"
3,"The Vatican Apostolic Library (), more commonl...",and?,"philosophy, science and theology"
4,"The Vatican Apostolic Library (), more commonl...",what was started in 2014?,a project
...,...,...,...
108642,(CNN) -- Cristiano Ronaldo provided the perfec...,Who was a sub?,Xabi Alonso
108643,(CNN) -- Cristiano Ronaldo provided the perfec...,Was it his first game this year?,Yes
108644,(CNN) -- Cristiano Ronaldo provided the perfec...,What position did the team reach?,third
108645,(CNN) -- Cristiano Ronaldo provided the perfec...,Who was ahead of them?,Barca.


### Loading BERT fine-tuned on SQuAD
Loading BERT for question answering which is already fine-tuned on SQuAD, as well as the corresponding BERT tokenizer (each pre-trained BERT model has a corresponding tokenizer)


In [None]:
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

Downloading (…)lve/main/config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

### Experimenting with BERT


In [None]:
# picking out a random question and answer pair from the dataset
random_num = np.random.randint(0,len(new_df))
question = new_df["question"][random_num]
text = new_df["text"][random_num]

In [None]:
# tokeninzing the question and answer pair
input_ids = tokenizer.encode(question, text)
print("The input has a total of {} tokens.".format(len(input_ids)))

The input has a total of 328 tokens.


We inspect the resulting tokens and observe that each word is assigned a unique token, and that some rare words are getting split into multiple tokens. The token 101 is always the first token indicating the start of the input text, and token 102 is the separator token, which comes between the question and the answer and also at the end

In [None]:
# inspecting the resulting tokens
tokens = tokenizer.convert_ids_to_tokens(input_ids)
for token, id in zip(tokens, input_ids):
    print('{:8}{:8,}'.format(token,id))

[CLS]        101
what       2,054
did        2,106
pam       14,089
want       2,215
to         2,000
make       2,191
?          1,029
[SEP]        102
on         2,006
the        1,996
third      2,353
day        2,154
of         1,997
november   2,281
,          1,010
ron        6,902
and        1,998
pam       14,089
went       2,253
to         2,000
the        1,996
store      3,573
.          1,012
they       2,027
wanted     2,359
to         2,000
get        2,131
some       2,070
food       2,833
for        2,005
a          1,037
new        2,047
recipe    17,974
.          1,012
it         2,009
was        2,001
late       2,397
in         1,999
the        1,996
afternoon   5,027
,          1,010
but        2,021
they       2,027
wanted     2,359
to         2,000
eat        4,521
the        1,996
food       2,833
soon       2,574
at         2,012
dinner     4,596
.          1,012
to         2,000
save       3,828
time       2,051
they       2,027
split      3,975
the        1,

In [None]:
# Visualizing the number of token in question and text
sep_idx = input_ids.index(tokenizer.sep_token_id)
print("SEP token index: ", sep_idx)
num_seg_a = sep_idx + 1
print("Number of tokens in segment A (question): ", num_seg_a)
num_seg_b = len(input_ids) - num_seg_a
print("Number of tokens in segment B (answer): ", num_seg_b)

SEP token index:  8
Number of tokens in segment A (question):  9
Number of tokens in segment B (answer):  319


In [None]:
#creating the segment ids and making sure every input token has a segment id
segment_ids = [0]*num_seg_a + [1]*num_seg_b
assert len(segment_ids) == len(input_ids)

Now the tokens and the segment ids will be passed to the model

In [None]:
# token input_ids to represent the input and token segment_ids to differentiate
# our segments - question and text
output = model(torch.tensor([input_ids]),  token_type_ids = torch.tensor([segment_ids]))

Getting the start and end tokens from the output

In [None]:
#tokens with highest start and end scores
answer_start = torch.argmax(output.start_logits)
answer_end = torch.argmax(output.end_logits)
if answer_end >= answer_start:
    answer = " ".join(tokens[answer_start:answer_end+1])
else:
    print("I am unable to find the answer to this question. Can you please ask another question?")

print("Text:\n{}".format(new_df["text"][random_num]))
print("\nQuestion:\n{}".format(question.capitalize()))
print("\nAnswer:\n{}.".format(answer.capitalize()))

Text:
On the third day of November, Ron and Pam went to the store. They wanted to get some food for a new recipe. It was late in the afternoon, but they wanted to eat the food soon at dinner. To save time they split the list in half. Ron was to get the pasta and tomato sauce, and Pam was to get the vegetables and juice. They went their separate ways in the store, and made plans to meet in the checkout line in half an hour. 

On her way to the fruit and vegetable section, Pam ran into her friend Tom. Tom had bought a pet bunny for his friend and wanted to buy it some food. He asked Pam what he needs to feed the bunny. Pam told him lettuce and carrots, so he put 5 heads of lettuce in his basket along with one bag of carrots. Tom said goodbye to Pam and went to the front of the store to buy his vegetables. Now it was time for Pam to pick out the vegetables she would buy for dinner. She wanted to make a salad, so she bought spinach, 2 big red tomatoes, a box of mushrooms, and 3 cucumbers. 

Cleaning up the answer is needed when there are multiple tokens for a word. The double hash symbols indicate that a word split into multiple tokens (separated by ##)

In [None]:
# cleaning up the answer
answer = tokens[answer_start]
for i in range(answer_start+1, answer_end+1):
    if tokens[i][0:2] == "##":
        answer += tokens[i][2:]
    else:
        answer += " " + tokens[i]

In [None]:
print("Answer:\n{}.".format(answer.capitalize()))

Answer:
A salad.


In [None]:
# retrieve and print the answer to this question that we had in the training set
answer = new_df["answer"][random_num]
answer

'a salad'

# PART 2

## Initialization & Setup

In [None]:
# importing required libraries
import requests
import json
import torch
import os
from tqdm import tqdm
from transformers import BertTokenizerFast
from torch.utils.data import DataLoader
from transformers import BertForQuestionAnswering
from transformers import AdamW

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# creating a directory in Google drive
if not os.path.exists('/content/drive/MyDrive/CMPE259/Assignment 6/BERT-SQuAD'): os.mkdir('/content/drive/MyDrive/CMPE259/Assignment 6/BERT-SQuAD')

## Loading the SQuAD dataset

In [None]:
# getting the SQuAD dataset
!wget -nc https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
!wget -nc https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

--2023-11-13 02:48:00--  https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42123633 (40M) [application/json]
Saving to: ‘train-v2.0.json’


2023-11-13 02:48:03 (275 MB/s) - ‘train-v2.0.json’ saved [42123633/42123633]

--2023-11-13 02:48:03--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4370528 (4.2M) [application/json]
Saving to: ‘dev-v2.0.json’


2023-11-13 02:48:04 (154 MB/s) - ‘dev-v2.0.json’ saved [4370528/4370528]



In [None]:
# Load the training dataset and inspecting it
with open('train-v2.0.json', 'rb') as f:
  squad = json.load(f)

In [None]:
# Each 'data' dict has two keys (title and paragraphs)
squad['data'][0].keys()

dict_keys(['title', 'paragraphs'])

In [None]:
squad['data'][0]

{'title': 'Beyoncé',
 'paragraphs': [{'qas': [{'question': 'When did Beyonce start becoming popular?',
     'id': '56be85543aeaaa14008c9063',
     'answers': [{'text': 'in the late 1990s', 'answer_start': 269}],
     'is_impossible': False},
    {'question': 'What areas did Beyonce compete in when she was growing up?',
     'id': '56be85543aeaaa14008c9065',
     'answers': [{'text': 'singing and dancing', 'answer_start': 207}],
     'is_impossible': False},
    {'question': "When did Beyonce leave Destiny's Child and become a solo singer?",
     'id': '56be85543aeaaa14008c9066',
     'answers': [{'text': '2003', 'answer_start': 526}],
     'is_impossible': False},
    {'question': 'In what city and state did Beyonce  grow up? ',
     'id': '56bf6b0f3aeaaa14008c9601',
     'answers': [{'text': 'Houston, Texas', 'answer_start': 166}],
     'is_impossible': False},
    {'question': 'In which decade did Beyonce become famous?',
     'id': '56bf6b0f3aeaaa14008c9602',
     'answers': [{'text

Here we see that for each topic there are multiple paragraphs, and for each paragraph there are mutliple question and answer pairs

In [None]:
# checking the number of topics
len(squad['data'])

442

In [None]:
# loading the data in triplets of context, questions and answers
def read_data(path):

  with open(path, 'rb') as f:
    squad = json.load(f)

  contexts = []
  questions = []
  answers = []

  for group in squad['data']:
    for passage in group['paragraphs']:
      context = passage['context']
      for qa in passage['qas']:
        question = qa['question']
        for answer in qa['answers']:
          contexts.append(context)
          questions.append(question)
          answers.append(answer)

  return contexts, questions, answers

In [None]:
train_contexts, train_questions, train_answers = read_data('train-v2.0.json')
valid_contexts, valid_questions, valid_answers = read_data('dev-v2.0.json')

In [None]:
print(f'There are {len(train_questions)} training set questions')
print(f'There are {len(valid_questions)} dev set questions')

There are 86821 training set questions
There are 20302 dev set questions


## Dataset pre-processing

In [None]:
# fixing some data quality issues
def add_end_idx(answers, contexts):
  for answer, context in zip(answers, contexts):
    gold_text = answer['text']
    start_idx = answer['answer_start']
    end_idx = start_idx + len(gold_text)

    # sometimes squad answers are off by a character or two so we fix this
    if context[start_idx:end_idx] == gold_text:
      answer['answer_end'] = end_idx
    elif context[start_idx-1:end_idx-1] == gold_text:
      answer['answer_start'] = start_idx - 1
      answer['answer_end'] = end_idx - 1     # When the gold label is off by one character
    elif context[start_idx-2:end_idx-2] == gold_text:
      answer['answer_start'] = start_idx - 2
      answer['answer_end'] = end_idx - 2     # When the gold label is off by two characters

add_end_idx(train_answers[:1000], train_contexts[:1000])
add_end_idx(valid_answers[:100], valid_contexts[:100])

## Fine-tuning BERT on SQuAD

In [None]:
# getting the model and its tokenizer (currently training on only 1000 rows as it is very time consuming)

tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

train_encodings = tokenizer(train_contexts[:1000], train_questions[:1000], truncation=True, padding=True)
valid_encodings = tokenizer(valid_contexts[:100], valid_questions[:100], truncation=True, padding=True)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [None]:
train_encodings.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

Visualing the output of tokenizer, input ids are the token indices with padding of 0s, token_type_ids are different integers for different sequences and attention mask states which positions to give attention to while training

In [None]:
train_encodings["input_ids"][0]

[101,
 20773,
 21025,
 19358,
 22815,
 1011,
 5708,
 1006,
 1013,
 12170,
 23432,
 29715,
 3501,
 29678,
 12325,
 29685,
 1013,
 10506,
 1011,
 10930,
 2078,
 1011,
 2360,
 1007,
 1006,
 2141,
 2244,
 1018,
 1010,
 3261,
 1007,
 2003,
 2019,
 2137,
 3220,
 1010,
 6009,
 1010,
 2501,
 3135,
 1998,
 3883,
 1012,
 2141,
 1998,
 2992,
 1999,
 5395,
 1010,
 3146,
 1010,
 2016,
 2864,
 1999,
 2536,
 4823,
 1998,
 5613,
 6479,
 2004,
 1037,
 2775,
 1010,
 1998,
 3123,
 2000,
 4476,
 1999,
 1996,
 2397,
 4134,
 2004,
 2599,
 3220,
 1997,
 1054,
 1004,
 1038,
 2611,
 1011,
 2177,
 10461,
 1005,
 1055,
 2775,
 1012,
 3266,
 2011,
 2014,
 2269,
 1010,
 25436,
 22815,
 1010,
 1996,
 2177,
 2150,
 2028,
 1997,
 1996,
 2088,
 1005,
 1055,
 2190,
 1011,
 4855,
 2611,
 2967,
 1997,
 2035,
 2051,
 1012,
 2037,
 14221,
 2387,
 1996,
 2713,
 1997,
 20773,
 1005,
 1055,
 2834,
 2201,
 1010,
 20754,
 1999,
 2293,
 1006,
 2494,
 1007,
 1010,
 2029,
 2511,
 2014,
 2004,
 1037,
 3948,
 3063,
 4969,
 1010,
 36

In [None]:
train_encodings["token_type_ids"][0]

[0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,


In [None]:
train_encodings["attention_mask"][0]

[1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,


In [None]:
# printing the number of training data samples
no_of_encodings = len(train_encodings['input_ids'])
print(f'We have {no_of_encodings} context-question pairs')

We have 1000 context-question pairs


In [None]:
# adding the answers in the training set for fine tuning
def add_token_positions(encodings, answers):
  start_positions = []
  end_positions = []
  for i in range(len(answers)):
    start_positions.append(encodings.char_to_token(i, answers[i]['answer_start']))
    end_positions.append(encodings.char_to_token(i, answers[i]['answer_end'] - 1))

    # if start position is None, the answer passage has been truncated
    if start_positions[-1] is None:
      start_positions[-1] = tokenizer.model_max_length
    if end_positions[-1] is None:
      end_positions[-1] = tokenizer.model_max_length

  encodings.update({'start_positions': start_positions, 'end_positions': end_positions})

add_token_positions(train_encodings, train_answers[:1000])
add_token_positions(valid_encodings, valid_answers[:100])

In [None]:
# creating the dataset in the format it is required for fine tuning BERT
class SQuAD_Dataset(torch.utils.data.Dataset):
  def __init__(self, encodings):
    self.encodings = encodings
  def __getitem__(self, idx):
    return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  def __len__(self):
    return len(self.encodings.input_ids)

In [None]:
train_dataset = SQuAD_Dataset(train_encodings)
valid_dataset = SQuAD_Dataset(valid_encodings)

In [None]:
# Define the dataloaders
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=8)

In [None]:
# loading the BERT model which we will fine tune
model = BertForQuestionAnswering.from_pretrained("bert-base-uncased")

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# checking the device
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(f'Working on {device}')

Working on cuda


In [None]:
# Fine tuning it per batch
N_EPOCHS = 5
optim = AdamW(model.parameters(), lr=5e-5)

model.to(device)
model.train()

for epoch in range(N_EPOCHS):
  loop = tqdm(train_loader, leave=True)
  for batch in loop:
    optim.zero_grad()
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    start_positions = batch['start_positions'].to(device)
    end_positions = batch['end_positions'].to(device)
    outputs = model(input_ids, attention_mask=attention_mask, start_positions=start_positions, end_positions=end_positions)
    loss = outputs[0]
    loss.backward()
    optim.step()

    loop.set_description(f'Epoch {epoch+1}')
    loop.set_postfix(loss=loss.item())

Epoch 1: 100%|██████████| 125/125 [01:41<00:00,  1.23it/s, loss=2.96]
Epoch 2: 100%|██████████| 125/125 [01:41<00:00,  1.23it/s, loss=1.89]
Epoch 3: 100%|██████████| 125/125 [01:41<00:00,  1.23it/s, loss=0.482]
Epoch 4: 100%|██████████| 125/125 [01:41<00:00,  1.23it/s, loss=1.63]
Epoch 5: 100%|██████████| 125/125 [01:41<00:00,  1.23it/s, loss=0.548]


In [None]:
# checking the performance
model.eval()

acc = []

for batch in tqdm(valid_loader):
  with torch.no_grad():
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    start_true = batch['start_positions'].to(device)
    end_true = batch['end_positions'].to(device)

    outputs = model(input_ids, attention_mask=attention_mask)

    start_pred = torch.argmax(outputs['start_logits'], dim=1)
    end_pred = torch.argmax(outputs['end_logits'], dim=1)

    acc.append(((start_pred == start_true).sum()/len(start_pred)).item())
    acc.append(((end_pred == end_true).sum()/len(end_pred)).item())

acc = sum(acc)/len(acc)

100%|██████████| 13/13 [00:03<00:00,  4.32it/s]


In [None]:
acc

0.5721153846153846

# Homework assignment

## Exercise 1: Fine-tune BERT for question answering on the CoQA dataset using the same process as shown in Part 2 for the SQuAD dataset.

How does the SQuAD dataset looks like?

In [None]:
def print_general_structure(data, indent=0):
    for key, value in data.items():
        print('  ' * indent + str(key))
        if isinstance(value, dict):
            print_structure(value, indent + 1)

# Print the structure of the loaded data
print_general_structure(squad)

version
data


How does each data point looks like?

In [None]:
squad['data'][0].keys()

dict_keys(['title', 'paragraphs'])

In [None]:
# Function to print the structure recursively
def print_structure(data, indent=0):
    for key, value in data.items():
        print('  ' * indent + f"{key}: {type(value).__name__}")
        if isinstance(value, dict):
            print_structure(value, indent + 1)
        elif isinstance(value, list) and value and isinstance(value[0], dict):
            print_structure(value[0], indent + 1)


# Print the structure of the loaded data
print_structure(squad['data'][1])

title: str
paragraphs: list
  qas: list
    question: str
    id: str
    answers: list
      text: str
      answer_start: int
    is_impossible: bool
  context: str


In [None]:
len(squad['data'])

442

How does data looks like in coqa?

In [None]:
print_structure(coqa['data'][1])

source: str
id: str
filename: str
story: str
questions: list
  input_text: str
  turn_id: int
answers: list
  span_start: int
  span_end: int
  span_text: str
  input_text: str
  turn_id: int
name: str


In [None]:
len(coqa)

7199

Split the COQA dataset into train and validation

In [None]:
from sklearn.model_selection import train_test_split


# Split the data into training and validation sets (80% train, 20% valid)
train_data, valid_data = train_test_split(coqa, test_size=0.2, random_state=42)

# Print the number of samples in the training and validation sets
print(f"Number of samples in the training set: {len(train_data)}")
print(f"Number of samples in the validation set: {len(valid_data)}")


Number of samples in the training set: 5759
Number of samples in the validation set: 1440


Using span_text of the answer as the answer

In [None]:
def convertCoQAToDF(data):
  cols = ["context","question","answer", "answer_start", "answer_end"]
  comp_list = []
  for index, row in data.iterrows():
      for i in range(len(row["data"]["questions"])):
        #if "bad_turn" not in row["data"]["answers"][i] or not row["data"]["answers"][i]["bad_turn"]:
        if row["data"]["answers"][i]["span_start"] >= 0 and row["data"]["answers"][i]["span_end"] >= 0:
          temp_list = []
          temp_list.append(row["data"]["story"])
          temp_list.append(row["data"]["questions"][i]["input_text"])
          temp_list.append(row["data"]["answers"][i]["span_text"])
          temp_list.append(row["data"]["answers"][i]["span_start"])
          temp_list.append(row["data"]["answers"][i]["span_end"])
          comp_list.append(temp_list)
  return pd.DataFrame(comp_list, columns=cols)
train_df = convertCoQAToDF(train_data)
valid_df = convertCoQAToDF(valid_data)

In [None]:
train_df.head()

Unnamed: 0,context,question,answer,answer_start,answer_end
0,"TUNIS, Tunisia (CNN) -- Polls closed late Sund...",Where is this taking place?,"Polls closed late Sunday in Tunisia, the torch...",24,192
1,"TUNIS, Tunisia (CNN) -- Polls closed late Sund...",What is being voted on?,"""It's a wonderful day. It's the first time we ...",435,538
2,"TUNIS, Tunisia (CNN) -- Polls closed late Sund...",What day of the week did they vote?,"Polls closed late Sunday in Tunisia, t",23,62
3,"TUNIS, Tunisia (CNN) -- Polls closed late Sund...",When was the last one held?,some waiting for hours to cast a vote in the n...,312,433
4,"TUNIS, Tunisia (CNN) -- Polls closed late Sund...",What else happened then?,in the nation's first national elections since...,350,432


In [None]:
valid_df.head()

Unnamed: 0,context,question,answer,answer_start,answer_end
0,"(CNN) -- Andy Carroll scored twice, his first ...",Who was playing in the game?,"Liverpool, to help his club",55,84
1,"(CNN) -- Andy Carroll scored twice, his first ...",who was playing against them?,Manchester City 3-0 i,103,124
2,"(CNN) -- Andy Carroll scored twice, his first ...",What was the score?,defeat Manchester City 3-0 i,96,124
3,"(CNN) -- Andy Carroll scored twice, his first ...",When was the game?,Monday,125,132
4,"(CNN) -- Andy Carroll scored twice, his first ...",What league are they in?,Premier League,135,149


In [None]:
len(train_df)

85807

In [None]:
len(valid_df)

21479

### Convert COQA to SQUAD lists format

In [None]:
train_contexts = train_df['context'].tolist()
train_questions = train_df['question'].tolist()
train_answers = [{'text': answer, 'answer_start': start, 'answer_end': end} for answer, start, end in zip(train_df['answer'], train_df['answer_start'], train_df['answer_end'])]
valid_contexts = valid_df['context'].tolist()
valid_questions = valid_df['question'].tolist()
valid_answers = [{'text': answer, 'answer_start': start, 'answer_end': end} for answer, start, end in zip(valid_df['answer'], valid_df['answer_start'] , valid_df['answer_end'])]

In [None]:
print(f'There are {len(train_questions)} training set questions')
print(f'There are {len(valid_questions)} dev set questions')

There are 85807 training set questions
There are 21479 dev set questions


### Fine tuning BERT on COQA dataset

In [None]:
# getting the model and its tokenizer (currently training on only 1000 rows as it is very time consuming)
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

In [None]:
train_encodings = tokenizer(train_contexts[:5000], train_questions[:5000], truncation=True, padding=True)
valid_encodings = tokenizer(valid_contexts[:100], valid_questions[:100], truncation=True, padding=True)

In [None]:
train_encodings.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [None]:
# printing the number of training data samples
no_of_encodings = len(train_encodings['input_ids'])
print(f'We have {no_of_encodings} context-question pairs')

We have 5000 context-question pairs


In [None]:
train_answers[1]

{'text': '"It\'s a wonderful day. It\'s the first time we can choose our own representatives," said Walid Marrakchi',
 'answer_start': 435,
 'answer_end': 538}

In [None]:
# adding the answers in the training set for fine tuning
def add_token_positions(encodings, answers):
  start_positions = []
  end_positions = []
  for i in range(len(answers)):
        start_idx = encodings.char_to_token(i, answers[i]['answer_start'])
        end_idx = encodings.char_to_token(i, answers[i]['answer_end'] - 1)

        # Handle out-of-range indices
        if start_idx is not None and start_idx >= 0:
            start_positions.append(start_idx)
        else:
            start_positions.append(tokenizer.model_max_length)

        if end_idx is not None and end_idx >= 0:
            end_positions.append(end_idx)
        else:
            end_positions.append(tokenizer.model_max_length)

  encodings.update({'start_positions': start_positions, 'end_positions': end_positions})

add_token_positions(train_encodings, train_answers[:5000])
add_token_positions(valid_encodings, valid_answers[:100])

In [None]:
# creating the dataset in the format it is required for fine tuning BERT
class SQuAD_Dataset(torch.utils.data.Dataset):
  def __init__(self, encodings):
    self.encodings = encodings
  def __getitem__(self, idx):
    return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  def __len__(self):
    return len(self.encodings.input_ids)

In [None]:
train_dataset = SQuAD_Dataset(train_encodings)
valid_dataset = SQuAD_Dataset(valid_encodings)

In [None]:
# Define the dataloaders
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=8)

In [None]:
# loading the BERT model which we will fine tune
model = BertForQuestionAnswering.from_pretrained("bert-base-uncased")

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# checking the device
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(f'Working on {device}')

Working on cuda


In [None]:
# Fine tuning it per batch
N_EPOCHS = 5
optim = AdamW(model.parameters(), lr=5e-5)

model.to(device)
model.train()

for epoch in range(N_EPOCHS):
  loop = tqdm(train_loader, leave=True)
  for batch in loop:
    optim.zero_grad()
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    start_positions = batch['start_positions'].to(device)
    end_positions = batch['end_positions'].to(device)
    outputs = model(input_ids, attention_mask=attention_mask, start_positions=start_positions, end_positions=end_positions)
    loss = outputs[0]
    loss.backward()
    optim.step()

    loop.set_description(f'Epoch {epoch+1}')
    loop.set_postfix(loss=loss.item())

Epoch 1: 100%|██████████| 625/625 [09:23<00:00,  1.11it/s, loss=3.23]
Epoch 2: 100%|██████████| 625/625 [09:21<00:00,  1.11it/s, loss=2.05]
Epoch 3: 100%|██████████| 625/625 [09:20<00:00,  1.11it/s, loss=1.95]
Epoch 4: 100%|██████████| 625/625 [09:21<00:00,  1.11it/s, loss=1.14]
Epoch 5: 100%|██████████| 625/625 [09:22<00:00,  1.11it/s, loss=1.22]


 The average accuracy is calculated based on the start and end positions of the predicted answers.

 The accuracy calculation includes a percentage error threshold to account for slight variations in the predicted positions.

In [None]:
percentage_error_threshold = 25

In [None]:
# checking the performance
model.eval()

acc = []

for batch in tqdm(valid_loader):
  with torch.no_grad():
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    start_true = batch['start_positions'].to(device)
    end_true = batch['end_positions'].to(device)

    outputs = model(input_ids, attention_mask=attention_mask)

    start_pred = torch.argmax(outputs['start_logits'], dim=1)
    end_pred = torch.argmax(outputs['end_logits'], dim=1)

    #acc.append(((start_pred == start_true).sum()/len(start_pred)).item())
    #acc.append(((end_pred == end_true).sum()/len(end_pred)).item())
    # Calculate accuracy with percentage error threshold
    start_correct = ((start_pred >= (start_true - start_true * (percentage_error_threshold / 100))) &
                         (start_pred <= (start_true + start_true * (percentage_error_threshold / 100)))).sum().item()

    end_correct = ((end_pred >= (end_true - end_true * (percentage_error_threshold / 100))) &
                       (end_pred <= (end_true + end_true * (percentage_error_threshold / 100)))).sum().item()

    start_accuracy = start_correct / len(start_pred)
    end_accuracy = end_correct / len(end_pred)

    acc.append((start_accuracy + end_accuracy) / 2)

acc = sum(acc)/len(acc)

100%|██████████| 13/13 [00:03<00:00,  3.66it/s]


In [None]:
acc

0.46634615384615385

## Exercise 2: Import the BERT model fine-tuned for classification and test its performance on any text classification dataset such as the twitter dataset.

The Twitter Sentiment140 dataset is used, which contains tweets labeled with sentiment scores.
The dataset is preprocessed, and binary labels (positive/negative) are assigned.

### Get the Twitter Dataset

In [None]:
!pip install datasets
!pip install torch transformers




https://huggingface.co/datasets/sentiment140 - This is the dataset used.


Sentiment140 consists of Twitter messages with emoticons, which are used as noisy labels for sentiment classification.


text: a string feature.

date: a string feature.

user: a string feature.

sentiment: a int32 feature.

query: a string feature.

In [None]:
from datasets import load_dataset

In [None]:
# Load the Twitter sentiment dataset
dataset = load_dataset("sentiment140")

### Test with BERT fine tuned for classification

In [None]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
import pandas as pd
model_name = "bert-base-uncased"

In [None]:
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
test_dataset = dataset["test"]

In [None]:
num_items = test_dataset.num_rows
print(f"Number of items in test_dataset: {num_items}")

Number of items in test_dataset: 498


In [None]:
first_5_rows = test_dataset.select([i for i in range(5)])
print("First 5 rows of the test_dataset:")
for row in first_5_rows:
    print(row)

First 5 rows of the test_dataset:
{'text': '@stellargirl I loooooooovvvvvveee my Kindle2. Not that the DX is cool, but the 2 is fantastic in its own right.', 'date': 'Mon May 11 03:17:40 UTC 2009', 'user': 'tpryan', 'sentiment': 4, 'query': 'kindle2'}
{'text': 'Reading my kindle2...  Love it... Lee childs is good read.', 'date': 'Mon May 11 03:18:03 UTC 2009', 'user': 'vcu451', 'sentiment': 4, 'query': 'kindle2'}
{'text': 'Ok, first assesment of the #kindle2 ...it fucking rocks!!!', 'date': 'Mon May 11 03:18:54 UTC 2009', 'user': 'chadfu', 'sentiment': 4, 'query': 'kindle2'}
{'text': "@kenburbary You'll love your Kindle2. I've had mine for a few months and never looked back. The new big one is huge! No need for remorse! :)", 'date': 'Mon May 11 03:19:04 UTC 2009', 'user': 'SIX15', 'sentiment': 4, 'query': 'kindle2'}
{'text': "@mikefish  Fair enough. But i have the Kindle2 and I think it's perfect  :)", 'date': 'Mon May 11 03:21:41 UTC 2009', 'user': 'yamarama', 'sentiment': 4, 'query':

In [None]:
# Convert the Hugging Face dataset to a Pandas DataFrame
df = pd.DataFrame(test_dataset)


In [None]:
from sklearn.metrics import accuracy_score

In [None]:
# Map sentiment labels to binary classes (e.g., positive: 1, negative: 0)
df['binary_sentiment'] = df['sentiment'].apply(lambda x: 1 if x > 2 else 0)

# Function to predict sentiment for a single example
def predict_sentiment(text):
    # Tokenize the input text
    inputs = tokenizer(text, return_tensors="pt", truncation=True)

    # Forward pass
    with torch.no_grad():
        outputs = model(**inputs)

    # Access logits
    logits = outputs.logits

    # Apply softmax to obtain probabilities
    probabilities = torch.nn.functional.softmax(logits, dim=1)

    # Determine the predicted sentiment class
    predicted_class = torch.argmax(probabilities, dim=1).item()

    return predicted_class

# Predict binary sentiment for all examples
df['predicted_sentiment'] = df['text'].apply(predict_sentiment)

# Calculate overall accuracy
accuracy = accuracy_score(df['binary_sentiment'], df['predicted_sentiment'])

print(f"Overall Accuracy: {accuracy * 100:.2f}%")

Overall Accuracy: 36.75%


## Exercise 3: Fine-tune the BERT model from Exercise 2 on the text classification dataset you used for testing (in Exercise 2) and evaluate its performance (on a test set from the dataset that you set aside prior to fine tuning the model)

In [None]:
!pip install accelerate
!pip install transformers[torch]



### Setup the train dataset

Choose 5000 train instances

In [None]:
# Select a subset of the dataset (5000 examples for training)
train_dataset = dataset["train"].shuffle(seed=42).select([i for i in range(5000)])
train_df = pd.DataFrame(train_dataset)
train_df['label'] = train_df['sentiment'].apply(lambda x: 1 if x > 2 else 0)

In [None]:
# Load pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)  # binary classification

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
train_df.head(5)

Unnamed: 0,text,date,user,sentiment,query,label
0,why am i awake so early? damn projects. super...,Sun Jun 07 07:43:33 PDT 2009,_stacey_rae,0,NO_QUERY,0
1,watching church online because I'd be half an ...,Sun May 31 06:16:45 PDT 2009,Trollyjd,0,NO_QUERY,0
2,Hillsong!,Fri May 29 17:35:07 PDT 2009,ffaithyy,4,NO_QUERY,1
3,is at Stafford Train Station and just watched ...,Fri Jun 19 23:28:43 PDT 2009,VCasambros,0,NO_QUERY,0
4,thanks everyone for the follow fridays!,Fri Jun 05 17:59:44 PDT 2009,angela_woo,4,NO_QUERY,1


In [None]:
import torch
from torch.utils.data import Dataset, DataLoader

### Load the train and test datasets

In [None]:
class CustomDataset(Dataset):
    def __init__(self, texts, labels):
        self.texts = texts
        self.labels = labels

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        return {'text': self.texts[idx], 'label': self.labels[idx]}

In [None]:
texts = train_df['text'].tolist()
labels = train_df['label'].tolist()

In [None]:
dataset = CustomDataset(texts, labels)
dataloader = DataLoader(dataset, batch_size=4, shuffle=True)

In [None]:
valid_texts = df['text'].tolist()
valid_labels = df['binary_sentiment'].tolist()

In [None]:
test_dataset = CustomDataset(valid_texts, valid_labels)
test_dataloader = DataLoader(test_dataset, batch_size=4, shuffle=True)


### Fine tune the BERT Sequence Classifier with train data

In [None]:
from torch.utils.data import DataLoader, TensorDataset
from transformers import AdamW

In [None]:
optimizer = AdamW(model.parameters(), lr=5e-5)



In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [None]:
num_epochs = 3
for epoch in range(num_epochs):
    model.train()
    for batch in dataloader:
        inputs = tokenizer(batch['text'], return_tensors='pt', padding=True, truncation=True, max_length=128)
        inputs = {key: val.to(device) for key, val in inputs.items()}
        labels = batch['label'].to(device)

        outputs = model(**inputs, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

### Evaluate again on the test data

In [None]:
model.eval()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [None]:
all_preds = []
all_labels = []

In [None]:

with torch.no_grad():
    for batch in test_dataloader:
        inputs = tokenizer(batch['text'], return_tensors='pt', padding=True, truncation=True, max_length=128)
        inputs = {key: val.to(device) for key, val in inputs.items()}
        labels = batch['label'].to(device)

        outputs = model(**inputs)
        logits = outputs.logits
        preds = torch.argmax(logits, dim=1)
        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

# Calculate accuracy
accuracy = accuracy_score(all_labels, all_preds)
print(f'Test Accuracy: {accuracy * 100:.2f}%')

Test Accuracy: 64.86%


The fine tuned model has better accuracy