#PREPARE GPU FOR the training model

#REFERENCES:

## Reference_1: Fine tune Bert https://www.youtube.com/watch?v=x66kkDnbzi4 by ChrisMcCormickAI


## Reference_2: applying SQUAD 1.0 dataset to BertForAnsweringQuestion already trained with SQUAD: https://www.youtube.com/watch?v=l8ZYCvgGu0o&list=WL&index=118&t=878s by ChrisMcCormickAI

## Reference_3: 'Question Answering with SQuAD 2.0' section from: https://huggingface.co/transformers/custom_datasets

## Reference_4: Basic knowledge about fine tuning, input formate and output format of BERT models : https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/?fbclid=IwAR3uWlc8mUlrJ3QnYoYyQOfze3yDYkacgVyKSk24YjYE04Gs-7XiM3b9gTA 

#SET UP GPU FOR TRAINING MODEL:

**FIRSTLY, Setup GPU for training**
edit -> Notebook setting -> Hardware accelerator -> GPU

In [1]:
import tensorflow as tf

# Get GPU device name:
device_name= tf.test.gpu_device_name()

# GPU device should have the following name:
if device_name == "/device:GPU:0":
  print("Found GPU at: " + device_name)
else:
  raise SystemError("GPU not found") # "GPU not found" a parameter to pass to the SystemError for printing out  

Found GPU at: /device:GPU:0


In [2]:
import torch

# if there is a GPU device available..
if torch.cuda.is_available():

  # Tell TORCH to use this GPU:
  device= torch.device("cuda")

  print("There are %d GPU(s) available" % torch.cuda.device_count())

  print('we will use the GPU: ', torch.cuda.get_device_name(0))

# if not:
else:
  print('NO GPU available, using CPU instead')
  device = torch.device("cpu")

There are 1 GPU(s) available
we will use the GPU:  Tesla T4


#IMPORT DATASET

In [3]:


# libraries for project:
import pandas as pd
import tensorflow as tf
import torch
import torch.nn as nn
import torch.optim as optim
from collections import defaultdict
# necessary libraries for project:

dataset = pd.read_table('./data/final.tsv')  
dataset

Unnamed: 0,question,is_impossible,text,answer_start,context
0,When did Beyonce start becoming popular?,False,in the late 1990s,269,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...
1,What areas did Beyonce compete in when she was...,False,singing and dancing,207,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...
2,When did Beyonce leave Destiny's Child and bec...,False,2003,526,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...
3,In what city and state did Beyonce grow up?,False,"Houston, Texas",166,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...
4,In which decade did Beyonce become famous?,False,late 1990s,276,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...
...,...,...,...,...,...
9996,What famous World War II battle was the Canadi...,False,the Normandy Landings,166,Battles which are particularly notable to the ...
9997,What effort was the Canadian Military known fo...,False,the strategic bombing of German cities,288,Battles which are particularly notable to the ...
9998,What Battle in France was the Canadian Militar...,False,the Battle of Vimy Ridge,72,Battles which are particularly notable to the ...
9999,What country was the latest Canadian Military ...,False,Croatia,377,Battles which are particularly notable to the ...


##TRIM (Or SAMPLE) DOWN THE DATASET FOR TRAINING:

In [4]:
# shuffling the dataset first beforing triming down:
# reference: https://stackoverflow.com/questions/29576430/shuffle-dataframe-rows
dataset = dataset.sample(frac=1).reset_index(drop=True)
dataset

Unnamed: 0,question,is_impossible,text,answer_start,context
0,What are the largest photovoltaic solar power ...,False,"The 250 MW Agua Caliente Solar Project, in the...",325,Commercial CSP plants were first developed in ...
1,When is the Taoiseach reelected?,True,after every general election,171,"Some states, however, do have a term of office..."
2,"Because of a dog's resourcefulness to people, ...",False,man's best friend,356,The dogs' value to early human hunter-gatherer...
3,What do you get when dividing tandem repeats b...,True,The proportion of repetitive DNA,0,The proportion of repetitive DNA is calculated...
4,Quantum is a division of what other organizati...,False,Spectre,33,"Despite being an original story, Spectre draws..."
...,...,...,...,...,...
9996,What does craving carry with it?,False,defilements,126,"In Theravāda Buddhism, the cause of human exis..."
9997,What was the most searched term on Google for ...,False,Beyonce pregnant,719,"In August, the couple attended the 2011 MTV Vi..."
9998,Queens is located on what part of Long Island?,False,the west end,221,New York City is located on one of the world's...
9999,How many scientists believe that symbiosis sho...,True,130,276,The definition of symbiosis has varied among s...


In [5]:
# SAMPLE DOWN NUMBER OF DATASET FOR TRAINING AND EVALUATION:
dataset=dataset.iloc[:1000,:]
dataset

Unnamed: 0,question,is_impossible,text,answer_start,context
0,What are the largest photovoltaic solar power ...,False,"The 250 MW Agua Caliente Solar Project, in the...",325,Commercial CSP plants were first developed in ...
1,When is the Taoiseach reelected?,True,after every general election,171,"Some states, however, do have a term of office..."
2,"Because of a dog's resourcefulness to people, ...",False,man's best friend,356,The dogs' value to early human hunter-gatherer...
3,What do you get when dividing tandem repeats b...,True,The proportion of repetitive DNA,0,The proportion of repetitive DNA is calculated...
4,Quantum is a division of what other organizati...,False,Spectre,33,"Despite being an original story, Spectre draws..."
...,...,...,...,...,...
995,Which friend took on the role of several jobs ...,False,Julian Fontana,133,Two Polish friends in Paris were also to play ...
996,Who decided to place Beyonce's group in Star S...,False,Arne Frager,303,"At age eight, Beyoncé and childhood friend Kel..."
997,What was federally-governed Maastricht?,True,several border territories,31,"After the Peace of Westphalia, several border ..."
998,What kind of system is a solar chimney?,False,passive solar ventilation,59,"A solar chimney (or thermal chimney, in this c..."


#EXTRACTING THE START AND END TOKENS OF AN ANSWER IN A CONTEXT

In [6]:
# Small run:
# drop the question mark in the 'question' column:
def Drop_question_mark (quest):
  l= len(quest)
  return quest[:l-1]

dataset['question'] = dataset['question'].apply(Drop_question_mark)
dataset

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


Unnamed: 0,question,is_impossible,text,answer_start,context
0,What are the largest photovoltaic solar power ...,False,"The 250 MW Agua Caliente Solar Project, in the...",325,Commercial CSP plants were first developed in ...
1,When is the Taoiseach reelected,True,after every general election,171,"Some states, however, do have a term of office..."
2,"Because of a dog's resourcefulness to people, ...",False,man's best friend,356,The dogs' value to early human hunter-gatherer...
3,What do you get when dividing tandem repeats b...,True,The proportion of repetitive DNA,0,The proportion of repetitive DNA is calculated...
4,Quantum is a division of what other organization?,False,Spectre,33,"Despite being an original story, Spectre draws..."
...,...,...,...,...,...
995,Which friend took on the role of several jobs ...,False,Julian Fontana,133,Two Polish friends in Paris were also to play ...
996,Who decided to place Beyonce's group in Star S...,False,Arne Frager,303,"At age eight, Beyoncé and childhood friend Kel..."
997,What was federally-governed Maastricht,True,several border territories,31,"After the Peace of Westphalia, several border ..."
998,What kind of system is a solar chimney,False,passive solar ventilation,59,"A solar chimney (or thermal chimney, in this c..."


##Import BertTokenizer for extracting the end and start tokens of an answer (in this case, the 'text') in a 'context'. [BertTokenizer also later used for tokenized and add special tokens [CLS], [SEP] for the input data of the Bert model]

In [7]:
# INSTALL TRANSFORMER TO IMPORT BERT:
! pip install transformers



In [8]:
from transformers import DistilBertTokenizerFast

In [9]:
#Load the BERT tokenizer:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

##Extracting the necessary data from the 'dataset', such as 'context', 'text', 'question' as lists of data elements for easily hanlding this task. [Later these data lists also used as input for step TRANSFORMING the dataset into appropriate format input of Bert model]

In [10]:
# list of all the 'context' from the dataset
context= dataset.context.values

# list of all the 'question' from the dataset
question = dataset.question.values

# list of all the 'answer' from the dataset
answers = dataset.text.values


##THE MAIN STEP : for extracting the end and start tokens of an 'answer' in a 'context'

In [11]:
def extract_answer_start_end_tokens(answer, context):
  for i in range (0, len(context)-(len(answer)-2)+1):
    if context[i] == answer[1]: # find the first token of the answer in the context
      
      for j in range (1, len(answer)-1):
        if context[i+j-1] != answer[j]: # if the next tokens in the context are not in the answer
          break; # stop
      
      if j == len(answer)-2: # reach the end of the answer, in other words, the 'for-loop' of j reaches the end:
        # we have found the answer start and end indices in the context:
        start_token = i
        end_token= i+j-1
        return start_token, end_token
      # else: we move on to the next value in the context to keep searching for the answer start and end indices in the context.
  return 0, 0 # can not find the answer in the context

In [12]:
# TEST:
# tokenize a 'context'
tokenized_context= tokenizer(context[1])

# extract only the input_ids from the tokenized result:
input_ids_tokenized_context = tokenized_context['input_ids'] 

# tokenize an 'answer'
tokenized_answer= tokenizer(answers[1])

# extract only the input_ids from the tokenized result:
input_ids_tokenized_answer = tokenized_answer['input_ids']

start, end = extract_answer_start_end_tokens(input_ids_tokenized_answer, input_ids_tokenized_context)
print(start)
print(end)
test_list=[]
for i in range (start, end+1):
  test_list.append(input_ids_tokenized_context[i])

tokens= tokenizer.convert_ids_to_tokens(test_list)

for token, id in zip(tokens, test_list):
  if id == tokenizer.sep_token_id:
    print(" ")
  print('{:12} {:>6}'.format(token,id))
  if id == tokenizer.sep_token_id:
    print(" ")

38
41
after          2044
every          2296
general        2236
election       2602


**EXTRACTING START AND END TOKENS OF AN 'ANSWER' IN EVERY 'CONTEXT' IN THE DATASET**

In [13]:
start_labels = []
end_labels = []

In [14]:
assert len(context) == len(answers) # must be TRUE

In [15]:
import random

def fit_max_length (input, max_len):

  # find the range of the 'context' tokens:
  for i in range (0, len(input)):

    if input[i] == 102: # encouter the first [SEP], which is the end of 'context
      range_context= i
      break
  
  # randomly drop some tokens in the 'context' to make the input len = 512:

  len_difference= len(input) - max_len
  
  # generate a list of random indices from 0 to range_context (not include (0 and value of range_context) ):
  # reference:
  
  for i in range (0, len_difference):
    d_index=random.randint( 1, range_context-1)
    del input[d_index]

    # the range_conext should be decreased by 1 due to the deleted tokens:
    range_context= range_context-1
  
  return input

In [16]:
for i in range (0, len(context)):
  # tokenize a 'context'
  tokenized_context= tokenizer(context[i])

  # extract only the input_ids from the tokenized result:
  input_ids_tokenized_context = tokenized_context['input_ids'] 

  #resize the 'context' ids to maximum length of 512:
  input_ids_tokenized_context =fit_max_length(input_ids_tokenized_context,512) 

  # tokenize an 'answer'
  tokenized_answer= tokenizer(answers[i])

  # extract only the input_ids from the tokenized result:
  input_ids_tokenized_answer = tokenized_answer['input_ids']

  start, end = extract_answer_start_end_tokens(input_ids_tokenized_answer, input_ids_tokenized_context)

  start_labels.append(start)
  end_labels.append(end)

Token indices sequence length is longer than the specified maximum sequence length for this model (720 > 512). Running this sequence through the model will result in indexing errors


In [17]:
# TEST:
assert len(start_labels) == len(end_labels)
assert len(start_labels) == len(context)

In [None]:
# TEST:
# TEST:
# tokenize a 'context'
index= 4 # choose an examplary 'context', by choose a random value for the index in range (0, len(context))

tokenized_context= tokenizer(context[index])

# extract only the input_ids from the tokenized result:
input_ids_tokenized_context = tokenized_context['input_ids'] 

# tokenize an 'answer'
tokenized_answer= tokenizer(answers[index])

# extract only the input_ids from the tokenized result:
input_ids_tokenized_answer = tokenized_answer['input_ids']

tst_start = start_labels[index]
tst_end = end_labels[index]
print(f'{tst_start}\t{tst_end}')

test_list=[]
for i in range (tst_start, tst_end+1):
  test_list.append(input_ids_tokenized_context[i])

tokens= tokenizer.convert_ids_to_tokens(test_list)

for token, id in zip(tokens, test_list):
  if id == tokenizer.sep_token_id:
    print(" ")
  print('{:12} {:>6}'.format(token,id))
  if id == tokenizer.sep_token_id:
    print(" ")

69	70
late           2397
1990s          4134


#TRANSFORMING the dataset into appropriate format input of Bert model

##STEP 1: Tokenize the dataset and add special tokens [CLS], [SEP]. Then convert the tokenized the dataset into appropriate ids which are the indices of the lookup vocab table of the Bert model [Because the BertTokenizer is used for this task]

In [18]:
# the maximum length of input sequence for bert-base-uncase is 512

# check the length of the input sequence:
# because input sequence = a 'context' + a'question' 
# the tokenized input is [CLS] + a 'context' + [SEP] + a'question'  +[SEP]
# => length of the tokenized input <= 512

# => Check the len(tokenized input), if it > 512, drop some tokens in the 'context' to make len = 512.

import random

def fit_max_length (input, max_len):

  # find the range of the 'context' tokens:
  for i in range (0, len(input)):

    if input[i] == 102: # encouter the first [SEP], which is the end of 'context
      range_context= i
      break
  
  # randomly drop some tokens in the 'context' to make the input len = 512:

  len_difference= len(input) - max_len
  
  # generate a list of random indices from 0 to range_context (not include (0 and value of range_context) ):
  # reference:
  
  for i in range (0, len_difference):
    d_index=random.randint( 1, range_context-1)
    del input[d_index]

    # the range_conext should be decreased by 1 due to the deleted tokens:
    range_context= range_context-1
  
  return input

In [19]:
# Test: fit_max_length()
tokenized_sentences = tokenizer(context[10],question[10], add_special_tokens= True)
# BECAUSE input for the Bert model will be [CLS] + 'context' + [SEP] + 'question' + [SEP] (requirement_1),
# => tokenizer(context[10],question[10], add_special_tokens= True) takes care of the requirement_1.
# context[c_index], question[1_index] : c_index must be the same as q_question ('conext' must correspond to its own 'question')
# these indices can be a integer number in range (0, len (context))

ids= tokenized_sentences['input_ids']


max_len=200 # test with a random length

new_input_ids= fit_max_length(ids,max_len)

#len(new_input_ids), new_input_ids

tokens= tokenizer.convert_ids_to_tokens(new_input_ids)

for token, id in zip(tokens, new_input_ids):
  if id == tokenizer.sep_token_id:
    print(" ")
  print('{:12} {:>6}'.format(token,id))
  if id == tokenizer.sep_token_id:
    print(" ")

len(new_input_ids)

[CLS]           101
in             1999
higher         3020
education      2495
,              1010
polite        13205
##c            2278
##nic          8713
##o            2080
refers         5218
to             2000
a              1037
technical      4087
university     2118
awarding      21467
degrees        5445
in             1999
engineering    3330
.              1012
historically   7145
there          2045
were           2020
two            2048
polite        13205
##c            2278
##nic          8713
##i            2072
,              1010
one            2028
in             1999
each           2169
of             1997
the            1996
two            2048
largest        2922
industrial     3919
cities         3655
of             1997
the            1996
north          2167
:              1024
 
[SEP]           102
 
what           2054
term           2744
in             1999
higher         3020
education      2495
refers         5218
to             2000
technical      4

56

In [20]:
# Each entry at ith index of the input_ids is 'ids' like in the 'TEST' cell right above.
# Maximum sequence length for this model (512)
# => Each entry of the input_ids must be 512.

input_ids=[]
max_len= 512
for cntx, quest in zip(context, question):

  tokenized_sentences = tokenizer(cntx,quest, add_special_tokens= True)
  ids= tokenized_sentences['input_ids']

  if len(ids) > 512: # if the sequence length > 512
    ids= fit_max_length(ids, max_len) # decrease it to 512

  input_ids.append(ids)

In [21]:
# TEST:
test_1=input_ids[4]

tokens= tokenizer.convert_ids_to_tokens(test_1)

for token, id in zip(tokens, test_1):
  if id == tokenizer.sep_token_id:
    print(" ")
  print('{:12} {:>6}'.format(token,id))
  if id == tokenizer.sep_token_id:
    print(" ")
print(len(test_1))

[CLS]           101
despite        2750
being          2108
an             2019
original       2434
story          2466
,              1010
spec          28699
##tre          7913
draws          9891
on             2006
ian            4775
fleming       13779
'              1005
s              1055
source         3120
material       3430
,              1010
most           2087
notably        5546
in             1999
the            1996
character      2839
of             1997
franz          8965
obe           15578
##rh          25032
##aus         20559
##er           2121
,              1010
played         2209
by             2011
christoph     21428
waltz         17569
.              1012
obe           15578
##rh          25032
##aus         20559
##er           2121
shares         6661
his            2010
name           2171
with           2007
han            7658
##nes          5267
obe           15578
##rh          25032
##aus         20559
##er           2121
,              1010


In [None]:
# CHECK:
input_ids[10]
for i in range (0,len(input_ids)):
  if len(input_ids[i]) > 512:
    print("NOT OK")
    break

##PADDING: to make all the input sequences (in this case, an entry of the 'input_ids' list) the same length [because Bert model requires such a thing]

In [22]:
# CHECK The maximum length of each sequence in input_ids:
max_len_input_ids=0
for i in range (0, len(input_ids)):
  if len(input_ids[i])> max_len_input_ids:
    max_len_input_ids= len(input_ids[i])

max_len_input_ids 

512

In [23]:
# PADDING: FOR the input_ids
# [PAD] in Bert has value of 0

from keras.preprocessing.sequence import pad_sequences

# set the Max_len:
Max_len= 512 # set as the same value with the 'max_len_input_ids'

# because '[PAD]' in Bert vocab look-up take has id (or index) =0 
# => we can padd 0 values at the end of each entries of 'input_ids'
pad_input_ids= pad_sequences(input_ids, maxlen=Max_len, dtype='long', value=0, truncating='post', padding='post')

# check whether the padding and truncating after padding work as we expect:
assert len(pad_input_ids[10]) == Max_len
print(len(input_ids[0])) # original length of the input_ids[0]


186


In [24]:
# TEST:
test_1=pad_input_ids[5]

tokens= tokenizer.convert_ids_to_tokens(test_1)

for token, id in zip(tokens, test_1):
  if id == tokenizer.sep_token_id:
    print(" ")
  print('{:12} {:>6}'.format(token,id))
  if id == tokenizer.sep_token_id:
    print(" ")
print(len(test_1))

[CLS]           101
beyonce       20773
attended       3230
st             2358
.              1012
mary           2984
'              1005
s              1055
elementary     4732
school         2082
in             1999
frederick      5406
##sburg        9695
,              1010
texas          3146
,              1010
where          2073
she            2016
enrolled       8302
in             1999
dance          3153
classes        4280
.              1012
her            2014
singing        4823
talent         5848
was            2001
discovered     3603
when           2043
dance          3153
instructor     9450
dar           18243
##lette       27901
johnson        3779
began          2211
humming       20364
a              1037
song           2299
and            1998
she            2016
finished       2736
it             2009
,              1010
able           2583
to             2000
hit            2718
the            1996
high           2152
-              1011
pitched        8219


#BESIDES, the input sequences, the SEGMENT MASK is also required as input for the Bert model

as a sequence input in the 'input_ids' has format:
[CLS] + 'context' + [SEP] + 'question' + [SEP] + [PAD]s.

Then we would assign a sequence of 1s (1, 1, 1, ...) for the first part [CLS] + 'context' + [SEP]; and assign a sequence of 0s (0, 0, 0, ...) for the second part 'question' + [SEP] + [PAD]s.

In [25]:
segment_masks=[] # consider [PAD]s belonging to the second sequence
for i in range (0, len(pad_input_ids)):

  convert_to_list= pad_input_ids[i].tolist()
  sep_index= convert_to_list.index(tokenizer.sep_token_id)

  # number of the 'context' (='answer') tokens includes the [SEP] also
  num_seg_a= sep_index+1

  # the remainder is the 'question':
  num_seg_b= len(convert_to_list) - num_seg_a

  # construct list of 0s and 1s:
  segment_ids= [1]*num_seg_a + [0]*num_seg_b # a segment mask for [CLS]+ a 'context' +[SEP] + 'text' + [SEP]

  # add the segment_ids to the list of segment masks:
  segment_masks.append(segment_ids)

In [143]:
# TEST:
number_of_ones=0
position= 99 # take a segment mask in the list segment_maks for testing
for i in range (0, len(segment_masks[position])):
  if segment_masks[position][i] == 1:
    number_of_ones= number_of_ones+1

number_of_zeros = len(segment_masks[position]) - number_of_ones

# the real number of ones in the segment masks:
convert_to_list= pad_input_ids[position].tolist()
sep_index= convert_to_list.index(tokenizer.sep_token_id)
real_number_of_ones= sep_index+1
real_number_of_zeros= len(convert_to_list) - real_number_of_ones
#
assert number_of_ones== real_number_of_ones
assert number_of_zeros== real_number_of_zeros

In [144]:
# TEST:

for i in range (0, 15): # all segment_masks must have the same length
  print(len(segment_masks[i]))

512
512
512
512
512
512
512
512
512
512
512
512
512
512
512


#ATTTENTION MASK is also required together with the segment mask

as a sequence input in the 'input_ids' has format:
[CLS] + 'context' + [SEP] + 'question' + [SEP] + [PAD]s.

Then we would assign a sequence of 1s (1, 1, 1, ...) for the first part [CLS] + 'context' + [SEP] + 'question' + [SEP]; and assign a sequence of 0s (0, 0, 0, ...) for the second part [PAD]s.

In [26]:
# Create attention mask:
attention_masks= []

for pad_sequence in pad_input_ids:

  # because [PAD] has id = 0 => we could use this condition to apply the attension mask:
  attention_mask=[int(token_id >0) for token_id in pad_sequence]

  # aggregate each mask of each padded sequence into a list attention_masks
  # with this method, we could preserve the corresponding order between a padded sequence and its mask:
  attention_masks.append(attention_mask)


In [146]:
# CHECK attention_masks:
# check the mask for first sentence:

# the first encoded sentence:
#print(pad_input_ids[0])

# the mask of the first encoded sentence:
#print(attention_masks[0]) # there should be 19 values of '1' at the beginning.

# count '1' values in the first attention mask, the result should be 174
c=0
for i in range (0, len(attention_masks[2])):
  if attention_masks[2][i]==1:
    c=c+1
print(c)

218


In [27]:
assert len(attention_masks[10]) == 512

#SPLIT the dataset, the segment masks, the attention maks, the start labels, the end labels into the train set and evaluation set

In [28]:
from sklearn.model_selection import train_test_split

# split for the conformed input dataset of BERT (input_ids) and the labels list
train_inputs, evl_inputs, train_start_labels, evl_start_labels= train_test_split(pad_input_ids, start_labels, random_state=2018, test_size=0.1)

# do the same for segment masks of conformed input dataset:
train_segment_masks, evl_segment_masks, train_end_labels, evl_end_labels= train_test_split(segment_masks, end_labels, random_state=2018, test_size=0.1)
# _,_ is used because masks need no labels.

# do the same for segment masks of conformed input dataset:
train_attention_masks, evl_attention_masks,_,_= train_test_split(attention_masks, start_labels, random_state=2018, test_size=0.1)
# _,_ is used because masks need no labels.

In [29]:
#train_inputs.shape, train_labels.shape
print(train_inputs.shape)

(900, 512)


#CONVERT the train set and the evaluation set into 'torch.tensor' type becuase Bert model requires 'torch.tensor' type as its valid input type.

In [30]:
# for the input data:
train_inputs= torch.tensor([train_inputs])
evl_inputs= torch.tensor([evl_inputs])

# for the start_labels:
train_start_labels= torch.tensor(train_start_labels)
evl_start_labels= torch.tensor(evl_start_labels)

# for the end_labels:
train_end_labels= torch.tensor(train_end_labels)
evl_end_labels= torch.tensor(evl_end_labels)

# for segment masks:
train_segment_masks= torch.tensor([train_segment_masks])
evl_segment_masks= torch.tensor([evl_segment_masks])

# for attention masks:
train_attention_masks=torch.tensor([train_attention_masks])
evl_attention_masks=torch.tensor([evl_attention_masks])

In [31]:
# check the shape:
train_inputs.shape, train_start_labels.shape, train_end_labels.shape, train_segment_masks.shape, train_attention_masks.shape

(torch.Size([1, 900, 512]),
 torch.Size([900]),
 torch.Size([900]),
 torch.Size([1, 900, 512]),
 torch.Size([1, 900, 512]))

#Generate the train set and evaluation set in batches:

**FUNCTION FOR GENERATING BATCHES with batch size chosen by user**

In [32]:
# LEARN FROM PREVIOUS LECTURES AND LABS in the class:
class BatchedIterator:
    def __init__(self, *tensors, batch_size,**kwarg):
        # all tensors must have the same first dimension
        assert len(set(len(tensor) for tensor in tensors)) == 1
        #print(type(tensors))
        #print(tensors[1])
        self.tensors = tensors
        self.batch_size = batch_size

        for keyword, value in kwarg.items():
            if keyword == "shuffle":
                self.shuffle=value
    
    def iterate_once(self):
        num_data = len(self.tensors[0][0]) # the length of the data

        if self.shuffle== False:
          for start in range(0, num_data, self.batch_size):
              end = start + self.batch_size
              yield tuple(tensor[0][start:end] for tensor in self.tensors) 
              #must be tensor[0], to access to real data
              # cause tensor size [1,..]; 1: is unecessary dimension
              # => must exclude it by using tensor[0]  
        else:
          all_batches=[] # to gather all the batches formed form the dataset
          for start in range(0, num_data, self.batch_size):
              end = start + self.batch_size
              all_batches.append(tuple(tensor[0][start:end] for tensor in self.tensors))
          
          # shuffle the batches: 
          # reference: https://note.nkmk.me/en/python-random-shuffle/#:~:text=To%20randomly%20shuffle%20elements%20of,Python%2C%20use%20the%20random%20module.&text=random%20provides%20shuffle()%20that,used%20for%20strings%20and%20tuples.
          shuf_batches = random.sample(all_batches, len(all_batches))

          # yield a batch in the list at a time:
          for i in range (0,len(shuf_batches)):
            yield shuf_batches[i]

In [33]:
# extract for test sample:
test_train_inputs= train_inputs[:,:10,:]

test_train_start_labels= train_start_labels[:10]
test_train_end_labels= train_end_labels[:10]

test_train_segment_masks = train_segment_masks[:,:10,:]
test_train_attention_masks = train_attention_masks[:,:10,:]

test_train_inputs.shape, test_train_start_labels.shape, test_train_end_labels.shape, test_train_segment_masks.shape, test_train_attention_masks.shape,

(torch.Size([1, 10, 512]),
 torch.Size([10]),
 torch.Size([10]),
 torch.Size([1, 10, 512]),
 torch.Size([1, 10, 512]))

In [34]:
# the torch.size must be torch.size([1,...])
# but the test_train_labels has the torch.size([10])
# must make it into torch.size([1,10])
good_test_train_start_labels=torch.unsqueeze(test_train_start_labels, 0)
good_test_train_end_labels=torch.unsqueeze(test_train_end_labels, 0)

#test:
good_test_train_start_labels.shape, good_test_train_end_labels.shape

(torch.Size([1, 10]), torch.Size([1, 10]))

In [35]:
# Test out the BatchedIterator:

batch_size = 5
train_iter = BatchedIterator(test_train_inputs, good_test_train_start_labels,good_test_train_end_labels, test_train_segment_masks,test_train_attention_masks, batch_size=batch_size,shuffle=True)

for train_batch, start_batch, end_batch, segment_mask_batch, attention_batch in train_iter.iterate_once():
  print(f'train_batch.type= {train_batch.shape}\tstart_batch.type={start_batch.shape}\tend_batch={end_batch.shape}\tsegment_mask_batch={segment_mask_batch.shape}\tattention_mask_batch={attention_batch.shape}')
  print(train_batch)

train_batch.type= torch.Size([5, 512])	start_batch.type=torch.Size([5])	end_batch=torch.Size([5])	segment_mask_batch=torch.Size([5, 512])	attention_mask_batch=torch.Size([5, 512])
tensor([[  101,  1999,  2249,  ...,     0,     0,     0],
        [  101, 20773, 21025,  ...,     0,     0,     0],
        [  101,  2087,  1997,  ...,     0,     0,     0],
        [  101,  1999,  2238,  ...,     0,     0,     0],
        [  101,  1037,  2193,  ...,     0,     0,     0]])
train_batch.type= torch.Size([5, 512])	start_batch.type=torch.Size([5])	end_batch=torch.Size([5])	segment_mask_batch=torch.Size([5, 512])	attention_mask_batch=torch.Size([5, 512])
tensor([[  101,  1996,  4145,  ...,     0,     0,     0],
        [  101,  1996,  2048,  ...,     0,     0,     0],
        [  101, 20773,  1005,  ...,     0,     0,     0],
        [  101,  2006,  2410,  ...,     0,     0,     0],
        [  101,  5148,  1024,  ...,     0,     0,     0]])


#DEFINE BERT MODEL FOR TRAINING

**Bert model is too large, i would use DistlledBert model which is 40 percent smaller than Bert model but still preserves 97 percent performace of the Bert model**

In [36]:
# https://huggingface.co/transformers/model_doc/bert.html#bertforquestionanswering
from transformers import DistilBertForQuestionAnswering

# reference for config a pretrained model : https://towardsdatascience.com/hugging-face-transformers-fine-tuning-distilbert-for-binary-classification-tasks-490f1d192379
from transformers import DistilBertConfig

In [37]:
class ForQuestionAnsweringClassifier(nn.Module):
    def __init__(self,model_out_sequence_length, output_size, freeze_bert = True):
        super(ForQuestionAnsweringClassifier, self).__init__()
        # Configure DistilBERT's initialization
        self.bert_config = DistilBertConfig(n_layers=1, n_heads=2,qa_dropout=0.2, dim = 100)
                          
        self.Distil_Bert_model= DistilBertForQuestionAnswering(config=self.bert_config)

        # Make DistilBERT layers untrainable
        for p in self.Distil_Bert_model.parameters():
          p.requires_grad = False
          p.trainable = False
                
        #Classification layer
        self.start_cls_layer = nn.Linear(model_out_sequence_length, output_size)
        self.end_cls_layer = nn.Linear(model_out_sequence_length, output_size)

    def forward(self, data, attn_masks, start_label, end_label):
        bertOut= self.Distil_Bert_model(data, # the tokens representing our input
                attention_mask=attn_masks, start_positions= start_label, end_positions= end_label ) # the segment ids to differentiate the question and the answer
        start_labels= self.start_cls_layer(bertOut.start_logits)
        end_labels=self.start_cls_layer(bertOut.end_logits)
        return start_labels, end_labels

In [38]:
model = ForQuestionAnsweringClassifier(512,512) #model_out_sequence_length = output_size

In [39]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

In [40]:
# Tell Pytorch to run this model on the GPU:
#for p in model.parameters():
#      p.requires_grad = False
model.to(device)

ForQuestionAnsweringClassifier(
  (Distil_Bert_model): DistilBertForQuestionAnswering(
    (distilbert): DistilBertModel(
      (embeddings): Embeddings(
        (word_embeddings): Embedding(30522, 100, padding_idx=0)
        (position_embeddings): Embedding(512, 100)
        (LayerNorm): LayerNorm((100,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (transformer): Transformer(
        (layer): ModuleList(
          (0): TransformerBlock(
            (attention): MultiHeadSelfAttention(
              (dropout): Dropout(p=0.1, inplace=False)
              (q_lin): Linear(in_features=100, out_features=100, bias=True)
              (k_lin): Linear(in_features=100, out_features=100, bias=True)
              (v_lin): Linear(in_features=100, out_features=100, bias=True)
              (out_lin): Linear(in_features=100, out_features=100, bias=True)
            )
            (sa_layer_norm): LayerNorm((100,), eps=1e-12, elementwise_affine=Tr

#TRAINING AND EVALUATION:

In [41]:
# Convert the data into tensor so that the model can run on the data
# extract for test sample:
train_inputs
evl_inputs

train_inputs= train_inputs.to(device)
evl_inputs = evl_inputs.to(device)

train_start_labels
train_end_labels

train_start_labels= train_start_labels.to(device)
train_end_labels= train_end_labels.to(device) 


evl_start_labels
evl_end_labels

evl_start_labels= evl_start_labels.to(device)
evl_end_labels= evl_end_labels.to(device)

train_segment_masks
evl_segment_masks


train_segment_masks= train_segment_masks.to(device) 
evl_segment_masks= evl_segment_masks.to(device) 



train_attention_masks
evl_attention_masks


train_attention_masks = train_attention_masks.to(device) 
evl_attention_masks= evl_attention_masks.to(device) 

train_inputs.shape, train_start_labels.shape, train_end_labels.shape, train_segment_masks.shape, train_attention_masks.shape

(torch.Size([1, 900, 512]),
 torch.Size([900]),
 torch.Size([900]),
 torch.Size([1, 900, 512]),
 torch.Size([1, 900, 512]))

In [42]:
# the torch.size must be torch.size([1,...])
# but the test_train_labels has the torch.size([10])
# must make it into torch.size([1,10])
good_train_start_labels=torch.unsqueeze(train_start_labels, 0)
good_train_end_labels=torch.unsqueeze(train_end_labels, 0)

good_evl_start_labels=torch.unsqueeze(evl_start_labels, 0)
good_evl_end_labels=torch.unsqueeze(evl_end_labels, 0)

#test:
good_train_start_labels.shape, good_train_end_labels.shape

(torch.Size([1, 900]), torch.Size([1, 900]))

In [43]:
batch_size = 10
train_iter = BatchedIterator(train_inputs, good_train_start_labels, good_train_end_labels, train_segment_masks, train_attention_masks, batch_size=batch_size, shuffle=True)

In [44]:
criterion = nn.CrossEntropyLoss()
criterion = criterion.cuda()
optimizer = optim.Adam(model.parameters())

In [45]:
num_epochs = 4

In [46]:
for epoch in range(num_epochs):
    model.train()
    # Training on train data
  
    for train_batch, start_batch, end_batch, segment_mask_batch, attention_batch in train_iter.iterate_once():
        
        #train_batch = train_batch.to(device)
        #attention_batch = attention_batch.to(device)
        #start_batch = start_batch.to(device)
        #end_batch = end_batch.to(device)
        # run the model on the inputs:
        #print(f'train_batch.shape= {train_batch.shape}\nattention_batch.shape= {attention_batch.shape}\nstart_batch.shape= {start_batch.shape}\nend_batch.shape= {end_batch.shape}')
        start_predicts, end_predicts = model(train_batch, # the tokens representing our input
                attention_batch, start_batch, end_batch ) # the segment ids to differentiate the question and the answer
        

        # FOR TASK 1:
        
        #y_out = model(X_batch)
        #print(y_out)
        # FOR TASK 1:
        
        # To understand the ouput of the model:
        # y_out.shape = [batch_size, output_size] # output_size = number of labels
        #print(f'y_out.shape= {y_out.shape} \ny_batch.shape= {y_batch.shape} ')

        #print(f'y_out.shape= {y_out.shape} \ny_batch.shape= {y_batch.shape} ')
        #train_loss= outputs.loss
        #print(f'start_predict.shape={start_predicts.shape}\tstart_batch={start_batch.shape}')
        start_loss= criterion(start_predicts, start_batch)
        end_loss= criterion(end_predicts, end_batch)
        #print("loss:")
        #print(loss)
        optimizer.zero_grad()
        start_loss.backward()
        end_loss.backward()
        optimizer.step()
        
    model.eval()  # or model.train(False)
    
    #Move the training data for evaluation
    #train_inputs[0] = train_inputs[0].to(device)
    #train_attention_masks[0] = train_attention_masks[0].to(device)
    #train_start_labels = train_start_labels.to(device)
    #train_end_labels = train_end_labels.to(device)
        
    # evaluation on train data:
    start_predicts, end_predicts = model(train_inputs[0], # the tokens representing our input
                train_attention_masks[0], train_start_labels , train_end_labels) # the segment ids to differentiate the question and the answer

    start_labels= start_predicts.argmax(axis=1)
    end_labels=end_predicts.argmax(axis=1)

    #train_loss= outputs.loss
    start_train_loss = criterion(start_predicts, train_start_labels).item()
    end_train_loss = criterion(end_predicts, end_labels).item()
    train_loss= (start_train_loss + end_train_loss)/2

    train_start_accuracy = (torch.eq(start_labels, train_start_labels).sum() / float(len(train_start_labels))).item()
    train_end_accuracy = (torch.eq(end_labels, train_end_labels).sum() / float(len(train_end_labels))).item()
    train_accuracy = (train_start_accuracy + train_end_accuracy) /2
    

    
    # evaluation on eval data:
    #Move the training data for evaluation
    #evl_inputs[0] = evl_inputs[0].to(device)
    #evl_attention_masks[0] = evl_attention_masks[0].to(device)
    #evl_start_labels = evl_start_labels.to(device)
    #evl_end_labels = evl_end_labels.to(device)
    

    evl_start_predicts, evl_end_predicts = model(evl_inputs[0], # the tokens representing our input
                evl_attention_masks[0], evl_start_labels , evl_end_labels) # the segment ids to differentiate the question and the answer

    start_labels= evl_start_predicts.argmax(axis=1)
    end_labels=evl_end_predicts.argmax(axis=1)

    start_dev_loss = criterion(evl_start_predicts, evl_start_labels).item()
    end_dev_loss = criterion(evl_end_predicts, evl_end_labels).item()
    dev_loss= (start_train_loss + end_train_loss)/2
    
    evl_start_accuracy = (torch.eq(start_labels, evl_start_labels).sum() / float(len(evl_start_labels))).item()
    evl_end_accuracy = (torch.eq(end_labels, evl_end_labels).sum() / float(len(evl_end_labels))).item()
    evl_accuracy = (evl_start_accuracy + evl_end_accuracy) /2
    

    print(f"Epoch: {epoch} -- train loss: {train_loss} - train acc: {train_accuracy*100} - "
          f"dev loss: {dev_loss} - dev acc: {evl_accuracy*100}")

Epoch: 0 -- train loss: 4.329398274421692 - train acc: 2.7222222648561 - dev loss: 4.329398274421692 - dev acc: 1.4999999664723873


RuntimeError: ignored

#SAVE & LOAD trained model:

##TO SAVE

In [None]:
torch.save(model.state_dict(), './model/Trained_model.tsv')

##TO LOAD:

In [None]:
device = torch.device("cuda")
model = ForQuestionAnsweringClassifier(449,449)
model.load_state_dict(torch.load('./model/Trained_model.tsv'))
model.to(device)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForQuestionAnswering: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias', 'distilbert.transformer.layer.1.attention.q_lin.weight', 'distilbert.transformer.layer.1.attention.q_lin.bias', 'distilbert.transformer.layer.1.attention.k_lin.weight', 'distilbert.transformer.layer.1.attention.k_lin.bias', 'distilbert.transformer.layer.1.attention.v_lin.weight', 'distilbert.transformer.layer.1.attention.v_lin.bias', 'distilbert.transformer.layer.1.attention.out_lin.weight', 'distilbert.transformer.layer.1.attention.out_lin.bias', 'distilbert.transformer.layer.1.sa_layer_norm.weight', 'distilbert.transformer.layer.1.sa_layer_norm.bias', 'distilbert.transformer.layer.1.ffn.lin1.weight', 'distilbert.transformer.layer.1.ffn.lin1.bias', 'distilbert.transformer.layer.1.ffn.lin2.weight', 'distilbert

ForQuestionAnsweringClassifier(
  (Distil_Bert_model): DistilBertForQuestionAnswering(
    (distilbert): DistilBertModel(
      (embeddings): Embeddings(
        (word_embeddings): Embedding(30522, 768, padding_idx=0)
        (position_embeddings): Embedding(512, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (transformer): Transformer(
        (layer): ModuleList(
          (0): TransformerBlock(
            (attention): MultiHeadSelfAttention(
              (dropout): Dropout(p=0.1, inplace=False)
              (q_lin): Linear(in_features=768, out_features=768, bias=True)
              (k_lin): Linear(in_features=768, out_features=768, bias=True)
              (v_lin): Linear(in_features=768, out_features=768, bias=True)
              (out_lin): Linear(in_features=768, out_features=768, bias=True)
            )
            (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=Tr

**To Run on the loaded model, just go back to 'TRAINING AND EVALUATION' section.**

In [None]:
import gc
del model
gc.collect()

NameError: ignored