In [29]:
!pip install transformers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[0m

In [30]:
from datasets import load_dataset

raw_datasets = load_dataset("squad")

In [31]:
print("Context: ", raw_datasets["train"][0]["context"])
print("Question: ", raw_datasets["train"][0]["question"])
print("Answer: ", raw_datasets["train"][0]["answers"])

Context:  Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
Question:  To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
Answer:  {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}


In [32]:
print(raw_datasets['train'])

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 87599
})


In [33]:
print(raw_datasets['validation'])

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 10570
})


In [34]:
print(raw_datasets['train']['question'][0:10])

['To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', 'What is in front of the Notre Dame Main Building?', 'The Basilica of the Sacred heart at Notre Dame is beside to which structure?', 'What is the Grotto at Notre Dame?', 'What sits on top of the Main Building at Notre Dame?', 'When did the Scholastic Magazine of Notre dame begin publishing?', "How often is Notre Dame's the Juggler published?", 'What is the daily student paper at Notre Dame called?', 'How many student news papers are found at Notre Dame?', 'In what year did the student paper Common Sense begin publication at Notre Dame?']


In [35]:
from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')

train_encodings = tokenizer(raw_datasets['train']['context'], raw_datasets['train']['question'], truncation=True, padding=True)
valid_encodings = tokenizer(raw_datasets['validation']['context'], raw_datasets['validation']['question'], truncation=True, padding=True)

In [36]:
train_encodings.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [37]:
print(len(train_encodings['input_ids']))

87599


In [38]:
print(train_encodings['input_ids'][0])

[101, 22182, 1193, 117, 1103, 1278, 1144, 170, 2336, 1959, 119, 1335, 4184, 1103, 4304, 4334, 112, 188, 2284, 10945, 1110, 170, 5404, 5921, 1104, 1103, 6567, 2090, 119, 13301, 1107, 1524, 1104, 1103, 4304, 4334, 1105, 4749, 1122, 117, 1110, 170, 7335, 5921, 1104, 4028, 1114, 1739, 1146, 14089, 5591, 1114, 1103, 7051, 107, 159, 21462, 1566, 24930, 2508, 152, 1306, 3965, 107, 119, 5893, 1106, 1103, 4304, 4334, 1110, 1103, 19349, 1104, 1103, 11373, 4641, 119, 13301, 1481, 1103, 171, 17506, 9538, 1110, 1103, 144, 10595, 2430, 117, 170, 14789, 1282, 1104, 8070, 1105, 9284, 119, 1135, 1110, 170, 16498, 1104, 1103, 176, 10595, 2430, 1120, 10111, 20500, 117, 1699, 1187, 1103, 6567, 2090, 25153, 1193, 1691, 1106, 2216, 17666, 6397, 3786, 1573, 25422, 13149, 1107, 8109, 119, 1335, 1103, 1322, 1104, 1103, 1514, 2797, 113, 1105, 1107, 170, 2904, 1413, 1115, 8200, 1194, 124, 11739, 1105, 1103, 3487, 17917, 114, 117, 1110, 170, 3014, 117, 2030, 2576, 5921, 1104, 2090, 119, 102, 1706, 2292, 1225, 110

In [39]:
text = tokenizer.decode(train_encodings['input_ids'][0])
print(text[:])
print(text[523:])
print(tokenizer.decode(101))

[CLS] Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive ( and in a direct line that connects through 3 statues and the Gold Dome ), is a simple, modern stone statue of Mary. [SEP] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] 

In [40]:
print(len(raw_datasets['train']))
filtered_dataset = raw_datasets["train"].filter(lambda x: len(x["answers"]["text"]) != 1)
print(filtered_dataset)
print(raw_datasets['train']['answers'][6])
print(raw_datasets['train']['context'][0][515:])

87599
Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 0
})


KeyboardInterrupt: 

In [None]:
filtered_dataset_valid = raw_datasets["validation"].filter(lambda x: len(x["answers"]["text"]) != 1)
print(filtered_dataset_valid)
print(raw_datasets['validation']['answers'][1004])
print(raw_datasets['validation']['context'][0])

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 10567
})
{'text': ['location of Warsaw', 'location of Warsaw', 'location'], 'answer_start': [104, 104, 104]}
Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.


In [None]:
raw_datasets['train'][3276]
raw_datasets['train'][3276]['context'][2982]

# raw_datasets['train'][1]
# raw_datasets['train'][1]['context'][212]

'S'

In [None]:
def add_end_idx(answers, contexts):
  for answer, context in zip(answers, contexts):
    gold_text = answer['text'][0]
    start_idx = answer['answer_start'][0]
    end_idx = start_idx + len(gold_text)

    # sometimes squad answers are off by a character or two so we fix this
    if context[start_idx:end_idx] == gold_text:
      answer['answer_end'] = [end_idx]
    elif context[start_idx-1:end_idx-1] == gold_text:
      answer['answer_start'] = [start_idx - 1]
      answer['answer_end'] = [end_idx - 1]     # When the gold label is off by one character
    elif context[start_idx-2:end_idx-2] == gold_text:
      answer['answer_start'] = [start_idx - 2]
      answer['answer_end'] = [end_idx - 2]     # When the gold label is off by two characters

train_context = raw_datasets['train']['context']
train_answers = raw_datasets['train']['answers']

add_end_idx(train_answers, train_context)
print(train_answers[3276])
print(train_answers[1])
# add_end_idx(raw_datasets['validation']['answers'], raw_datasets['validation']['context'])

{'text': ['Skyfall'], 'answer_start': [2982], 'answer_end': [2989]}
{'text': ['a copper statue of Christ'], 'answer_start': [188], 'answer_end': [213]}


In [None]:
print(train_encodings.char_to_token(3276, 2982)) # (batch num , char start index)
print(tokenizer.decode(train_encodings['input_ids'][1][41]))
print(train_encodings.char_to_token(1, 212))
print(tokenizer.decode(train_encodings['input_ids'][1][45]))
print(tokenizer.model_max_length)

None
a
45
Christ
512


In [None]:
def add_token_positions(encodings, answers):
    start_positions = []
    end_positions = []
    for i, answer in enumerate(answers):
        if encodings.char_to_token(i, answer['answer_start'][0]) == None or encodings.char_to_token(i, answer['answer_end'][0] - 1) == None:
            start_positions.append(0)
            end_positions.append(0)
        else:
            start_positions.append(encodings.char_to_token(i, answer['answer_start'][0]))
            end_positions.append(encodings.char_to_token(i, answer['answer_end'][0] - 1))
        
    if start_positions[-1] is None:
        start_positions[-1] = tokenizer.model_max_length
    if end_positions[-1] is None:
        end_positions[-1] = tokenizer.model_max_length    
    
    encodings.update({'start_positions': start_positions, 'end_positions': end_positions})
        
add_token_positions(train_encodings, train_answers)
        

In [None]:
train_encodings['end_positions']

[126,
 45,
 69,
 96,
 27,
 53,
 89,
 117,
 27,
 173,
 23,
 32,
 50,
 76,
 150,
 93,
 8,
 26,
 59,
 29,
 97,
 16,
 37,
 126,
 75,
 122,
 222,
 22,
 155,
 12,
 110,
 59,
 81,
 76,
 26,
 35,
 39,
 73,
 84,
 24,
 36,
 57,
 109,
 138,
 10,
 80,
 121,
 39,
 186,
 10,
 72,
 85,
 127,
 25,
 2,
 40,
 36,
 59,
 21,
 5,
 195,
 209,
 220,
 22,
 4,
 18,
 44,
 4,
 18,
 25,
 83,
 138,
 165,
 19,
 57,
 302,
 334,
 12,
 3,
 29,
 70,
 9,
 75,
 25,
 52,
 78,
 268,
 62,
 134,
 18,
 51,
 76,
 171,
 48,
 121,
 5,
 15,
 87,
 28,
 46,
 77,
 58,
 63,
 58,
 111,
 251,
 169,
 172,
 8,
 8,
 69,
 73,
 129,
 93,
 115,
 156,
 16,
 122,
 19,
 48,
 73,
 30,
 120,
 21,
 135,
 209,
 260,
 36,
 23,
 72,
 92,
 130,
 50,
 2,
 23,
 36,
 100,
 120,
 35,
 43,
 73,
 95,
 153,
 16,
 20,
 42,
 81,
 133,
 22,
 64,
 74,
 120,
 345,
 28,
 95,
 127,
 162,
 35,
 62,
 106,
 38,
 6,
 17,
 24,
 45,
 91,
 16,
 11,
 25,
 48,
 54,
 4,
 30,
 36,
 58,
 117,
 78,
 87,
 113,
 134,
 8,
 32,
 48,
 101,
 116,
 6,
 32,
 59,
 73,
 81,
 16,
 30,
 44

In [None]:
print(train_encodings['start_positions'][3276])
print(train_encodings['end_positions'][3276])

0
0


In [None]:
def show_answer(idx):
    print(tokenizer.decode(train_encodings['input_ids'][idx][train_encodings['start_positions'][idx]: train_encodings['end_positions'][idx] + 1]))
    print(train_answers[idx]['text'][0])
    print(tokenizer.decode(train_encodings['input_ids'][idx][train_encodings['start_positions'][idx]: train_encodings['end_positions'][idx] + 1]) == train_answers[idx]['text'][0])

In [None]:
show_answer(3275)

[CLS]
75
False


In [None]:
tokenizer.decode(train_encodings['input_ids'][1][train_encodings['start_positions'][1]: train_encodings['end_positions'][1] + 1])
tokenizer.decode(train_encodings['input_ids'][1][train_encodings['end_positions'][1]])
# tokenizer.model_max_length

'Christ'

In [None]:
import torch

class SQuAD_Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings
    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
    def __len__(self):
        return len(self.encodings.input_ids)

In [None]:
train_dataset = SQuAD_Dataset(train_encodings)

In [None]:
train_dataset[3275]

{'input_ids': tensor([  101, 15247, 12647, 14089, 11794,  1104,  1103,  1273,  1108,  3216,
          1107,  1103,  1244,  1311,   119,  1130,   170,   181, 16140, 18900,
          3189,  1111,  4271,  2036,  7488,   119,  3254,   117,  3895,   163,
         12666,  1200, 22087,  8976,  1522,  1103,  1273,   123,   119,   126,
          2940,  1149,  1104,   125,   117,  7645,   156, 26426,  1874,  1112,
         22410,  1105,  3372,  1106,  2364,  4862,  1113,  1157,  3209,   119,
          8488, 17037,  4047,   117, 19730,  1103,  1273,  1111,  2238,  2460,
          2706,   117,  4803,  1115,   156, 26426,  1874,   107,  2502,  1228,
          1112,  8984,  1105,  8362,  4935, 23709,   107,   119,  2268, 10559,
          1742, 23612, 25019,  1104,  1109,  1203,  1365,  2706, 13316,  3540,
          1103,  1273,  1112,  1515,   107,  1720, 11567,   107,  1105, 21718,
          1665,  2047, 21361,  1158,  1157,  1560,  1785,  1111,  1103,  8590,
          1104,  2884,  1701,  5166,   

In [None]:
from torch.utils.data import DataLoader

# Define the dataloaders
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
# valid_loader = DataLoader(valid_dataset, batch_size=16)

In [None]:
from transformers import BertForQuestionAnswering

model = BertForQuestionAnswering.from_pretrained('bert-base-cased')

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(f'Working on {device}')

Working on cuda


In [None]:
from transformers import AdamW
from tqdm import tqdm # 진행률

N_EPOCHS = 5
optim = AdamW(model.parameters(), lr=5e-5)

model.to(device)
model.train()

for epoch in range(N_EPOCHS):
  loop = tqdm(train_loader, leave=True)
  for batch in loop:
    optim.zero_grad()
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    start_positions = batch['start_positions'].to(device)
    end_positions = batch['end_positions'].to(device)
    outputs = model(input_ids, attention_mask=attention_mask, start_positions=start_positions, end_positions=end_positions)
    loss = outputs[0]
    loss.backward()
    optim.step()

    loop.set_description(f'Epoch {epoch+1}')
    loop.set_postfix(loss=loss.item())

Epoch 1: 100%|██████████| 5475/5475 [33:20<00:00,  2.74it/s, loss=1.22] 
Epoch 2: 100%|██████████| 5475/5475 [33:22<00:00,  2.73it/s, loss=1.03] 
Epoch 3: 100%|██████████| 5475/5475 [33:17<00:00,  2.74it/s, loss=1.36] 
Epoch 4: 100%|██████████| 5475/5475 [33:23<00:00,  2.73it/s, loss=0.389] 
Epoch 5: 100%|██████████| 5475/5475 [33:21<00:00,  2.74it/s, loss=0.272] 


In [None]:
import os
model_path = os.getcwd()
print(model_path)

/


In [None]:
model_path = '/workspace/BERT-SQUAD/'
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

('/workspace/BERT-SQUAD/tokenizer_config.json',
 '/workspace/BERT-SQUAD/special_tokens_map.json',
 '/workspace/BERT-SQUAD/vocab.txt',
 '/workspace/BERT-SQUAD/added_tokens.json',
 '/workspace/BERT-SQUAD/tokenizer.json')

In [None]:
from transformers import BertForQuestionAnswering, BertTokenizerFast

model_path = '/workspace/BERT-SQUAD'
model = BertForQuestionAnswering.from_pretrained(model_path)
tokenizer = BertTokenizerFast.from_pretrained(model_path)

model = model.to(device)
print(model)

BertForQuestionAnswering(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elem

In [None]:
def get_prediction(context, question):
  inputs = tokenizer.encode_plus(question, context, return_tensors='pt').to(device)
  outputs = model(**inputs)
  
  answer_start = torch.argmax(outputs[0])  
  answer_end = torch.argmax(outputs[1]) + 1 
  
  answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][answer_start:answer_end]))
  
  return answer

def normalize_text(s):
  """Removing articles and punctuation, and standardizing whitespace are all typical text processing steps."""
  import string, re
  def remove_articles(text):
    regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
    return re.sub(regex, " ", text)
  def white_space_fix(text):
    return " ".join(text.split())
  def remove_punc(text):
    exclude = set(string.punctuation)
    return "".join(ch for ch in text if ch not in exclude)
  def lower(text):
    return text.lower()

  return white_space_fix(remove_articles(remove_punc(lower(s))))

def exact_match(prediction, truth):
    return bool(normalize_text(prediction) == normalize_text(truth))

def compute_f1(prediction, truth):
  pred_tokens = normalize_text(prediction).split()
  truth_tokens = normalize_text(truth).split()
  
  # if either the prediction or the truth is no-answer then f1 = 1 if they agree, 0 otherwise
  if len(pred_tokens) == 0 or len(truth_tokens) == 0:
    return int(pred_tokens == truth_tokens)
  
  common_tokens = set(pred_tokens) & set(truth_tokens)
  
  # if there are no common tokens then f1 = 0
  if len(common_tokens) == 0:
    return 0
  
  prec = len(common_tokens) / len(pred_tokens)
  rec = len(common_tokens) / len(truth_tokens)
  
  return round(2 * (prec * rec) / (prec + rec), 2)
  
def question_answer(context, question,answer):
  prediction = get_prediction(context,question)
  em_score = exact_match(prediction, answer)
  f1_score = compute_f1(prediction, answer)

  print(f'Question: {question}')
  print(f'Prediction: {prediction}')
  print(f'True Answer: {answer}')
  print(f'Exact match: {em_score}')
  print(f'F1 score: {f1_score}\n')

In [None]:
context = """Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, 
          songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing 
          and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. 
          Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. 
          Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, 
          earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy"."""


questions = ["For whom the passage is talking about?",
             "When did Beyonce born?",
             "Where did Beyonce born?",
             "What is Beyonce's nationality?",
             "Who was the Destiny's group manager?",
             "What name has the Beyoncé's debut album?",
             "How many Grammy Awards did Beyonce earn?",
             "When did the Beyoncé's debut album release?",
             "Who was the lead singer of R&B girl-group Destiny's Child?"]

answers = ["Beyonce Giselle Knowles - Carter", "September 4, 1981", "Houston, Texas", 
           "American", "Mathew Knowles", "Dangerously in Love", "five", "2003", 
           "Beyonce Giselle Knowles - Carter"]

for question, answer in zip(questions, answers):
  question_answer(context, question, answer)

Question: For whom the passage is talking about?
Prediction: Beyoncé Giselle Knowles - Carter
True Answer: Beyonce Giselle Knowles - Carter
Exact match: False
F1 score: 0.75

Question: When did Beyonce born?
Prediction: September 4, 1981
True Answer: September 4, 1981
Exact match: True
F1 score: 1.0

Question: Where did Beyonce born?
Prediction: Houston, Texas
True Answer: Houston, Texas
Exact match: True
F1 score: 1.0

Question: What is Beyonce's nationality?
Prediction: American
True Answer: American
Exact match: True
F1 score: 1.0

Question: Who was the Destiny's group manager?
Prediction: 
True Answer: Mathew Knowles
Exact match: False
F1 score: 0

Question: What name has the Beyoncé's debut album?
Prediction: Dangerously in Love
True Answer: Dangerously in Love
Exact match: True
F1 score: 1.0

Question: How many Grammy Awards did Beyonce earn?
Prediction: five
True Answer: five
Exact match: True
F1 score: 1.0

Question: When did the Beyoncé's debut album release?
Prediction: 2003
