<a href="https://colab.research.google.com/github/Mahdi-Golizadeh/Natural-Language-Processing/blob/main/transformers/question_answering/Q%26A_transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -q datasets
!pip install -q transformers
!pip install -q evaluate

In [2]:
import datasets
import transformers
import torch
import collections
import numpy as np
import evaluate
from tqdm.auto import tqdm

In [3]:
raw_datasets = datasets.load_dataset("squad")



  0%|          | 0/2 [00:00<?, ?it/s]

In [4]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [5]:
raw_datasets["train"][9]

{'id': '5733bf84d058e614000b61c1',
 'title': 'University_of_Notre_Dame',
 'context': "As at most other universities, Notre Dame's students run a number of news media outlets. The nine student-run outlets include three newspapers, both a radio and television station, and several magazines and journals. Begun as a one-page journal in September 1876, the Scholastic magazine is issued twice monthly and claims to be the oldest continuous collegiate publication in the United States. The other magazine, The Juggler, is released twice a year and focuses on student literature and artwork. The Dome yearbook is published annually. The newspapers have varying publication interests, with The Observer published daily and mainly reporting university and other news, and staffed by students from both Notre Dame and Saint Mary's College. Unlike Scholastic and The Dome, The Observer is an independent publication and does not have a faculty advisor or any editorial oversight from the University. In 1987, 

to check if a sample has more than one answer

In [6]:
raw_datasets["train"].filter(lambda x: len(x["answers"]["text"]) != 1)



Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 0
})

the result shows that none of the dataset sample has more than one answer

In [7]:
raw_datasets["validation"].filter(lambda x:len(x["answers"]["text"]) != 1)



Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 10567
})

In [8]:
checkpoint = "bert-base-cased"

In [9]:
tokenizer = transformers.AutoTokenizer.from_pretrained(checkpoint)

In [10]:
tokenizer

PreTrainedTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_len=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

In [11]:
tokenizer.is_fast

True

example of tokenizing the input

our model input will be a combination of context and question

In [12]:
sample = raw_datasets["train"][10]
context = sample["context"]
question = sample["question"]

In [13]:
inputs = tokenizer(question, context)

In [14]:
tokenizer.decode(inputs["input_ids"])

'[CLS] Where is the headquarters of the Congregation of the Holy Cross? [SEP] The university is the major seat of the Congregation of Holy Cross ( albeit not its official headquarters, which are in Rome ). Its main seminary, Moreau Seminary, is located on the campus across St. Joseph lake from the Main Building. Old College, the oldest building on campus and located near the shore of St. Mary lake, houses undergraduate seminarians. Retired priests and brothers reside in Fatima House ( a former retreat center ), Holy Cross House, as well as Columba Hall near the Grotto. The university through the Moreau Seminary has ties to theologian Frederick Buechner. While not Catholic, Buechner has praised writers from Notre Dame and Moreau Seminary created a Buechner Prize for Preaching. [SEP]'

In [15]:
max_length = 384
stride = 64

In [16]:
max_length = 384
stride = 128
def preprocess_training_examples(examples):
    ###################################################################
    #tokenizing and chunking samples
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(questions, examples["context"],
                       max_length= max_length,
                       truncation= "only_second",
                       stride= stride,
                       return_overflowing_tokens= True,
                       return_offsets_mapping= True,
                       padding= "max_length",
                       )
    ####################################################################
    #assigning every chunk its answer location
    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []
    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)
        idx = 0
        while sequence_ids[idx] == 1:
            idx += 1
        context_start = idx 
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            
            start_positions.append(idx - 1)
            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)
    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [17]:
train_dataset = raw_datasets["train"].map(
    preprocess_training_examples,
    batched= True,
    remove_columns= raw_datasets["train"].column_names,
)



In [18]:
train_dataset = train_dataset.select(range(10000))
train_dataset

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'],
    num_rows: 10000
})

In [19]:
def preprocess_validation_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(questions, examples["context"],
                       max_length= max_length,
                       truncation= "only_second",
                       stride= stride,
                       return_overflowing_tokens= True,
                       return_offsets_mapping= True,
                       padding= "max_length",
                       )
    sample_map = inputs.pop("overflow_to_sample_mapping")
    example_ids = []
    for i in range(len(inputs["input_ids"])):
        sample_idx = sample_map[i]
        example_ids.append(examples["id"][sample_idx])
        sequence_ids = inputs.sequence_ids(i)
        offset = inputs["offset_mapping"][i]
        inputs["offset_mapping"][i] = [o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)]
    inputs["example_ids"] = example_ids
    return inputs

In [20]:
validation_dataset = raw_datasets["validation"].shuffle().select(range(500)).map(preprocess_validation_examples, 
                                                    batched= True,
                                                    remove_columns= raw_datasets["validation"].column_names)

  0%|          | 0/1 [00:00<?, ?ba/s]

stablishing a baseline

In [21]:
small_eval_set = raw_datasets["validation"].select(range(100))

In [22]:
trained_checkpoint = "distilbert-base-cased-distilled-squad"
train_tokenizer = transformers.AutoTokenizer.from_pretrained(trained_checkpoint)

In [23]:
eval_set = small_eval_set.map(preprocess_validation_examples, batched= True,
                              remove_columns= raw_datasets["validation"].column_names,)
eval_set_for_model = eval_set.remove_columns(["example_ids", "offset_mapping", "token_type_ids"])
eval_set_for_model.set_format("torch")



In [24]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(device)

cuda


In [25]:
batch = {k: eval_set_for_model[k].to(device) for k in eval_set_for_model.column_names}

In [26]:
trained_model = transformers.AutoModelForQuestionAnswering.from_pretrained(trained_checkpoint).to(device)
with torch.no_grad():
    outputs = trained_model(**batch)

start_logits = outputs.start_logits.cpu().numpy()
end_logits = outputs.end_logits.cpu().numpy()
example_to_features = collections.defaultdict(list)
for idx, feature in enumerate(eval_set):
    example_to_features[feature["example_ids"]].append(idx)

In [27]:
n_best = 20
max_answer_length = 30
predicted_answers = []
for example in small_eval_set:
    example_id = example["id"]
    context = example["context"]
    answers = []
    for feature_index in example_to_features[example_id]:
        start_logit = start_logits[feature_index]
        end_logit = end_logits[feature_index]
        offsets = eval_set["offset_mapping"][feature_index]
        start_indexes = np.argsort(start_logit)[-1: -n_best - 1:-1].tolist()
        end_indexes = np.argsort(end_logit)[-1: -n_best - 1:-1].tolist()
        for start_index in start_indexes:
            for end_index in end_indexes:
                if offsets[start_index] is None or offsets[end_index] is None:
                    continue
                if (end_index < start_index or end_index - start_index + 1 > max_answer_length):
                    continue
                answers.append({
                    "text": context[offsets[start_index][0]: offsets[end_index][1]],
                    "logit_score": start_logit[start_index] + end_logit[end_index],
                })
    best_answer = max(answers, key= lambda x: x["logit_score"])
    predicted_answers.append({"id": example_id, "prediction_text": best_answer["text"]})

In [28]:
metric = evaluate.load('squad')

In [29]:
theoritical_answers = [{"id": ex["id"],"answers": ex["answers"]} for ex in small_eval_set]
print(predicted_answers[0], theoritical_answers[0], sep= "\n")

{'id': '56be4db0acb8001400a502ec', 'prediction_text': 'Denver Broncos'}
{'id': '56be4db0acb8001400a502ec', 'answers': {'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'], 'answer_start': [177, 177, 177]}}


In [30]:
metric.compute(predictions= predicted_answers,
               references= theoritical_answers)

{'exact_match': 83.0, 'f1': 88.25000000000004}

In [31]:
def compute_metrics(start_logits, end_logits, features, examples):
    example_to_features = collections.defaultdict(list)
    for idx, feature in enumerate(features):
        example_to_features[feature["example_ids"]].append(idx)

    predicted_answers = []
    for example in tqdm(examples):
        example_id = example["id"]
        context = example["context"]
        answers = []

        # Loop through all features associated with that example
        for feature_index in example_to_features[example_id]:
            start_logit = start_logits[feature_index]
            end_logit = end_logits[feature_index]
            offsets = features[feature_index]["offset_mapping"]

            start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
            end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Skip answers that are not fully in the context
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    # Skip answers with a length that is either < 0 or > max_answer_length
                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                    ):
                        continue

                    answer = {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                    answers.append(answer)

        # Select the answer with the best score
        if len(answers) > 0:
            best_answer = max(answers, key=lambda x: x["logit_score"])
            predicted_answers.append(
                {"id": example_id, "prediction_text": best_answer["text"]}
            )
        else:
            predicted_answers.append({"id": example_id, "prediction_text": ""})

    theoretical_answers = [{"id": ex["id"], "answers": ex["answers"]} for ex in examples]
    return metric.compute(predictions=predicted_answers, references=theoretical_answers)

In [32]:
compute_metrics(start_logits, end_logits, eval_set, small_eval_set)

  0%|          | 0/100 [00:00<?, ?it/s]

{'exact_match': 83.0, 'f1': 88.25000000000004}

In [33]:
model = transformers.AutoModelForQuestionAnswering.from_pretrained(checkpoint)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForQuestionAnswering: ['cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and a

In [34]:
args = transformers.TrainingArguments(
    "bert-finetuned-squad",
    evaluation_strategy= "no",
    save_strategy= "epoch",
    learning_rate= 2e-5,
    num_train_epochs= 2,
    weight_decay= .01,
    fp16= True,
)

In [35]:
trainer = transformers.Trainer(
    model= model,
    args= args,
    train_dataset= train_dataset,
    eval_dataset= validation_dataset,
    tokenizer= tokenizer,
)

Using cuda_amp half precision backend


In [36]:
trainer.train()

***** Running training *****
  Num examples = 10000
  Num Epochs = 2
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 2500
  Number of trainable parameters = 107721218
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,0.1774
1000,0.0001
1500,0.0001
2000,0.0
2500,0.0


Saving model checkpoint to bert-finetuned-squad/checkpoint-1250
Configuration saved in bert-finetuned-squad/checkpoint-1250/config.json
Model weights saved in bert-finetuned-squad/checkpoint-1250/pytorch_model.bin
tokenizer config file saved in bert-finetuned-squad/checkpoint-1250/tokenizer_config.json
Special tokens file saved in bert-finetuned-squad/checkpoint-1250/special_tokens_map.json
Saving model checkpoint to bert-finetuned-squad/checkpoint-2500
Configuration saved in bert-finetuned-squad/checkpoint-2500/config.json
Model weights saved in bert-finetuned-squad/checkpoint-2500/pytorch_model.bin
tokenizer config file saved in bert-finetuned-squad/checkpoint-2500/tokenizer_config.json
Special tokens file saved in bert-finetuned-squad/checkpoint-2500/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=2500, training_loss=0.035521170222759246, metrics={'train_runtime': 606.9233, 'train_samples_per_second': 32.953, 'train_steps_per_second': 4.119, 'total_flos': 3919451351040000.0, 'train_loss': 0.035521170222759246, 'epoch': 2.0})