<a href="https://colab.research.google.com/github/Armadillobambi/NLP_squad/blob/main/QA_SQuAD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
!pip install datasets
!pip install pytorch_transformers
!pip install datasets
!pip install accelerate -U
!pip install transformers[torch]



In [6]:
import torch
import pandas as pd
from transformers import pipeline
from pytorch_transformers import RobertaTokenizer
from pytorch_transformers import RobertaModel

# Preprocessing

https://huggingface.co/docs/datasets/v2.18.0/en/package_reference/main_classes#datasets.DatasetDict

In [10]:
from datasets import load_dataset
train_data = load_dataset("rajpurkar/squad", split="train")
validation_data = load_dataset("rajpurkar/squad", split="validation")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

In [7]:
# Create configuration object holds information (hyperparameters) concerning the model

# from: https://medium.com/analytics-vidhya/6-steps-to-build-roberta-a-robustly-optimised-bert-pretraining-approach-e508ebe78b96
# change paths and tune values !

class Config(dict):
  def __init__(self, **kwargs):
    super().__init__(**kwargs)
    for k, v in kwargs.items():
      setattr(self, k, v)

  def set(self, key, val):
    self[key] = val
    setattr(self, key, val)

In [8]:
config = Config(
    seed = 18,
    roberta_model_name = 'roBERTa',
    max_lr = 1e-5,
    max_epochs = 1,
    max_steps = -1,                        # takes precedence over max_epochs
    precision = 16,
    batch_size = 4,
    max_seq_len = 320,
    hidden_dropout_prob = 0.05,
    hidden_size = 1024,
    valid_pct = 0.20,
    start_tok = "<s>",
    end_tok = "</s>",
    model_path = 'Model_Roberta.pkl',
    pred_path = 'Prediction_Roberta.csv',
    train_file_path = 'train.csv',
    test_file_paht = 'test.csv',
    text_column_nmae = 'question',
    target_column_name = 'answer'
)

    seed: how to split the data in training and validation set
    max_lr: maximum learning rate
    epochs: number of epoch
    batch_size: set to 4 because of gpu memory limit
    max_seq_len: max length of tokens in a sentence
    hidden_dropout_prob: the percentage of dropout
    hidden_size: 1024 for roberta-large and 768 for roberta-base
    valid_pct: percentage of validation dataset
    start_tok: start of sentence
    end_tok: end of sentence

In [11]:
train_data[0]

{'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}

Do pre-processing and set up tokenizer

In [12]:
from transformers import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained("FacebookAI/roberta-base") #This one doesn't work with our current preprocessing system

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

In [13]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("FacebookAI/roberta-base")

Import and configure model

In order to preprocess the data, we need to deal with long sequences and find the beginning and end tokens of the data.
<!-- We need to create a function to truncate and map the start and end tokens of the data. -->

In [14]:
# from https://huggingface.co/docs/transformers/tasks/question_answering, still needs some adjusting
def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [15]:
tokenized_training_data = train_data.map(preprocess_function, batched=True, remove_columns=train_data.column_names)
tokenized_validation_data = validation_data.map(preprocess_function, batched=True, remove_columns=validation_data.column_names)

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

In [19]:
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()

# Training the model

In [22]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer
model = AutoModelForQuestionAnswering.from_pretrained('roberta-base')

Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [23]:
training_args = TrainingArguments(
    output_dir="qa_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_training_data,
    eval_dataset=tokenized_validation_data,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 