# Lab 1 - Question Answering

## Load Dataset (Squad)

#### Retrieval Augmented Generation (RAG)

In [None]:
from datasets import load_dataset
rag_dataset = load_dataset("neural-bridge/rag-dataset-12000")


#### Stanford QUestion Answering Dataset (SQUAD)

In [85]:
squad = load_dataset("squad", split="train[:5000]")
squad = squad.train_test_split(test_size=0.2)

In [3]:
rag_dataset['train'][0]

{'context': 'Caption: Tasmanian berry grower Nic Hansen showing Macau chef Antimo Merone around his property as part of export engagement activities.\nTHE RISE and rise of the Australian strawberry, raspberry and blackberry industries has seen the sectors redouble their international trade focus, with the release of a dedicated export plan to grow their global presence over the next 10 years.\nDriven by significant grower input, the Berry Export Summary 2028 maps the sectors’ current position, where they want to be, high-opportunity markets and next steps.\nHort Innovation trade manager Jenny Van de Meeberg said the value and volume of raspberry and blackberry exports rose by 100 per cent between 2016 and 2017. She said the Australian strawberry industry experienced similar success with an almost 30 per cent rise in export volume and a 26 per cent rise in value to $32.6M over the same period.\n“Australian berry sectors are in a firm position at the moment,” she said. “Production, adopt

## Preprocessing

In [62]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")



There are a few preprocessing steps particular to question answering tasks you should be aware of:

    1. Some examples in a dataset may have a very long context that exceeds the maximum input length of the model. To deal with longer sequences, truncate only the context by setting truncation="only_second".
    2. Next, map the start and end positions of the answer to the original context by setting return_offset_mapping=True.
    3. With the mapping in hand, now you can find the start and end tokens of the answer. Use the sequence_ids method to find which part of the offset corresponds to the question and which corresponds to the context.

In [81]:
squad['answers'][0]['answer_start'][0]

515

#### Preprocess function

In [45]:
def preprocess_function(examples):
    questions = [str(q) for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answer"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

To apply the preprocessing function over the entire dataset, use 🤗 Datasets map function. You can speed up the map function by setting batched=True to process multiple elements of the dataset at once.

In [46]:
tokenized_squad = rag_dataset.map(preprocess_function, batched=True)# remove_columns=rag_dataset["train"].column_names)

Map:   0%|          | 0/9600 [00:00<?, ? examples/s]

Map:   0%|          | 0/9600 [00:05<?, ? examples/s]


TypeError: string indices must be integers, not 'str'

In [54]:
print(rag_dataset['train']['context'][3])

How unequal is India? The question is simple, the answer is not.
For some 60 years, the only reliable information about India’s inequality was coming from the annual National Sample Survey conducted from 1951. NSS is one of the most venerable surveys in the world of poverty and income distribution statistics. India started fielding it soon after its independence: the survey was supposed to track how the new government fought poverty, to provide information on caste differences, rural-urban gap, caloric intake, especially of the poor and many other statistics. Since its main concern was with poverty, the decision was made to survey consumption, that is, how much people actually consume (do they have sufficient number of calories) rather than income (how many rupees they earn).
For all these decades since 1951 NSS was the key instrument that allowed researchers from India and the rest of the world as well as Indian policymakers to know what is happening with India’s population. India cou

In [83]:
def prep(exp):
    ans = exp['answers']['text'][0]
    exp['answer_start'] = exp['answers']['answer_start'][0]
    exp['answer_end'] = exp['answer_start'] + len(ans)
    return exp

In [84]:
squad.map(prep, remove_columns=['answers'])

Map:  14%|█▍        | 710/5000 [00:00<00:02, 1611.65 examples/s]

Map: 100%|██████████| 5000/5000 [00:02<00:00, 1701.60 examples/s]


Dataset({
    features: ['id', 'title', 'context', 'question', 'answer_start', 'answer_end'],
    num_rows: 5000
})