--------------------------
**Author**: Gunnvant

**Description**: QA dataset and training pipeline

--------------------------

In [2]:
from datasets import load_dataset

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
raw_data = load_dataset("csv",data_files="../qa/SQuAD_csv.csv")

Downloading data files: 100%|█████████████████████████████████████████| 1/1 [00:00<00:00, 9510.89it/s]
Extracting data files: 100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 279.42it/s]
Generating train split: 86821 examples [00:00, 174678.58 examples/s]


In [5]:
raw_data = raw_data.remove_columns("Unnamed: 0")

In [19]:
raw_data['train'][0:2]

{'context': ['Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".',
  'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 19

In [8]:
from transformers import AutoTokenizer
model_checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading (…)okenizer_config.json: 100%|█████████████████████████| 29.0/29.0 [00:00<00:00, 48.1kB/s]
Downloading (…)lve/main/config.json: 100%|███████████████████████████| 570/570 [00:00<00:00, 1.23MB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████████████████████| 213k/213k [00:00<00:00, 524kB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████████████████████| 436k/436k [00:00<00:00, 707kB/s]


In [9]:
context = raw_data["train"][0]["context"]
question = raw_data["train"][0]["question"]

In [10]:
inputs = tokenizer(question, context)
tokenizer.decode(inputs["input_ids"])

'[CLS] When did Beyonce start becoming popular? [SEP] Beyoncé Giselle Knowles - Carter ( / [UNK] / bee - YON - say ) ( born September 4, 1981 ) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R & B girl - group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best - selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love ( 2003 ), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number - one singles " Crazy in Love " and " Baby Boy ". [SEP]'

The general format is `[CLS] question [SEP] context [SEP]` for the tokenized data. This is how the model will know what is the question and what is the context

#### Issues with large context sizes:

When training the model, if the context size is large it becomes difficult to train. The general approach is to split the context into chunks and use `question->chunk` pairs for each context as training.

While creating chunks there is usually an overlap between the tokens in each of the chunks. We can achieve this by using the tokenizer with relevant options.

In [11]:
inputs = tokenizer(
    question,
    context,
    max_length=100, ## size of chunk
    truncation="only_second",
    stride=50, ## overlap between chunks
    return_overflowing_tokens=True,
)

for ids in inputs["input_ids"]:
    print(tokenizer.decode(ids))

[CLS] When did Beyonce start becoming popular? [SEP] Beyoncé Giselle Knowles - Carter ( / [UNK] / bee - YON - say ) ( born September 4, 1981 ) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R & B girl - group Destiny's Child. Managed by her father, Mathew Knowles [SEP]
[CLS] When did Beyonce start becoming popular? [SEP] raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R & B girl - group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best - selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love ( 2003 ) [SEP]
[CLS] When did Beyonce start becoming popular? [SEP] s Child. Managed by her father, Mathew Knowles, the group b

In [12]:
inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'overflow_to_sample_mapping'])

The first context has the answer, but since the context window is 100 only the original context has been chunked into 3 context chunks. One can also see the overlap in the chunks.

#### Offset mapping
We need the offset map to find the answer location.

In [13]:
inputs = tokenizer(
    question,
    context,
    max_length=100,
    truncation="only_second",
    stride=50,
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
)
inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'overflow_to_sample_mapping'])

In [14]:
inputs

{'input_ids': [[101, 1332, 1225, 24896, 1320, 2093, 1838, 2479, 1927, 136, 102, 24041, 144, 22080, 25384, 118, 5007, 113, 120, 100, 120, 17775, 118, 162, 11414, 118, 1474, 114, 113, 1255, 1347, 125, 117, 2358, 114, 1110, 1126, 1237, 2483, 117, 5523, 117, 1647, 2451, 1105, 3647, 119, 3526, 1105, 2120, 1107, 4666, 117, 2245, 117, 1131, 1982, 1107, 1672, 4241, 1105, 5923, 6025, 1112, 170, 2027, 117, 1105, 3152, 1106, 8408, 1107, 1103, 1523, 3281, 1112, 1730, 2483, 1104, 155, 111, 139, 1873, 118, 1372, 16784, 112, 188, 6405, 119, 2268, 15841, 1118, 1123, 1401, 117, 15112, 5773, 25384, 102], [101, 1332, 1225, 24896, 1320, 2093, 1838, 2479, 1927, 136, 102, 2120, 1107, 4666, 117, 2245, 117, 1131, 1982, 1107, 1672, 4241, 1105, 5923, 6025, 1112, 170, 2027, 117, 1105, 3152, 1106, 8408, 1107, 1103, 1523, 3281, 1112, 1730, 2483, 1104, 155, 111, 139, 1873, 118, 1372, 16784, 112, 188, 6405, 119, 2268, 15841, 1118, 1123, 1401, 117, 15112, 5773, 25384, 117, 1103, 1372, 1245, 1141, 1104, 1103, 1362, 11

`overflow_to_sample_mapping` maps each chunk to the example/question it is related to. Here since we had only one question hence there is `0` in the array meaning we are mapping to the first question/example each of these chunks.

In [16]:
len(inputs['offset_mapping'])

3

In [17]:
inputs['offset_mapping'][0]

[(0, 0),
 (0, 4),
 (5, 8),
 (9, 12),
 (12, 14),
 (14, 16),
 (17, 22),
 (23, 31),
 (32, 39),
 (39, 40),
 (0, 0),
 (0, 7),
 (8, 9),
 (9, 15),
 (16, 23),
 (23, 24),
 (24, 30),
 (31, 32),
 (32, 33),
 (33, 43),
 (43, 44),
 (45, 48),
 (48, 49),
 (49, 50),
 (50, 52),
 (52, 53),
 (53, 56),
 (56, 57),
 (58, 59),
 (59, 63),
 (64, 73),
 (74, 75),
 (75, 76),
 (77, 81),
 (81, 82),
 (83, 85),
 (86, 88),
 (89, 97),
 (98, 104),
 (104, 105),
 (106, 116),
 (116, 117),
 (118, 124),
 (125, 133),
 (134, 137),
 (138, 145),
 (145, 146),
 (147, 151),
 (152, 155),
 (156, 162),
 (163, 165),
 (166, 173),
 (173, 174),
 (175, 180),
 (180, 181),
 (182, 185),
 (186, 195),
 (196, 198),
 (199, 206),
 (207, 214),
 (215, 218),
 (219, 226),
 (227, 239),
 (240, 242),
 (243, 244),
 (245, 250),
 (250, 251),
 (252, 255),
 (256, 260),
 (261, 263),
 (264, 268),
 (269, 271),
 (272, 275),
 (276, 280),
 (281, 286),
 (287, 289),
 (290, 294),
 (295, 301),
 (302, 304),
 (305, 306),
 (306, 307),
 (307, 308),
 (309, 313),
 (313, 314),

In [18]:
inputs['offset_mapping'][1]

[(0, 0),
 (0, 4),
 (5, 8),
 (9, 12),
 (12, 14),
 (14, 16),
 (17, 22),
 (23, 31),
 (32, 39),
 (39, 40),
 (0, 0),
 (156, 162),
 (163, 165),
 (166, 173),
 (173, 174),
 (175, 180),
 (180, 181),
 (182, 185),
 (186, 195),
 (196, 198),
 (199, 206),
 (207, 214),
 (215, 218),
 (219, 226),
 (227, 239),
 (240, 242),
 (243, 244),
 (245, 250),
 (250, 251),
 (252, 255),
 (256, 260),
 (261, 263),
 (264, 268),
 (269, 271),
 (272, 275),
 (276, 280),
 (281, 286),
 (287, 289),
 (290, 294),
 (295, 301),
 (302, 304),
 (305, 306),
 (306, 307),
 (307, 308),
 (309, 313),
 (313, 314),
 (314, 319),
 (320, 327),
 (327, 328),
 (328, 329),
 (330, 335),
 (335, 336),
 (337, 340),
 (340, 344),
 (345, 347),
 (348, 351),
 (352, 358),
 (358, 359),
 (360, 364),
 (364, 366),
 (367, 374),
 (374, 375),
 (376, 379),
 (380, 385),
 (386, 392),
 (393, 396),
 (397, 399),
 (400, 403),
 (404, 409),
 (409, 410),
 (410, 411),
 (412, 416),
 (416, 417),
 (417, 424),
 (425, 429),
 (430, 436),
 (437, 439),
 (440, 443),
 (444, 448),
 (44

With offset mapping one can very easily see `(0,0)` tuples separating out the boundaries of question and context. One can also see that there is an overlap in context by observing the start position of 2nd chunk.

In [27]:
inputs.sequence_ids() ### this can be used instead of `token_type_ids` as many models don't support that

[None,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 None,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 None]

In [31]:
### Batch output
context = raw_data["train"][2:6]["context"]
question = raw_data["train"][2:6]["question"]
inputs = tokenizer(
    question,
    context,
    max_length=100,
    truncation="only_second",
    stride=50,
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
)

In [38]:
inputs['offset_mapping']

[[(0, 0),
  (0, 4),
  (5, 8),
  (9, 12),
  (12, 14),
  (14, 16),
  (17, 22),
  (23, 30),
  (30, 31),
  (31, 32),
  (33, 38),
  (39, 42),
  (43, 49),
  (50, 51),
  (52, 56),
  (57, 63),
  (63, 64),
  (0, 0),
  (0, 7),
  (8, 9),
  (9, 15),
  (16, 23),
  (23, 24),
  (24, 30),
  (31, 32),
  (32, 33),
  (33, 43),
  (43, 44),
  (45, 48),
  (48, 49),
  (49, 50),
  (50, 52),
  (52, 53),
  (53, 56),
  (56, 57),
  (58, 59),
  (59, 63),
  (64, 73),
  (74, 75),
  (75, 76),
  (77, 81),
  (81, 82),
  (83, 85),
  (86, 88),
  (89, 97),
  (98, 104),
  (104, 105),
  (106, 116),
  (116, 117),
  (118, 124),
  (125, 133),
  (134, 137),
  (138, 145),
  (145, 146),
  (147, 151),
  (152, 155),
  (156, 162),
  (163, 165),
  (166, 173),
  (173, 174),
  (175, 180),
  (180, 181),
  (182, 185),
  (186, 195),
  (196, 198),
  (199, 206),
  (207, 214),
  (215, 218),
  (219, 226),
  (227, 239),
  (240, 242),
  (243, 244),
  (245, 250),
  (250, 251),
  (252, 255),
  (256, 260),
  (261, 263),
  (264, 268),
  (269, 271),

In [37]:
inputs['overflow_to_sample_mapping']

[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3]

In [39]:
print(len(inputs['offset_mapping']))
print(len(inputs['overflow_to_sample_mapping']))

15
15


In [43]:
inputs.sequence_ids(0)

[None,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 None,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 None]

### How to find which chunks have the answer?

- Get a batch input tokenized with correct tokenization options selected, lets call these tokenized results as `inputs`.
- Also separately keep the raw answer metadata such as its start idx and len
- Now, loop through the `offset_mapping` and find the corresponding value (based on idx) `overflow_to_sample_mapping`
- The the value extracted from `overflow_to_sample_mapping` will help in finding the corresponding index of the correct answer, from which the start_idx and len of answer can be extracted.
- Extract the `input.sequence_id(idx)`. This will be needed to find the location of `context start`in the chunk by finding the first value which is one. This will also help us in finding the value of `context_end` as we continue to increment a counter till we see a value of 1.
- Once we have `context_start` and `context_end` we start finding if the start idx of answer token is inside the chunk we note down its posiiton in `context_start` array and do a similar excercise for `context_end` array.

## TLDR:

When each context is broken into chunks, the answer start idx and answer end idx in the original data can't be used as chunking involves overlap and truncation. To find the idx of `input_tokens` that now represent the location of answer, we need to use the above algorithm.


In [50]:
### Test the preprocess function
max_length = 100
stride = 50


def preprocess_training_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    #answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        #answer = answers[sample_idx]
        start_char = examples["answer_start"][sample_idx]
        end_char = start_char + len(examples["text"][sample_idx])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label is (0, 0)
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [51]:
samples=raw_data['train'][0:10]

In [52]:
preprocess_training_examples(samples) ### seems to work

{'input_ids': [[101, 1332, 1225, 24896, 1320, 2093, 1838, 2479, 1927, 136, 102, 24041, 144, 22080, 25384, 118, 5007, 113, 120, 100, 120, 17775, 118, 162, 11414, 118, 1474, 114, 113, 1255, 1347, 125, 117, 2358, 114, 1110, 1126, 1237, 2483, 117, 5523, 117, 1647, 2451, 1105, 3647, 119, 3526, 1105, 2120, 1107, 4666, 117, 2245, 117, 1131, 1982, 1107, 1672, 4241, 1105, 5923, 6025, 1112, 170, 2027, 117, 1105, 3152, 1106, 8408, 1107, 1103, 1523, 3281, 1112, 1730, 2483, 1104, 155, 111, 139, 1873, 118, 1372, 16784, 112, 188, 6405, 119, 2268, 15841, 1118, 1123, 1401, 117, 15112, 5773, 25384, 102], [101, 1332, 1225, 24896, 1320, 2093, 1838, 2479, 1927, 136, 102, 2120, 1107, 4666, 117, 2245, 117, 1131, 1982, 1107, 1672, 4241, 1105, 5923, 6025, 1112, 170, 2027, 117, 1105, 3152, 1106, 8408, 1107, 1103, 1523, 3281, 1112, 1730, 2483, 1104, 155, 111, 139, 1873, 118, 1372, 16784, 112, 188, 6405, 119, 2268, 15841, 1118, 1123, 1401, 117, 15112, 5773, 25384, 117, 1103, 1372, 1245, 1141, 1104, 1103, 1362, 11