## Fine-tuning a model on a question-answering task

In [None]:
! pip install datasets transformers accelerate

In [7]:
# Flag to use squad_v2
squad_v2 = True
model_checkpoint = "microsoft/phi-2"
batch_size = 1

## Loading the dataset

**load_metric:** <br>Get the metric we need to use for evaluation (to compare our model to the benchmark)<br>
[https://huggingface.co/docs/datasets/v1.0.1/loading_metrics.html](https://huggingface.co/docs/datasets/v1.0.1/loading_metrics.html)

In [8]:
from datasets import load_dataset, load_metric

In [9]:
# loading the dataset
datasets = load_dataset("squad_v2" if squad_v2 else "squad")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/8.92k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/16.4M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/130319 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11873 [00:00<?, ? examples/s]

The datasets object is DatasetDict, which contains one key for the training and validation.

In [None]:
datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 130319
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 11873
    })
})

In [None]:
# access first element by giving split and index
datasets["train"][0]

{'id': '56be85543aeaaa14008c9063',
 'title': 'Beyoncé',
 'context': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".',
 'question': 'When did Beyonce start becoming popular?',
 'answers': {'text': ['in the late 1990s'], 'answer_start': [269]}}

answers are indicated by their start position in the text  and their full text, which is a substring of the context.

In [None]:
# function to pick some examples randomly in the dataset and convert numerical representations to their meaningful names.
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        # Applies the transformation to each value i in the DataFrame column, replacing the integer with its corresponding class name.
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        # Applies the transformation to each list x in the DataFrame column, replacing each integer in the list with its corresponding class name.
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(datasets["train"])

Unnamed: 0,id,title,context,question,answers
0,57346018879d6814001ca583,Hunting,"Hunting is claimed to give resource managers an important tool in managing populations that might exceed the carrying capacity of their habitat and threaten the well-being of other species, or, in some instances, damage human health or safety.[citation needed] However, in most circumstances carrying capacity is determined by a combination habitat and food availability, and hunting for 'population control' has no effect on the annual population of species.[citation needed] In some cases, it can increase the population of predators such as coyotes by removing territorial bounds that would otherwise be established, resulting in excess neighbouring migrations into an area, thus artificially increasing the population. Hunting advocates[who?] assert that hunting reduces intraspecific competition for food and shelter, reducing mortality among the remaining animals. Some environmentalists assert[who?] that (re)introducing predators would achieve the same end with greater efficiency and less negative effect, such as introducing significant amounts of free lead into the environment and food chain.",What does hunting give resource managers an important tool?,"{'text': ['managing populations'], 'answer_start': [66]}"
1,572747fddd62a815002e9a7c,Affirmative_action_in_the_United_States,"In the year 2000, according to a study by American Association of University Professors (AAUP), affirmative action promoted diversity within colleges and universities. This has been shown to have positive effects on the educational outcomes and experiences of college students as well as the teaching of faculty members. According to a study by Geoffrey Maruyama and José F. Moreno, the results showed that faculty members believed diversity helps students to reach the essential goals of a college education, Caucasian students suffer no detrimental effects from classroom diversity, and that attention to multicultural learning improves the ability of colleges and universities to accomplish their missions. Furthermore, a diverse population of students offers unique perspectives in order to challenge preconceived notions through exposure to the experiences and ideas of others. According to Professor Gurin of the University of Michigan, skills such as ""perspective-taking, acceptance of differences, a willingness and capacity to find commonalities among differences, acceptance of conflict as normal, conflict resolution, participation in democracy, and interest in the wider social world"" can potentially be developed in college while being exposed to heterogeneous group of students. In addition, broadening perspectives helps students confront personal and substantive stereotypes and fosters discussion about racial and ethnic issues in a classroom setting. Furthermore, the 2000 AAUP study states that having a diversity of views leads to a better discussion and greater understanding among the students on issues of race, tolerance, fairness, etc.",In which year did the AAUP release their study?,"{'text': ['2000'], 'answer_start': [12]}"
2,56ddec6666d3e219004dae0f,Cardinal_(Catholicism),"In 1630, Pope Urban VIII decreed their title to be Eminence (previously, it had been ""illustrissimo"" and ""reverendissimo"") and decreed that their secular rank would equate to Prince, making them secondary only to the Pope and crowned monarchs.",In was year was the title decreed Eminence?,"{'text': ['1630'], 'answer_start': [3]}"
3,5acfd0c777cf76001a68616b,Presbyterianism,"In England, Presbyterianism was established in secret in 1592. Thomas Cartwright is thought to be the first Presbyterian in England. Cartwright's controversial lectures at Cambridge University condemning the episcopal hierarchy of the Elizabethan Church led to his deprivation of his post by Archbishop John Whitgift and his emigration abroad. Between 1645 and 1648, a series of ordinances of the Long Parliament established Presbyterianism as the polity of the Church of England. Presbyterian government was established in London and Lancashire and in a few other places in England, although Presbyterian hostility to the execution of Charles I and the establishment of the republican Commonwealth of England meant that Parliament never enforced the Presbyterian system in England. The re-establishment of the monarchy in 1660 brought the return of Episcopal church government in England (and in Scotland for a short time); but the Presbyterian church in England continued in Non-Conformity, outside of the established church. In 1719 a major split, the Salter's Hall controversy, occurred; with the majority siding with nontrinitarian views. Thomas Bradbury published several sermons bearing on the controversy, and in 1719, ""An answer to the reproaches cast on the dissenting ministers who subscribed their belief of the Eternal Trinity."". By the 18th century many English Presbyterian congregations had become Unitarian in doctrine.",Who gave controversial lectures at Parliament University?,"{'text': [], 'answer_start': []}"
4,570c488cb3d812140066d077,Melbourne,"A brash boosterism that had typified Melbourne during this time ended in the early 1890s with a severe depression of the city's economy, sending the local finance and property industries into a period of chaos during which 16 small ""land banks"" and building societies collapsed, and 133 limited companies went into liquidation. The Melbourne financial crisis was a contributing factor in the Australian economic depression of the 1890s and the Australian banking crisis of 1893. The effects of the depression on the city were profound, with virtually no new construction until the late 1890s.",When did severe depression hit Melbourne's city?,"{'text': ['1890s'], 'answer_start': [83]}"


In [21]:
def preprocess(dataset):
  dataset["question"] = '[CLS]' + " " + dataset["question"]
  dataset["context"] = " " + '[SEP]' + " " + dataset["context"] + " " + '[SEP]'
  return dataset

In [22]:
# datasets["train"] = datasets["train"].map(preprocess)
datasets = datasets.map(preprocess)

Map:   0%|          | 0/130319 [00:00<?, ? examples/s]

Map:   0%|          | 0/11873 [00:00<?, ? examples/s]

In [None]:
datasets["validation"][0]

{'id': '56ddde6b9a695914005b9628',
 'title': 'Normans',
 'context': ' [SEP] The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries. [SEP]',
 'question': '[CLS] In what country is Normandy located?',
 'answers': {'text': ['France', 'France', 'France', 'France'],
  'answer_start': [159, 159, 159, 159]}}

In [None]:
datasets["train"][15]

{'id': '56be86cf3aeaaa14008c9076',
 'title': 'Beyoncé',
 'context': ' [SEP] Following the disbandment of Destiny\'s Child in June 2005, she released her second solo album, B\'Day (2006), which contained hits "Déjà Vu", "Irreplaceable", and "Beautiful Liar". Beyoncé also ventured into acting, with a Golden Globe-nominated performance in Dreamgirls (2006), and starring roles in The Pink Panther (2006) and Obsessed (2009). Her marriage to rapper Jay Z and portrayal of Etta James in Cadillac Records (2008) influenced her third album, I Am... Sasha Fierce (2008), which saw the birth of her alter-ego Sasha Fierce and earned a record-setting six Grammy Awards in 2010, including Song of the Year for "Single Ladies (Put a Ring on It)". Beyoncé took a hiatus from music in 2010 and took over management of her career; her fourth album 4 (2011) was subsequently mellower in tone, exploring 1970s funk, 1980s pop, and 1990s soul. Her critically acclaimed fifth studio album, Beyoncé (2013), was disting

## Preprocessing the training data

In [None]:
# instantiate tokenizer with the AutoTokenizer.from_pretrained method
# get a tokenizer that corresponds to the model architecture we want to use,
# download the vocabulary used when pretraining this specific checkpoint.

[https://huggingface.co/docs/transformers/preprocessing](https://huggingface.co/docs/transformers/preprocessing)

In [10]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)



tokenizer_config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [11]:
# Define special tokens
cls_token = "[CLS]"
sep_token = "[SEP]"

# Add special tokens to the tokenizer's vocabulary
tokenizer.add_special_tokens({"cls_token": cls_token, "sep_token": sep_token})

2

In [None]:
# ensure that the tokenizer variable is of the expected type
import transformers
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

In [None]:
# 2 sentences: one for the answer, one for the context
tokenizer("[CLS] What is your name?", "[SEP] My name is Sylvain. [SEP]")

{'input_ids': [50295, 1867, 318, 534, 1438, 30, 50296, 2011, 1438, 318, 24286, 391, 13, 220, 50296], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Dealing with very long documents.
We will allow one (long) example in our dataset to give several input features, each of length shorter than the maximum length of the model (or the one we set as a hyper-parameter). Also, just in case the answer lies at the point we split a long context, we allow some overlap between the features we generate controlled by the hyper-parameter doc_stride.

In [12]:
max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.

In [None]:
# example of one long example in our dataset
for i, example in enumerate(datasets["train"]):
    if len(tokenizer(example["question"], example["context"])["input_ids"]) > 384:
        break
example = datasets["train"][i]

In [None]:
# Without any truncation, we get the following length for the input IDs:
len(tokenizer(example["question"], example["context"])["input_ids"])
# if we just truncate, we will lose information (and possibly the answer to our question):

443

In [None]:
len(tokenizer(example["question"], example["context"], max_length=max_length, truncation="only_second")["input_ids"])

384

In [None]:
# only truncate context, hence "only_second"
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    # if the context is too long and needs to be split, the tokenizer will return the overflowing tokens as separate sequences.
    # This allows the model to handle large contexts by processing them in chunks.
    return_overflowing_tokens=True,
    stride=doc_stride
)

Striding:<br>
Chunk 1: [CLS] What is the capital of France? [SEP] France is a country in Europe. It has many famous cities, including Paris, [SEP]

Chunk 2:                             many famous cities, including Paris, which is the capital of France. Paris is known for its art, gastronomy, [SEP]

Chunk 3:                                                 which is the capital of France. Paris is known for its art, gastronomy, and culture. The Eiffel Tower is located in [SEP]


In [None]:
# we get several list of input_ids
[len(x) for x in tokenized_example["input_ids"]]

[384, 197]

In [None]:
# decode this
for x in tokenized_example["input_ids"][:2]:
    print(tokenizer.decode(x))

[CLS] Beyonce got married in 2008 to whom? [SEP] On April 4, 2008, Beyoncé married Jay Z. She publicly revealed their marriage in a video montage at the listening party for her third studio album, I Am... Sasha Fierce, in Manhattan's Sony Club on October 22, 2008. I Am... Sasha Fierce was released on November 18, 2008 in the United States. The album formally introduces Beyoncé's alter ego Sasha Fierce, conceived during the making of her 2003 single "Crazy in Love", selling 482,000 copies in its first week, debuting atop the Billboard 200, and giving Beyoncé her third consecutive number-one album in the US. The album featured the number-one song "Single Ladies (Put a Ring on It)" and the top-five songs "If I Were a Boy" and "Halo". Achieving the accomplishment of becoming her longest-running Hot 100 single in her career, "Halo"'s success in the US helped Beyoncé attain more top-ten singles on the list than any other woman during the 2000s. It also included the successful "Sweet Dreams",

special tokens: <br> [CLS]-classification and [SEP]-separator, used in handling input sequences.<br>
[CLS] and [SEP] tokens get embedded into dense vectors along with other tokens.
The model processes these embeddings through multiple transformer layers, capturing contextual information.<br>
The final hidden state of the [CLS] token can be used for classification or as a summary representation.
The [SEP] token helps in maintaining the distinction between different segments throughout the model's layers.

In [None]:
# offset_mapping - provides information about the position(start and end character positions) of each token within the original input text.
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
    stride=doc_stride
)
print(tokenized_example["offset_mapping"][0][:100])
# printing offset_mapping of first 100 elemenets for the first tokenized example

[(0, 5), (5, 11), (11, 13), (13, 17), (17, 25), (25, 28), (28, 33), (33, 36), (36, 41), (41, 42), (0, 1), (1, 6), (6, 9), (9, 15), (15, 17), (17, 18), (18, 23), (23, 24), (24, 30), (30, 32), (32, 40), (40, 44), (44, 46), (46, 47), (47, 51), (51, 60), (60, 69), (69, 75), (75, 84), (84, 87), (87, 89), (89, 95), (95, 100), (100, 103), (103, 106), (106, 110), (110, 120), (120, 126), (126, 130), (130, 134), (134, 140), (140, 147), (147, 153), (153, 154), (154, 156), (156, 159), (159, 162), (162, 168), (168, 170), (170, 175), (175, 176), (176, 179), (179, 189), (189, 191), (191, 196), (196, 201), (201, 204), (204, 212), (212, 215), (215, 216), (216, 221), (221, 222), (222, 224), (224, 227), (227, 230), (230, 236), (236, 238), (238, 243), (243, 247), (247, 256), (256, 259), (259, 268), (268, 271), (271, 272), (272, 277), (277, 280), (280, 284), (284, 291), (291, 298), (298, 299), (299, 303), (303, 309), (309, 318), (318, 329), (329, 335), (335, 337), (337, 339), (339, 345), (345, 349), (349, 

In [None]:
first_token_id = tokenized_example["input_ids"][0][1]
offsets = tokenized_example["offset_mapping"][0][1]
print(tokenizer.convert_ids_to_tokens([first_token_id])[0], example["question"][offsets[0]:offsets[1]])

ĠBeyon  Beyon


In [None]:
# distinguish which parts of the offsets correspond to the question and which part correspond to the context
# None - special tokens
# 0 - question
# 1 - context
sequence_ids = tokenized_example.sequence_ids()
print(sequence_ids)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

In [None]:
# printing the start_position and end_position of the answer in the tokenized sequence.
answers = example["answers"]
start_char = answers["answer_start"][0]
end_char = start_char + len(answers["text"][0])

# Start token index of the current span in the text.
token_start_index = 0
while sequence_ids[token_start_index] != 1:
    token_start_index += 1

# End token index of the current span in the text.
token_end_index = len(tokenized_example["input_ids"][0]) - 1
while sequence_ids[token_end_index] != 1:
    token_end_index -= 1

# Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
offsets = tokenized_example["offset_mapping"][0]
if (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
    # Move the token_start_index and token_end_index to the two ends of the answer.
    # Note: we could go after the last offset if the answer is the last word (edge case).
    while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
        token_start_index += 1
    start_position = token_start_index - 1
    while offsets[token_end_index][1] >= end_char:
        token_end_index -= 1
    end_position = token_end_index + 1
    print(start_position, end_position)
else:
    print("The answer is not in this feature.")

20 20


In [19]:
pad_on_right = tokenizer.padding_side == "right"
tokenizer.pad_token = tokenizer.eos_token

In [13]:
# combining all together
def prepare_train_features(examples):

    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    tokenizer.pad_token = tokenizer.eos_token

    # Define special tokens
    cls_token = "[CLS]"
    sep_token = "[SEP]"

    # Add special tokens to the tokenizer's vocabulary
    tokenizer.add_special_tokens({"cls_token": cls_token, "sep_token": sep_token})

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []


    for i, offsets in enumerate(offset_mapping):
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

In [14]:
print("CLS token ID:", tokenizer.cls_token_id)
print("SEP token ID:", tokenizer.sep_token_id)

CLS token ID: 50295
SEP token ID: 50296


In [None]:
features = prepare_train_features(datasets['train'][:5])

In [None]:
# # apply the defined function to full dataset
# tokenized_datasets = datasets.map(prepare_train_features, batched=True, remove_columns=datasets["train"].column_names)

In [23]:
# Select 5000 samples from the "train" split
train_subset = datasets["train"].select(range(5000))

# Select 1000 samples from the "validation" split
validation_subset = datasets["validation"].select(range(1000))

In [24]:
tokenized_train_subset = train_subset.map(prepare_train_features, batched=True, remove_columns=train_subset.column_names)
tokenized_validation_subset = validation_subset.map(prepare_train_features, batched=True, remove_columns=validation_subset.column_names)

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
tokenized_train_subset[0]

{'input_ids': [50295,
  1649,
  750,
  37361,
  344,
  923,
  5033,
  2968,
  30,
  220,
  50296,
  37361,
  32682,
  402,
  271,
  13485,
  9365,
  829,
  12,
  49958,
  50247,
  8482,
  135,
  238,
  45990,
  73,
  133,
  240,
  77,
  325,
  133,
  103,
  14,
  20697,
  12,
  56,
  1340,
  12,
  16706,
  8,
  357,
  6286,
  2693,
  604,
  11,
  14745,
  8,
  318,
  281,
  1605,
  14015,
  11,
  3496,
  16002,
  11,
  1700,
  9920,
  290,
  14549,
  13,
  18889,
  290,
  4376,
  287,
  6995,
  11,
  3936,
  11,
  673,
  6157,
  287,
  2972,
  13777,
  290,
  15360,
  24174,
  355,
  257,
  1200,
  11,
  290,
  8278,
  284,
  16117,
  287,
  262,
  2739,
  6303,
  82,
  355,
  1085,
  14015,
  286,
  371,
  5,
  33,
  2576,
  12,
  8094,
  17886,
  338,
  5932,
  13,
  1869,
  1886,
  416,
  607,
  2988,
  11,
  6550,
  6391,
  9365,
  829,
  11,
  262,
  1448,
  2627,
  530,
  286,
  262,
  995,
  338,
  1266,
  12,
  16473,
  2576,
  2628,
  286,
  477,
  640,
  13,
  5334,
  37009,


In [None]:
for keys,value in tokenized_validation_subset[15].items():
  print(keys)

## Fine-tuning the model

In [None]:
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

model = AutoModelForCausalLM.from_pretrained(model_checkpoint)



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
# defining attributes to customize the training
model_name = model_checkpoint.split("/")[-1]
args = TrainingArguments(
    f"{model_name}-finetuned-squadv2",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=False,
    hub_model_id=""
)

In [17]:
# data collator to batch our processed examples together
from transformers import default_data_collator

data_collator = default_data_collator

In [32]:
# Select a small batch of 5 samples
small_batch = [tokenized_train_subset[i] for i in range(1)]
# Use the default data collator
collated_batch = default_data_collator(small_batch)
# Print the resulting batch to inspect its contents
for key, value in collated_batch.items():
    print(f"{key}: {value}")

# Print a summary for better readability
for key, value in collated_batch.items():
    print(f"{key}:")
    print(f"  Type: {type(value)}")
    print(f"  Shape: {value.shape if hasattr(value, 'shape') else 'N/A'}")
    print(f"  Example element: {value[0]}")
    print()  # Blank line for readability

input_ids: tensor([[50295,  1649,   750, 37361,   344,   923,  5033,  2968,    30,   220,
         50296, 37361, 32682,   402,   271, 13485,  9365,   829,    12, 49958,
         50247,  8482,   135,   238, 45990,    73,   133,   240,    77,   325,
           133,   103,    14, 20697,    12,    56,  1340,    12, 16706,     8,
           357,  6286,  2693,   604,    11, 14745,     8,   318,   281,  1605,
         14015,    11,  3496, 16002,    11,  1700,  9920,   290, 14549,    13,
         18889,   290,  4376,   287,  6995,    11,  3936,    11,   673,  6157,
           287,  2972, 13777,   290, 15360, 24174,   355,   257,  1200,    11,
           290,  8278,   284, 16117,   287,   262,  2739,  6303,    82,   355,
          1085, 14015,   286,   371,     5,    33,  2576,    12,  8094, 17886,
           338,  5932,    13,  1869,  1886,   416,   607,  2988,    11,  6550,
          6391,  9365,   829,    11,   262,  1448,  2627,   530,   286,   262,
           995,   338,  1266,    12, 1647

In [None]:
print("model")
print(model)
print("arguments",args)
print("train_datasets",tokenized_train_subset[0])
print("valid_datasets",tokenized_validation_subset[0])

model
PhiForCausalLM(
  (model): PhiModel(
    (embed_tokens): Embedding(51200, 2560)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x PhiDecoderLayer(
        (self_attn): PhiSdpaAttention(
          (q_proj): Linear(in_features=2560, out_features=2560, bias=True)
          (k_proj): Linear(in_features=2560, out_features=2560, bias=True)
          (v_proj): Linear(in_features=2560, out_features=2560, bias=True)
          (dense): Linear(in_features=2560, out_features=2560, bias=True)
          (rotary_emb): PhiRotaryEmbedding()
        )
        (mlp): PhiMLP(
          (activation_fn): NewGELUActivation()
          (fc1): Linear(in_features=2560, out_features=10240, bias=True)
          (fc2): Linear(in_features=10240, out_features=2560, bias=True)
        )
        (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (final_layernorm): LayerNor

In DistilBERT for Question Answering, Output Layer: The final layer of the model consists of a linear transformation to produce the output logits for question answering.<br>
In Phi2 for for Causal LM, Output Layer: The final layer consists of a linear transformation to predict the next token in the sequence.

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_train_subset,
    eval_dataset=tokenized_validation_subset,
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [None]:
# finetune the model
trainer.train()

ValueError: The model did not return a loss from the inputs, only the following keys: logits,past_key_values. For reference, the inputs it received are input_ids,attention_mask.

In [None]:
# saving the model as training everytime takes time
trainer.save_model("test-squad-trained")
# Model Weights, Configuration(config.json), Tokenizer Files and Training Argument are saved

## Evaluation

In [None]:
# Task is to map the predictions of our model back to parts of the context.
# model itself predicts logits for the start and end position of our answers

In [None]:
import torch

for batch in trainer.get_eval_dataloader():
    break
batch = {k: v.to(trainer.args.device) for k, v in batch.items()}
with torch.no_grad():
    output = trainer.model(**batch)
output.keys()

In [None]:
# only consider logits for predictions
output.start_logits.shape, output.end_logits.shape