<a href="https://colab.research.google.com/github/Omar-Aliii/AI-AGENT/blob/main/Question%20Answer%20Models/BERT/BERT_Question_Answering_on_entire_SQUAD_v3(early_stopping%2CBERTScore).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it.

Transformers: This library by Hugging Face provides a collection of pre-trained models for Natural Language Processing (NLP) tasks, including BERT, GPT, RoBERTa, and more. It also includes tools for training your own models and fine-tuning existing ones.

In [None]:
! pip install datasets transformers

Collecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, multiprocess, datasets
Successfully installed datasets-2.16.1 dill-0.3.7 multiprocess-0.70.15


If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then execute the following cell and input your username and password:

In [None]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from transformers import DistilBertTokenizer, DistilBertForQuestionAnswering
from datasets import load_dataset
import torch
from torch.utils.data import DataLoader
from transformers import AdamW
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from transformers import get_linear_schedule_with_warmup
from transformers import pipeline
from transformers import Trainer, TrainingArguments
from transformers import EarlyStoppingCallback
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, EarlyStoppingCallback


In [None]:
!pip install git+https://github.com/deepset-ai/haystack.git


Collecting git+https://github.com/deepset-ai/haystack.git
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-req-build-mkolhl0f
  Running command git clone --filter=blob:none --quiet https://github.com/deepset-ai/haystack.git /tmp/pip-req-build-mkolhl0f
  Resolved https://github.com/deepset-ai/haystack.git to commit acf4cd502fd67c50c3bbbd3eb580738ff041c28f
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting boilerpy3 (from haystack-ai==2.0.0b5)
  Downloading boilerpy3-1.0.7-py3-none-any.whl (22 kB)
Collecting haystack-bm25 (from haystack-ai==2.0.0b5)
  Downloading haystack_bm25-1.0.2-py2.py3-none-any.whl (8.8 kB)
Collecting lazy-imports (from haystack-ai==2.0.0b5)
  Downloading lazy_imports-0.3.1-py3-none-any.whl (12 kB)
Collecting openai>=1.1.0 (from haystack-ai==2.0.0b5)
  Downloading openai-1.10.0-py3-none-any.whl (225 kB)
[2K     [90m━━━━

In [None]:
import logging

logging.basicConfig(level=logging.DEBUG)


In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Then you need to install Git-LFS. Uncomment the following instructions:

Git LFS is an extension to Git that allows managing large files efficiently. It replaces large files in your repository with text pointers inside Git

In [None]:
!apt install git-lfs

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 31 not upgraded.


Make sure your version of Transformers is at least 4.11.0 since the functionality was introduced in that version:

In [None]:
import transformers

print(transformers.__version__)

4.35.2


You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/question-answering).

In [None]:
from transformers.utils import send_example_telemetry
#send information to hugging face
send_example_telemetry("question_answering_notebook", framework="pytorch")

send_example_telemetry function you're using is part of the Hugging Face Transformers library, and **it is used to send telemetry data about model usage to Hugging Face**. This helps them gather information about how their models are being used in the community.

# Fine-tuning a model on a question-answering task

This notebook is built to run on any question answering task with the same format as SQUAD (version 1 or 2), with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a version with a token classification head and a fast tokenizer (check on [this table](https://huggingface.co/transformers/index.html#bigtable) if this is the case). It might just need some small adjustments if you decide to use a different dataset than the one used here. Depending on you model and the GPU you are using, you might need to adjust the batch size to avoid out-of-memory errors. Set those three parameters, then the rest of the notebook should run smoothly:

In [None]:
# This flag is the difference between SQUAD v1 or 2 (if you're using another dataset, it indicates if impossible
# answers are allowed or not).

squad_v2 = False
model_checkpoint = "distilbert-base-uncased"
batch_size = 32

squad_v2: This flag indicates whether you're working with SQuAD version 2 (True) or version 1 (False). SQuAD v2 introduced questions where the answer is not present in the provided passage, making it a more challenging dataset. If squad_v2 is False, it suggests that you're working with SQuAD v1, where every question has a corresponding answer in the passage.

**model_checkpoint:** This variable specifies the model checkpoint or pre-trained model you want to use for fine-tuning. In this case, it's set to "distilbert-base-uncased," which is a smaller and faster version of BERT (Bidirectional Encoder Representations from Transformers) pre-trained on uncased text. You can replace this with other model checkpoints available in the Hugging Face model hub

**batch_size**:  It determines how many examples are processed in each iteration during training.
 model will compute gradients and update its parameters based on the average loss calculated over these 16 examples

  batch size often depends on various factors, including the **available memory on the GPU or CPU, the size of the dataset**

## Loading the dataset

In [None]:
from datasets import load_dataset, load_metric

**`load_dataset function`**:

This function is used to load datasets from the Hugging Face datasets library.

**`load_metric function:`**

This function is used to load evaluation metrics from the datasets library.


these metrics provide a way to quantify how well a model is performing in terms of its predictions compared to the ground truth or correct answers.

For our example here, we'll use the [SQUAD dataset](https://rajpurkar.github.io/SQuAD-explorer/). The notebook should work with any question answering dataset provided by the 🤗 Datasets library. If you're using your own dataset defined from a JSON or csv file (see the [Datasets documentation](https://huggingface.co/docs/datasets/loading_datasets.html#from-local-files) on how to load them), it might need some adjustments in the names of the columns used.

In [None]:
datasets = load_dataset("squad_v2" if squad_v2 else "squad")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/7.83k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

In [None]:
datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

We can see the training, validation and test sets all have a column for the context, the question and the answers to those questions.

datasets["train"][0] is used to access the first example in the training split of your SQuAD dataset.

In [None]:
datasets["train"][0]

{'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}

**answer_start': [515]:** the answer of the question start with character at position 515

-------------------------------------------------------------------------------------------------------------------

show_random_elements that takes a dataset and displays a specified number of randomly picked examples. The function decodes labels if they are encoded (e.g., if they are represented as integer indices), making it easier to understand the content of the dataset


**how the function works:**

It randomly selects indices to pick examples from the dataset, ensuring that

the same example is not picked more than once.

It creates a Pandas DataFrame from the selected examples.

It decodes class labels if they are represented as integer indices.

It displays the DataFrame for better tabular representation

In [None]:
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML

#The dataset from which random elements will be displayed.
# The number of random examples to display (default is 10).
def show_random_elements(dataset, num_examples=10):

    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(datasets["train"])

Unnamed: 0,id,title,context,question,answers
0,573402a3d058e614000b6797,Genocide,"All signatories to the CPPCG are required to prevent and punish acts of genocide, both in peace and wartime, though some barriers make this enforcement difficult. In particular, some of the signatories—namely, Bahrain, Bangladesh, India, Malaysia, the Philippines, Singapore, the United States, Vietnam, Yemen, and former Yugoslavia—signed with the proviso that no claim of genocide could be brought against them at the International Court of Justice without their consent. Despite official protests from other signatories (notably Cyprus and Norway) on the ethics and legal standing of these reservations, the immunity from prosecution they grant has been invoked from time to time, as when the United States refused to allow a charge of genocide brought against it by former Yugoslavia following the 1999 Kosovo War.",Signatories to the CPPC are required to prevent and punish what?,"{'text': ['acts of genocide'], 'answer_start': [64]}"
1,56cdf52c62d2951400fa69cb,Spectre_(2015_film),"Following filming in Mexico, and during a scheduled break, Craig was flown to New York to undergo minor surgery to fix his knee injury. It was reported that filming was not affected and he had returned to filming at Pinewood Studios as planned on 22 April.",When did Craig go back to work?,"{'text': ['22 April'], 'answer_start': [247]}"
2,57313428e6313a140071cd15,Indigenous_peoples_of_the_Americas,"The first indigenous group encountered by Columbus were the 250,000 Taínos of Hispaniola who represented the dominant culture in the Greater Antilles and the Bahamas. Within thirty years about 70% of the Taínos had died. They had no immunity to European diseases, so outbreaks of measles and smallpox ravaged their population. Increasing punishment of the Taínos for revolting against forced labour, despite measures put in place by the encomienda, which included religious education and protection from warring tribes, eventually led to the last great Taíno rebellion.",What did the Taínos represent in the Greater Antilles and Bahamas?,"{'text': ['dominant culture'], 'answer_start': [109]}"
3,572e90ee03f98919007567b2,Elevator,"Elevator doors protect riders from falling into the shaft. The most common configuration is to have two panels that meet in the middle, and slide open laterally. In a cascading telescopic configuration (potentially allowing wider entryways within limited space), the doors roll on independent tracks so that while open, they are tucked behind one another, and while closed, they form cascading layers on one side. This can be configured so that two sets of such cascading doors operate like the center opening doors described above, allowing for a very wide elevator cab. In less expensive installations the elevator can also use one large ""slab"" door: a single panel door the width of the doorway that opens to the left or right laterally. Some buildings have elevators with the single door on the shaft way, and double cascading doors on the cab.",What design allows wider entryways within limited space?,"{'text': ['a cascading telescopic configuration'], 'answer_start': [165]}"
4,56e085d7231d4119001ac25c,Saint_Helena,"The Governor's Cup is a yacht race between Cape Town and Saint Helena island, held every two years in December/January; the most recent event was in December 2010. In Jamestown a timed run takes place up Jacob's Ladder every year, with people coming from all over the world to take part.",What months does the Governor's cup take place?,"{'text': ['December/January'], 'answer_start': [102]}"
5,572f112fdfa6aa1500f8d5af,Elevator,"The Twilight Zone Tower of Terror is the common name for a series of elevator attractions at the Disney's Hollywood Studios park in Orlando, the Disney California Adventure Park park in Anaheim, the Walt Disney Studios Park in Paris and the Tokyo DisneySea park in Tokyo. The central element of this attraction is a simulated free-fall achieved through the use of a high-speed elevator system. For safety reasons, passengers are seated and secured in their seats rather than standing. Unlike most traction elevators, the elevator car and counterweight are joined using a rail system in a continuous loop running through both the top and the bottom of the drop shaft. This allows the drive motor to pull down on the elevator car from underneath, resulting in downward acceleration greater than that of normal gravity. The high-speed drive motor is used to rapidly lift the elevator as well.",Do ride goers stand or sit?,"{'text': ['passengers are seated and secured in their seats rather than standing'], 'answer_start': [414]}"
6,57337cc94776f41900660ba8,Alfred_North_Whitehead,"Overall, however, Whitehead's influence is very difficult to characterize. In English-speaking countries, his primary works are little-studied outside of Claremont and a select number of liberal graduate-level theology and philosophy programs. Outside of these circles his influence is relatively small and diffuse, and has tended to come chiefly through the work of his students and admirers rather than Whitehead himself. For instance, Whitehead was a teacher and long-time friend and collaborator of Bertrand Russell, and he also taught and supervised the dissertation of Willard Van Orman Quine, both of whom are important figures in analytic philosophy – the dominant strain of philosophy in English-speaking countries in the 20th century. Whitehead has also had high-profile admirers in the continental tradition, such as French post-structuralist philosopher Gilles Deleuze, who once dryly remarked of Whitehead that ""he stands provisionally as the last great Anglo-American philosopher before Wittgenstein's disciples spread their misty confusion, sufficiency, and terror."" French sociologist and anthropologist Bruno Latour even went so far as to call Whitehead ""the greatest philosopher of the 20th century.""",Where has interest outside of those areas mainly come from?,"{'text': ['through the work of his students and admirers rather'], 'answer_start': [347]}"
7,572831a53acd2414000df6bd,PlayStation_3,"The system displays the What's New screen by default instead of the [Games] menu (or [Video] menu, if a movie was inserted) when starting up. What's New has four sections: ""Our Pick"", ""Recently Played"", latest information and new content available in PlayStation Store. There are four kinds of content the What's New screen displays and links to, on the sections. ""Recently Played"" displays the user's recently played games and online services only, whereas, the other sections can contain website links, links to play videos and access to selected sections of the PlayStation Store.","Other than the Video default screen for movies, what menu would the PS3 default to before What's New?","{'text': ['Games'], 'answer_start': [69]}"
8,5725c0c789a1e219009abdec,Buckingham_Palace,"The original early 19th-century interior designs, many of which survive, include widespread use of brightly coloured scagliola and blue and pink lapis, on the advice of Sir Charles Long. King Edward VII oversaw a partial redecoration in a Belle Époque cream and gold colour scheme. Many smaller reception rooms are furnished in the Chinese regency style with furniture and fittings brought from the Royal Pavilion at Brighton and from Carlton House. The palace has 775 rooms, and the garden is the largest private garden in London. The state rooms, used for official and state entertaining, are open to the public each year for most of August and September, and on selected days in winter and spring.",What house were many of the furniture and fittings brought from?,"{'text': ['Carlton House'], 'answer_start': [435]}"
9,56e72ec800c9c71400d76eeb,Daylight_saving_time,"In the case of the United States where a one-hour shift occurs at 02:00 local time, in spring the clock jumps forward from the last moment of 01:59 standard time to 03:00 DST and that day has 23 hours, whereas in autumn the clock jumps backward from the last moment of 01:59 DST to 01:00 standard time, repeating that hour, and that day has 25 hours. A digital display of local time does not read 02:00 exactly at the shift to summer time, but instead jumps from 01:59:59.9 forward to 03:00:00.0.","Daylight Saving Time is sometimes called summer time, but the clocks are actually moved forward in which season?","{'text': ['spring'], 'answer_start': [87]}"


In [None]:
from datasets import load_dataset

# Load the SQuAD dataset

squad_dataset = load_dataset("squad")

# Access the original dataset file

train_dataset_path = squad_dataset['train'].cache_files[0] if 'train' in squad_dataset else None
print("Path to the original SQuAD training dataset file:", train_dataset_path)



Path to the original SQuAD training dataset file: {'filename': '/root/.cache/huggingface/datasets/squad/plain_text/0.0.0/0a3a8b7b57e8578ec40b2d2bb4c75aca1a6d6ce1/squad-train.arrow'}


## Preprocessing the training data

Tokenization is the process of breaking down a piece of text into individual tokens (words, subwords, or characters) that the model can understand and process.



tokenizer is a crucial component in natural language processing (NLP) that breaks down a piece of text into smaller units called tokens.


Tokenizer's Role:

Input: Raw text.

Output: List of tokens.

Purpose: To convert variable-length sequences of characters (words or subwords) into a fixed-length sequence of tokens that can be fed into a model.

**`AutoTokenizer`**

 to load the appropriate tokenizer for a given pre-trained language model

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

The following assertion ensures that our tokenizer is a fast tokenizers (backed by Rust) from the 🤗 Tokenizers library. Those fast tokenizers are available for almost all models, and we will need some of the special features they have for our preprocessing.

In [None]:
import transformers
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

You can check which type of models have a fast tokenizer available and which don't on the [big table of models](https://huggingface.co/transformers/index.html#bigtable).

In [None]:
tokenizer("What is your name?", "My name is Sylvain.")

{'input_ids': [101, 2054, 2003, 2115, 2171, 1029, 102, 2026, 2171, 2003, 25353, 22144, 2378, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

**input_ids**': This is a list of integer indices representing the tokens. Each integer corresponds to a specific token in the vocabulary.

'**attention_mask**': This is a binary mask indicating which elements in the input sequence are padding (0) and which are actual tokens (1). It helps the model distinguish between the actual input tokens and padding tokens.

**Input Tokens:** These are the actual tokens representing the meaningful content of the text.

**Padding Tokens:** These are additional tokens (usually zeros) added to make all input sequences in a batch have the same length.




In [None]:
max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.

These parameters are commonly used in the context of handling long documents or passages during tokenization and input preparation for transformer-based models.

**`max_length:`**

**Definition**: It represents the maximum length (in terms of tokens) that a feature, which includes both the question and context, should have.

**Purpose**: In question answering tasks, the input text is often composed of a question and a context. The max_length parameter sets a limit on the total number of tokens allowed in the combined question and context

doc_stride:


Purpose: When tokenizing long documents or passages, it's common to split them into chunks to fit within the model's maximum token limit. doc_stride determines the overlap between consecutive chunks.

**EX**:
The first segment includes tokens 1 to 384.
The second segment includes tokens 256 to 640 (overlapping with the first segment by 128 tokens).

Let's find one long example in our dataset:

If the length is greater than a certain threshold (384 in this case), it may trigger actions like truncation or segmentation to fit the model's constraints.

In [None]:
for i, example in enumerate(datasets["train"]):
    if len(tokenizer(example["question"], example["context"])["input_ids"]) > 384:
        break
example = datasets["train"][i]

This code  is **iterating through the training examples in the dataset (datasets["train"]) and finding the first example where the tokenized representation of the combined question and context exceeds a maximum length of 384 tokens**.

Without any truncation, we get the following length for the input IDs:

calculates the length (number of tokens) of the tokenized input sequence for a specific example

In [None]:
len(tokenizer(example["question"], example["context"])["input_ids"])

396

Now, if we just truncate, we will lose information (and possibly the answer to our question):

calculates the length of the tokenized representation of the question and context while taking into account the specified maximum length and truncation strategy.

This is commonly used when you want to ensure that the combined length of the question and context does not exceed a certain limit (max_length)

In [None]:
len(tokenizer(example["question"], example["context"], max_length=max_length, truncation="only_second")["input_ids"])

384

Note that we never want to truncate the question, only the context, else the `only_second` truncation picked. Now, our tokenizer can automatically return us a list of features capped by a certain maximum length, with the overlap we talked above, we just have to tell it with `return_overflowing_tokens=True` and by passing the stride:

In [None]:
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    stride=doc_stride
)

**`truncation`**

refers to the process of shortening a sequence of text by removing a portion of it. This is often necessary when the original text is too long to fit within the maximum length constraints

ex

```
# Question: "What is the capital of France?"

Context: "Paris, the beautiful capital of France, is known for its iconic landmarks..."

[CLS], "What", "is", "the", "capital", "of", "France", "?", [SEP], "Paris", ",", "the", "beautiful", "capital", "of", "France", ",", "is", "known", "for", "its", "iconic", "landmarks", "...", [SEP]

 truncation="only_second"


[CLS], "What", "is", "the", "capital", "of", "France", "?", [SEP], "Paris", ",", "the", "beautiful", "...", [SEP]
```





Now we don't have one list of `input_ids`, but several:


creates a list containing the length of each sequence.

In [None]:
[len(x) for x in tokenized_example["input_ids"]]

[384, 157]

And if we decode them, we can see the overlap:

iterates over the first two tokenized input sequences in tokenized_example["input_ids"] and prints the human-readable text by decoding each sequence using the tokenizer's decode method.

Output:

The human-readable text representation of each tokenized sequence

In [None]:
for x in tokenized_example["input_ids"][:2]:
    print(tokenizer.decode(x))

[CLS] how many wins does the notre dame men's basketball team have? [SEP] the men's basketball team has over 1, 600 wins, one of only 12 schools who have reached that mark, and have appeared in 28 ncaa tournaments. former player austin carr holds the record for most points scored in a single game of the tournament with 61. although the team has never won the ncaa tournament, they were named by the helms athletic foundation as national champions twice. the team has orchestrated a number of upsets of number one ranked teams, the most notable of which was ending ucla's record 88 - game winning streak in 1974. the team has beaten an additional eight number - one teams, and those nine wins rank second, to ucla's 10, all - time in wins against the top team. the team plays in newly renovated purcell pavilion ( within the edmund p. joyce center ), which reopened for the beginning of the 2009 – 2010 season. the team is coached by mike brey, who, as of the 2014 – 15 season, his fifteenth at notr

Now this will give us some work to properly treat the answers: we need to find in which of those features the answer actually is, and where exactly in that feature. The models we will use require the start and end positions of these answers in the tokens, so we will also need to to map parts of the original context to some tokens. Thankfully, the tokenizer we're using can help us with that by returning an `offset_mapping`:

In [None]:
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
    stride=doc_stride
)
print(tokenized_example["offset_mapping"][0][:100])

[(0, 0), (0, 3), (4, 8), (9, 13), (14, 18), (19, 22), (23, 28), (29, 33), (34, 37), (37, 38), (38, 39), (40, 50), (51, 55), (56, 60), (60, 61), (0, 0), (0, 3), (4, 7), (7, 8), (8, 9), (10, 20), (21, 25), (26, 29), (30, 34), (35, 36), (36, 37), (37, 40), (41, 45), (45, 46), (47, 50), (51, 53), (54, 58), (59, 61), (62, 69), (70, 73), (74, 78), (79, 86), (87, 91), (92, 96), (96, 97), (98, 101), (102, 106), (107, 115), (116, 118), (119, 121), (122, 126), (127, 138), (138, 139), (140, 146), (147, 153), (154, 160), (161, 165), (166, 171), (172, 175), (176, 182), (183, 186), (187, 191), (192, 198), (199, 205), (206, 208), (209, 210), (211, 217), (218, 222), (223, 225), (226, 229), (230, 240), (241, 245), (246, 248), (248, 249), (250, 258), (259, 262), (263, 267), (268, 271), (272, 277), (278, 281), (282, 285), (286, 290), (291, 301), (301, 302), (303, 307), (308, 312), (313, 318), (319, 321), (322, 325), (326, 330), (330, 331), (332, 340), (341, 351), (352, 354), (355, 363), (364, 373), (374,

**`offset mapping`**

**`provides the start and end character positions of the token in the original text.`**

 provides a mapping between the tokens in the tokenized sequence and their corresponding character positions in the original text.

**stride=doc_stride**: This parameter determines the overlap between consecutive chunks when splitting a long document


tokenizes the question and context of a specific example using the tokenizer with additional parameters for handling long documents or passages.

**max_length=max_length**

If the total number of tokens after tokenization exceeds this maximum length, the sequence is truncated or split accordingly

 "**truncation**" refers to the process of shortening a sequence of tokens to fit within a specified maximum length.

 "**only_second**" means that the truncation will be applied to the second input sequence (in this case, the "context").

`**return_overflowing_tokens=True**`

return any overflowing tokens that do not fit into the specified maximum length return_overflowing_tokens=True, the tokenizer will provide these additional parts, allowing you to handle cases where the input exceeds the model's maximum token limit.
The additional parts are accessible in the overflowing_tokens field in the output.




In [None]:
first_token_id = tokenized_example["input_ids"][0][1]
offsets = tokenized_example["offset_mapping"][0][1]
print(tokenizer.convert_ids_to_tokens([first_token_id])[0], example["question"][offsets[0]:offsets[1]])

how How


first_token_id = tokenized_example["input_ids"][0][1]:

Retrieves the ID of the second token (index 1) in the first tokenized sequence. The index 0 often corresponds to the [CLS] token.
offsets = tokenized_example

["offset_mapping"][0][1]:

Retrieves the offset mapping for the second token in the first tokenized sequence. The offset mapping provides the start and end character positions of the token in the original text.

tokenizer.convert_ids_to_tokens([first_token_id])[0] to convert the token ID back to its textual representation.

So we can use this mapping to find the position of the start and end tokens of our answer in a given feature. We just have to distinguish which parts of the offsets correspond to the question and which part correspond to the context, this is where the `sequence_ids` method of our `tokenized_example` can be useful:

In [None]:
sequence_ids = tokenized_example.sequence_ids()
print(sequence_ids)

[None, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, None, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

**sequence_ids** method returns a list of sequence IDs for each token in the tokenized example

It returns `None` for the special tokens,

then **`0`** corresponding token comes from the first sentence past (the question)

 or **`1`** depending on whether the corresponding token comes from the first  the second (the context).

  Now with all of this, we can find the first and last token of the answer in one of our input feature (or if the answer is not in this feature):

In [None]:
#Retrieve Answer Information:
#nswers contains information about the answer, including the starting character position (start_char) and the ending character position (end_char).
answers = example["answers"]
start_char = answers["answer_start"][0]
end_char = start_char + len(answers["text"][0])

# These while loops find the token indices corresponding to the start and end character positions of the answer in the tokenized sequence.
token_start_index = 0
while sequence_ids[token_start_index] != 1:
    token_start_index += 1

# End token index of the current span in the text.
token_end_index = len(tokenized_example["input_ids"][0]) - 1
while sequence_ids[token_end_index] != 1:
    token_end_index -= 1



#send to omar
# if the answer span is within the span of the tokenized sequence by comparing character positions with offset mappings
offsets = tokenized_example["offset_mapping"][0]
if (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
    # Move the token_start_index and token_end_index to the two ends of the answer.
    # Note: we could go after the last offset if the answer is the last word (edge case).
    while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
        token_start_index += 1
    start_position = token_start_index - 1
    while offsets[token_end_index][1] >= end_char:
        token_end_index -= 1
    end_position = token_end_index + 1
    print(start_position, end_position)
else:
    print("The answer is not in this feature.")

23 26


align the tokenized answer with the original text** by finding the corresponding token indices. It considers cases where the answer might extend beyond the boundaries of the feature

print the decoded representation of the answer span in the tokenized sequence and compare it with the original answer tex

In [None]:
print(tokenizer.decode(tokenized_example["input_ids"][0][start_position: end_position+1]))
print(answers["text"][0])

over 1, 600
over 1,600


For this notebook to work with any kind of models, we need to account for the special case where the model expects padding on the left (in which case we switch the order of the question and the context):

In natural language processing tasks, when preparing input sequences for a model, padding is often applied to ensure that all sequences in a batch have the same length. The side on which the padding is applied can affect how the model processes the input. This variable, pad_on_right, can be used later in the code to handle cases where padding is applied on the right side of the input sequences.

In [None]:
pad_on_right = tokenizer.padding_side == "right"

Now let's put everything together in one function we will apply to our training set. In the case of impossible answers (the answer is in another feature given by an example with a long context), we set the cls index for both the start and end position. We could also simply discard those examples from the training set if the flag `allow_impossible_answers` is `False`. Since the preprocessing is already complex enough as it is, we've kept is simple for this part.

In [None]:
def prepare_train_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

In [None]:
#  tokenized_subset = squad_subset.map(prepare_train_features, batched=True, remove_columns=squad_subset.column_names)


 preparing data for training a question-answering model. It handles the tokenization, truncation, padding, and labeling of examples, taking into account the specifics of the tokenization process and the requirements of the downstream task

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [None]:
features = prepare_train_features(datasets['train'][:5])

prepare_train_features function is applied to the first 5 examples from the training dataset

it processes and tokenizes a subset of training examples to prepare them for training a question-answering mode

To apply this function on all the sentences (or pairs of sentences) in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command. Since our preprocessing changes the number of samples, we need to remove the old columns when applying it.

In [None]:
# Assuming you want to transform the "train" subset
# tokenized_datasets = datasets["train"].map(prepare_train_features, batched=True, remove_columns=datasets["train"].column_names)


In [None]:
tokenized_datasets = datasets.map(prepare_train_features, batched=True, remove_columns=datasets["train"].column_names)

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

KeyboardInterrupt: 

Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

Note that we passed `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

In [None]:
# Assuming datasets is already loaded
print(len(datasets["train"]), len(datasets["validation"]))


percentage split is **`89.23% for training`** and **`10.77% for validation`** in the SQuAD dataset.


https://www.tensorflow.org/datasets/catalog/squad#:~:text=Stanford%20Question%20Answering%20Dataset%20(SQuAD,the%20question%20might%20be%20unanswerable.&text=Versions%3A,3.0.

## Fine-tuning the model

**`AutoModelForQuestionAnswering`**

automatically load a pre-trained question-answering (QA) model based on a provided model identifier or path.

**`from_pretrained`**

 method allows users to load models using either a model identifier or a local path. Model identifiers can be names like "distilbert-base-uncased-distilled-squad" or paths to locally stored models.

In [None]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

The warning is telling us we are throwing away some weights (the `vocab_transform` and `vocab_layer_norm` layers) and randomly initializing some other (the `pre_classifier` and `classifier` layers). This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modeling objective and replacing it with a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

---------------------------------------------------------------------------------------------------

To instantiate a `Trainer`, we will need to define three more things. The most important is the [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

**`accelerate`** is a library developed by Hugging Face to simplify and optimize the training of deep learning models

you can train your models on multiple GPUs or even multiple machines.

In [None]:
!pip install accelerate -U

-------------------------------------------------------------------------------------------

!pip install transformers[torch] ensures that you have both the transformers library and the PyTorch dependencies installed

[**`torch`**]: This is an extra specifier that indicates you want to install additional dependencies for PyTorch.

In [None]:
!pip install transformers[torch]

In [None]:
! pip install -U accelerate
! pip install -U transformers

In [None]:
pip install transformers[pytorch]

In [None]:
import accelerate

accelerate.__version__

In [None]:
import accelerate
import transformers

transformers.__version__, accelerate.__version__

**`TrainingArguments`** class. This class is part of the transformers library and is used to **store all the hyperparameters** and **settings for training a model.**

**`evaluation_strategy="epoch":`**  evaluation should be** performed at the end of each epoch** during training.

**`learning rate`** is a hyperparameter that determines the step size at each iteration while moving toward a minimum of a loss function.

**`per_device_train_batch_size`** and **`per_device_eval_batch_size`** in the training configuration (TrainingArguments) are used to specify the batch size per device during training and evaluation, respectively

In [None]:
from transformers import DefaultDataCollator


In [None]:
from sklearn.metrics import f1_score, accuracy_score
import numpy as np


In [None]:
def compute_metrics(p):
    predictions, labels = p.predictions, p.label_ids
    predictions = np.argmax(predictions, axis=2)

    # Flatten predictions and labels
    prediction_list = [pred for pred_seq in predictions for pred in pred_seq]
    label_list = [label for label_seq in labels for label in label_seq]

    # Make sure the lengths match
    assert len(prediction_list) == len(label_list), "Number of predictions and labels must match"

    # Compute F1 and EM scores
    f1 = f1_score(label_list, prediction_list, average='micro')
    em = accuracy_score(label_list, prediction_list)

    return {"f1": f1, "exact_match": em}




# > early_stopping



In [None]:
from transformers.trainer_callback import EarlyStoppingCallback


In [None]:
early_stopping = EarlyStoppingCallback(
    early_stopping_patience=1,
    early_stopping_threshold=0.0125,
)


In [None]:
model_name = model_checkpoint.split("/")[-1]
args = TrainingArguments(
    #This is a formatted string that defines the directory where the fine-tuned model and related outputs will be saved.
    f"{model_name}-finetuned-squad",
    evaluation_strategy = "epoch",
    #the speed at which a machine learning model "learns"
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    #a technique used to prevent overfitting,
    weight_decay=0.01,
    load_best_model_at_end=True,
    save_strategy="epoch",
    push_to_hub=False,
)

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the notebook and customize the number of epochs for training, as well as the weight decay.

The last argument to setup everything so we can push the model to the [Hub](https://huggingface.co/models) regularly during training. Remove it if you didn't follow the installation steps at the top of the notebook. If you want to save your model locally in a name that is different than the name of the repository it will be pushed, or if you want to push your model under an organization and not your name space, use the `hub_model_id` argument to set the repo name (it needs to be the full name, including your namespace: for instance `"sgugger/bert-finetuned-squad"` or `"huggingface/bert-finetuned-squad"`).

Then we will need a data collator that will batch our processed examples together, here the default one will work:

**` data collator`** is a function or object responsible for combining individual training examples into batches. The purpose of a data collator is to take a list of samples and organize them into a batch that can be efficiently processed by the model during training.


# train



In [None]:
from transformers import default_data_collator

data_collator = default_data_collator

We will evaluate our model and compute metrics in the next section (this is a very long operation, so we will only compute the evaluation loss during training).

Then we just need to pass all of this along with our datasets to the `Trainer`:

**`Trainer`** class in the transformers library provides a high-level interface for training and evaluating transformer models.

Trainer simplifies the process of fine-tuning transformer models, making it easier for users to experiment with different architectures and hyperparameters

**`Training Arguments (args`**): An object containing various hyperparameters and settings for training. This includes parameters such as learning rate, batch size, number of epochs, and more.

**`Data Collator (data_collator)`**: A data collator is responsible for batching and organizing individual examples into input batches that can be processed by the model

**`Tokenizer (tokenizer)`**: The tokenizer used to convert raw text into tokenized input suitable for the model. It is responsible for encoding text into input IDs and attention masks.

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[early_stopping],
)







We can now finetune our model by just calling the `train` method:

In [None]:
trainer.train()

Since this training is particularly long, let's save the model just in case we need to restart.

In [None]:
# After trainer.train()
best_checkpoint = trainer.state.best_model_checkpoint
model = AutoModelForQuestionAnswering.from_pretrained(best_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(best_checkpoint)


In [None]:
import matplotlib.pyplot as plt

# Data
epochs = [1, 2, 3]
training_loss = [1.223600, 0.957300, 0.751500]
validation_loss = [1.178746, 1.130409, 1.166829]

# Plotting
plt.plot(epochs, training_loss, label='Training Loss', marker='o')
plt.plot(epochs, validation_loss, label='Validation Loss', marker='o')

# Add labels and title
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training and Validation Loss Over Epochs')

# Add legend
plt.legend()

# Show the plot
plt.show()


In [None]:
trainer.save_model("test-squad-trained")

## Evaluation

Evaluating our model will require a bit more work, as we will need to map the predictions of our model back to parts of the context. The model itself predicts logits for the start and en position of our answers: if we take a batch from our validation datalaoder, here is the output our model gives us:

**`perform evaluation on a batch of data.`**

means using the trained model **to make predictions on a set of input examples** for the purpose of assessing the model's performance.

In [None]:
import torch

for batch in trainer.get_eval_dataloader():
    break
batch = {k: v.to(trainer.args.device) for k, v in batch.items()}
with torch.no_grad():
    output = trainer.model(**batch)
output.keys()

**`logits`**

They are the outputs of a neural network before applying an activation function, such as softmax. Logits can be seen as the scores assigned to each class in a classification problem or the unnormalized scores for different positions in a sequence.

checking the shape of the logits produced by the model for start and end positions
 of answer in the context

**`Start Logits:`**

indicate how likely it is to be the beginning of the answer.

**`End Logits:`**

 indicate how likely it is to be the end of the answer.



In [None]:
output.start_logits.shape, output.end_logits.shape

 used to **`find the positions (indices) in the sequence with the highest logit`** values. Specifically,

  argmax(dim=-1) is applied along the last dimension of the logits tensor, which corresponds to the different positions in the sequence.

In [None]:
output.start_logits.argmax(dim=-1), output.end_logits.argmax(dim=-1)

This will work great in a lot of cases, but what if this prediction gives us something impossible: the start position could be greater than the end position, or point to a span of text in the question instead of the answer. In that case, we might want to look at the second best prediction to see if it gives a possible answer and select that instead.

However, picking the second best answer is not as easy as picking the best one: is it the second best index in the start logits with the best index in the end logits? Or the best index in the start logits with the second best index in the end logits? And if that second best answer is not possible either, it gets even trickier for the third best answer.


To classify our answers, we will use the score obtained by adding the start and end logits. We won't try to order all the possible answers and limit ourselves to with a hyper-parameter we call `n_best_size`. We'll pick the best indices in the start and end logits and gather all the answers this predicts. After checking if each one is valid, we will sort them by their score and keep the best one. Here is how we would do this on the first feature in the batch:

if n_best_size is set to 20, the post-processing step will **`consider the top 20 candidate answer`** spans based on their scores


**`Candidate Answer Spans`**

These are possible answers that the model identifies based on the logits (scores) assigned to different positions in the input context.

In [None]:
n_best_size = 20

**`1-`**Convert start_logits and end_logits to NumPy arrays for easier manipulation.

**`2-`**Get indices of the top candidates:

**`3-`**Generate valid answers: It iterates through combinations of start and end indices, checking if the start index is less than or equal to the end index. This ensures a valid answer span

check **`ensures that the start index of a candidate answer comes before or at the same position as the end index`**



In [None]:
import numpy as np
#Convert start_logits and end_logits to NumPy arrays for easier manipulation.
start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()

# Gather the indices the best start/end logits:
#Get indices of the top candidates:
#it gets the indices of the top candidates in descending order.
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()


#Generate valid answers: It iterates through combinations of start and end indices,
#checking if the start index is less than or equal to the end index. This ensures a valid answer span
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        if start_index <= end_index: # We need to refine that test to check the answer is inside the context
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": "" # We need to find a way to get back the original substring corresponding to the answer in the context
                }
            )

And then we can sort the `valid_answers` according to their `score` and only keep the best one. The only point left is how to check a given span is inside the context (and not the question) and how to get back the text inside. To do this, we need to add two things to our validation features:
- the ID of the example that generated the feature (since each example can generate several features, as seen before);
- the offset mapping that will give us a map from token indices to character positions in the context.

That's why we will re-process the validation set with the following function, slightly different from `prepare_train_features`:

**`this function prepares validation features by tokenizing the questions and contexts, handling overflow, and maintaining mappings between features and examples. It ensures that the tokenized validation data is appropriately formatted for evaluation during the fine-tuning process.`**

In [None]:
def prepare_validation_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace

    #Whitespace Removal:ensures consistent formatting and helps with tokenization.
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

And like before, we can apply that function to our validation set easily:

**`prepare_validation_features:`**

The function used to process each example in the validation dataset.

**`batched=True: `**

Indicates that the function should be applied in a batched manner, which can **improve efficiency during processing**

**`remove_columns=datasets["validation"].column_names: `**

**Removes the specified columns** from the processed features. This is done to **clean up unnecessary information and reduce memory usage.**

**`validation_features`**:

Stores the processed validation features.

In [None]:
validation_features = datasets["validation"].map(
    prepare_validation_features,
    batched=True,
    remove_columns=datasets["validation"].column_names
)

**`validation_features,`** **is a tokenized and formatted version of the validation dataset, ready to be used for model evaluation**. It contains information such as input IDs, attention masks, offset mappings, and example IDs. During evaluation, this preprocessed dataset is fed into the fine-tuned model to obtain predictions, which can then be compared with the ground truth to assess the model's performance on the validation set.

Now we can grab the predictions for all features by using the `Trainer.predict` method:

In [None]:
raw_predictions = trainer.predict(validation_features)

The `Trainer` *hides* the columns that are not used by the model (here `example_id` and `offset_mapping` which we will need for our post-processing), so we set them back:

In [None]:
validation_features.set_format(type=validation_features.format["type"], columns=list(validation_features.features.keys()))

We can now refine the test we had before: since we set `None` in the offset mappings when it corresponds to a part of the question, it's easy to check if an answer is fully inside the context. We also eliminate very long answers from our considerations (with an hyper-parameter we can tune)

In [None]:
max_answer_length = 30

**`Offset Mappings`**

ex
running" in the tokenized input corresponds to the characters 10 to 16 in the original text, the offset mapping for that token would be [(10, 16)].

In [None]:
start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()

#It fetches the original context (the passage of text) from the validation dataset.
offset_mapping = validation_features[0]["offset_mapping"]
# The first feature comes from the first example. For the more general case, we will need to be match the example_id to
# an example index
context = datasets["validation"][0]["context"]

# Gather the indices the best start/end logits:
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()

#Filtering and Forming Valid Answers:
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
        # to part of the input_ids that are not in the context.
        if (
            start_index >= len(offset_mapping)
            or end_index >= len(offset_mapping)
            or offset_mapping[start_index] is None
            or offset_mapping[end_index] is None
        ):
            continue
        # Don't consider answers with a length that is either < 0 or > max_answer_length.
        if end_index < start_index or end_index - start_index + 1 > max_answer_length:
            continue
        if start_index <= end_index: # We need to refine that test to check the answer is inside the context
            start_char = offset_mapping[start_index][0]
            end_char = offset_mapping[end_index][1]
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": context[start_char: end_char]
                }
            )
#Selecting Top Answers:
valid_answers = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[:n_best_size]
valid_answers

We can compare to the actual ground-truth answer:

it includes information about the text of the answer and the character position where the answer starts in the original context.

In [None]:
datasets["validation"][0]["answers"]

Our model picked the right as the most likely answer!

As we mentioned in the code above, this was easy on the first feature because we knew it comes from the first example. For the other features, we will need a map between examples and their corresponding features. Also, since one example can give several features, we will need to gather together all the answers in all the features generated by a given example, then pick the best one. The following code builds a map from example index to its corresponding features indices:

**`example_id_to_index:`**

It creates a dictionary that maps example IDs to their corresponding index in the validation dataset.

**`features_per_example:`**

It is a defaultdict that collects lists of feature indices for each example.

The purpose of organizing the features in this way is to **facilitate the grouping of features based on the example** to which they belong

In [None]:
import collections

examples = datasets["validation"]
features = validation_features

example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
features_per_example = collections.defaultdict(list)
for i, feature in enumerate(features):
    features_per_example[example_id_to_index[feature["example_id"]]].append(i)

We're almost ready for our post-processing function. The last bit to deal with is the impossible answer (when `squad_v2 = True`). The code above only keeps answers that are inside the context, we need to also grab the score for the impossible answer (which has start and end indices corresponding to the index of the CLS token). When one example gives several features, we have to predict the impossible answer when all the features give a high score to the impossible answer (since one feature could predict the impossible answer just because the answer isn't in the part of the context it has access too), which is why the score of the impossible answer for one example is the *minimum* of the scores for the impossible answer in each feature generated by the example.

We then predict the impossible answer when that score is greater than the score of the best non-impossible answer. All combined together, this gives us this post-processing function:

In [None]:
from tqdm.auto import tqdm

def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
    all_start_logits, all_end_logits = raw_predictions
    # Build a map example to its corresponding features.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    # Logging.
    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    # Let's loop over all the examples!
    for example_index, example in enumerate(tqdm(examples)):
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]

        min_null_score = None # Only used if squad_v2 is True.
        valid_answers = []

        context = example["context"]
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Update minimum null prediction.
            cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            # Go through all possibilities for the `n_best_size` greater start and end logits.
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                    # to part of the input_ids that are not in the context.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                    ):
                        continue
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )

        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
            # failure.
            best_answer = {"text": "", "score": 0.0}

        # Let's pick our final answer: the best one or the null answer (only for squad_v2)
        if not squad_v2:
            predictions[example["id"]] = best_answer["text"]
        else:
            answer = best_answer["text"] if best_answer["score"] > min_null_score else ""
            predictions[example["id"]] = answer

    return predictions

And we can apply our post-processing function to our raw predictions:

In [None]:
!pip install sentence-transformers


apply validation

In [None]:
final_predictions = postprocess_qa_predictions(datasets["validation"], validation_features, raw_predictions.predictions)

Then we can load the metric from the datasets library.

In [None]:
metric = load_metric("squad_v2" if squad_v2 else "squad")

Then we can call compute on it. We just need to format predictions and labels a bit as it expects a list of dictionaries and not one big dictionary. In the case of squad_v2, we also have to set a `no_answer_probability` argument (which we set to 0.0 here as we have already set the answer to empty if we picked it).

In [None]:
if squad_v2:
    formatted_predictions = [{"id": k, "prediction_text": v, "no_answer_probability": 0.0} for k, v in final_predictions.items()]
else:
    formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
references = [{"id": ex["id"], "answers": ex["answers"]} for ex in datasets["validation"]]
metric.compute(predictions=formatted_predictions, references=references)

**`exact match`** score of 73.12% indicates **`the percentage of predictions that exactly match the ground truth answers`**

In [None]:
import matplotlib.pyplot as plt

metrics = {'exact_match': 76.81173131504258, 'f1': 85.18108051522785}

# Plotting the metrics
plt.bar(metrics.keys(), metrics.values(), color=['blue', 'green'])
plt.ylabel('Percentage')
plt.title('Exact Match and F1 Score')
plt.show()


-----------------------------------------------------------------------------------------------------



# TESTING






In [None]:
# Your training code here

# Save the fine-tuned model
save_path = "C:\\Users\\Lenovo\\OneDrive\\Desktop\\Graduation project\\BERT\\BERT fine tuning Q&A SQUAD"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)


In [None]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

# Load the fine-tuned DistilBERT model and tokenizer
fine_tuned_model = AutoModelForQuestionAnswering.from_pretrained("C:\\Users\\Lenovo\\OneDrive\\Desktop\\Graduation project\\BERT\\BERT fine tuning Q&A SQUAD")
tokenizer = AutoTokenizer.from_pretrained("C:\\Users\\Lenovo\\OneDrive\\Desktop\\Graduation project\\BERT\\BERT fine tuning Q&A SQUAD")


In [None]:
!pip install transformers --upgrade


In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

# Load the fine-tuned DistilBERT model and tokenizer
model_name = "C:\\Users\\Lenovo\\OneDrive\\Desktop\\Graduation project\\BERT\\BERT fine tuning Q&A SQUAD"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

def generate_answer_distilbert_squad(row):
    question = row["question"]
    context = row["context"]

    encoding = tokenizer(
        question,
        context,
        max_length=384,  # Adjust the max length as per your fine-tuning
        padding="max_length",
        truncation="only_second",
        return_tensors="pt"
    )

    with torch.no_grad():
        bs, q_length, dim = encoding["input_ids"].size() if len(encoding["input_ids"].size()) == 3 else encoding["input_ids"].size(0), encoding["input_ids"].size(-2), encoding["input_ids"].size(-1)

        start_logits = output.start_logits.argmax(dim=-1)
        end_logits = output.end_logits.argmax(dim=-1)

        start_index = torch.argmax(start_logits).item()
        end_index = torch.argmax(end_logits).item()

    answer = tokenizer.decode(encoding["input_ids"][0][start_index:end_index+1], skip_special_tokens=True)

    return {
        "context": context,
        "question": question,
        "answer": answer
    }

# Example usage with a row from your SQuAD dataset
sample_row = {
    "question": "To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?",
    "context": "Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend 'Venite Ad Me Omnes'. Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.",
    "id": "5733be284776f41900661182",
}

result = generate_answer_distilbert_squad(sample_row)
print(result)


In [None]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

# Define the model checkpoint and load the tokenizer and model
model_checkpoint = "C:\\Users\\Lenovo\\OneDrive\\Desktop\\Graduation project\\BERT\\BERT fine tuning Q&A SQUAD"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)


In [None]:
def get_answer(context, question):
    inputs = tokenizer(question, context, return_tensors="pt")
    outputs = model(**inputs)
    start_logits = outputs.start_logits
    end_logits = outputs.end_logits
    start_idx = start_logits.argmax(-1).item()
    end_idx = end_logits.argmax(-1).item() + 1
    answer = tokenizer.decode(inputs["input_ids"][0][start_idx:end_idx])
    return answer


In [None]:
context_example = "Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend \"Venite Ad Me Omnes\"..."

question_example = "To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?"

answer_example = get_answer(context_example, question_example)

print(f"Question: {question_example}")
print(f"Answer: {answer_example}")


----------------------------------------------------------------------------------------------

In [None]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

# Define the model checkpoint and load the tokenizer and model
model_checkpoint = "C:\\Users\\Lenovo\\OneDrive\\Desktop\\Graduation project\\BERT\\BERT fine tuning Q&A SQUAD"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

# Create a function for question answering
def get_answer(context, question):
    inputs = tokenizer(question, context, return_tensors="pt")
    outputs = model(**inputs)
    start_logits = outputs.start_logits
    end_logits = outputs.end_logits
    start_idx = start_logits.argmax(-1).item()
    end_idx = end_logits.argmax(-1).item() + 1
    answer = tokenizer.decode(inputs["input_ids"][0][start_idx:end_idx])
    return answer

# Test the function with a context and question
context_example = "Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend \"Venite Ad Me Omnes\". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary."
question_example = "To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?"
answer_example = get_answer(context_example, question_example)
print(f"Question: {question_example}")
print(f"Answer: {answer_example}")


In [None]:
print(f"Question: {question_example}")
print(f"Answer: {answer_example}")

In [None]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

# Define the model checkpoint and load the tokenizer and model
model_checkpoint = "C:\\Users\\Lenovo\\OneDrive\\Desktop\\Graduation project\\BERT\\BERT fine tuning Q&A SQUAD"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

# Create a function for question answering
def get_answer(context, question):
    inputs = tokenizer(question, context, return_tensors="pt")
    outputs = model(**inputs)
    start_logits = outputs.start_logits
    end_logits = outputs.end_logits
    start_idx = start_logits.argmax(-1).item()
    end_idx = end_logits.argmax(-1).item() + 1
    answer = tokenizer.decode(inputs["input_ids"][0][start_idx:end_idx])
    return answer

# New context and question
new_context = "Paris, the capital of France, is renowned for its architecture, art, and rich history. The Eiffel Tower, an iconic landmark, stands tall on the Champ de Mars, offering breathtaking views of the city. The Louvre Museum, home to thousands of works of art, including the Mona Lisa, attracts millions of visitors each year. Paris is known for its charming streets, sidewalk cafes, and vibrant cultural scene."

new_question = "What is the most famous landmark in Paris?"
new_answer = get_answer(new_context, new_question)

# Print the results
print(f"Question: {new_question}")
print(f"Answer: {new_answer}")


In [None]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

# Define the model checkpoint and load the tokenizer and model
model_checkpoint = "C:\\Users\\Lenovo\\OneDrive\\Desktop\\Graduation project\\BERT\\BERT fine tuning Q&A SQUAD"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

# Create a function for question answering
def get_answer(context, question):
    inputs = tokenizer(question, context, return_tensors="pt")
    outputs = model(**inputs)
    start_logits = outputs.start_logits
    end_logits = outputs.end_logits
    start_idx = start_logits.argmax(-1).item()
    end_idx = end_logits.argmax(-1).item() + 1
    answer = tokenizer.decode(inputs["input_ids"][0][start_idx:end_idx])
    return answer

# New context and question
#new_context = "Paris, the capital of France, is renowned for its architecture, art, and rich history. The Eiffel Tower, an iconic landmark, stands tall on the Champ de Mars, offering breathtaking views of the city. The Louvre Museum, home to thousands of works of art, including the Mona Lisa, attracts millions of visitors each year. Paris is known for its charming streets, sidewalk cafes, and vibrant cultural scene."

new_question = "During whose rule was the use of Old Akkadian at its peak?"
new_answer = get_answer(new_context, new_question)

# Print the results
print(f"Question: {new_question}")
print(f"Answer: {new_answer}")


In [None]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from datasets import load_dataset

# Define the model checkpoint and load the tokenizer and model
model_checkpoint = "C:\\Users\\Lenovo\\OneDrive\\Desktop\\Graduation project\\BERT\\BERT fine tuning Q&A SQUAD"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

# Load the SQuAD dataset
squad_dataset = load_dataset("squad")

# Example question
question_example = "To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?"

# Iterate through SQuAD examples to find the context for the example question
for example in squad_dataset["train"]:
    if question_example.lower() in example["question"].lower():
        context = example["context"]
        inputs = tokenizer(question_example, context, return_tensors="pt")
        outputs = model(**inputs)
        start_logits = outputs.start_logits
        end_logits = outputs.end_logits
        start_idx = start_logits.argmax(-1).item()
        end_idx = end_logits.argmax(-1).item() + 1
        answer = tokenizer.decode(inputs["input_ids"][0][start_idx:end_idx])

        # Print the results
        print(f"Question: {question_example}")
        print(f"Context: {context}")
        print(f"Answer: {answer}")
        break


In [None]:
from datasets import load_dataset

# Load SQuAD 2.0 dataset
squad_dataset = load_dataset("squad_v2")

# Print the first example in the training set
print(squad_dataset["train"][0])


In [None]:
for i in range(10):
    print(squad_dataset["train"][i])
    print("\n" + "="*80 + "\n")  # Add a separator for better readability

In [None]:
import random

# Set a seed for reproducibility
random.seed(42)

# Print 10 random examples from SQuAD v2 dataset
for _ in range(10):
    random_index = random.randint(0, len(squad_dataset["train"]) - 1)
    example = squad_dataset["train"][random_index]

    print(example)
    print("\n" + "="*80 + "\n")  # Add a separator for better readability


In [None]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from datasets import load_dataset

# Define the model checkpoint and load the tokenizer and model
model_checkpoint = "C:\\Users\\Lenovo\\OneDrive\\Desktop\\Graduation project\\BERT\\BERT fine tuning Q&A SQUAD"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

# Load SQuAD v2 dataset
squad_dataset = load_dataset("squad_v2")

# Example question
question_example = "Who managed the Destiny's Child group?"

# Iterate through SQuAD v2 examples to find the context for the example question
for example in squad_dataset["train"]:
    if question_example.lower() in example["question"].lower():
        context = example["context"]
        inputs = tokenizer(question_example, context, return_tensors="pt")
        outputs = model(**inputs)
        start_logits = outputs.start_logits
        end_logits = outputs.end_logits
        start_idx = start_logits.argmax(-1).item()
        end_idx = end_logits.argmax(-1).item() + 1
        answer = tokenizer.decode(inputs["input_ids"][0][start_idx:end_idx])

        # Print the results
        print(f"Question: {question_example}")
        print(f"Context: {context}")
        print(f"Answer: {answer}")
        break


In [None]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from datasets import load_dataset

# Define the model checkpoint and load the tokenizer and model
model_checkpoint = "C:\\Users\\Lenovo\\OneDrive\\Desktop\\Graduation project\\BERT\\BERT fine tuning Q&A SQUAD"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

# Load SQuAD v2 dataset
squad_dataset = load_dataset("squad_v2")

# Example question
question_example = "What would happen without a proper five-year plan?"

# Iterate through SQuAD v2 examples to find the context for the example question
for example in squad_dataset["train"]:
    if question_example.lower() in example["question"].lower():
        context = example["context"]
        inputs = tokenizer(question_example, context, return_tensors="pt")
        outputs = model(**inputs)
        start_logits = outputs.start_logits
        end_logits = outputs.end_logits
        start_idx = start_logits.argmax(-1).item()
        end_idx = end_logits.argmax(-1).item() + 1
        answer = tokenizer.decode(inputs["input_ids"][0][start_idx:end_idx])

        # Print the results
        print(f"Question: {question_example}")
        print(f"Context: {context}")
        print(f"Answer: {answer}")
        break


In [None]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from datasets import load_dataset

# Define the model checkpoint and load the tokenizer and model
model_checkpoint = "/content/distilbert-base-uncased-finetuned-squad/checkpoint-2767"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

# Load SQuAD v2 dataset
squad_dataset = load_dataset("squad_v2")

# Example question
question_example = "Who's concept of duration was left  behind for a for more concrete frame's of references? "

# Iterate through SQuAD v2 examples to find the context for the example question
for example in squad_dataset["train"]:
    if question_example.lower() in example["question"].lower():
        context = example["context"]
        inputs = tokenizer(question_example, context, return_tensors="pt")
        outputs = model(**inputs)
        start_logits = outputs.start_logits
        end_logits = outputs.end_logits
        start_idx = start_logits.argmax(-1).item()
        end_idx = end_logits.argmax(-1).item() + 1
        answer = tokenizer.decode(inputs["input_ids"][0][start_idx:end_idx])

        # Print the results
        print(f"Question: {question_example}")
        print(f"Context: {context}")
        print(f"Answer: {answer}")
        break




# **EVALUATION **



In [None]:
!pip install sentence_transformers scikit-learn


In [None]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from itertools import islice

# Define the model checkpoint and load the tokenizer and model
model_checkpoint = "/content/distilbert-base-uncased-finetuned-squad/checkpoint-2767"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

# Load SQuAD v2 dataset
squad_dataset = load_dataset("squad_v2")

# Initialize SentenceTransformer for Semantic Answer Similarity (SAS)
embedder = SentenceTransformer("paraphrase-distilroberta-base-v1")

# List to store SAS scores
sas_scores_list = []

# Iterate through the first 100 examples in the dataset and compare the generated answer with the ground truth answer
for example in islice(squad_dataset["train"], 100):
    # Use the model to generate an answer
    inputs = tokenizer(example["question"], example["context"], return_tensors="pt")
    outputs = model(**inputs)
    start_logits = outputs.start_logits
    end_logits = outputs.end_logits
    start_idx = start_logits.argmax(-1).item()
    end_idx = end_logits.argmax(-1).item() + 1
    generated_answer = tokenizer.decode(inputs["input_ids"][0][start_idx:end_idx])

    # Retrieve the ground truth answer from the dataset
    ground_truth_answer = example["answers"]["text"]

    # Calculate Semantic Answer Similarity (SAS) score
    try:
        texts = [generated_answer, ground_truth_answer]
        sas_score = cosine_similarity(embedder.encode(texts))[0][0]
        sas_scores_list.append(sas_score)
    except IndexError:
        print(f"Error calculating SAS score for texts: {texts}")

    # Print the results
    print(f"Question: {example['question']}")
    print(f"Context: {example['context']}")
    print(f"Generated Answer: {generated_answer}")
    print(f"Ground Truth Answer: {ground_truth_answer}")
    print(f"SAS Score: {sas_score}")
    print("=" * 50)

# After the loop, you can analyze sas_scores_list as needed
average_sas_score = sum(sas_scores_list) / len(sas_scores_list)
print(f"Average SAS Score: {average_sas_score}")


In [None]:
!pip install bert_score


In [None]:
!pip install nltk


In [None]:
from nltk.tokenize import word_tokenize


In [None]:
import nltk
nltk.download('punkt')
from collections import Counter
from nltk.tokenize import word_tokenize
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from bert_score import score
from itertools import islice
from sklearn.metrics.pairwise import cosine_similarity

# Define the model checkpoint and load the tokenizer and model
model_checkpoint = "/content/distilbert-base-uncased-finetuned-squad/checkpoint-2767"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

# Load SQuAD v2 dataset
squad_dataset = load_dataset("squad_v2")

# Initialize SentenceTransformer for Semantic Answer Similarity (SAS)
embedder = SentenceTransformer("paraphrase-distilroberta-base-v1")

# List to store SAS scores
sas_scores_list = []
em_scores_list = []
f1_scores_list = []
bert_scores_list = []

# Variables for precision and recall
total_true_positives = 0
total_false_positives = 0
total_false_negatives = 0

# Function to compute EM (Exact Match) and F1 scores
def compute_scores(prediction, reference):
    if isinstance(prediction, list):
        prediction = ' '.join(prediction)
    if isinstance(reference, list):
        reference = ' '.join(reference)

    prediction_tokens = word_tokenize(prediction.lower())
    reference_tokens = word_tokenize(reference.lower())

    common = Counter(prediction_tokens) & Counter(reference_tokens)
    num_same = sum(common.values())

    precision = 0.0
    recall = 0.0

    if num_same > 0:
        precision = 1.0 * num_same / len(prediction_tokens)
        recall = 1.0 * num_same / len(reference_tokens)

    em = int(prediction_tokens == reference_tokens)

    return em, precision, recall

# Iterate through the first 100 examples in the dataset and compare the generated answer with the ground truth answer
for example in islice(squad_dataset["train"], 100):
    # Use the model to generate an answer
    inputs = tokenizer(example["question"], example["context"], return_tensors="pt")
    outputs = model(**inputs)
    start_logits = outputs.start_logits
    end_logits = outputs.end_logits
    start_idx = start_logits.argmax(-1).item()
    end_idx = end_logits.argmax(-1).item() + 1
    generated_answer = tokenizer.decode(inputs["input_ids"][0][start_idx:end_idx])

    # Retrieve the ground truth answer from the dataset
    if isinstance(example["answers"], dict) and "answer_start" in example["answers"]:
        ground_truth_answer = example["answers"]["text"]
    elif example["answers"]:
        ground_truth_answer = example["answers"][0]["text"]
    else:
        continue  # Skip examples with no answers

    # Calculate Semantic Answer Similarity (SAS) score
    try:
        texts = [generated_answer, ground_truth_answer]
        sas_score = cosine_similarity(embedder.encode(texts))[0][0]
        sas_scores_list.append(sas_score)
    except IndexError:
        print(f"Error calculating SAS score for texts: {texts}")

    # Compute EM, precision, and recall
    em_score, precision, recall = compute_scores(generated_answer, ground_truth_answer)
    em_scores_list.append(em_score)

    # Update true positives, false positives, and false negatives
    total_true_positives += em_score
    total_false_positives += int(em_score == 0)
    total_false_negatives += int(em_score == 0)

    f1_scores_list.append(2 * precision * recall / (precision + recall))

    # Calculate BERTScore
    bert_scores = score([generated_answer], [ground_truth_answer], lang="en")
    bert_score = bert_scores[2].mean().item()
    bert_scores_list.append(bert_score)

    # Print the results
    print(f"Question: {example['question']}")
    print(f"Context: {example['context']}")
    print(f"Generated Answer: {generated_answer}")
    print(f"Ground Truth Answer: {ground_truth_answer}")
    print(f"SAS Score: {sas_score}")
    print(f"EM Score: {em_score}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")
    print(f"F1 Score: {2 * precision * recall / (precision + recall)}")
    print(f"BERTScore: {bert_score}")
    print("=" * 50)

# After the loop, you can analyze sas_scores_list, em_scores_list, f1_scores_list, and bert_scores_list as needed
average_sas_score = sum(sas_scores_list) / len(sas_scores_list)
average_em_score = sum(em_scores_list) / len(em_scores_list)
average_f1_score = sum(f1_scores_list) / len(f1_scores_list)
average_bert_score = sum(bert_scores_list) / len(bert_scores_list)

# Calculate overall precision, recall, and F1
overall_precision = total_true_positives / (total_true_positives + total_false_positives)
overall_recall = total_true_positives / (total_true_positives + total_false_negatives)
overall_f1 = 2 * overall_precision * overall_recall / (overall_precision + overall_recall)

print(f"Average SAS Score: {average_sas_score}")
print(f"Average EM Score: {average_em_score}")
print(f"Average F1 Score: {average_f1_score}")
print(f"Average BERTScore: {average_bert_score}")
print(f"Overall Precision: {overall_precision}")
print(f"Overall Recall: {overall_recall}")
print(f"Overall F1 Score: {overall_f1}")


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
# Plotting results
overall_precision = total_true_positives / (total_true_positives + total_false_positives)
overall_recall = total_true_positives / (total_true_positives + total_false_negatives)
overall_f1 = 2 * overall_precision * overall_recall / (overall_precision + overall_recall)

# Update metrics and average_scores
metrics = ['SAS', 'EM', 'F1', 'BERTScore', 'Precision', 'Recall']
average_scores = [average_sas_score, average_em_score, average_f1_score, average_bert_score, overall_precision, overall_recall]

# Plotting results
plt.figure(figsize=(12, 8))
sns.barplot(x=metrics, y=average_scores)
plt.title('Average Evaluation Metrics')
plt.ylabel('Score')
plt.show()

In [None]:
def plot_scores(scores, metric_name, average_score):
    plt.figure(figsize=(15, 6))

    # Scatter plot
    plt.scatter(range(1, len(scores) + 1), scores, label=metric_name, marker='o', color='blue', alpha=0.7)

    # Highlight exceptional cases (e.g., low scores)
    exceptional_indices = [i for i, score in enumerate(scores) if score < 0.5]  # Customize the threshold as needed
    plt.scatter(exceptional_indices, [scores[i] for i in exceptional_indices], color='red', label='Exceptional', marker='x')

    # Average line
    plt.axhline(y=average_score, color='green', linestyle='--', label=f'Average {metric_name}')

    plt.title(f'{metric_name} Scores for Each Example')
    plt.xlabel('Example Index')
    plt.ylabel(f'{metric_name} score')
    plt.legend()
    plt.show()

# Plotting results for each metric
average_sas_score = sum(sas_scores_list) / len(sas_scores_list)
average_em_score = sum(em_scores_list) / len(em_scores_list)
average_f1_score = sum(f1_scores_list) / len(f1_scores_list)
average_bert_score = sum(bert_scores_list) / len(bert_scores_list)

plot_scores(sas_scores_list, 'SAS', average_sas_score)
plot_scores(em_scores_list, 'EM', average_em_score)
plot_scores(f1_scores_list, 'F1', average_f1_score)
plot_scores(bert_scores_list, 'BERTScore', average_bert_score)

# **You can now upload the result of the training to the Hub, just execute this instruction:**

In [None]:
from transformers import pipeline

# Check if the model exists
model_name = "distilbert-base-uncased-finetuned-squad"
try:
    pipeline(model=model_name, tokenizer=model_name)
    print(f"The model {model_name} exists and is accessible.")
except Exception as e:
    print(f"Error: {e}")


In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer

model_name = "distilbert-base-uncased-finetuned-squad"

try:
    model = AutoModelForQuestionAnswering.from_pretrained(model_name)
    print("Config file (config.json) exists.")
except Exception as e:
    print(f"Config file (config.json) is missing. Error: {e}")


In [None]:
import os
import json

config_content = {
    "architectures": ["DistilBertForQuestionAnswering"],
    "attention_probs_dropout_prob": 0.1,
    "hidden_dropout_prob": 0.1,
    "num_labels": 2,
    "id2label": {"0": "LABEL_0", "1": "LABEL_1"},
    "label2id": {"LABEL_0": 0, "LABEL_1": 1},
}

# Use the current working directory as the output directory
output_directory = os.getcwd()
config_path = os.path.join(output_directory, "config.json")

# Write the content to the config.json file
with open(config_path, "w") as config_file:
    json.dump(config_content, config_file, indent=4)

print(f"Config file (config.json) has been created at: {config_path}")


In [None]:
trainer.push_to_hub()

You can now share this model with all your friends, family, favorite pets: they can all load it with the identifier `"your-username/the-name-you-picked"` so for instance:

```python
from transformers import AutoModelForQuestionAnswering

model = AutoModelForQuestionAnswering.from_pretrained("sgugger/my-awesome-model")
```