# **Tâche #2 - Questions-réponses avec un modèle QA extractif**

Cette tâche consiste à utiliser un modèle de question-réponse extractif de type transformer afin de repérer des informations dans un texte. Vous utilisez la librairie HuggingFace pour accomplir cette tâche. On demande plus spécifiquement d’utiliser le modèle *bert-base-uncased-whole-word-masking-finetuned-squad*.

La tâche a pour but précis de repérer 3 informations dans les descriptions textuelles : le lieu et la date de l’incident ainsi qu’un court passage de texte indiquant ce qui s’est passé.  Une partie importante de votre travail consiste à trouver de bonnes formulations de questions pour repérer ces informations. Le fichier *t2_qa_examples*.json, qui contient 25 exemples annotés par un humain, est disponible pour mener vos expérimentations.

Les consignes pour cette tâche sont:
-	Nom du notebook : *t2_qa.ipynb* (ce notebook)
-	Tokenisation et plongements de mots : Ceux du modèle utilisé.
-	Normalisation : Aucune normalisation à faire (le tokeniseur convertit les lettres en minuscule).
-	Construction du modèle : vous utilisez la version préentraînée du modèle sans modification. Aucun affinement (fine-tuning) du modèle n’est requis pour cette tâche.
-	Évaluation : Du code est disponible dans le notebook pour évaluer la performance du modèle avec les métriques *exact match* et *F1*.
-	Analyse : Présentez et discutez des résultats que vous obtenez pour les 3 types d’informations à repérer. Discutez également de vos choix de questions pour accomplir cette tâche et les erreurs commises par le modèle QA.

Vous pouvez ajouter au notebook toutes les cellules dont vous avez besoin pour votre code, vos explications ou la présentation de vos résultats. Vous pouvez également ajouter des sous-sections (par ex. des sous-sections 1.1, 1.2 etc.) si cela améliore la lisibilité.

Notes :
- Évitez les bouts de code trop longs ou trop complexes. Par exemple, il est difficile de comprendre 4-5 boucles ou conditions imbriquées. Si c'est le cas, définissez des sous-fonctions pour refactoriser et simplifier votre code.
- Expliquez sommairement votre démarche.
- Expliquez les choix que vous faites au niveau de la programmation et des modèles (si non trivial).
- Analysez vos résultats. Indiquez ce que vous observez, si c'est bon ou non, si c'est surprenant, etc.
- Une analyse quantitative et qualitative d'erreurs est intéressante et permet de mieux comprendre le comportement d'un modèle.

## 1. Le chargement des données

Utilisez le fichier ***/data/t2_qa_examples.json*** pour mener vos expérimentations.

In [2]:
import json

def load_json_data(filename):
    with open(filename, 'r') as fp:
        data = json.load(fp)
    return data

In [3]:
data = load_json_data('/data/t2_qa_examples.json')

In [4]:
data

  'WHEN': 'November 10  2013',
  'WHERE': 'railroad bridge overpass',
  'EVENT': 'Employee #1  was struck and thrown'},
 {'text': " On August 27  2012  Employee #1  a 19 year-old male laborer with Stomper  Company Inc.  arrived at 2:00 .am. at a site in Menlo Park California to  demolish the interiors of the building. They scraped the interiors of the  building and collected debris as they finished up the job. On August 28  2012   at approximately 10:00 a.m  the job assignment was done and every employee was  to put away all the rubble and gather all equipment in order to pack up and  leave the site. When the job assignment was finished  it is typical for all  employees to gather everything and put it away into the garbage bin or in  their trailers and bins. At the time  four coworkers were outside in the  parking lot working near the Number 5 700 Panther. Two coworkers were going to  load the number 5 700 Panther and Employee #1 stated that he was going to load  the number 5 700 Panth

## 2. Vos questions

Vous pouvez mettre plusieurs options de questions dans le notebook. Il est important de présenter, au minimum, les résultats pour le meilleur jeu de questions. Vous pourrez également mettre des informations à ce propos dans la section d'analyse.

*WHEN formulations:*

In [5]:
version1_WHEN = "When was the employee doing staff"
version2_WHEN = "What was the date of the event?"
version3_WHEN = "What was the day? "
meuilleur_WHEN = "What was the exact day?"

*WHEN formulations:*

In [6]:
version1_WHERE = "where was it mostly"
version2_WHERE = "where was it done?"
meuilleur_WHERE = "Where was the actual place it was occuring?"

*EVENT formulations:*

In [7]:
version1_EVENT = "What was happening at the end?"
version2_EVENT = "What is the most controversial thing that happened?"
meuilleur_EVENT = "What is the  most astonishing thing happening?"

## 3. Le modèle de question-réponse extractif

In [8]:
from transformers import pipeline

In [9]:
pipe = pipeline("question-answering", model="google-bert/bert-large-uncased-whole-word-masking-finetuned-squad")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at google-bert/bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [10]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("google-bert/bert-large-uncased-whole-word-masking-finetuned-squad")

Some weights of the model checkpoint at google-bert/bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [11]:
import torch

## 4. Des fonctions utilitaires pour l'évaluation

In [12]:
import string
import re
from collections import Counter

def remove_articles(text):
    return re.sub(r'\b(a|an|the)\b', ' ', text)

def white_space_fix(text):
    return ' '.join(text.split())

def remove_punc(text):
    exclude = set(string.punctuation)
    return ''.join(ch for ch in text if ch not in exclude)

def lower(text):
    return text.lower()

def normalize_answer(s):
    """Mettre en minuscule et retirer la ponctuation, des déterminants and les espaces."""
    return white_space_fix(remove_articles(remove_punc(lower(s))))

In [13]:
def evaluate_f1(ground_truth, prediction):
    """Normalise les 2 textes, trouve ce qu'il y a en commun et estime précision, rappel et F1."""
    prediction_tokens = normalize_answer(prediction).split()
    ground_truth_tokens = normalize_answer(ground_truth).split()
    common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
    num_same = sum(common.values())
    if len(ground_truth_tokens) == 0 or len(prediction_tokens) == 0:
        return int(ground_truth_tokens == prediction_tokens)
    if num_same == 0:
        return 0
    precision = 1.0 * num_same / len(prediction_tokens)
    recall = 1.0 * num_same / len(ground_truth_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1

def evaluate_exact_match(ground_truth, prediction):
    """Vérifie si les 2 textes sont quasi-identiques."""
    return (normalize_answer(prediction) == normalize_answer(ground_truth))

## 5. Évaluation du modèle et analyse

**EVENT (juste pour connaitre effet sliding window sur EVENT):**

In [None]:

# Parameters
window_size = 512  # Max tokens per window
stride = 32  # Overlap size
# windSz 512 , stride 64 , f1 score 0.49 , 20%
# Split text into overlapping windows
def sliding_windows(text, window_size, stride):
    input_ids = tokenizer.encode(text, add_special_tokens=False)
    windows = []
    for i in range(0, len(input_ids), stride):
        window = input_ids[i:i + window_size]
        windows.append(window)
    return windows

# Initialize scores
s = 0
f = 0
exM = 0

for i in range(len(data)):
    text_data = data[i]['text']
    ground_truth = data[i]["EVENT"]
    question, text = "What is the  most astonishing thing happening?", text_data

    windows = sliding_windows(text_data, window_size, stride)
    answers = []

    # Get answers from each window
    for window in windows:
        tokens = tokenizer.convert_ids_to_tokens(window, skip_special_tokens=True)
        window_text = tokenizer.convert_tokens_to_string(tokens)

        inputs = tokenizer.encode_plus(question, window_text, return_tensors="pt", truncation=True, max_length=window_size)
        input_ids = inputs["input_ids"]
        attention_mask = inputs["attention_mask"]

        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)

        start_scores = outputs.start_logits
        end_scores = outputs.end_logits
        all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0])

        # Get the best answer within this window
        answer_start = torch.argmax(start_scores)
        answer_end = torch.argmax(end_scores) + 1

        if answer_start < answer_end and answer_end <= len(all_tokens):
            answer = tokenizer.convert_tokens_to_string(all_tokens[answer_start:answer_end])
            score = start_scores[0][answer_start] + end_scores[0][answer_end - 1]
            answers.append((answer, score.item()))

    # Select the best answer based on score
    prediction = max(answers, key=lambda x: x[1])[0] if answers else ""

    # Evaluate performance
    f1_score = evaluate_f1(ground_truth, prediction)
    f += f1_score
    exact_match = evaluate_exact_match(ground_truth, prediction)
    exM += int(exact_match)

    print(f"##### Example {i}: #####")
    print("Prediction:", prediction)
    print("F1 score:", f1_score)
    print("Exact Match:", int(exact_match), "\n")
    print("#########\n")

    if prediction.lower() == ground_truth.lower():
        s += 1

# Final Metrics
percentage = (s / len(data)) * 100
print(f"Number correct EVENT: {s}")
print(f"{percentage:.2f}% correct answers")
print(f"{f / len(data):.2f} Average F1 Score")
print(f"{(exM / len(data)) * 100:.2f}% Average Exact Match Score")


##### Example 0: #####
Prediction: employee # 1 was pronounced dead
F1 score: 0.5454545454545454
Exact Match: 0 

#########

##### Example 1: #####
Prediction: employee # 1 suffered a serious fracture injury
F1 score: 0
Exact Match: 0 

#########

##### Example 2: #####
Prediction: pronounced dead
F1 score: 0
Exact Match: 0 

#########

##### Example 3: #####
Prediction: he was struck by the cement mixer that tipped over in the process of mixing cement / concrete
F1 score: 0.5714285714285715
Exact Match: 0 

#########

##### Example 4: #####
Prediction: the semi - truck swerved to miss a car at employee # 1 ' s sign area and struck employee # 1
F1 score: 0.3157894736842105
Exact Match: 0 

#########

##### Example 5: #####
Prediction: employee # 1 was killed
F1 score: 0.3333333333333333
Exact Match: 0 

#########

##### Example 6: #####
Prediction: the employee moved and his left foot was struck by and run over by a loader
F1 score: 0.88
Exact Match: 0 

#########

##### Example 7: ###

**WHERE**

In [29]:

f=0
exM=0

for i in range(len(data)):
  text_data = data[i]['text']
  ground_truth = data[i]["WHERE"]
  question,text = "where was it mostly?",text_data

  chunk_size = 512
  text_debut=text[:chunk_size]
  head = text[:chunk_size//2]
  tail = text[-chunk_size//2:]

  text_final = text[-chunk_size:]
  head_1 = text[:chunk_size//3]
  head_2 = text[chunk_size//6:chunk_size*2//3]
  head_3 = text[chunk_size*5//6:chunk_size*3//3]

  #combined head and tail at the begining and in the middle
  head_debut = text[:chunk_size//2]
  tail_debut = text[chunk_size//4:chunk_size//2]
  head_debut_2 = text[-chunk_size//2:]
  tail_debut_2 = text[chunk_size*3//4:chunk_size]
  inputs = tokenizer(question, head , return_tensors="pt")
  with torch.no_grad():
      outputs = model(**inputs)

  answer_start_index = outputs.start_logits.argmax()
  answer_end_index = outputs.end_logits.argmax()

  predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
  prediction = tokenizer.decode(predict_answer_tokens, skip_special_tokens=True)

  When_part = data[i]["WHERE"]
  f1_score = evaluate_f1(ground_truth, prediction)
  exact_match = evaluate_exact_match(ground_truth, prediction)

  f+=f1_score
  exM+=int(exact_match)
  print("##### example {}: #####".format(i))
  print(prediction)
  print("f1 score: ",f1_score)
  print("exact_match: ",int(exact_match),"\n")
  print("#########\n")

print("{}  moyenne de F1 - Score".format(f/25))
print("{} % moyenne de Exact Match - Score".format((exM/25)*100))

##### example 0: #####
a railroad bridge over
f1 score:  0.6666666666666666
exact_match:  0 

#########

##### example 1: #####
menlo park california
f1 score:  1.0
exact_match:  1 

#########

##### example 2: #####
bathroom facility
f1 score:  1.0
exact_match:  1 

#########

##### example 3: #####
cement mixer. he was struck by the cement mixer that tipped over in the process of mixing cement / concrete
f1 score:  0.10526315789473684
exact_match:  0 

#########

##### example 4: #####
iowa
f1 score:  0
exact_match:  0 

#########

##### example 5: #####
the emergency shoulder of an interstate highway
f1 score:  1.0
exact_match:  1 

#########

##### example 6: #####
inside a tunnel pipeline
f1 score:  1.0
exact_match:  1 

#########

##### example 7: #####
birds eye foods to service the dock lock on bay number 3
f1 score:  0.42857142857142855
exact_match:  0 

#########

##### example 8: #####
interstate 15 ( i - 15 ) and approximately 14100 south
f1 score:  0.7999999999999999
exact

In [33]:

f=0
exM=0
#"where was it done?" 67% 52% text_debut   (en bas cellule)

#"where was it?" head + tail 63.66% 36%
#"where was it mostly?" 59% 44%   head (512// 2)
# "where was it mostly?" 53.91% 48% head(first only)
# "where was it occuring?" 58.94 44% head + tail
# "where was it occuring?" 65.34 24% head_debut + tail_debut
#"Where was the actual place it was occuring?" 68.26%  52%   (head_debut + tail_debut)  | 67.59% 52% (head) |  73.1%  64%(first tokens 512  (text_debut))  | 50% (text_final) | 74.31%  64% (text_debut + head_3)
#"Where was the actual place it was taken place in?"  70% (512 first token)
for i in range(len(data)):
  text_data = data[i]['text']
  ground_truth = data[i]["WHERE"]
  question,text = "where was it done?",text_data

  chunk_size = 512
  text_debut=text[:chunk_size]
  head = text[:chunk_size//2]
  tail = text[-chunk_size//2:]

  text_final = text[-chunk_size:]
  head_1 = text[:chunk_size//3]
  head_2 = text[chunk_size//6:chunk_size*2//3]
  head_3 = text[chunk_size*5//6:chunk_size*3//3]

  #combined head and tail at the begining and in the middle
  head_debut = text[:chunk_size//2]
  tail_debut = text[chunk_size//4:chunk_size//2]
  head_debut_2 = text[-chunk_size//2:]
  tail_debut_2 = text[chunk_size*3//4:chunk_size]
  inputs = tokenizer(question, head + tail , return_tensors="pt")
  with torch.no_grad():
      outputs = model(**inputs)

  answer_start_index = outputs.start_logits.argmax()
  answer_end_index = outputs.end_logits.argmax()

  predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
  prediction = tokenizer.decode(predict_answer_tokens, skip_special_tokens=True)

  When_part = data[i]["WHERE"]
  f1_score = evaluate_f1(ground_truth, prediction)
  exact_match = evaluate_exact_match(ground_truth, prediction)

  f+=f1_score
  exM+=int(exact_match)
  print("##### example {}: #####".format(i))
  print(prediction)
  print("f1 score: ",f1_score)
  print("exact_match: ",int(exact_match),"\n")
  print("#########\n")

print("{}  moyenne de F1 - Score".format(f/25))
print("{} % moyenne de Exact Match - Score".format((exM/25)*100))

##### example 0: #####
railroad bridge
f1 score:  0.8
exact_match:  0 

#########

##### example 1: #####
menlo park california
f1 score:  1.0
exact_match:  1 

#########

##### example 2: #####
st. james hospital in chicago heights
f1 score:  0
exact_match:  0 

#########

##### example 3: #####
mixing cement / concrete
f1 score:  0.4
exact_match:  0 

#########

##### example 4: #####
iowa erosion control inc.
f1 score:  0
exact_match:  0 

#########

##### example 5: #####
in the emergency shoulder of an interstate highway
f1 score:  0.9090909090909091
exact_match:  0 

#########

##### example 6: #####
inside a tunnel pipeline
f1 score:  1.0
exact_match:  1 

#########

##### example 7: #####
birds eye foods
f1 score:  1.0
exact_match:  1 

#########

##### example 8: #####
on interstate 15 ( i - 15 ) and approximately 14100 south
f1 score:  0.75
exact_match:  0 

#########

##### example 9: #####
alimack hoist tower
f1 score:  1.0
exact_match:  1 

#########

##### example 10: ###

In [32]:

f=0
exM=0
#"where was it done?" 67% 52% text_debut   (en bas cellule)

#"where was it?" head + tail 63.66% 36%
#"where was it mostly?" 59% 44%   head (512// 2)
# "where was it mostly?" 53.91% 48% head(first only)
# "where was it occuring?" 58.94 44% head + tail
# "where was it occuring?" 65.34 24% head_debut + tail_debut
#"Where was the actual place it was occuring?" 68.26%  52%   (head_debut + tail_debut)  | 67.59% 52% (head) |  73.1%  64%(first tokens 512  (text_debut))  | 50% (text_final) | 74.31%  64% (text_debut + head_3)
#"Where was the actual place it was taken place in?"  70% (512 first token)
for i in range(len(data)):
  text_data = data[i]['text']
  ground_truth = data[i]["WHERE"]
  question,text = "where was it done?",text_data

  chunk_size = 512
  text_debut=text[:chunk_size]
  head = text[:chunk_size//2]
  tail = text[-chunk_size//2:]

  text_final = text[-chunk_size:]
  head_1 = text[:chunk_size//3]
  head_2 = text[chunk_size//6:chunk_size*2//3]
  head_3 = text[chunk_size*5//6:chunk_size*3//3]

  #combined head and tail at the begining and in the middle
  head_debut = text[:chunk_size//2]
  tail_debut = text[chunk_size//4:chunk_size//2]
  head_debut_2 = text[-chunk_size//2:]
  tail_debut_2 = text[chunk_size*3//4:chunk_size]
  inputs = tokenizer(question, text_debut , return_tensors="pt")
  with torch.no_grad():
      outputs = model(**inputs)

  answer_start_index = outputs.start_logits.argmax()
  answer_end_index = outputs.end_logits.argmax()

  predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
  prediction = tokenizer.decode(predict_answer_tokens, skip_special_tokens=True)

  When_part = data[i]["WHERE"]
  f1_score = evaluate_f1(ground_truth, prediction)
  exact_match = evaluate_exact_match(ground_truth, prediction)

  f+=f1_score
  exM+=int(exact_match)
  print("##### example {}: #####".format(i))
  print(prediction)
  print("f1 score: ",f1_score)
  print("exact_match: ",int(exact_match),"\n")
  print("#########\n")

print("{}  moyenne de F1 - Score".format(f/25))
print("{} % moyenne de Exact Match - Score".format((exM/25)*100))

##### example 0: #####
railroad bridge overpass
f1 score:  1.0
exact_match:  1 

#########

##### example 1: #####
menlo park california
f1 score:  1.0
exact_match:  1 

#########

##### example 2: #####
st. james hospital
f1 score:  0
exact_match:  0 

#########

##### example 3: #####
mixing cement / concrete
f1 score:  0.4
exact_match:  0 

#########

##### example 4: #####
iowa erosion control inc. was turning a vertical sign from stop to slow based upon the location of a pilot car running between the worker ' s location and a second coworker " flagger " located approximately one mile away
f1 score:  0.17142857142857143
exact_match:  0 

#########

##### example 5: #####
the emergency shoulder of an interstate highway
f1 score:  1.0
exact_match:  1 

#########

##### example 6: #####
inside a tunnel pipeline
f1 score:  1.0
exact_match:  1 

#########

##### example 7: #####
birds eye foods
f1 score:  1.0
exact_match:  1 

#########

##### example 8: #####
interstate 15 ( i - 15 ) a

In [34]:

f=0
exM=0
#"where was it done?" 67% 52% text_debut   (en bas cellule)

#"where was it?" head + tail 63.66% 36%
#"where was it mostly?" 59% 44%   head (512// 2)
# "where was it mostly?" 53.91% 48% head(first only)
# "where was it occuring?" 58.94 44% head + tail
# "where was it occuring?" 65.34 24% head_debut + tail_debut
#"Where was the actual place it was occuring?" 68.26%  52%   (head_debut + tail_debut)  | 67.59% 52% (head) |  73.1%  64%(first tokens 512  (text_debut))  | 50% (text_final) | 74.31%  64% (text_debut + head_3)
#"Where was the actual place it was taken place in?"  70% (512 first token)
for i in range(len(data)):
  text_data = data[i]['text']
  ground_truth = data[i]["WHERE"]
  question,text = "Where was the actual place it was occuring?",text_data

  chunk_size = 512
  text_debut=text[:chunk_size]
  head = text[:chunk_size//2]
  tail = text[-chunk_size//2:]

  text_final = text[-chunk_size:]
  head_1 = text[:chunk_size//3]
  head_2 = text[chunk_size//6:chunk_size*2//3]
  head_3 = text[chunk_size*5//6:chunk_size*3//3]

  #combined head and tail at the begining and in the middle
  head_debut = text[:chunk_size//2]
  tail_debut = text[chunk_size//4:chunk_size//2]
  head_debut_2 = text[-chunk_size//2:]
  tail_debut_2 = text[chunk_size*3//4:chunk_size]
  inputs = tokenizer(question, text_debut + head_3 , return_tensors="pt")
  with torch.no_grad():
      outputs = model(**inputs)

  answer_start_index = outputs.start_logits.argmax()
  answer_end_index = outputs.end_logits.argmax()

  predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
  prediction = tokenizer.decode(predict_answer_tokens, skip_special_tokens=True)

  When_part = data[i]["WHERE"]
  f1_score = evaluate_f1(ground_truth, prediction)
  exact_match = evaluate_exact_match(ground_truth, prediction)

  f+=f1_score
  exM+=int(exact_match)
  print("##### example {}: #####".format(i))
  print(prediction)
  print("f1 score: ",f1_score)
  print("exact_match: ",int(exact_match),"\n")
  print("#########\n")

print("{}  moyenne de F1 - Score".format(f/25))
print("{} % moyenne de Exact Match - Score".format((exM/25)*100))

##### example 0: #####
railroad bridge overpass
f1 score:  1.0
exact_match:  1 

#########

##### example 1: #####
menlo park california
f1 score:  1.0
exact_match:  1 

#########

##### example 2: #####
a bathroom facility
f1 score:  1.0
exact_match:  1 

#########

##### example 3: #####
mixing cement / concrete
f1 score:  0.4
exact_match:  0 

#########

##### example 4: #####
one mile away
f1 score:  1.0
exact_match:  1 

#########

##### example 5: #####
the emergency shoulder of an interstate highway
f1 score:  1.0
exact_match:  1 

#########

##### example 6: #####
inside a tunnel pipeline
f1 score:  1.0
exact_match:  1 

#########

##### example 7: #####
birds eye foods
f1 score:  1.0
exact_match:  1 

#########

##### example 8: #####
utah county line to 12300 south in salt lake county
f1 score:  0.11764705882352941
exact_match:  0 

#########

##### example 9: #####
alimack hoist tower
f1 score:  1.0
exact_match:  1 

#########

##### example 10: #####
home depot store number

In [35]:

f=0
exM=0
#"where was it done?" 67% 52% text_debut   (en bas cellule)

#"where was it?" head + tail 63.66% 36%
#"where was it mostly?" 59% 44%   head (512// 2)
# "where was it mostly?" 53.91% 48% head(first only)
# "where was it occuring?" 58.94 44% head + tail
# "where was it occuring?" 65.34 24% head_debut + tail_debut
#"Where was the actual place it was occuring?" 68.26%  52%   (head_debut + tail_debut)  | 67.59% 52% (head) |  73.1%  64%(first tokens 512  (text_debut))  | 50% (text_final) | 74.31%  64% (text_debut + head_3)
#"Where was the actual place it was taken place in?"  70% (512 first token)
for i in range(len(data)):
  text_data = data[i]['text']
  ground_truth = data[i]["WHERE"]
  question,text = "Where was the actual place it was occuring?",text_data

  chunk_size = 512
  text_debut=text[:chunk_size]
  head = text[:chunk_size//2]
  tail = text[-chunk_size//2:]

  text_final = text[-chunk_size:]
  head_1 = text[:chunk_size//3]
  head_2 = text[chunk_size//6:chunk_size*2//3]
  head_3 = text[chunk_size*5//6:chunk_size*3//3]

  #combined head and tail at the begining and in the middle
  head_debut = text[:chunk_size//2]
  tail_debut = text[chunk_size//4:chunk_size//2]
  head_debut_2 = text[-chunk_size//2:]
  tail_debut_2 = text[chunk_size*3//4:chunk_size]
  inputs = tokenizer(question, text_debut , return_tensors="pt")
  with torch.no_grad():
      outputs = model(**inputs)

  answer_start_index = outputs.start_logits.argmax()
  answer_end_index = outputs.end_logits.argmax()

  predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
  prediction = tokenizer.decode(predict_answer_tokens, skip_special_tokens=True)

  When_part = data[i]["WHERE"]
  f1_score = evaluate_f1(ground_truth, prediction)
  exact_match = evaluate_exact_match(ground_truth, prediction)

  f+=f1_score
  exM+=int(exact_match)
  print("##### example {}: #####".format(i))
  print(prediction)
  print("f1 score: ",f1_score)
  print("exact_match: ",int(exact_match),"\n")
  print("#########\n")

print("{}  moyenne de F1 - Score".format(f/25))
print("{} % moyenne de Exact Match - Score".format((exM/25)*100))

##### example 0: #####
railroad bridge overpass
f1 score:  1.0
exact_match:  1 

#########

##### example 1: #####
menlo park california
f1 score:  1.0
exact_match:  1 

#########

##### example 2: #####
a bathroom facility
f1 score:  1.0
exact_match:  1 

#########

##### example 3: #####
mixing cement / concrete
f1 score:  0.4
exact_match:  0 

#########

##### example 4: #####
one mile away
f1 score:  1.0
exact_match:  1 

#########

##### example 5: #####
the emergency shoulder of an interstate highway
f1 score:  1.0
exact_match:  1 

#########

##### example 6: #####
inside a tunnel pipeline
f1 score:  1.0
exact_match:  1 

#########

##### example 7: #####
birds eye foods
f1 score:  1.0
exact_match:  1 

#########

##### example 8: #####
utah county line to 12300 south in salt lake county
f1 score:  0.11764705882352941
exact_match:  0 

#########

##### example 9: #####
alimack hoist tower
f1 score:  1.0
exact_match:  1 

#########

##### example 10: #####
home depot store number

**EVENT**

2ème forrmulation (head + tail + head-4 + tail_4)

In [38]:
s = 0
f=0
exM=0
#"What was happening after what occured in the place where was it done?"
#"What was happening at the end?" 38%
#"What the most thing that was happening and marking the story  at the end?"  31%
# "What the most thing that was happening and marking the story?" 31%
#"What the most thing that was happening  in the story?" 31%
#"What the most thing that was happening?" 43.83 (the most > 43.27)
# "What is the most controversial thing that happened?" (46%)
# "What is the worst and risky thing happening?" 47.88%
#"What is the  most astonishing thing happening?"  head_only  (41%  16%) |  (head + tail) 50.61%   24%  | 54.61%  28% (head + tail + head_4 + tail_4)
for i in range(len(data)):
  text_data = data[i]['text']
  ground_truth = data[i]["EVENT"]
  question,text = "What is the most controversial thing that happened?",text_data

  chunk_size = 512
  head_only = text[:chunk_size]
  head = text[:chunk_size//2]
  tail = text[-chunk_size//2:]

  head_2 = text[chunk_size:chunk_size+ 513]
  tail_2 = text[chunk_size + 512: chunk_size + 1024]

  head_3 = text[chunk_size + 1024: chunk_size + 1536]
  tail_3 = text[chunk_size + 1536: chunk_size + 2048]

  head_4 = text[chunk_size + 2048: chunk_size + 256]
  tail_4 = text[chunk_size + 2560: chunk_size + 3071]

  tail_only = text[-chunk_size:]
  inputs = tokenizer(question, head + tail + head_4 + tail_4 , return_tensors="pt")
  with torch.no_grad():
      outputs = model(**inputs)

  answer_start_index = outputs.start_logits.argmax()
  answer_end_index = outputs.end_logits.argmax()

  predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
  prediction = tokenizer.decode(predict_answer_tokens, skip_special_tokens=True)

  When_part = data[i]["EVENT"]
  f1_score = evaluate_f1(ground_truth, prediction)
  f+=f1_score
  exact_match = evaluate_exact_match(ground_truth, prediction)
  exM += int(exact_match)
  print("##### example {}: #####".format(i))
  print(prediction)
  print("f1 score: ",f1_score)
  print("exact_match: ",int(exact_match),"\n")
  print("#########\n")

print("{}  moyenne de F1 - Score".format(f/25))
print("{} % moyenne de Exact Match - Score".format((exM/25)*100))

##### example 0: #####
employee # 1 was struck and thrown some 100 ft from where he was originally standing. the vehicle was moving approximately 45 mph per hour. as he was transferred to a hospital by emergency personnel employee # 1 was treated for severe trauma lacerations fractures and contusions
f1 score:  0.24489795918367346
exact_match:  0 

#########

##### example 1: #####
employee # 1 suffered a serious fracture injury to his left leg
f1 score:  0.16
exact_match:  0 

#########

##### example 2: #####
employee # 1 had a seizure
f1 score:  1.0
exact_match:  1 

#########

##### example 3: #####
he was struck by the cement mixer that tipped over in the process of mixing cement / concrete
f1 score:  0.5714285714285715
exact_match:  0 

#########

##### example 4: #####
employee # 1 was killed
f1 score:  0.5714285714285715
exact_match:  0 

#########

##### example 5: #####
employee # 1 was killed
f1 score:  0.3333333333333333
exact_match:  0 

#########

##### example 6: #####
t

meuilleur_formulation (head_only)

In [None]:
s = 0
f=0
exM=0
#"What was happening after what occured in the place where was it done?"
#"What was happening at the end?" 38%
#"What the most thing that was happening and marking the story  at the end?"  31%
# "What the most thing that was happening and marking the story?" 31%
#"What the most thing that was happening  in the story?" 31%
#"What the most thing that was happening?" 43.83 (the most > 43.27)
# "What is the most controversial thing that happened?" (46%)
# "What is the worst and risky thing happening?" 47.88%
#"What is the  most astonishing thing happening?"  head_only  (41%  16%) |  (head + tail) 50.61%   24%  | 54.61%  28% (head + tail + head_4 + tail_4)
for i in range(len(data)):
  text_data = data[i]['text']
  ground_truth = data[i]["EVENT"]
  question,text = "What is the  most astonishing thing happening?",text_data

  chunk_size = 512
  head_only = text[:chunk_size]
  head = text[:chunk_size//2]
  tail = text[-chunk_size//2:]

  head_2 = text[chunk_size:chunk_size+ 513]
  tail_2 = text[chunk_size + 512: chunk_size + 1024]

  head_3 = text[chunk_size + 1024: chunk_size + 1536]
  tail_3 = text[chunk_size + 1536: chunk_size + 2048]

  head_4 = text[chunk_size + 2048: chunk_size + 256]
  tail_4 = text[chunk_size + 2560: chunk_size + 3071]

  tail_only = text[-chunk_size:]
  inputs = tokenizer(question, head_only , return_tensors="pt")
  with torch.no_grad():
      outputs = model(**inputs)

  answer_start_index = outputs.start_logits.argmax()
  answer_end_index = outputs.end_logits.argmax()

  predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
  prediction = tokenizer.decode(predict_answer_tokens, skip_special_tokens=True)

  When_part = data[i]["EVENT"]
  f1_score = evaluate_f1(ground_truth, prediction)
  f+=f1_score
  exact_match = evaluate_exact_match(ground_truth, prediction)
  exM += int(exact_match)
  print("##### example {}: #####".format(i))
  print(prediction)
  print("f1 score: ",f1_score)
  print("exact_match: ",int(exact_match),"\n")
  print("#########\n")
  if prediction == When_part.lower():
    s+=1
percenantage = (s/25)*100
print("number correct EVENT :{}".format(s))
print("{} % correct answers".format(percenantage))
print("{}  moyenne de F1 - Score".format(f/25))
print("{} % moyenne de Exact Match - Score".format((exM/25)*100))

##### example 0: #####
standing on the ground checking the depth of the cut into the asphalt
f1 score:  0
exact_match:  0 

#########

##### example 1: #####

f1 score:  0
exact_match:  0 

#########

##### example 2: #####
seizure
f1 score:  0.4
exact_match:  0 

#########

##### example 3: #####
he was struck by the cement mixer that tipped over in the process of mixing cement / concrete
f1 score:  0.5714285714285715
exact_match:  0 

#########

##### example 4: #####
the semi - truck swerved to miss a car at employee # 1 ' s sign area and struck employee # 1
f1 score:  0.3157894736842105
exact_match:  0 

#########

##### example 5: #####
employee # 1 was killed
f1 score:  0.3333333333333333
exact_match:  0 

#########

##### example 6: #####
the employee moved and his left foot was struck by and run over by a loader
f1 score:  0.88
exact_match:  0 

#########

##### example 7: #####
a tractor - trailer unit was parked in the adjacent bay and there was a gap of approximately 2 to 3 

meuilleur_formulation (head+tail)

In [None]:
s = 0
f=0
exM=0
#"What was happening after what occured in the place where was it done?"
#"What was happening at the end?" 38%
#"What the most thing that was happening and marking the story  at the end?"  31%
# "What the most thing that was happening and marking the story?" 31%
#"What the most thing that was happening  in the story?" 31%
#"What the most thing that was happening?" 43.83 (the most > 43.27)
# "What is the most controversial thing that happened?" (46%)
# "What is the worst and risky thing happening?" 47.88%
#"What is the  most astonishing thing happening?"  head_only  (41%  16%) |  (head + tail) 50.61%   24%  | 54.61%  28% (head + tail + head_4 + tail_4)
for i in range(len(data)):
  text_data = data[i]['text']
  ground_truth = data[i]["EVENT"]
  question,text = "What is the  most astonishing thing happening?",text_data

  chunk_size = 512
  head_only = text[:chunk_size]
  head = text[:chunk_size//2]
  tail = text[-chunk_size//2:]

  head_2 = text[chunk_size:chunk_size+ 513]
  tail_2 = text[chunk_size + 512: chunk_size + 1024]

  head_3 = text[chunk_size + 1024: chunk_size + 1536]
  tail_3 = text[chunk_size + 1536: chunk_size + 2048]

  head_4 = text[chunk_size + 2048: chunk_size + 256]
  tail_4 = text[chunk_size + 2560: chunk_size + 3071]

  tail_only = text[-chunk_size:]
  inputs = tokenizer(question, head + tail , return_tensors="pt")
  with torch.no_grad():
      outputs = model(**inputs)

  answer_start_index = outputs.start_logits.argmax()
  answer_end_index = outputs.end_logits.argmax()

  predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
  prediction = tokenizer.decode(predict_answer_tokens, skip_special_tokens=True)

  When_part = data[i]["EVENT"]
  f1_score = evaluate_f1(ground_truth, prediction)
  f+=f1_score
  exact_match = evaluate_exact_match(ground_truth, prediction)
  exM += int(exact_match)
  print("##### example {}: #####".format(i))
  print(prediction)
  print("f1 score: ",f1_score)
  print("exact_match: ",int(exact_match),"\n")
  print("#########\n")
  if prediction == When_part.lower():
    s+=1
percenantage = (s/25)*100
print("number correct EVENT :{}".format(s))
print("{} % correct answers".format(percenantage))
print("{}  moyenne de F1 - Score".format(f/25))
print("{} % moyenne de Exact Match - Score".format((exM/25)*100))

##### example 0: #####

f1 score:  0
exact_match:  0 

#########

##### example 1: #####
employee # 1 suffered a serious fracture injury to his left leg
f1 score:  0.16
exact_match:  0 

#########

##### example 2: #####
employee # 1 had a seizure
f1 score:  1.0
exact_match:  1 

#########

##### example 3: #####
he was struck by the cement mixer that tipped over in the process of mixing cement / concrete
f1 score:  0.5714285714285715
exact_match:  0 

#########

##### example 4: #####
employee # 1 was killed
f1 score:  0.5714285714285715
exact_match:  0 

#########

##### example 5: #####
employee # 1 was killed
f1 score:  0.3333333333333333
exact_match:  0 

#########

##### example 6: #####
the employee moved and his left foot was struck by and run over by a loader
f1 score:  0.88
exact_match:  0 

#########

##### example 7: #####
he suffered a fractured pelvis a ruptured bladder and leg vein damage
f1 score:  0.08333333333333333
exact_match:  0 

#########

##### example 8: #####


meuilleur formulation avec (head + tail +head_4 + tail_4)

In [None]:
s = 0
f=0
exM=0
#"What was happening after what occured in the place where was it done?"
#"What was happening at the end?" 38%
#"What the most thing that was happening and marking the story  at the end?"  31%
# "What the most thing that was happening and marking the story?" 31%
#"What the most thing that was happening  in the story?" 31%
#"What the most thing that was happening?" 43.83 (the most > 43.27)
# "What is the most controversial thing that happened?" (46%)
# "What is the worst and risky thing happening?" 47.88%
#"What is the  most astonishing thing happening?"  head_only  (41%  16%) |  (head + tail) 50.61%   24%  | 54.61%  28% (head + tail + head_4 + tail_4)
for i in range(len(data)):
  text_data = data[i]['text']
  ground_truth = data[i]["EVENT"]
  question,text = "What is the most controversial thing that happened?",text_data

  chunk_size = 512
  head_only = text[:chunk_size]
  head = text[:chunk_size//2]
  tail = text[-chunk_size//2:]

  head_2 = text[chunk_size:chunk_size+ 513]
  tail_2 = text[chunk_size + 512: chunk_size + 1024]

  head_3 = text[chunk_size + 1024: chunk_size + 1536]
  tail_3 = text[chunk_size + 1536: chunk_size + 2048]

  head_4 = text[chunk_size + 2048: chunk_size + 256]
  tail_4 = text[chunk_size + 2560: chunk_size + 3071]

  tail_only = text[-chunk_size:]
  inputs = tokenizer(question, head + tail + head_4 + tail_4, return_tensors="pt")
  with torch.no_grad():
      outputs = model(**inputs)

  answer_start_index = outputs.start_logits.argmax()
  answer_end_index = outputs.end_logits.argmax()

  predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
  prediction = tokenizer.decode(predict_answer_tokens, skip_special_tokens=True)

  When_part = data[i]["EVENT"]
  f1_score = evaluate_f1(ground_truth, prediction)
  f+=f1_score
  exact_match = evaluate_exact_match(ground_truth, prediction)
  exM += int(exact_match)
  print("##### example {}: #####".format(i))
  print(prediction)
  print("f1 score: ",f1_score)
  print("exact_match: ",int(exact_match),"\n")
  print("#########\n")
  if prediction == When_part.lower():
    s+=1
percenantage = (s/25)*100
print("number correct EVENT :{}".format(s))
print("{} % correct answers".format(percenantage))
print("{}  moyenne de F1 - Score".format(f/25))
print("{} % moyenne de Exact Match - Score".format((exM/25)*100))

##### example 0: #####
employee # 1 was struck and thrown
f1 score:  1.0
exact_match:  1 

#########

##### example 1: #####
employee # 1 suffered a serious fracture injury to his left leg
f1 score:  0.16
exact_match:  0 

#########

##### example 2: #####
employee # 1 had a seizure
f1 score:  1.0
exact_match:  1 

#########

##### example 3: #####
he was struck by the cement mixer that tipped over in the process of mixing cement / concrete
f1 score:  0.5714285714285715
exact_match:  0 

#########

##### example 4: #####
employee # 1 was killed
f1 score:  0.5714285714285715
exact_match:  0 

#########

##### example 5: #####
employee # 1 was killed
f1 score:  0.3333333333333333
exact_match:  0 

#########

##### example 6: #####
the employee moved and his left foot was struck by and run over by a loader
f1 score:  0.88
exact_match:  0 

#########

##### example 7: #####
he suffered a fractured pelvis a ruptured bladder and leg vein damage
f1 score:  0.08333333333333333
exact_match:  0 

2eme formulation (head + tail +head_4 + tail_4)

In [None]:
s = 0
f=0
exM=0
#"What was happening after what occured in the place where was it done?"
#"What was happening at the end?" 38%
#"What the most thing that was happening and marking the story  at the end?"  31%
# "What the most thing that was happening and marking the story?" 31%
#"What the most thing that was happening  in the story?" 31%
#"What the most thing that was happening?" 43.83 (the most > 43.27)
# "What is the most controversial thing that happened?" (46%)
# "What is the worst and risky thing happening?" 47.88%
#"What is the  most astonishing thing happening?"  head_only  (41%  16%) |  (head + tail) 50.61%   24%  | 54.61%  28% (head + tail + head_4 + tail_4)
for i in range(len(data)):
  text_data = data[i]['text']
  ground_truth = data[i]["EVENT"]
  question,text = "What was happening at the end?",text_data


  text_essay = text[256:256+512]
  chunk_size = 512
  head_only = text[:chunk_size]
  head = text[:chunk_size//2]
  tail = text[-chunk_size//2:]

  head_2 = text[chunk_size:chunk_size+ 513]
  tail_2 = text[chunk_size + 512: chunk_size + 1024]

  head_3 = text[chunk_size + 1024: chunk_size + 1536]
  tail_3 = text[chunk_size + 1536: chunk_size + 2048]

  head_4 = text[chunk_size + 2048: chunk_size + 256]
  tail_4 = text[chunk_size + 2560: chunk_size + 3071]

  tail_only = text[-chunk_size:]
  inputs = tokenizer(question, head + tail + head_4 + tail_4, return_tensors="pt")
  with torch.no_grad():
      outputs = model(**inputs)

  answer_start_index = outputs.start_logits.argmax()
  answer_end_index = outputs.end_logits.argmax()

  predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
  prediction = tokenizer.decode(predict_answer_tokens, skip_special_tokens=True)

  When_part = data[i]["EVENT"]
  f1_score = evaluate_f1(ground_truth, prediction)
  f+=f1_score
  exact_match = evaluate_exact_match(ground_truth, prediction)
  exM += int(exact_match)
  print("##### example {}: #####".format(i))
  print(prediction)
  print("f1 score: ",f1_score)
  print("exact_match: ",int(exact_match),"\n")
  print("#########\n")
  if prediction == When_part.lower():
    s+=1
percenantage = (s/25)*100
print("number correct EVENT :{}".format(s))
print("{} % correct answers".format(percenantage))
print("{}  moyenne de F1 - Score".format(f/25))
print("{} % moyenne de Exact Match - Score".format((exM/25)*100))

##### example 0: #####
employee # 1 was struck and thrown
f1 score:  1.0
exact_match:  1 

#########

##### example 1: #####
demolish the interiors of the building. they scraped the interiors of the building and collected debris asemployee # 1 ' s left foot. employee # 1 suffered a serious fracture injury to his left leg and was hospitalized
f1 score:  0.17777777777777776
exact_match:  0 

#########

##### example 2: #####
paramedics arrived
f1 score:  0
exact_match:  0 

#########

##### example 3: #####
tipped over
f1 score:  0
exact_match:  0 

#########

##### example 4: #####

f1 score:  0
exact_match:  0 

#########

##### example 5: #####
the passenger vehicle struck the trailered equipment and employee # 1. employee # 1 was killed.
f1 score:  0.8
exact_match:  0 

#########

##### example 6: #####
the employee underwent surgery for a crushed left foot
f1 score:  0.2222222222222222
exact_match:  0 

#########

##### example 7: #####

f1 score:  0
exact_match:  0 

#########

###

1ère formulation (head + tail + head_4 + tail_4)

In [None]:
s = 0
f=0
exM=0
#"What was happening after what occured in the place where was it done?"
#"What was happening at the end?" 38%
#"What the most thing that was happening and marking the story  at the end?"  31%
# "What the most thing that was happening and marking the story?" 31%
#"What the most thing that was happening  in the story?" 31%
#"What the most thing that was happening?" 43.83 (the most > 43.27)
# "What is the most controversial thing that happened?" (46%)
# "What is the worst and risky thing happening?" 47.88%
#"What is the  most astonishing thing happening?"  head_only  (41%  16%) |  (head + tail) 50.61%   24%  | 54.61%  28% (head + tail + head_4 + tail_4)
for i in range(len(data)):
  text_data = data[i]['text']
  ground_truth = data[i]["EVENT"]
  question,text = "What was happening at the end?",text_data


  text_essay = text[256:256+512]
  chunk_size = 512
  head_only = text[:chunk_size]
  head = text[:chunk_size//2]
  tail = text[-chunk_size//2:]

  head_2 = text[chunk_size:chunk_size+ 513]
  tail_2 = text[chunk_size + 512: chunk_size + 1024]

  head_3 = text[chunk_size + 1024: chunk_size + 1536]
  tail_3 = text[chunk_size + 1536: chunk_size + 2048]

  head_4 = text[chunk_size + 2048: chunk_size + 256]
  tail_4 = text[chunk_size + 2560: chunk_size + 3071]

  tail_only = text[-chunk_size:]
  inputs = tokenizer(question, head + tail + head_4 + tail_4, return_tensors="pt")
  with torch.no_grad():
      outputs = model(**inputs)

  answer_start_index = outputs.start_logits.argmax()
  answer_end_index = outputs.end_logits.argmax()

  predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
  prediction = tokenizer.decode(predict_answer_tokens, skip_special_tokens=True)

  When_part = data[i]["EVENT"]
  f1_score = evaluate_f1(ground_truth, prediction)
  f+=f1_score
  exact_match = evaluate_exact_match(ground_truth, prediction)
  exM += int(exact_match)
  print("##### example {}: #####".format(i))
  print(prediction)
  print("f1 score: ",f1_score)
  print("exact_match: ",int(exact_match),"\n")
  print("#########\n")
  if prediction == When_part.lower():
    s+=1
percenantage = (s/25)*100
print("number correct EVENT :{}".format(s))
print("{} % correct answers".format(percenantage))
print("{}  moyenne de F1 - Score".format(f/25))
print("{} % moyenne de Exact Match - Score".format((exM/25)*100))

##### example 0: #####
employee # 1 was struck and thrown
f1 score:  1.0
exact_match:  1 

#########

##### example 1: #####
demolish the interiors of the building. they scraped the interiors of the building and collected debris asemployee # 1 ' s left foot. employee # 1 suffered a serious fracture injury to his left leg and was hospitalized
f1 score:  0.17777777777777776
exact_match:  0 

#########

##### example 2: #####
paramedics arrived
f1 score:  0
exact_match:  0 

#########

##### example 3: #####
tipped over
f1 score:  0
exact_match:  0 

#########

##### example 4: #####

f1 score:  0
exact_match:  0 

#########

##### example 5: #####
the passenger vehicle struck the trailered equipment and employee # 1. employee # 1 was killed.
f1 score:  0.8
exact_match:  0 

#########

##### example 6: #####
the employee underwent surgery for a crushed left foot
f1 score:  0.2222222222222222
exact_match:  0 

#########

##### example 7: #####

f1 score:  0
exact_match:  0 

#########

###

In [36]:
s = 0
f=0
exM=0
#"What was happening after what occured in the place where was it done?"
#"What was happening at the end?" 38%
#"What the most thing that was happening and marking the story  at the end?"  31%
# "What the most thing that was happening and marking the story?" 31%
#"What the most thing that was happening  in the story?" 31%
#"What the most thing that was happening?" 43.83 (the most > 43.27)
# "What is the most controversial thing that happened?" (46%)
# "What is the worst and risky thing happening?" 47.88%
#"What is the  most astonishing thing happening?"  head_only  (41%  16%) |  (head + tail) 50.61%   24%  | 54.61%  28% (head + tail + head_4 + tail_4)
for i in range(len(data)):
  text_data = data[i]['text']
  ground_truth = data[i]["EVENT"]
  question,text = "What was happening at the end?",text_data


  text_essay = text[256:256+512]
  chunk_size = 512
  head_only = text[:chunk_size]
  head = text[:chunk_size//2]
  tail = text[-chunk_size//2:]

  head_2 = text[chunk_size:chunk_size+ 513]
  tail_2 = text[chunk_size + 512: chunk_size + 1024]

  head_3 = text[chunk_size + 1024: chunk_size + 1536]
  tail_3 = text[chunk_size + 1536: chunk_size + 2048]

  head_4 = text[chunk_size + 2048: chunk_size + 256]
  tail_4 = text[chunk_size + 2560: chunk_size + 3071]

  tail_only = text[-chunk_size:]
  inputs = tokenizer(question, head_only, return_tensors="pt")
  with torch.no_grad():
      outputs = model(**inputs)

  answer_start_index = outputs.start_logits.argmax()
  answer_end_index = outputs.end_logits.argmax()

  predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
  prediction = tokenizer.decode(predict_answer_tokens, skip_special_tokens=True)

  When_part = data[i]["EVENT"]
  f1_score = evaluate_f1(ground_truth, prediction)
  f+=f1_score
  exact_match = evaluate_exact_match(ground_truth, prediction)
  exM += int(exact_match)
  print("##### example {}: #####".format(i))
  print(prediction)
  print("f1 score: ",f1_score)
  print("exact_match: ",int(exact_match),"\n")
  print("#########\n")

print("{}  moyenne de F1 - Score".format(f/25))
print("{} % moyenne de Exact Match - Score".format((exM/25)*100))

##### example 0: #####
standing on the ground checking the depth of the cut into the asphalt
f1 score:  0
exact_match:  0 

#########

##### example 1: #####
when the job assignment was finish
f1 score:  0
exact_match:  0 

#########

##### example 2: #####
paramedics arrived
f1 score:  0
exact_match:  0 

#########

##### example 3: #####
tipped over
f1 score:  0
exact_match:  0 

#########

##### example 4: #####
its brakes locked up leaving residue on the roadway. then the semi - truck swerved to miss a car at employee # 1 ' s sign area and struck employee # 1
f1 score:  0.21428571428571425
exact_match:  0 

#########

##### example 5: #####
a driver of a passenger vehicle fell asleep while driving and crossed three lanes of traffic
f1 score:  0.2727272727272727
exact_match:  0 

#########

##### example 6: #####
the employee moved and his left foot was struck by and run over by a loader
f1 score:  0.88
exact_match:  0 

#########

##### example 7: #####
a tractor - trailer unit was

In [37]:
s = 0
f=0
exM=0
#"What was happening after what occured in the place where was it done?"
#"What was happening at the end?" 38%
#"What the most thing that was happening and marking the story  at the end?"  31%
# "What the most thing that was happening and marking the story?" 31%
#"What the most thing that was happening  in the story?" 31%
#"What the most thing that was happening?" 43.83 (the most > 43.27)
# "What is the most controversial thing that happened?" (46%)
# "What is the worst and risky thing happening?" 47.88%
#"What is the  most astonishing thing happening?"  head_only  (41%  16%) |  (head + tail) 50.61%   24%  | 54.61%  28% (head + tail + head_4 + tail_4)
for i in range(len(data)):
  text_data = data[i]['text']
  ground_truth = data[i]["EVENT"]
  question,text = "What was happening at the end?",text_data


  text_essay = text[256:256+512]
  chunk_size = 512
  head_only = text[:chunk_size]
  head = text[:chunk_size//2]
  tail = text[-chunk_size//2:]

  head_2 = text[chunk_size:chunk_size+ 513]
  tail_2 = text[chunk_size + 512: chunk_size + 1024]

  head_3 = text[chunk_size + 1024: chunk_size + 1536]
  tail_3 = text[chunk_size + 1536: chunk_size + 2048]

  head_4 = text[chunk_size + 2048: chunk_size + 256]
  tail_4 = text[chunk_size + 2560: chunk_size + 3071]

  tail_only = text[-chunk_size:]
  inputs = tokenizer(question, head + tail, return_tensors="pt")
  with torch.no_grad():
      outputs = model(**inputs)

  answer_start_index = outputs.start_logits.argmax()
  answer_end_index = outputs.end_logits.argmax()

  predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
  prediction = tokenizer.decode(predict_answer_tokens, skip_special_tokens=True)

  When_part = data[i]["EVENT"]
  f1_score = evaluate_f1(ground_truth, prediction)
  f+=f1_score
  exact_match = evaluate_exact_match(ground_truth, prediction)
  exM += int(exact_match)
  print("##### example {}: #####".format(i))
  print(prediction)
  print("f1 score: ",f1_score)
  print("exact_match: ",int(exact_match),"\n")
  print("#########\n")

print("{}  moyenne de F1 - Score".format(f/25))
print("{} % moyenne de Exact Match - Score".format((exM/25)*100))

##### example 0: #####
grind out existing asphalt from an interstate at a railroad bridge
f1 score:  0
exact_match:  0 

#########

##### example 1: #####
demolish the interiors of the building. they scraped the interiors of the building and collected debris asemployee # 1 ' s left foot. employee # 1 suffered a serious fracture injury to his left leg and was hospitalized
f1 score:  0.17777777777777776
exact_match:  0 

#########

##### example 2: #####
paramedics arrived
f1 score:  0
exact_match:  0 

#########

##### example 3: #####
tipped over
f1 score:  0
exact_match:  0 

#########

##### example 4: #####

f1 score:  0
exact_match:  0 

#########

##### example 5: #####
the passenger vehicle struck the trailered equipment and employee # 1. employee # 1 was killed.
f1 score:  0.8
exact_match:  0 

#########

##### example 6: #####
the employee underwent surgery for a crushed left foot
f1 score:  0.2222222222222222
exact_match:  0 

#########

##### example 7: #####

f1 score:  0
ex

Pour connaitre le nombre d'exmples dépassant les 512 et inversement:

In [None]:
compte_inferieur_512 = 0
compte_superieur_512 = 0
for i in range(len(data)):
  print("exemple {} de longueur  {}".format(i,len(data[i]['text'])))
  if len(data[i]['text']) < 512:
    compte_inferieur_512+=1
compte_superieur_512 = len(data) - compte_inferieur_512
print(compte_inferieur_512)
print(compte_superieur_512)


exemple 0 de longueur  4002
exemple 1 de longueur  2234
exemple 2 de longueur  387
exemple 3 de longueur  469
exemple 4 de longueur  1008
exemple 5 de longueur  465
exemple 6 de longueur  319
exemple 7 de longueur  1088
exemple 8 de longueur  1234
exemple 9 de longueur  545
exemple 10 de longueur  1291
exemple 11 de longueur  1020
exemple 12 de longueur  786
exemple 13 de longueur  853
exemple 14 de longueur  464
exemple 15 de longueur  764
exemple 16 de longueur  537
exemple 17 de longueur  390
exemple 18 de longueur  536
exemple 19 de longueur  1312
exemple 20 de longueur  1461
exemple 21 de longueur  625
exemple 22 de longueur  232
exemple 23 de longueur  238
exemple 24 de longueur  81
9
16


**when**

In [None]:
s = 0
f=0
exM=0
for i in range(len(data)):
  text_data = data[i]['text']
  ground_truth = data[i]["WHEN"]
  question,text = "What was the date of the event?",text_data

  chunk_size = 512
  head = text[:chunk_size//2]
  tail = text[-chunk_size//2:]
  inputs = tokenizer(question, head + tail, return_tensors="pt")
  with torch.no_grad():
      outputs = model(**inputs)

  answer_start_index = outputs.start_logits.argmax()
  answer_end_index = outputs.end_logits.argmax()

  predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
  prediction = tokenizer.decode(predict_answer_tokens, skip_special_tokens=True)

  When_part = data[i]["WHEN"]
  f1_score = evaluate_f1(ground_truth, prediction)
  f+=f1_score
  exM+=int(exact_match)
  print("##### example {}: #####".format(i))
  print(prediction)
  print("f1 score: ",f1_score)
  print("exact_match: ",int(exact_match),"\n")
  print("#########\n")
  if prediction == When_part.lower():
    s+=1
percenantage = (s/25)*100
print("number correct WHEN :{}".format(s))
print("{} % correct answers".format(percenantage))
print("{}  moyenne de F1 - Score".format(f/25))
print("{} % moyenne de Exact Match - Score".format((exM/25)*100))

##### example 0: #####
november 10 2013
f1 score:  1.0
exact_match:  1 

#########

##### example 1: #####
august 27 2012
f1 score:  1.0
exact_match:  1 

#########

##### example 2: #####
september 19 2012
f1 score:  1.0
exact_match:  1 

#########

##### example 3: #####
november17 2010
f1 score:  1.0
exact_match:  1 

#########

##### example 4: #####
september 26 2013
f1 score:  1.0
exact_match:  1 

#########

##### example 5: #####
june 14 2011
f1 score:  1.0
exact_match:  1 

#########

##### example 6: #####
february 3 2011
f1 score:  1.0
exact_match:  1 

#########

##### example 7: #####
february 6 2009
f1 score:  1.0
exact_match:  1 

#########

##### example 8: #####
september 29 2011
f1 score:  1.0
exact_match:  1 

#########

##### example 9: #####
september 30 2008
f1 score:  1.0
exact_match:  1 

#########

##### example 10: #####
august 24 2003
f1 score:  1.0
exact_match:  1 

#########

##### example 11: #####
october 24 2008
f1 score:  1.0
exact_match:  1 

#########

In [None]:
# Parameters
window_size = 512  # Max tokens per window
stride = 128  # Overlap size

# Split text into overlapping windows
def sliding_windows(text, window_size, stride):
    input_ids = tokenizer.encode(text, add_special_tokens=False)
    windows = []
    for i in range(0, len(input_ids), stride):
        window = input_ids[i:i + window_size]
        windows.append(window)
    return windows

# Initialize scores
s = 0
f = 0
exM = 0

for i in range(len(data)):
    text_data = data[i]['text']
    ground_truth = data[i]["WHEN"]
    question, text = "When was the employee doing staff?", text_data

    windows = sliding_windows(text_data, window_size, stride)
    answers = []

    # Get answers from each window
    for window in windows:
        tokens = tokenizer.convert_ids_to_tokens(window, skip_special_tokens=True)
        window_text = tokenizer.convert_tokens_to_string(tokens)

        inputs = tokenizer.encode_plus(question, window_text, return_tensors="pt", truncation=True, max_length=window_size)
        input_ids = inputs["input_ids"]
        attention_mask = inputs["attention_mask"]

        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)

        start_scores = outputs.start_logits
        end_scores = outputs.end_logits
        all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0])

        # Get the best answer within this window
        answer_start = torch.argmax(start_scores)
        answer_end = torch.argmax(end_scores) + 1

        if answer_start < answer_end and answer_end <= len(all_tokens):
            answer = tokenizer.convert_tokens_to_string(all_tokens[answer_start:answer_end])
            score = start_scores[0][answer_start] + end_scores[0][answer_end - 1]
            answers.append((answer, score.item()))

    # Select the best answer based on score
    prediction = max(answers, key=lambda x: x[1])[0] if answers else "No answer found"

    # Evaluate performance
    f1_score = evaluate_f1(ground_truth, prediction)
    f += f1_score
    exact_match = evaluate_exact_match(ground_truth, prediction)
    exM += int(exact_match)

    print(f"##### Example {i}: #####")
    print("Prediction:", prediction)
    print("F1 score:", f1_score)
    print("Exact Match:", int(exact_match), "\n")
    print("#########\n")

    if prediction.lower() == ground_truth.lower():
        s += 1

# Final Metrics
percentage = (s / len(data)) * 100
print(f"Number correct EVENT: {s}")
print(f"{percentage:.2f}% correct answers")
print(f"{f / len(data):.2f} Average F1 Score")
print(f"{(exM / len(data)) * 100:.2f}% Average Exact Match Score")


##### Example 0: #####
Prediction: november 10 2013
F1 score: 1.0
Exact Match: 1 

#########

##### Example 1: #####
Prediction: august 27 2012
F1 score: 1.0
Exact Match: 1 

#########

##### Example 2: #####
Prediction: september 19 2012
F1 score: 1.0
Exact Match: 1 

#########

##### Example 3: #####
Prediction: 9 : 30 a. m. on november17 2010
F1 score: 0.5
Exact Match: 0 

#########

##### Example 4: #####
Prediction: september 26 2013
F1 score: 1.0
Exact Match: 1 

#########

##### Example 5: #####
Prediction: june 14 2011
F1 score: 1.0
Exact Match: 1 

#########

##### Example 6: #####
Prediction: february 3 2011
F1 score: 1.0
Exact Match: 1 

#########

##### Example 7: #####
Prediction: february 6 2009
F1 score: 1.0
Exact Match: 1 

#########

##### Example 8: #####
Prediction: september 29 2011
F1 score: 1.0
Exact Match: 1 

#########

##### Example 9: #####
Prediction: september 30 2008
F1 score: 1.0
Exact Match: 1 

#########

##### Example 10: #####
Prediction: august 24 200

In [39]:

f=0
exM=0
for i in range(len(data)):
  text_data = data[i]['text']
  ground_truth = data[i]["WHEN"]
  question,text = "When was the employee doing staff?",text_data

  chunk_size = 512
  head = text[:chunk_size//2]
  tail = text[-chunk_size//2:]
  inputs = tokenizer(question, head + tail, return_tensors="pt")
  with torch.no_grad():
      outputs = model(**inputs)

  answer_start_index = outputs.start_logits.argmax()
  answer_end_index = outputs.end_logits.argmax()

  predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
  prediction = tokenizer.decode(predict_answer_tokens, skip_special_tokens=True)

  When_part = data[i]["WHEN"]
  f1_score = evaluate_f1(ground_truth, prediction)
  f+=f1_score
  exM+=int(exM)
  print("##### example {}: #####".format(i))
  print(prediction)
  print("f1 score: ",f1_score)
  print("exact_match: ",int(exM),"\n")
  print("#########\n")

print("{}  moyenne de F1 - Score".format(f/25))
print("{} % moyenne de Exact Match - Score".format((exM/25)*100))

##### example 0: #####
november 10 2013
f1 score:  1.0
exact_match:  0 

#########

##### example 1: #####
august 27 2012
f1 score:  1.0
exact_match:  0 

#########

##### example 2: #####
september 19 2012
f1 score:  1.0
exact_match:  0 

#########

##### example 3: #####
november17 2010
f1 score:  1.0
exact_match:  0 

#########

##### example 4: #####
september 26 2013
f1 score:  1.0
exact_match:  0 

#########

##### example 5: #####
june 14 2011
f1 score:  1.0
exact_match:  0 

#########

##### example 6: #####
february 3 2011
f1 score:  1.0
exact_match:  0 

#########

##### example 7: #####
february 6 2009
f1 score:  1.0
exact_match:  0 

#########

##### example 8: #####
september 29 2011
f1 score:  1.0
exact_match:  0 

#########

##### example 9: #####
september 30 2008
f1 score:  1.0
exact_match:  0 

#########

##### example 10: #####
august 24 2003
f1 score:  1.0
exact_match:  0 

#########

##### example 11: #####
october 24 2008
f1 score:  1.0
exact_match:  0 

#########

In [None]:
# Parameters
window_size = 512  # Max tokens per window
stride = 128  # Overlap size

# Split text into overlapping windows
def sliding_windows(text, window_size, stride):
    input_ids = tokenizer.encode(text, add_special_tokens=False)
    windows = []
    for i in range(0, len(input_ids), stride):
        window = input_ids[i:i + window_size]
        windows.append(window)
    return windows

# Initialize scores
s = 0
f = 0
exM = 0

for i in range(len(data)):
    text_data = data[i]['text']
    ground_truth = data[i]["WHEN"]
    question, text = "What was the date of the event?", text_data

    windows = sliding_windows(text_data, window_size, stride)
    answers = []

    # Get answers from each window
    for window in windows:
        tokens = tokenizer.convert_ids_to_tokens(window, skip_special_tokens=True)
        window_text = tokenizer.convert_tokens_to_string(tokens)

        inputs = tokenizer.encode_plus(question, window_text, return_tensors="pt", truncation=True, max_length=window_size)
        input_ids = inputs["input_ids"]
        attention_mask = inputs["attention_mask"]

        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)

        start_scores = outputs.start_logits
        end_scores = outputs.end_logits
        all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0])

        # Get the best answer within this window
        answer_start = torch.argmax(start_scores)
        answer_end = torch.argmax(end_scores) + 1

        if answer_start < answer_end and answer_end <= len(all_tokens):
            answer = tokenizer.convert_tokens_to_string(all_tokens[answer_start:answer_end])
            score = start_scores[0][answer_start] + end_scores[0][answer_end - 1]
            answers.append((answer, score.item()))

    # Select the best answer based on score
    prediction = max(answers, key=lambda x: x[1])[0] if answers else ""

    # Evaluate performance
    f1_score = evaluate_f1(ground_truth, prediction)
    f += f1_score
    exact_match = evaluate_exact_match(ground_truth, prediction)
    exM += int(exact_match)

    print(f"##### Example {i}: #####")
    print("Prediction:", prediction)
    print("F1 score:", f1_score)
    print("Exact Match:", int(exact_match), "\n")
    print("#########\n")

    if prediction.lower() == ground_truth.lower():
        s += 1

# Final Metrics
percentage = (s / len(data)) * 100
print(f"Number correct EVENT: {s}")
print(f"{percentage:.2f}% correct answers")
print(f"{f / len(data):.2f} Average F1 Score")
print(f"{(exM / len(data)) * 100:.2f}% Average Exact Match Score")


##### Example 0: #####
Prediction: november 10 2013
F1 score: 1.0
Exact Match: 1 

#########

##### Example 1: #####
Prediction: august 27 2012 employee # 1 a 19 year - old male laborer with stomper company inc. arrived at 2 : 00. am. at a site in menlo park california to demolish the interiors of the building. they scraped the interiors of the building and collected debris as they finished up the job. on august 28 2012
F1 score: 0.12
Exact Match: 0 

#########

##### Example 2: #####
Prediction: september 19 2012
F1 score: 1.0
Exact Match: 1 

#########

##### Example 3: #####
Prediction: november17 2010
F1 score: 1.0
Exact Match: 1 

#########

##### Example 4: #####
Prediction: september 26 2013
F1 score: 1.0
Exact Match: 1 

#########

##### Example 5: #####
Prediction: june 14 2011
F1 score: 1.0
Exact Match: 1 

#########

##### Example 6: #####
Prediction: february 3 2011
F1 score: 1.0
Exact Match: 1 

#########

##### Example 7: #####
Prediction: february 6 2009
F1 score: 1.0
Ex

In [None]:
# Parameters
window_size = 512  # Max tokens per window
stride = 128  # Overlap size

# Split text into overlapping windows
def sliding_windows(text, window_size, stride):
    input_ids = tokenizer.encode(text, add_special_tokens=False)
    windows = []
    for i in range(0, len(input_ids), stride):
        window = input_ids[i:i + window_size]
        windows.append(window)
    return windows

# Initialize scores
s = 0
f = 0
exM = 0

for i in range(len(data)):
    text_data = data[i]['text']
    ground_truth = data[i]["WHEN"]
    question, text = "What was the exact day?", text_data

    windows = sliding_windows(text_data, window_size, stride)
    answers = []

    # Get answers from each window
    for window in windows:
        tokens = tokenizer.convert_ids_to_tokens(window, skip_special_tokens=True)
        window_text = tokenizer.convert_tokens_to_string(tokens)

        inputs = tokenizer.encode_plus(question, window_text, return_tensors="pt", truncation=True, max_length=window_size)
        input_ids = inputs["input_ids"]
        attention_mask = inputs["attention_mask"]

        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)

        start_scores = outputs.start_logits
        end_scores = outputs.end_logits
        all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0])

        # Get the best answer within this window
        answer_start = torch.argmax(start_scores)
        answer_end = torch.argmax(end_scores) + 1

        if answer_start < answer_end and answer_end <= len(all_tokens):
            answer = tokenizer.convert_tokens_to_string(all_tokens[answer_start:answer_end])
            score = start_scores[0][answer_start] + end_scores[0][answer_end - 1]
            answers.append((answer, score.item()))

    # Select the best answer based on score
    prediction = max(answers, key=lambda x: x[1])[0] if answers else ""

    # Evaluate performance
    f1_score = evaluate_f1(ground_truth, prediction)
    f += f1_score
    exact_match = evaluate_exact_match(ground_truth, prediction)
    exM += int(exact_match)

    print(f"##### Example {i}: #####")
    print("Prediction:", prediction)
    print("F1 score:", f1_score)
    print("Exact Match:", int(exact_match), "\n")
    print("#########\n")

    if prediction.lower() == ground_truth.lower():
        s += 1

# Final Metrics
percentage = (s / len(data)) * 100
print(f"Number correct EVENT: {s}")
print(f"{percentage:.2f}% correct answers")
print(f"{f / len(data):.2f} Average F1 Score")
print(f"{(exM / len(data)) * 100:.2f}% Average Exact Match Score")


##### Example 0: #####
Prediction: november 10 2013
F1 score: 1.0
Exact Match: 1 

#########

##### Example 1: #####
Prediction: august 27 2012
F1 score: 1.0
Exact Match: 1 

#########

##### Example 2: #####
Prediction: september 19 2012
F1 score: 1.0
Exact Match: 1 

#########

##### Example 3: #####
Prediction: november17 2010
F1 score: 1.0
Exact Match: 1 

#########

##### Example 4: #####
Prediction: september 26 2013
F1 score: 1.0
Exact Match: 1 

#########

##### Example 5: #####
Prediction: june 14 2011
F1 score: 1.0
Exact Match: 1 

#########

##### Example 6: #####
Prediction: february 3 2011
F1 score: 1.0
Exact Match: 1 

#########

##### Example 7: #####
Prediction: february 6 2009
F1 score: 1.0
Exact Match: 1 

#########

##### Example 8: #####
Prediction: september 29 2011
F1 score: 1.0
Exact Match: 1 

#########

##### Example 9: #####
Prediction: september 30 2008
F1 score: 1.0
Exact Match: 1 

#########

##### Example 10: #####
Prediction: august 24 2003
F1 score: 1.0


In [None]:
# Parameters
window_size = 512  # Max tokens per window
stride = 128  # Overlap size

# Split text into overlapping windows
def sliding_windows(text, window_size, stride):
    input_ids = tokenizer.encode(text, add_special_tokens=False)
    windows = []
    for i in range(0, len(input_ids), stride):
        window = input_ids[i:i + window_size]
        windows.append(window)
    return windows

# Initialize scores
s = 0
f = 0
exM = 0

for i in range(len(data)):
    text_data = data[i]['text']
    ground_truth = data[i]["WHEN"]
    question, text = "What was the date of the event?", text_data

    windows = sliding_windows(text_data, window_size, stride)
    answers = []

    # Get answers from each window
    for window in windows:
        tokens = tokenizer.convert_ids_to_tokens(window, skip_special_tokens=True)
        window_text = tokenizer.convert_tokens_to_string(tokens)

        inputs = tokenizer.encode_plus(question, window_text, return_tensors="pt", truncation=True, max_length=window_size)
        input_ids = inputs["input_ids"]
        attention_mask = inputs["attention_mask"]

        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)

        start_scores = outputs.start_logits
        end_scores = outputs.end_logits
        all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0])

        # Get the best answer within this window
        answer_start = torch.argmax(start_scores)
        answer_end = torch.argmax(end_scores) + 1

        if answer_start < answer_end and answer_end <= len(all_tokens):
            answer = tokenizer.convert_tokens_to_string(all_tokens[answer_start:answer_end])
            score = start_scores[0][answer_start] + end_scores[0][answer_end - 1]
            answers.append((answer, score.item()))

    # Select the best answer based on score
    prediction = max(answers, key=lambda x: x[1])[0] if answers else ""

    # Evaluate performance
    f1_score = evaluate_f1(ground_truth, prediction)
    f += f1_score
    exact_match = evaluate_exact_match(ground_truth, prediction)
    exM += int(exact_match)

    print(f"##### Example {i}: #####")
    print("Prediction:", prediction)
    print("F1 score:", f1_score)
    print("Exact Match:", int(exact_match), "\n")
    print("#########\n")

    if prediction.lower() == ground_truth.lower():
        s += 1

# Final Metrics
percentage = (s / len(data)) * 100
print(f"Number correct EVENT: {s}")
print(f"{percentage:.2f}% correct answers")
print(f"{f / len(data):.2f} Average F1 Score")
print(f"{(exM / len(data)) * 100:.2f}% Average Exact Match Score")


##### Example 0: #####
Prediction: november 10 2013
F1 score: 1.0
Exact Match: 1 

#########

##### Example 1: #####
Prediction: august 27 2012 employee # 1 a 19 year - old male laborer with stomper company inc. arrived at 2 : 00. am. at a site in menlo park california to demolish the interiors of the building. they scraped the interiors of the building and collected debris as they finished up the job. on august 28 2012
F1 score: 0.12
Exact Match: 0 

#########

##### Example 2: #####
Prediction: september 19 2012
F1 score: 1.0
Exact Match: 1 

#########

##### Example 3: #####
Prediction: november17 2010
F1 score: 1.0
Exact Match: 1 

#########

##### Example 4: #####
Prediction: september 26 2013
F1 score: 1.0
Exact Match: 1 

#########

##### Example 5: #####
Prediction: june 14 2011
F1 score: 1.0
Exact Match: 1 

#########

##### Example 6: #####
Prediction: february 3 2011
F1 score: 1.0
Exact Match: 1 

#########

##### Example 7: #####
Prediction: february 6 2009
F1 score: 1.0
Ex

On remarque qu’il y a 6 exemples ou la taille correspond à celle par défaut pour la fenetre de Bert (<512) et les autres 16 exemples dépassent jusqu’à atteindre taille de 4002.


Donc , on a abordé les stratégies de sliding window et la concaténation de différents parties afin de résoudre ce problème et on a fait les comparaisons principalement entre la méthode par défaut et celle de la concaténation.

# 1 .Les réultats pour les 3 informations :

On s’intéresse dans notre analyse à la mesure de F1 Score plus que Exact match  (puisque elle permet de vérifier la capacité de notre modèle à trouver le contexte proche de la réponse en partie ou en tout ) car si on trouve une partie importante de la réponse pour chaque exemple le F1 Score va etre élevée et donc une bonne performance alors que Exact match est plutôt stricte puisqu’il faut que la prédiction est exacte à la réponse.

......

WHEN :      F1 Score : 96%    Exact match : 96%  
        
> => La valeur de F1 Score est très bon ( 90%< 96%<= 100%) . On a pu repérer 24/25  mots exactement correct.





WHERE :      F1 Score : 74.33%    Exact match : 64%

> =>	La valeur de F1 Score est moyen (puisque  50% < 74.33% < 80%)



EVENT :      F1 Score :  54.62%    Exact match : 28%

> =>	La valeur de F1 Score est moyen (puisque  50% < 54.62%< 80%)



**Conclusion : Le modèle a pu bien repérer les réponses pour le WHEN  , moyennement le WHERE et moins bon l’EVENT .**

# 2. Analyse de choix de partie :

WHEN :  situé au début

WHERE : situé au début et au milieu.

EVENT : situé au début et à la fin surtout.  (il est aussi éparpillé dans toute la phrase et plusieurs autres choses (événements , situations ) qui se passent proches dans le sens et position dans les phrases (cela peut causer des erreurs).

# 3.Discussion du choix de question : (Performance et explication globale de don choix)

**WHEN :**

1er partie :

On a utilisé la technique de fenetre glissante pour taille fenetre = taille maximale que le model Bert peut prendre comme fenetre et le stride (pas pour la fenetre) est 128

« When was the employee doing staff? »  F1 Score 90% Exact Match  88%    (sliding window)  | (les premiers 256 et les derniers 256 tokens )   F1 Score 92% Exact Match  0%

> =>	On a choisi  When pour repérer la date et on a remarqué que le contexte du temps est liée dans la plupart des cas à l’employé (Employee) .



« What was the date of the event? » F1 Score 92% Exact Match  92%  (sliding window)

> =>	Cette question s’intéresse a la date étant liée au contexte d’un événement qui se passe .



« What was the day? » F1 Score 90% Exact Match  84% (sliding window)

> =>	Cette question met le focus plus sur la date exacte ce qui permet de mieux repérer les dates .




« What was the exact day? » F1 Score 96% Exact Match  96% (sliding window)   | F1 Score 100% Exact Match  0%  (les premiers 256 et les derniers 256 tokens )
> => Cette question met le focus plus sur la date exacte ce qui permet de mieux repérer les dates .





Conclusion à propos des 3 formulations :

Les formulations sont proches en termes  de performance et celle qui donne le plus de contexte meme en cas d’absence de date est la dernière.

Conclusion à propos de la  comparaison des stratégies:

On conclut que la  cominaison des tokens des  parties de debut et de fin est légérement meuilleur en termes de F1 score (2% à 4 %) mais ayant un exact match nulle donc on opte pour le sliding window pour la formulation de la question WHEN  « What was the exact day? » F1 Score 96% Exact Match  96%.

La fenetre glissante dans ce type de question est plus efficace que la methode de choisir une combinaison de parties de la phrase au debut et a la fin pour prendre la partie la plus importante dans le contexte puisque sa performance est meuilleur.

**WHERE :**

"where was it mostly?"  F1 Score 59%   Exact Match 44% (test_debut : cad les premiers 512 jetons)
> =>	On retient que cette formulation donne le lieu qui peut etre comme emplacement qui n’est pas le point d’interet des événements dans plusieurs cas.



"where was it done?"   F1 Score 67.56%   Exact Match 44% (head + tail)  | les premiers 512 :  F1 Score 70%   Exact Match 56%  

> => On a choisi ce type de question   pour avoir un lieu comme dans la 1ère  formulation qui n’est pas nécessairement dans ce contexte de « mostly » cad on ne l’a pas restrient à certains lieux ce qui a permis de correspondre mieux aux exemples.



"Where was the actual place it was occuring?"  F1 Score 74.33%   Exact Match  (text_debut + head_3) 64%  | les premiers 512 :  F1 Score 73.1%   Exact Match 64%  

> Le choix de cette question est liée au fait qu’un lieu est liée à un événement .



   Conclusion à propos des 3 formulations :

On a choisi  WHERE pour trouver le lieu .De plus , on a remarqué que le contexte est le lieu qui marque  quelque chose qui s’est passée (un événement  présent) et non juste un emplacement qui n’a pas subit une situation (comme 2ème formulation) puisque il existe plusieurs possibilités de lieu (d’où le manque de performance du 2ème formulation par rapport à la dernière et la 1ère est proche du 2ème donc la meme chose)  .

Conclusion à propos de la comparaison des stratégies:

La combinaison de parties de la phrase au milieu dans ce type de question est plus efficace que la methode de choisir prendre juste la partie par défaut parle transformer de Bert ce qui a permis pour la meuilleur formulation de gagner presque 2% en F1score.


**EVENT :**

"What was happening at the end?" (les premiers 512) F1 Score 30% |(les premiers et les derniers 256 tokens)  F1 Score 38.48%   Exact match16% |42.48%  20% (head + tail + head_4 + tail_4)

> =>	Cette question se concentre sur les événements précis et factuels qui se produisent à la fin d’un scenario



“What is the most controversial thing that happened?” F1 score 47.21% (head + tail + head_4 + tail_4) Exact match 24%

> =>	Cette question cible des événements marquants qui suscitent des divergences d'opinion  ou des conflits.



"What is the  most astonishing thing happening?"  head_only  (39.59%  16%) |  (head + tail) 50.61%   24%  | 54.56%  28% (head + tail + head_4 + tail_4)

> =>	l’objectif est d’identifier quelque chose de surprenant ou d’extraordinaire dans le context.



comparaison des stratégies:

Pour la 3ème et la 1ère  formulation en adoptant la technique de concaténation de différents parties (les 1er 256 et les derniers 256 tokens) on a pu ganger par rapport à la startégie par défaut (les premiers 512 tokens)  presque (9 à 10%) en F1 score . Par conséquent la partie de réponse est non pas toujours situés au début et pour mieux s’assurer de cela on a essayé la concaténation du précédente + concaténation des tokens d’un autre emplacement à la fin et on a gagné presque ( 12 à 14%) en F1 score . Cela affirme que une grande partie des exemples contient l’information à la fin.Donc la technique de concaténation est plus efficace que prendre celle par défaut.

# 4.Les erreurs dans les types de questions :

**WHEN**

*1ère formulation:*

Les 2 fautes de la 1ere  (90%) sont dans les 2 exemples (17 et 24) :

 " Employee #1  a diver  became caught in a coffer dam and drowned.                " (le dernier exemple)

" Employee #1  an independent contractor at a construction site  was trying to  stand on end a wood-framed wall. The wall was too heavy for one person and  when it bumped up against a ceiling pipe while being raised  he lost control  of it. The wall fell on Employee #1  who sustained a compressed disc in his  back.                                                                           " (ligne 105 du fichier json)

*2ème formulation:*

la faute dans l’exemple 2 qu’il a ajouté en plus de la date , l’horaire de la journée.

*3ème formulation* :

les 2 fautes dans les exemples 17 et 24 puisque il a retourné une chaine non vide et aussi elle n’est pas une date .

*4ème formulation:*

On a bien résolu un des 2 exemples  faux dans la 1ère , ça peut s’expliquer par le fait d’avoir en contexte la date exacte permet de repérer l’exemple ou il  ne figure pas de dates .



> =>	la plupart des fautes dans les différents est liée à la difficulté de bien repérer l’absence de la réponse (cad chaine vide).



**WHERE**

Les erreurs commis par :

*1ère formulation:*

les exemples 3 , 4 , 10 , 12, 14 , 15 , 17 ,18 , 20   totalement faux (f1 score = 0 ou presque = 0),  et les exemples partiellement faux (exact_match = 0 et f1_score < 1 ) sont 0 , 7, 8 , 16 .

*2ème formulation:*

les exemples 2 , 3 , 4 , 7, 10 , 15 , 17  , 22  totalement(ou presque totalement)  faux ,  et les exemples partiellement faux (exact_match = 0 et f1_score < 1 ) sont 0 , 5 , 8, 12  , 13 , 14 , 19

*3ème formulation*

les exemples totalement(ou presque totalement)  faux 3 , 8 , 10 ,12 , 15 et les exemples partiellement faux  sont 16 , 19  et les exemples partiellement faux (exact_match = 0 et f1_score < 1 ) sont 0 , 7, 16 .



> =>	On peut conclure qu’il y a des exemples communs qui sont faux dans les différentes formulations , on peut assumer que ces exemples ( 3 , 10 , 15) sont les plus problématiques.



**EVENT**

Les erreurs commis par :

*1ère formulation:*

les exemples (presque totalement) faussess sont 1 , 2 ,3 , 4 , 6 , 7 , 8 , 9 , 14 , 17 , 19  

*2ème formulation:*

les exemples (presque totalement) faussess sont 0 , 1 , 5 , 7 , 8 , 9 , 10 , 11 , 15 , 17 , 19 , 21 , 24  

*3ème formulation*

les exemples (presque totalement) faussess sont 1 , 5 , 7 , 8 , 9 , 19 , 24



> =>	On constate  qu’il y a des exemples communs qui sont faux dans les différentes formulations , on peut assumer que ces exemples ( 1 , 5 , 7 , 8 , 9 , 19) sont les plus problématiques.



