# Imports

In [1]:
import nltk
nltk.download('wordnet')
import pandas as pd
import sys
sys.path.append('../')

from nltk.translate.meteor_score import meteor_score
from nltk.tokenize import word_tokenize
from qa_evaluation import QA_Evaluator
from question_gen_en import QuestionGenerator

[nltk_data] Downloading package wordnet to C:\Users\Will
[nltk_data]     Blanton\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
  from .autonotebook import tqdm as notebook_tqdm


# Helper Functions

In [2]:
def readable_print(text):
    # Replace each period with a period followed by a newline character
    modified_text = text.replace('. ', '.\n')
    print(modified_text)

def meteor_comparison(generated_questions: list[str], dataset_questions: list[str]):
    """
    Compare the generated questions with the dataset questions using METEOR score
    """

    scores = []
    for gq in generated_questions:
        generated_question = word_tokenize(gq.lower())

        # tokenize the questions from the dataset
        ref_questions = [word_tokenize(ref_q.lower()) for ref_q in dataset_questions]

        score = meteor_score(ref_questions, generated_question)

        score = score if score >= .00001 else 0

        scores.append(score)


    average_score = sum(scores) / len(scores)
    return average_score


# Get Questions

In [3]:
articles = pd.read_json("../data/xquad.en.json")

articles = [a for a in articles["data"]]

In [4]:
# crete a list of tuples, each tuple contains the title, context and question

cq_pairs = {}
for a in articles:
    title = a["title"]
    for p in a["paragraphs"]:
        context = p["context"]
        cq_pairs[context] = [(qas["question"], qas["answers"][0]["text"])for qas in p["qas"]]   
            
cq_pairs[list(cq_pairs.keys())[0]]

[('How many points did the Panthers defense surrender?', '308'),
 ('How many career sacks did Jared Allen have?', '136'),
 ('How many tackles did Luke Kuechly register?', '118'),
 ('How many balls did Josh Norman intercept?', 'four'),
 ('Who registered the most sacks on the team this season?', 'Kawann Short'),
 ('How many interceptions are the Panthers defense credited with in 2015?',
  '24'),
 ('Who led the Panthers in sacks?', 'Kawann Short'),
 ('How many Panthers defense players were selected for the Pro Bowl?', 'four'),
 ('How many forced fumbles did Thomas Davis have?', 'four'),
 ('Which player had the most interceptions for the season?', 'Kurt Coleman'),
 ("How many 2015 season interceptions did the Panthers' defense get?", '24'),
 ('Who had five sacks in nine games as a Carolina Panthers starter?',
  'Kony Ealy'),
 ("Who was the Panthers' tackle leader for 2015?", 'Luke Kuechly.'),
 ('How many interceptions did Josh Norman score touchdowns with in 2015?',
  'two.')]

In [5]:
len(cq_pairs)

240

# Test Models for Question Generation

In [6]:
context = list(cq_pairs.keys())[0]
readable_print(context)

The Panthers defense gave up just 308 points, ranking sixth in the league, while also leading the NFL in interceptions with 24 and boasting four Pro Bowl selections.
Pro Bowl defensive tackle Kawann Short led the team in sacks with 11, while also forcing three fumbles and recovering two.
Fellow lineman Mario Addison added 6½ sacks.
The Panthers line also featured veteran defensive end Jared Allen, a 5-time pro bowler who was the NFL's active career sack leader with 136, along with defensive end Kony Ealy, who had 5 sacks in just 9 starts.
Behind them, two of the Panthers three starting linebackers were also selected to play in the Pro Bowl: Thomas Davis and Luke Kuechly.
Davis compiled 5½ sacks, four forced fumbles, and four interceptions, while Kuechly led the team in tackles (118) forced two fumbles, and intercepted four passes of his own.
Carolina's secondary featured Pro Bowl safety Kurt Coleman, who led the team with a career high seven interceptions, while also racking up 88 tack

In [7]:
# generate questions
q_gen = QuestionGenerator()

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can s

In [8]:
# dataset questions
cq_pairs[context]

[('How many points did the Panthers defense surrender?', '308'),
 ('How many career sacks did Jared Allen have?', '136'),
 ('How many tackles did Luke Kuechly register?', '118'),
 ('How many balls did Josh Norman intercept?', 'four'),
 ('Who registered the most sacks on the team this season?', 'Kawann Short'),
 ('How many interceptions are the Panthers defense credited with in 2015?',
  '24'),
 ('Who led the Panthers in sacks?', 'Kawann Short'),
 ('How many Panthers defense players were selected for the Pro Bowl?', 'four'),
 ('How many forced fumbles did Thomas Davis have?', 'four'),
 ('Which player had the most interceptions for the season?', 'Kurt Coleman'),
 ("How many 2015 season interceptions did the Panthers' defense get?", '24'),
 ('Who had five sacks in nine games as a Carolina Panthers starter?',
  'Kony Ealy'),
 ("Who was the Panthers' tackle leader for 2015?", 'Luke Kuechly.'),
 ('How many interceptions did Josh Norman score touchdowns with in 2015?',
  'two.')]

In [9]:
example_qa = cq_pairs[context][0]
example_qa

('How many points did the Panthers defense surrender?', '308')

In [10]:
gen_q = q_gen.generate_question(
    answer=example_qa[1],
    context=context)
gen_q

'How many points did the Panthers defense give up?'

In [11]:
evaluator = QA_Evaluator()

In [12]:
evaluator.answer_question(context, gen_q)

'308'

In [13]:
import random
score_dfs = []

eval_pairs = list(cq_pairs.items())
#eval_pairs = random.sample(list(cq_pairs.items()), 4)

# for each context and answer, generate a question and evaluate it
for (context, qas) in eval_pairs:
    for question, answer in qas:

        # evaluation metrics to save
        generated_q = q_gen.generate_question(answer, context)
        meteor_q = meteor_comparison([generated_q], [question])
        similarity = evaluator.calculate_similarity(generated_q, question)

        # answer the generated question as part of question evaluation
        generated_a, answer_similarity = evaluator.evaluate_qa_pair(context, generated_q, answer)
        meteor_a = meteor_comparison([generated_a], [answer])

        df = pd.DataFrame({
            "context": [context],
            "answer": [answer],
            "target_q": [question],
            "generated_q": [generated_q],
            "generated_q_answer": [generated_a],
            "exact_match": [int(generated_a.lower() == answer.lower())],
            "q_METEOR_score": [meteor_q],
            "a_METEOR_score": [meteor_a],
            "question_similarity": [similarity],
            "answer_similarity": [answer_similarity]
        })

    score_dfs.append(df)

score_df = pd.concat(score_dfs, axis=0)

In [14]:
score_df.reset_index(drop=True, inplace=True)
score_df.head(10)

Unnamed: 0,context,answer,target_q,generated_q,generated_q_answer,exact_match,q_METEOR_score,a_METEOR_score,question_similarity,answer_similarity
0,The University is organized into eleven separa...,Harvard Yard,What is the name of the area that the main cam...,Where is the main campus of Harvard located?,Cambridge,0,0.33406,0.0,0.646834,0.507282
1,As northwest Europe slowly began to warm up fr...,9000 BP,When was Europe fully forested and recovered f...,By what year was Europe fully forested?,9000 BP,1,0.3872,0.9375,0.831209,1.0
2,The historian Frederick W. Mote wrote that the...,lived in poverty and were ill treated,There were many Mongols with what unexpected s...,What was the status of the Mongol and Semu?,Poor and ill treated,0,0.21978,0.381426,0.73311,0.785776
3,This projection was not included in the final ...,"""Variations of Snow and Ice in the past and at...",What report had the correct date?,What is the name of the ICSI report?,Variations of Snow and Ice in the past and at ...,0,0.277778,0.85096,0.494314,0.818869


In [15]:
score_df

Unnamed: 0,context,answer,target_q,generated_q,generated_q_answer,exact_match,q_METEOR_score,a_METEOR_score,question_similarity,answer_similarity
0,The University is organized into eleven separa...,Harvard Yard,What is the name of the area that the main cam...,Where is the main campus of Harvard located?,Cambridge,0,0.33406,0.0,0.646834,0.507282
1,As northwest Europe slowly began to warm up fr...,9000 BP,When was Europe fully forested and recovered f...,By what year was Europe fully forested?,9000 BP,1,0.3872,0.9375,0.831209,1.0
2,The historian Frederick W. Mote wrote that the...,lived in poverty and were ill treated,There were many Mongols with what unexpected s...,What was the status of the Mongol and Semu?,Poor and ill treated,0,0.21978,0.381426,0.73311,0.785776
3,This projection was not included in the final ...,"""Variations of Snow and Ice in the past and at...",What report had the correct date?,What is the name of the ICSI report?,Variations of Snow and Ice in the past and at ...,0,0.277778,0.85096,0.494314,0.818869


In [16]:
score_df.describe()

Unnamed: 0,exact_match,q_METEOR_score,a_METEOR_score,question_similarity,answer_similarity
count,4.0,4.0,4.0,4.0,4.0
mean,0.25,0.304704,0.542472,0.676367,0.777982
std,0.5,0.072121,0.436434,0.142842,0.203554
min,0.0,0.21978,0.0,0.494314,0.507282
25%,0.0,0.263278,0.28607,0.608704,0.716152
50%,0.0,0.305919,0.616193,0.689972,0.802323
75%,0.25,0.347345,0.872595,0.757635,0.864152
max,1.0,0.3872,0.9375,0.831209,1.0


In [17]:
worst_semantic = score_df.loc[score_df['question_similarity'].idxmin()]
worst_semantic['question_similarity']

0.49431413

In [18]:
readable_print(worst_semantic['context'])

This projection was not included in the final summary for policymakers.
The IPCC has since acknowledged that the date is incorrect, while reaffirming that the conclusion in the final summary was robust.
They expressed regret for "the poor application of well-established IPCC procedures in this instance".
The date of 2035 has been correctly quoted by the IPCC from the WWF report, which has misquoted its own source, an ICSI report "Variations of Snow and Ice in the past and at present on a Global and Regional Scale".


In [19]:
worst_semantic['target_q']

'What report had the correct date?'

In [20]:
worst_semantic['generated_q']

'What is the name of the ICSI report?'

In [21]:
worst_semantic['answer']

'"Variations of Snow and Ice in the past and at present on a Global and Regional Scale"'

In [22]:
worst_semantic['generated_q_answer']

'Variations of Snow and Ice in the past and at present on a Global and Regional Scal'

In [23]:
best_score = score_df.loc[score_df['q_METEOR_score'].idxmin()]
readable_print(best_score['context'])

The historian Frederick W.
Mote wrote that the usage of the term "social classes" for this system was misleading and that the position of people within the four-class system was not an indication of their actual social power and wealth, but just entailed "degrees of privilege" to which they were entitled institutionally and legally, so a person's standing within the classes was not a guarantee of their standing, since there were rich and well socially standing Chinese while there were less rich Mongol and Semu than there were Mongol and Semu who lived in poverty and were ill treated.


In [24]:
best_score['generated_q']

'What was the status of the Mongol and Semu?'

In [25]:
best_score['target_q']

'There were many Mongols with what unexpected status?'

In [26]:
best_score['answer']

'lived in poverty and were ill treated'

In [27]:
best_score['generated_q_answer']

'Poor and ill treated'