In [21]:
# !wget https://msmarco.blob.core.windows.net/msmarco/train_v2.1.json.gz && mv train_v2.1.json.gz MSMARCO/
# !wget https://msmarco.blob.core.windows.net/msmarco/dev_v2.1.json.gz && mv dev_v2.1.json.gz MSMARCO/
# !wget https://msmarco.blob.core.windows.net/msmarco/eval_v2.1_public.json.gz && mv eval_v2.1_public.json.gz MSMARCO/

## Data Format (Estratto dal README originale: https://raw.githubusercontent.com/microsoft/MSMARCO-Question-Answering/master/README.md)
Ogni entry contiene: query_id, query_type, query, passages, answers, and wellFormedAnswers.

Per la task di Q&A l'output è presente nella chiave 'answers'. 
Per la task di NLG l'output è presente nella chiave 'wellFormedAnswers'.
Delle 1,010,916 queries nel dataset Q&A, 182,669 hanno la risposta nella chiave 'wellFormedAnswers.

1. query_id: identificatore univoco per ogni query
2. query: una domanda espressa nel motore di ricerca Bing
3. passages: un insieme di 10:passages, URLs, ed una flag per segnalare il loro utilizzo nella formulazione della risposta(is_selected:1). Due passages possono provenire dalla stessa URL, dal momento che sono stati ottenuti dai risultati di Bing in ordine di rilavanza. Se un passage è segnalato come is_selected:1 vuol dire che è stato utilizzato dal giudice per formare la risposta. Se al contrario è segnato come is_selected:0 indica che non è stato utilizzato per la formulazione della risposta. Domande la cui risposta è 'No Answer Present.' avranno is_selected: 0 in ognuno dei loro passaggi.
4. query_type: una classificazione molto ad alto livello del tipo di domanda che è stata posta. La classificazione è fatta da un modello che punta alle segueni classi: {LOCATION,NUMERIC,PERSON,DESCRIPTION,ENTITY}. Può essere usato in fase di debug per investigare la distribuzione delle performance oppure per fare dei training più specifici.
5. answers: un array di risposte prodotte dai giudici umani, la maggior parte contengono solo una risposta ma ~1% ne contiene di più(in media ~2 risposte dove la lista è più lunga di 1). Queste risposte sono state generate da pesone reali con parole loro, invece di selezionare uno span di testo. Il linguaggio usato nelle loro risposte può essere simile o uguale al linguaggio usato in qualsiasi dei passaggi.
6. wellFormedAnswers. un array di risposte riscritte, come sopra di solito contiene una sola risposta, ma ~1% può contenerne più di una (in media ~5 risposte in caso di risposta multipla. Queste risposte sono generate da nuovi giudici, dopo aver letto la domanda e la risposta data precedentemente, verificandone (i) la grammatica, in modo da renderla una frase di senso compiuto, (ii) il senso in assenza di contesto o oppure in assenza di domanda, (iii) l'eccessivo overlap con porzioni di testo estratti direttametne dal paragrafo. Questo garantisce che la well formed answer sia verametne scritta in linguaggio naturale e non sia solo pura estrazione. Well Formed Answers sono più complesse per il Question answering dal momento che contengono parole che potrebbero non essere presenti in nessuno dei paragrafi o nel testo della domanda. 

example
~~~
{
	"answers":["A corporation is a company or group of people authorized to act as a single entity and recognized as such in law."],
	"passages":[
		{
			"is_selected":0,
			"url":"http:\/\/www.wisegeek.com\/what-is-a-corporation.htm",
			"passage_text":"A company is incorporated in a specific nation, often within the bounds of a smaller subset of that nation, such as a state or province. The corporation is then governed by the laws of incorporation in that state. A corporation may issue stock, either private or public, or may be classified as a non-stock corporation. If stock is issued, the corporation will usually be governed by its shareholders, either directly or indirectly."},
		...
		}],
	"query":". what is a corporation?",
	"query_id":1102432,
	"query_type":"DESCRIPTION",
	"wellFormedAnswers":"[]"
}
~~~

In [1]:
import json
from tqdm import tqdm
import pprint

In [2]:
tr_marco = json.load(open("./MSMARCO/train_v2.1.json"))

In [3]:
dev_marco = json.load(open("./MSMARCO/dev_v2.1.json"))

In [4]:
tr_marco_collection = []
for k in tqdm(tr_marco["query"]):
    question = tr_marco["query"][k]
    answers = tr_marco["answers"][k]
    wellFormedAnswers = tr_marco["wellFormedAnswers"][k]
    context = ""
    for passage in tr_marco["passages"][k]:
        if passage["is_selected"] == 1:
            context += passage["passage_text"]+"\n"
   
    tr_marco_collection.append({
        "question": question,
        "answers": answers,
        "wellFormedAnswers": wellFormedAnswers,
        "context": context
    })

100%|██████████| 808731/808731 [00:03<00:00, 268239.46it/s]


In [5]:
dev_marco_collection = []
for k in tqdm(dev_marco["query"]):
    question = dev_marco["query"][k]
    answers = dev_marco["answers"][k]
    wellFormedAnswers = dev_marco["wellFormedAnswers"][k]
    context = ""
    for passage in dev_marco["passages"][k]:
        if passage["is_selected"] == 1:
            context += passage["passage_text"]+"\n"
            
    dev_marco_collection.append({
        "question": question,
        "answers": answers,
        "wellFormedAnswers": wellFormedAnswers,
        "context": context
    })

100%|██████████| 101093/101093 [00:00<00:00, 417932.19it/s]


In [6]:
print("***** Training Example from Collection *****")
pprint.pprint(tr_marco_collection[0])
print("\n\n")
print("***** Dev Example from Collection *****")
pprint.pprint(dev_marco_collection[0])

***** Training Example from Collection *****
{'answers': ['The immediate impact of the success of the manhattan project was '
             'the only cloud hanging over the impressive achievement of the '
             'atomic researchers and engineers is what their success truly '
             'meant; hundreds of thousands of innocent lives obliterated.'],
 'context': 'The presence of communication amid scientific minds was equally '
            'important to the success of the Manhattan Project as scientific '
            'intellect was. The only cloud hanging over the impressive '
            'achievement of the atomic researchers and engineers is what their '
            'success truly meant; hundreds of thousands of innocent lives '
            'obliterated.\n',
 'question': ')what was the immediate impact of the success of the manhattan '
             'project?',
 'wellFormedAnswers': '[]'}



***** Dev Example from Collection *****
{'answers': ['A corporation is a company or group

## Natural Language Generation

Come facciamo ad usare le informazioni all'interno di Marco per creare un Q&A generativo?

### 1. Approaccio - HuggingFace pretrained 

Cerchiamo su huggingface un modello in grado di generare Conditioned Text, e lo testiamo sul Dev

In [8]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
from rouge import Rouge 
import numpy as np

rouge = Rouge()
# https://www.aclweb.org/anthology/W04-1013/
# https://towardsdatascience.com/evaluating-text-output-in-nlp-bleu-at-your-own-risk-e8609665a213

In [9]:
tokenizer = AutoTokenizer.from_pretrained("valhalla/t5-base-qa-qg-hl")

model = AutoModelForSeq2SeqLM.from_pretrained("valhalla/t5-base-qa-qg-hl")

nlp = pipeline("text2text-generation", model=model, tokenizer=tokenizer)

In [10]:
# to generate questions simply pass the text
answer = nlp("42 is the answer to life, the universe and everything.")
print(answer)
#=> [{'answer': '42', 'question': 'What is the answer to life, the universe and everything?'}]

# for qa pass a dict with "question" and "context"
answer = nlp("question: What is 42 ? context: 42 is the answer to life, the universe and everything.")
print(answer)
#=> 'the answer to life, the universe and everything'

[{'generated_text': 'What is the answer to life, the universe and everything?'}]
[{'generated_text': 'the answer to life, the universe and everything'}]


In [20]:
answers_rouge = []
wellformedanswers_rouge = []
exceptions_occurred = []
# Selezioniamo 500 answer senza WellFormed version, e 500 WellFormed version
np.random.shuffle(dev_marco_collection)

dev_marco_collection_subset = []
n_answers = 0
n_wellformed = 0
for dev_sample in tqdm(dev_marco_collection):
    if dev_sample["wellFormedAnswers"] != '[]' and n_wellformed < 500:
        n_wellformed += 1
        dev_marco_collection_subset.append(dev_sample)
    elif n_answers < 500:
        n_answers += 1
        dev_marco_collection_subset.append(dev_sample)
    elif n_answers >= 500 and n_wellformed >= 500:
        break

np.random.shuffle(dev_marco_collection_subset)

for i,dev_sample in enumerate(tqdm(dev_marco_collection_subset)):
    question = dev_sample["question"]
    context = dev_sample["context"]
    predicted_answer = nlp("question: "+question+" context: "+context)[0]["generated_text"]

    answers_int_rouge = []
    for real_answer in dev_sample["answers"]:
        try:
            answers_int_rouge.append(rouge.get_scores(predicted_answer, real_answer)[0]["rouge-l"]["f"])
        except Exception as e:
                exceptions_occurred.append(str(e))
            answers_int_rouge.append(0)
    
    if not answers_int_rouge:
        answers_int_rouge = [0]
    
    answers_rouge.append( np.mean(answers_int_rouge) )
    
    if dev_sample["wellFormedAnswers"] != '[]':
        wellformedanswers_int_rouge = []
        for real_wellformedanswer in dev_sample["wellFormedAnswers"]:
            try:
                wellformedanswers_int_rouge.append(rouge.get_scores(predicted_answer, real_wellformedanswer)[0]["rouge-l"]["f"])
            except Exception as e:
                exceptions_occurred.append(str(e))
                wellformedanswers_int_rouge.append(0)
                
        if not wellformedanswers_int_rouge:
            wellformedanswers_int_rouge = [0]

        wellformedanswers_rouge.append( np.mean(wellformedanswers_int_rouge) )
        
        dev_marco_collection_subset[i]["wellformed_score"] = np.mean(wellformedanswers_int_rouge) 
    else:
        dev_marco_collection_subset[i]["wellformed_score"] = 0.0
        
    dev_marco_collection_subset[i]["predicted_answer"] = predicted_answer
    dev_marco_collection_subset[i]["score"] = np.mean(answers_int_rouge)

print(f"DEV Set Answer Rouge-L F-score: {np.mean(answers_rouge)}")
print(f"DEV Set Well Formed Answer Rouge-L F-score: {np.mean(wellformedanswers_rouge)}")

  4%|▍         | 4087/101093 [00:00<00:00, 1151792.01it/s]
100%|██████████| 1000/1000 [07:11<00:00,  2.32it/s]

DEV Set Answer Rouge-L F-score: 0.3131871215774918
DEV Set Well Formed Answer Rouge-L F-score: 0.34997590167395515





In [19]:
print(f"Errors during tests: {np.unique(exceptions_occurred)}")

Errors during tests: ['Hypothesis is empty.']


In [21]:
for i,dev_sample in enumerate(dev_marco_collection_subset):
    if i >= 20:
        break
    print(" ****** ")
    print(f"QUESTION: {dev_sample['question']}")
    print(f"ANSWER: {dev_sample['answers']}")
    print(f"WELL FORMED ANSWER: {dev_sample['wellFormedAnswers']}")
    print(f"PREDICTED ANSWER: {dev_sample['predicted_answer']}")
    print(f"WELLFORMED SCORE: {dev_sample['wellformed_score']}")
    print(f"NORMAL ANSWER SCORE: {dev_sample['score']}")
    

 ****** 
QUESTION: who is advanced management group
ANSWER: ['No Answer Present.']
WELL FORMED ANSWER: []
PREDICTED ANSWER: .
WELLFORMED SCORE: 0.0
NORMAL ANSWER SCORE: 0.0
 ****** 
QUESTION: eating more of what will prevent heart disease
ANSWER: ['Eating avocado, along with avocado oil or even peel, salmon, trout, or herring, or from flaxseed, kale, spinach, or walnuts will prevent heart disease.']
WELL FORMED ANSWER: []
PREDICTED ANSWER: healthy fats
WELLFORMED SCORE: 0.0
NORMAL ANSWER SCORE: 0.0
 ****** 
QUESTION: Steven Sanders poem, you are men who in your lives fought for
ANSWER: ['No Answer Present.']
WELL FORMED ANSWER: []
PREDICTED ANSWER: fought for context
WELLFORMED SCORE: 0.0
NORMAL ANSWER SCORE: 0.0
 ****** 
QUESTION: when is the baltimore orioles home opener
ANSWER: ['The Baltimore Orioles home opener at Oriole Park at Camden Yards was on April 10, 2015 in Baltimore, Maryland.']
WELL FORMED ANSWER: []
PREDICTED ANSWER: April 10, 2015
WELLFORMED SCORE: 0.0
NORMAL ANSWER S

### Fine tuning del modello

Per migliorare le performance è possibile utilizzare diverse tecniche:
1. HuggingFace training: https://huggingface.co/transformers/training.html
2. Il repository di questo modello seguendo il readme: https://github.com/patil-suraj/question_generation

### 2. Approccio - Parafrasi

Il concetto è molto semplice: vogliamo parafrasare l'input e la risposta trovata in modo da poter essere sicuri che qualsiasi cosa che venga estratto non sia scritta come nel testo.

In [26]:
import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
model_name = 'tuner007/pegasus_paraphrase'
torch_device = 'cpu'
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)

def get_response(input_text,num_return_sequences,num_beams):
    batch = tokenizer([input_text],truncation=True,padding='longest',max_length=60, return_tensors="pt").to(torch_device)
    translated = model.generate(**batch,max_length=60,num_beams=num_beams, num_return_sequences=num_return_sequences, temperature=1.5)
    tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
    return tgt_text

In [27]:
num_beams = 10
num_return_sequences = 10
#context = "The ultimate test of your knowledge is your capacity to convey it to another."
context = "Which course should I take to get started in data science?"
get_response(context,num_return_sequences,num_beams)

To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)


['Which data science course should I take?',
 'Which data science course should I take first?',
 'Should I take a data science course?',
 'Which data science class should I take?',
 'Which data science course should I attend?',
 'I want to get started in data science.',
 'Which data science course should I enroll in?',
 'Which data science course is right for me?',
 'Which data science course is best for me?',
 'Which course should I take to get started?']

In [24]:
import torch
from transformers import T5ForConditionalGeneration,T5Tokenizer


def set_seed(seed):
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

set_seed(42)

model = T5ForConditionalGeneration.from_pretrained('ramsrigouthamg/t5_paraphraser')
tokenizer = T5Tokenizer.from_pretrained('ramsrigouthamg/t5_paraphraser')

device = torch.device("cpu")
print ("device ",device)
model = model.to(device)

#sentence = "Which course should I take to get started in data science?"
sentence = "The ultimate test of your knowledge is your capacity to convey it to another."
# sentence = "What are the ingredients required to bake a perfect cake?"
# sentence = "What is the best possible approach to learn aeronautical engineering?"
# sentence = "Do apples taste better than oranges in general?"


text =  "paraphrase: " + sentence + " </s>"


max_len = 256

encoding = tokenizer.encode_plus(text,pad_to_max_length=True, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"].to(device), encoding["attention_mask"].to(device)


# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
beam_outputs = model.generate(
    input_ids=input_ids, attention_mask=attention_masks,
    do_sample=True,
    max_length=256,
    top_k=120,
    top_p=0.98,
    early_stopping=True,
    num_return_sequences=10
)


print ("\nOriginal Question ::")
print (sentence)
print ("\n")
print ("Paraphrased Questions :: ")
final_outputs =[]
for beam_output in beam_outputs:
    sent = tokenizer.decode(beam_output, skip_special_tokens=True,clean_up_tokenization_spaces=True)
    if sent.lower() != sentence.lower() and sent not in final_outputs:
        final_outputs.append(sent)

for i, final_output in enumerate(final_outputs):
    print("{}: {}".format(i, final_output))

device  cpu

Original Question ::
The ultimate test of your knowledge is your capacity to convey it to another.


Paraphrased Questions :: 
0: The ultimate test of knowledge is your ability to convey it to another person.
1: What is the ultimate test of one's knowledge? How can one communicate this knowledge to another?
2: The ultimate test of knowledge is your capacity to convey it to another.
3: What's the test of knowledge is your capacity to communicate to others with accuracy?
4: COURSE: The ultimate test of knowledge is your capacity to convey it to someone.
5: What is the test of knowledge?
6: How well you can convey information about yourself?
7: The ultimate test for your knowledge is your ability to share it with another.
8: If we are learning a new language, the test for what we know is our capacity to communicate it to others..everyone can gain it.
9: What is the test of your ability to convey information that is unknown to you?
