## Project Question Answering SQUAD Model

Terminologi Question Answering (QA) dalam Kecerdasan Buatan mengacu pada kemampuan mesin untuk merespons pertanyaan yang diajukan dalam natural languange atau bahasa manusia. Tujuan utama dari teknologi ini adalah untuk mengekstrak informasi yang relevan dari sejumlah besar data dan menyajikannya dalam bentuk jawaban yang ringkas. Para peneliti dan insinyur AI telah mengembangkan model khusus untuk memenuhi tujuan tersebut. 
Model ini menerima pertanyaan dan kemudian memproses data teks untuk menentukan jawaban yang paling akurat. Misalnya, jika pertanyaannya adalah "Apa puncak gunung tertinggi di dunia?" model QA akan memindai databasenya dan memberikan jawaban "Gunung Everest".
Tujuan akhir dari model Question Answering adalah untuk benar-benar memahami makna dibalik pertanyaan dan memberikan jawaban yang relevan dan sesuai dengan konteksnya. Keberhasilan model QA diukur dari kemampuannya memberikan jawaban yang akurat dan bermakna terhadap berbagai pertanyaan.

## 1. Data Preview

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import transformers

In [2]:
path = r'C:\\Users\\PPL2\\Documents\\Private\\Model\\Pacman Data Science\\Project\\Natural Languange Processing'
# /kaggle/input/train-v1.1.json
# /kaggle/input/dev-v1.1.json
train_data = pd.read_json(r'C:\\Users\\PPL2\\Documents\\Private\\Model\\Pacman Data Science\\Project\\Natural Languange Processing/train-v1.1.json')
validation_data = pd.read_json(r'C:\\Users\\PPL2\\Documents\\Private\\Model\\Pacman Data Science\\Project\\Natural Languange Processing/dev-v1.1.json')

In [3]:
train_data.head()

Unnamed: 0,data,version
0,"{'title': 'University_of_Notre_Dame', 'paragra...",1.1
1,"{'title': 'Beyoncé', 'paragraphs': [{'context'...",1.1
2,"{'title': 'Montana', 'paragraphs': [{'context'...",1.1
3,"{'title': 'Genocide', 'paragraphs': [{'context...",1.1
4,"{'title': 'Antibiotics', 'paragraphs': [{'cont...",1.1


In [4]:
def prepare_data_for_tokenizer(data):
    """
        This function takes a DataFrame containing data structured as raw SQuAD
        (Stanford Question Answering Dataset) data and reorganizes it to align
        with BERT tokenizer requirements.
        
        It returns a dictionary with the following structure:
        return dict({
            'id': [list of question ids],
            'title': [list of titles],
            'context': [list of contexts],
            'question': [list of questions],
            'answers': [list of { # in squard1 every question has only 1 answer
                'answer_start': [list of indicies where answer start in the context]
                'text': [list of answers texts]
            
            }]
        })
    
    Parameters:
    - data (DataFrame): Input DataFrame with raw SQuAD data.

    Returns:
    - DataFrame: Cleaned and reorganized DataFrame following the specified structure for BERT tokenizer.
    """
    
    # intialize of empty lists aligning with BERT tokenizer structure
    ids = []
    titles = []
    contexts = []
    questions = []
    answers = []
    
    # iterate over documents of the dataframe
    for idx, qa_doc in data.iterrows():
        # we grab the first column which contain the actuall data
        document = qa_doc[0]
        # grab document tilte
        qa_title = document['title']
        # grab list of paragraphs associated with that title
        paragraphs = document['paragraphs']
        
        # for each paragraph extract context, questions and answers
        for parg in paragraphs:
            parg_context = parg['context']
            parg_questions = parg['qas']
            parg_questions_len = len(parg_questions)
            
            for p_question in parg_questions:
                qa_id = p_question['id']
                question = p_question['question']
                q_answers = p_question['answers'][0]
                answer = {
                    'answer_start': [q_answers['answer_start']],
                    'text': [q_answers['text']]
                }
                
                # add the extracted data to corresponding list
                ids.append(qa_id)
                titles.append(qa_title)
                contexts.append(parg_context)
                questions.append(question)
                answers.append(answer)
    
    # final structure of re-organized data
    cleaned_data = {
        'id': ids,
        'title': titles,
        'context': contexts,
        'question': questions,
        'answers': answers
    }
    
    return pd.DataFrame(cleaned_data)

In [5]:
# restructure our data to align with BERT tokenizer
train_data = prepare_data_for_tokenizer(train_data)
validation_data = prepare_data_for_tokenizer(validation_data)

  document = qa_doc[0]


In [6]:
from datasets import load_dataset, Dataset, DatasetDict

# put our train and validation in dataset object
dataset = DatasetDict({
    'train': Dataset.from_pandas(train_data),
    'validation': Dataset.from_pandas(validation_data)
})

In [7]:
# inspect new data shape as intended
dataset['train'][:2]

{'id': ['5733be284776f41900661182', '5733be284776f4190066117f'],
 'title': ['University_of_Notre_Dame', 'University_of_Notre_Dame'],
 'context': ['Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
  'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front o

## 2. Data Preprocessing and Tokenization

Sebelum kita bisa mengirimkan teks model, perlu dilakukan preprocessing. Hal ini dilakukan oleh Transformers Tokenizer yang akan melakukan tokenisasi dari input

    Mengkonversi token ke ID yang bersesuaian dalam kamus pretrained
    memasukkan secara proper token khusus untuk membentuk kalimat 

Representasi tokenizer dilakukan dengan  AutoTokenizer.from_pretrained method untuk memastikan :

mengunduh kosakata yang digunakan didownload dan di-cache, sehingga tidak diunduh lagi saat berikutnya menjalankan sel tersebut.

In [8]:
# model_checkpoint = "distilbert-base-cased-distilled-squad" # this is the pretrained model that was fine tuned

# This check point for model after fine tuned on qa tasks
model_checkpoint = "Ahmed-Zakaria/distilbert-base-cased-finetuned-squad"
batch_size = 16

In [9]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [10]:
# test our tokenizer
tokenizer("What is your BERT?", "BERT is Bidirectional Encoder Representations from Transformers")

{'input_ids': [101, 1327, 1110, 1240, 139, 9637, 1942, 136, 102, 139, 9637, 1942, 1110, 139, 2386, 5817, 17264, 13832, 13775, 1197, 20777, 4894, 20936, 1116, 1121, 25267, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Tokenizer akan mengambil pertanyaan dan memasangkan satu sama lain dengan [SEP] special token untuk menyerupai [CLS] pertanyaan [SEP] konteks [SEP] akan digunakan max_length parameter untuk menangani input yang panjang.

In [11]:
max_length = 384  # The maximum length of a feature (question + context)
doc_stride = 128  # The allowed overlap between two part of the context when splitting is performed.

In [12]:
# lets see that in practive
example = dataset["train"][0]

In [13]:
example

{'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']}}

In [14]:
# before truncating the input
print("length of tokenized input before truncation:", len(tokenizer(example["question"], example["context"])["input_ids"]))

length of tokenized input before truncation: 181


If we use truncation we lose 76 tokens of the original input which is not ideal

Note that we never want to truncate the question, only the context, and so we use the only_second truncation method.


In [15]:
print("length of tokenized input after truncation:",
      len(
        tokenizer(
        example["question"],
        example["context"],
        max_length=100,
        truncation="only_second",
    )["input_ids"]
))

length of tokenized input after truncation: 100


In [16]:
# our example is 
example

{'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']}}

In [17]:
# using max length lower than the input length with return_flowing_tokens set to true
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=100,
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
    stride=50,
)

In [18]:
sample_mapping = tokenized_example["overflow_to_sample_mapping"]
sample_mapping

[0, 0, 0, 0]

In [19]:
# tokenized input of features
for i, feature in enumerate(tokenized_example["input_ids"]):
    original_example_index = sample_mapping[i]
    print(f"Feature {i + 1} (Original Example {original_example_index}): {feature}")
    print()

Feature 1 (Original Example 0): [101, 1706, 2292, 1225, 1103, 6567, 2090, 9273, 2845, 1107, 8109, 1107, 10111, 20500, 1699, 136, 102, 22182, 1193, 117, 1103, 1278, 1144, 170, 2336, 1959, 119, 1335, 4184, 1103, 4304, 4334, 112, 188, 2284, 10945, 1110, 170, 5404, 5921, 1104, 1103, 6567, 2090, 119, 13301, 1107, 1524, 1104, 1103, 4304, 4334, 1105, 4749, 1122, 117, 1110, 170, 7335, 5921, 1104, 4028, 1114, 1739, 1146, 14089, 5591, 1114, 1103, 7051, 107, 159, 21462, 1566, 24930, 2508, 152, 1306, 3965, 107, 119, 5893, 1106, 1103, 4304, 4334, 1110, 1103, 19349, 1104, 1103, 11373, 4641, 119, 13301, 1481, 1103, 171, 17506, 102]

Feature 2 (Original Example 0): [101, 1706, 2292, 1225, 1103, 6567, 2090, 9273, 2845, 1107, 8109, 1107, 10111, 20500, 1699, 136, 102, 1103, 4304, 4334, 1105, 4749, 1122, 117, 1110, 170, 7335, 5921, 1104, 4028, 1114, 1739, 1146, 14089, 5591, 1114, 1103, 7051, 107, 159, 21462, 1566, 24930, 2508, 152, 1306, 3965, 107, 119, 5893, 1106, 1103, 4304, 4334, 1110, 1103, 19349, 110

In [20]:
for i, feature in enumerate(tokenized_example["input_ids"]):
    original_example_index = sample_mapping[i]
    print(f"Feature {i + 1} (Original Example {original_example_index}): {tokenizer.decode(feature)}")
    print()

Feature 1 (Original Example 0): [CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP] Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basi [SEP]

Feature 2 (Original Example 0): [CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP] the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin [SEP]

Feature 3 (Original Example 0): [CLS] To whom did th

In [21]:
# inspecting sample offset mapping
for i, offset_map in enumerate(tokenized_example["offset_mapping"]):
    original_example_index = sample_mapping[i]
    print(f"Feature {i + 1} offset_mapping (Original Example {original_example_index}): {offset_map}")
    print()

Feature 1 offset_mapping (Original Example 0): [(0, 0), (0, 2), (3, 7), (8, 11), (12, 15), (16, 22), (23, 27), (28, 37), (38, 44), (45, 47), (48, 52), (53, 55), (56, 59), (59, 63), (64, 70), (70, 71), (0, 0), (0, 13), (13, 15), (15, 16), (17, 20), (21, 27), (28, 31), (32, 33), (34, 42), (43, 52), (52, 53), (54, 56), (56, 58), (59, 62), (63, 67), (68, 76), (76, 77), (77, 78), (79, 83), (84, 88), (89, 91), (92, 93), (94, 100), (101, 107), (108, 110), (111, 114), (115, 121), (122, 126), (126, 127), (128, 139), (140, 142), (143, 148), (149, 151), (152, 155), (156, 160), (161, 169), (170, 173), (174, 180), (181, 183), (183, 184), (185, 187), (188, 189), (190, 196), (197, 203), (204, 206), (207, 213), (214, 218), (219, 223), (224, 226), (226, 229), (229, 232), (233, 237), (238, 241), (242, 248), (249, 250), (250, 251), (251, 254), (254, 256), (257, 259), (260, 262), (263, 264), (264, 265), (265, 268), (268, 269), (269, 270), (271, 275), (276, 278), (279, 282), (283, 287), (288, 296), (297, 2

In [22]:
# check if our token index is set correctly
first_token_id = tokenized_example["input_ids"][0][1]
offsets = tokenized_example["offset_mapping"][0][1]
print(
    tokenizer.convert_ids_to_tokens([first_token_id])[0],
    example["question"][offsets[0] : offsets[1]],
)

To To


Untuk memasangkan pertanyaan dengan konteks dan membentuk fitur, kita perlu untuk membedakan bagian dari fitur yang akan berkorespondensi dengan pertanyaan dan bagian mana yang berkorespondensi dengan konteks. Ini adalah part di mana sequence_ids dari tokenized_example akan berfungsi.

    0 indikasi pertanyaan
    1 indikasi konteks



In [23]:
# sequence_ids keep track of different part of input mapping to either context or question
# call sequence_ids() without parameter default for to sequence at index 0
# since we are using 1 example here we will not worry about index for now
# since we have 4 features in our example we can use 0, 1, 2, 3 index to show mapping for each part

# i used feature 3 with index 2 (since we count from 0) because i know where the answer :D
sequence_ids = tokenized_example.sequence_ids(2)
print(sequence_ids)

[None, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, None, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, None]


Karena kita akan mengeluarkan probabilitas untuk setiap token yang menunjukkan probabilitas token untuk menjadi token jawaban awal atau akhir, kita memerlukan cara untuk memetakan indeks awal karakter tertentu ke token dalam data yang diberi token.

In [24]:
# grab the answer for our example and its boundaries
answers = example["answers"]
answer_start_char = answers["answer_start"][0]
answer_end_char = answer_start_char + len(answers["text"][0])

# first we define the boundaries of the context in the tokenized data (question + context)
context_token_start_index = 0

# we trying to get **first** token with value = 1 marking the **start** of context
while sequence_ids[context_token_start_index] != 1:
    context_token_start_index += 1

# we trying to get **last** token with value = 1 marking the **end** of context
context_token_end_index = len(tokenized_example["input_ids"][2]) - 1
while sequence_ids[context_token_end_index] != 1:
    context_token_end_index -= 1

# Detect if the answer is in the current feature

# first grab offset mapping for the feature (we will use first feature to demonstrate at index [0])
# offsets map each token start and end positions in the context
offsets = tokenized_example["offset_mapping"][2]

# check if answer start char and end char from our data is in current feature span or not
# if it is in the context grab the index in current context span
if (
    # first token at the context start char index
    offsets[context_token_start_index][0] <= answer_start_char
    # last token at the context last char index
    and offsets[context_token_end_index][1] >= answer_end_char
):
    # Move the context_token_start_index and context_token_end_index to the two ends of the answer.
    # we go through each token in the feature and check its offset mapping (where it start and end)
    # and compare its offset to answer start index given to us to locate token corresponding to that 
    # start char index of the answer
    while (
        context_token_start_index < len(offsets)
        and offsets[context_token_start_index][0] <= answer_start_char
    ):
        context_token_start_index += 1
    start_position = context_token_start_index - 1
    
    while offsets[context_token_end_index][1] >= answer_end_char:
        context_token_end_index -= 1
    end_position = context_token_end_index + 1
    
    print("Anser start position: ", start_position, "Answer end position: ", end_position)
else:
    print("The answer is not in this feature.")

Anser start position:  72 Answer end position:  78


changeing sequence ids to same feature also) you will need to change the following lines of code:

    sequence_ids = tokenized_example.sequence_ids(2)
    offsets = tokenized_example["offset_mapping"][2]
    context_token_end_index = len(tokenized_example["input_ids"][2]

In [25]:
# check the mapping is done correctly
print("Mapped answer: ", 
    tokenizer.decode(
        # if you tried different indexes make sure to change [2] to match your index
        tokenized_example["input_ids"][2][start_position : end_position + 1]
    )
)
print("Actual Answer: ", answers["text"])

Mapped answer:  Saint Bernadette Soubirous
Actual Answer:  ['Saint Bernadette Soubirous']


In [26]:
def prepare_train_features(examples):
    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context,
    # we need a map from a feature to its corresponding example.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    
    # The offset mappings will give us a map from token to character position in the original context.
    # This will help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that feature.
        sequence_ids = tokenized_examples.sequence_ids(i)
        # One example can give several features, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        
        # grab the answer associated with that example
        answers = examples["answers"][sample_index]
        
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            answer_start_char = answers["answer_start"][0]
            answer_end_char = answer_start_char + len(answers["text"][0])

            # Start token index of the current feature in the text.
            context_token_start_index = 0
            while sequence_ids[context_token_start_index] != 1:
                context_token_start_index += 1

            # End token index of the current feature in the text.
            context_token_end_index = len(input_ids) - 1
            while sequence_ids[context_token_end_index] != 1:
                context_token_end_index -= 1
                
            # Detect if the answer is in feature span.
            if (
                # first char in the current feature <= answer char start
                offsets[context_token_start_index][0] <= answer_start_char
                # last char in the current context >= answer char end
                and offsets[context_token_end_index][1] >= answer_end_char
            ):
                # find the token associated with answer start char
                while (
                    context_token_start_index < len(offsets)
                    and offsets[context_token_start_index][0] <= answer_start_char
                ):
                    context_token_start_index += 1
                tokenized_examples["start_positions"].append(context_token_start_index - 1)
                
                # find the token associated with answer end char
                while offsets[context_token_end_index][1] >= answer_end_char:
                    context_token_end_index -= 1
                tokenized_examples["end_positions"].append(context_token_end_index + 1)

            else:
                # Otherwise move the token_start_index and token_end_index to the cls token index.
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
                

    return tokenized_examples

In [27]:
features = prepare_train_features(dataset["train"][:5])

In [28]:
features

{'input_ids': [[101, 1706, 2292, 1225, 1103, 6567, 2090, 9273, 2845, 1107, 8109, 1107, 10111, 20500, 1699, 136, 102, 22182, 1193, 117, 1103, 1278, 1144, 170, 2336, 1959, 119, 1335, 4184, 1103, 4304, 4334, 112, 188, 2284, 10945, 1110, 170, 5404, 5921, 1104, 1103, 6567, 2090, 119, 13301, 1107, 1524, 1104, 1103, 4304, 4334, 1105, 4749, 1122, 117, 1110, 170, 7335, 5921, 1104, 4028, 1114, 1739, 1146, 14089, 5591, 1114, 1103, 7051, 107, 159, 21462, 1566, 24930, 2508, 152, 1306, 3965, 107, 119, 5893, 1106, 1103, 4304, 4334, 1110, 1103, 19349, 1104, 1103, 11373, 4641, 119, 13301, 1481, 1103, 171, 17506, 9538, 1110, 1103, 144, 10595, 2430, 117, 170, 14789, 1282, 1104, 8070, 1105, 9284, 119, 1135, 1110, 170, 16498, 1104, 1103, 176, 10595, 2430, 1120, 10111, 20500, 117, 1699, 1187, 1103, 6567, 2090, 25153, 1193, 1691, 1106, 2216, 17666, 6397, 3786, 1573, 25422, 13149, 1107, 8109, 119, 1335, 1103, 1322, 1104, 1103, 1514, 2797, 113, 1105, 1107, 170, 2904, 1413, 1115, 8200, 1194, 124, 11739, 1105, 1

In [29]:
train_val_set = dataset.map(
    prepare_train_features, batched=True, remove_columns=dataset["train"].column_names
)

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

## 3. Model

Setelah data siap, kita dapat mengunduh pretrained model dan melakukan fine tuning. Untuk case question answering, akan digunakan TFAutoModelForQuestionAnswering class.

In [30]:
# you will need to run this like if you are using kaggle to run this note book
# this line resolve an error which is "numpy has no attribute called object" that only appens on kaggle 
np.object = object

In [32]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [33]:
from transformers import TFAutoModelForQuestionAnswering

model = TFAutoModelForQuestionAnswering.from_pretrained(model_checkpoint)
model.summary()





All model checkpoint layers were used when initializing TFDistilBertForQuestionAnswering.

All the layers of TFDistilBertForQuestionAnswering were initialized from the model checkpoint at Ahmed-Zakaria/distilbert-base-cased-finetuned-squad.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForQuestionAnswering for predictions without further training.


Model: "tf_distil_bert_for_question_answering"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMa  multiple                  65190912  
 inLayer)                                                        
                                                                 
 qa_outputs (Dense)          multiple                  1538      
                                                                 
 dropout_19 (Dropout)        multiple                  0 (unused)
                                                                 
Total params: 65192450 (248.69 MB)
Trainable params: 65192450 (248.69 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Parameter default dari model adalah 


- learning_rate = 2e-5
- num_train_epochs = 3
- weight_decay = 0.01

In [36]:
# convert data to tensor flow dataset to ffeed into model
train_set = model.prepare_tf_dataset(
    train_val_set["train"],
    shuffle=True,
    batch_size=batch_size,
)

validation_set = model.prepare_tf_dataset(
    train_val_set["validation"],
    shuffle=False,
    batch_size=batch_size,
)

## 4. Evaluasi Model

We masked the start and end logits corresponding to tokens outside of the context to identify them in post processing.

In [38]:
def prepare_validation_features(examples):
    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1

        # One example can give several features, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        # we set each feature generated from example with the example id it was generated from
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (offset_map if sequence_ids[k] == context_index else None)
            for k, offset_map in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

In [39]:
# prepare our validation data
validation_features = dataset["validation"].map(
    prepare_validation_features,
    batched=True,
    remove_columns=dataset["validation"].column_names,
)

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

In [40]:
# convert validation data to tensor flow dataset
val_set = model.prepare_tf_dataset(
     validation_features,
     shuffle=False,
     batch_size=batch_size,
)

Kita mencari logit score untuk start dan end logits dengan mengecualikan posisi yang menghasilkan:

    - Jawaban yang tidak di dalam konteks 
    - Jawaban dengan panjang negatif 
    - Jawaban terlalu panjang (dibatasi dengan maksimal panjang = 30)



In [41]:
batch = next(iter(val_set))
output = model.predict_on_batch(batch)
output.keys()




odict_keys(['start_logits', 'end_logits'])

In [42]:
output.start_logits.shape, output.end_logits.shape

((16, 384), (16, 384))

In [43]:
np.argmax(output.start_logits, -1), np.argmax(output.end_logits, -1)

(array([ 46,  57,  78,  54, 118, 108,  72,   9, 108,  45,  73,  41,  80,
         91, 157,  35], dtype=int64),
 array([ 47,  58,  81,  55, 118, 109,  75,  48, 109,  47,  76,  42,  83,
         92, 159,  35], dtype=int64))

In [44]:
# set a limit for the answer length
max_answer_length = 30
n_best_size = 20

# grab logits for start and end tokens of the answer
start_logits = output.start_logits[15]
end_logits = output.end_logits[15]

offset_mapping = validation_features[15]["offset_mapping"]
# The first feature comes from the first example. For the more general case, we will need to be match the example_id to
# an example index
context = dataset["validation"][15]["context"]

# Gather the indices the best start/end logits:
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()

valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
        # to part of the input_ids that are not in the context.
        if (
            start_index >= len(offset_mapping)
            or end_index >= len(offset_mapping)
            or offset_mapping[start_index] is None
            or offset_mapping[end_index] is None
        ):
            continue
        # Don't consider answers with a length that is either < 0 or > max_answer_length.
        if end_index < start_index or end_index - start_index + 1 > max_answer_length:
            continue
        if (
            start_index <= end_index
        ):  # We need to refine that test to check the answer is inside the context
            start_char = offset_mapping[start_index][0]
            end_char = offset_mapping[end_index][1]
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": context[start_char:end_char],
                }
            )

valid_answers = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[
    :n_best_size
]
valid_answers

[{'score': 17.9795, 'text': '2015'},
 {'score': 13.820364, 'text': 'the 2015'},
 {'score': 12.658412, 'text': '2015 season'},
 {'score': 11.241871, 'text': '2015 season.'},
 {'score': 11.125507, 'text': 'for the 2015'},
 {'score': 10.312961,
  'text': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015'},
 {'score': 9.25402,
  'text': '50 was an American football game to determine the champion of the National Football League (NFL) for the 2015'},
 {'score': 8.499275, 'text': 'the 2015 season'},
 {'score': 7.5638137, 'text': '2016'},
 {'score': 7.351606, 'text': 'NFL) for the 2015'},
 {'score': 7.1056266,
  'text': 'champion of the National Football League (NFL) for the 2015'},
 {'score': 7.082734, 'text': 'the 2015 season.'},
 {'score': 6.9751983,
  'text': 'Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015'},
 {'score': 6.7369404, 'text': 'National Foo

In [45]:
import collections

examples = dataset["validation"]
features = validation_features

example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
features_per_example = collections.defaultdict(list)
for i, feature in enumerate(features):
    features_per_example[example_id_to_index[feature["example_id"]]].append(i)

Bagian terakhir yang harus dikerjakan adalah jawaban yang mustahil (bila squad_v2 = Benar). Kode di atas hanya menyimpan jawaban yang berada di dalam konteks, kita juga perlu mengambil skor untuk jawaban yang tidak mungkin (yang memiliki indeks awal dan akhir yang sesuai dengan indeks token CLS). Ketika satu contoh memberikan beberapa fitur, kita harus memprediksi jawaban yang tidak mungkin ketika semua fitur memberikan skor tinggi pada jawaban yang tidak mungkin (karena satu fitur dapat memprediksi jawaban yang tidak mungkin hanya karena jawabannya tidak sesuai dengan konteksnya)

kemudian kita memprediksi jawaban yang tidak mungkin ketika skor tersebut lebih besar dari skor jawaban terbaik yang tidak mungkin. Jika digabungkan bersama-sama, didapatkan fungsi pasca-pemrosesan berikut:

In [46]:
from tqdm.auto import tqdm

def postprocess_qa_predictions(
    dataset,
    features,
    all_start_logits,
    all_end_logits,
    n_best_size=20,
    max_answer_length=30,
):
    """
        Takes raw predictions and validate answers by removing answers where start and end positions 
        are out of context span and return top valid answer
        
        parameters:
            dataset: original dataset to to extract answer text from
            features: tokenized dataset
            all_start_logits: list of start position probability for each token in each feature
            all_end_logits: list of end position probability for each token in each feature
            n_best_size: to choose number of valid answers to save
            max_answer_length: max length of answer text
            
        return: dictionary with key = to doc id and value corresponding to best answer
            {
                'doc_id': best_answer
            }
        
    """
    
    
    # Build a id -> feature indexes map of doc -> features    
    example_to_features = collections.defaultdict(list)
    for i, feature in enumerate(features):
        # feature['example_id'] => grab example id feature was generated from
        # features_per_doc[...] => append feature id to the key corresponding to their original example id
        example_to_features[feature["example_id"]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    # Logging.
    print(
        f"Post-processing {len(dataset)} example predictions split into {len(features)} features."
    )

    # Let's loop over all the docs!
    for example in tqdm(examples):
        example_id = example['id']
        feature_indices = features_per_example[example_id]
        
        valid_answers = []

        context = example["context"]
        # Looping through all the features associated to the current document.
        for feature_index in example_to_features[example_id]:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Go through all possibilities for the `n_best_size` greater start and end logits.
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                    # to part of the input_ids that are not in the context.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or not offset_mapping[start_index]
                        or not offset_mapping[end_index]
                    ):
                        continue
                        
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                    ):
                        continue
                    
                    # valid answers
                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char:end_char],
                        }
                    )
                    
        best_answer = max(valid_answers, key=lambda x: x["score"])
        predictions[example_id] = best_answer["text"]

    return predictions

In [47]:
raw_predictions = model.predict(val_set)



In [67]:
eval_loss = model.evaluate(validation_set)



In [48]:
final_predictions = postprocess_qa_predictions(
    dataset["validation"],
    validation_features,
    raw_predictions["start_logits"],
    raw_predictions["end_logits"],
)

Post-processing 10570 example predictions split into 10822 features.


  0%|          | 0/10570 [00:00<?, ?it/s]

In [49]:
from datasets import load_metric
metric = load_metric("squad")

  metric = load_metric("squad")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


In [51]:
# test model answers
test_samples = 10
for i in range(test_samples):
    random_example = np.random.randint(0, len(validation_features['example_id']))
    example = dataset["validation"][random_example]
    predicted_answer = final_predictions[example['id']]
    print(f"example ID: {example['id']}")
    print(f"Context: {example['context']}")
    print(f"Quesiton {example['question']}")
    print(f"True Answer {example['answers']['text']} | Predicted Answer: {predicted_answer}")
    print()

example ID: 57293e983f37b3190047818d
Context: In 2001, 16 national science academies issued a joint statement on climate change. The joint statement was made by the Australian Academy of Science, the Royal Flemish Academy of Belgium for Science and the Arts, the Brazilian Academy of Sciences, the Royal Society of Canada, the Caribbean Academy of Sciences, the Chinese Academy of Sciences, the French Academy of Sciences, the German Academy of Natural Scientists Leopoldina, the Indian National Science Academy, the Indonesian Academy of Sciences, the Royal Irish Academy, Accademia Nazionale dei Lincei (Italy), the Academy of Sciences Malaysia, the Academy Council of the Royal Society of New Zealand, the Royal Swedish Academy of Sciences, and the Royal Society (UK). The statement, also published as an editorial in the journal Science, stated "we support the [TAR's] conclusion that it is at least 90% certain that temperatures will continue to rise, with average global surface temperature pro

selanjutnya dilakukan hyperparameter tuning dengan mengubah parameter sebagai berikut :

In [55]:
learning_rate = 2e-5
num_train_epochs = 2
weight_decay = 0.01

In [56]:
# convert data to tensor flow dataset to ffeed into model
train_set = model.prepare_tf_dataset(
    train_val_set["train"],
    shuffle=True,
    batch_size=batch_size,
)

validation_set = model.prepare_tf_dataset(
    train_val_set["validation"],
    shuffle=False,
    batch_size=batch_size,
)

In [57]:
from transformers import create_optimizer

total_train_steps = len(train_set) * num_train_epochs

optimizer, schedule = create_optimizer(
     init_lr=learning_rate,
     num_warmup_steps=0,
     num_train_steps=total_train_steps,
     weight_decay_rate= weight_decay
 )

model.compile(optimizer=optimizer, metrics=["accuracy"])

model.fit(
     train_set,
     validation_data=validation_set,
     epochs=num_train_epochs
    #     callbacks = callbacks
 )

Epoch 1/2
Epoch 2/2


<tf_keras.src.callbacks.History at 0x1b7004f1510>

In [60]:
eval_loss = model.evaluate(validation_set)



In [61]:
formatted_predictions = [
    {"id": k, "prediction_text": v} for k, v in final_predictions.items()
]
references = [
    {"id": ex["id"], "answers": ex["answers"]} for ex in dataset["validation"]
]
metric.compute(predictions=formatted_predictions, references=references)

{'exact_match': 64.22894985808892, 'f1': 79.11440805987684}

In [70]:
final_predictions = postprocess_qa_predictions(
    dataset["validation"],
    validation_features,
    raw_predictions["start_logits"],
    raw_predictions["end_logits"],
)

Post-processing 10570 example predictions split into 10822 features.


  0%|          | 0/10570 [00:00<?, ?it/s]

In [71]:
test_samples = 10
for i in range(test_samples):
    random_example = np.random.randint(0, len(validation_features['example_id']))
    example = dataset["validation"][random_example]
    predicted_answer = final_predictions[example['id']]
    print(f"example ID: {example['id']}")
    print(f"Context: {example['context']}")
    print(f"Quesiton {example['question']}")
    print(f"True Answer {example['answers']['text']} | Predicted Answer: {predicted_answer}")
    print()

example ID: 57113639a58dae1900cd6d1a
Context: It is a logical extension of the compound engine (described above) to split the expansion into yet more stages to increase efficiency. The result is the multiple expansion engine. Such engines use either three or four expansion stages and are known as triple and quadruple expansion engines respectively. These engines use a series of cylinders of progressively increasing diameter. These cylinders are designed to divide the work into equal shares for each expansion stage. As with the double expansion engine, if space is at a premium, then two smaller cylinders may be used for the low-pressure stage. Multiple expansion engines typically had the cylinders arranged inline, but various other formations were used. In the late 19th century, the Yarrow-Schlick-Tweedy balancing 'system' was used on some marine triple expansion engines. Y-S-T engines divided the low-pressure expansion stages between two cylinders, one at each end of the engine. This a

model bekerja dengan baik dalam memprediksi jawaban sesuai konteks yang diberikan. Selanjutnya dicoba untuk menggunakan tokenizer yang  berbeda, yaitu T5Tokenizer

In [73]:
import json
from transformers import T5Tokenizer, T5Model, T5ForConditionalGeneration, T5TokenizerFast
from torch.optim import Adam
import torch
import evaluate
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader, RandomSampler
import pandas as pd
import numpy as np
import transformers

In [74]:
TOKENIZER = T5TokenizerFast.from_pretrained("t5-base")
MODEL = T5ForConditionalGeneration.from_pretrained("t5-base", return_dict=True)
OPTIMIZER = Adam(MODEL.parameters(), lr=0.00001)
Q_LEN = 256   # Question Length
T_LEN = 32    # Target Length
# BATCH_SIZE = 4
BATCH_SIZE = 10
DEVICE = "cuda:0"

In [75]:
# Loading the data

with open(r'C:\\Users\\PPL2\\Documents\\Private\\Model\\Pacman Data Science\\Project\\Natural Languange Processing/train-v1.1.json') as f:
    data = json.load(f)

In [76]:
def prepare_data(data):
    articles = []
    
    for article in data["data"]:
        for paragraph in article["paragraphs"]:
            for qa in paragraph["qas"]:
                question = qa["question"]
                answer = qa["answers"][0]["text"]
                inputs = {"context": paragraph["context"], "question": question, "answer": answer}
                articles.append(inputs)

    return articles

In [77]:
data = prepare_data(data)

# Create a Dataframe
data = pd.DataFrame(data)
data = data.loc[:3000]

In [78]:
data.shape

(3001, 3)

In [79]:
class QA_Dataset(Dataset):
    def __init__(self, tokenizer, dataframe, q_len, t_len):
        self.tokenizer = tokenizer
        self.q_len = q_len
        self.t_len = t_len
        self.data = dataframe
        self.questions = self.data["question"]
        self.context = self.data["context"]
        self.answer = self.data['answer']
        
    def __len__(self):
        return len(self.questions)
    
    def __getitem__(self, idx):
        question = self.questions[idx]
        context = self.context[idx]
        answer = self.answer[idx]
        
        question_tokenized = self.tokenizer(question, context, max_length=self.q_len, padding="max_length",
                                                    truncation=True, pad_to_max_length=True, add_special_tokens=True)
        answer_tokenized = self.tokenizer(answer, max_length=self.t_len, padding="max_length", 
                                          truncation=True, pad_to_max_length=True, add_special_tokens=True)
        
        labels = torch.tensor(answer_tokenized["input_ids"], dtype=torch.long)
        labels[labels == 0] = -100
        
        return {
            "input_ids": torch.tensor(question_tokenized["input_ids"], dtype=torch.long),
            "attention_mask": torch.tensor(question_tokenized["attention_mask"], dtype=torch.long),
            "labels": labels,
            "decoder_attention_mask": torch.tensor(answer_tokenized["attention_mask"], dtype=torch.long)
        }

In [80]:
# Dataloader

train_data, val_data = train_test_split(data, test_size=0.2, random_state=42)

train_sampler = RandomSampler(train_data.index)
val_sampler = RandomSampler(val_data.index)

qa_dataset = QA_Dataset(TOKENIZER, data, Q_LEN, T_LEN)

train_loader = DataLoader(qa_dataset, batch_size=BATCH_SIZE, sampler=train_sampler)
val_loader = DataLoader(qa_dataset, batch_size=BATCH_SIZE, sampler=val_sampler)

In [81]:
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
MODEL.to(DEVICE)
OPTIMIZER = torch.optim.Adam(MODEL.parameters(), lr=0.00001)

In [82]:
train_loss = 0
val_loss = 0
train_batch_count = 0
val_batch_count = 0

for epoch in range(2):
    MODEL.train()
    for batch in tqdm(train_loader, desc="Training batches"):
        input_ids = batch["input_ids"].to(DEVICE)
        attention_mask = batch["attention_mask"].to(DEVICE)
        labels = batch["labels"].to(DEVICE)
        decoder_attention_mask = batch["decoder_attention_mask"].to(DEVICE)

        outputs = MODEL(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels,
            decoder_attention_mask=decoder_attention_mask
        )

        OPTIMIZER.zero_grad()
        outputs.loss.backward()
        OPTIMIZER.step()
        train_loss += outputs.loss.item()
        train_batch_count += 1
    
    # Evaluation
    MODEL.eval()
    for batch in tqdm(val_loader, desc="Validation batches"):
        input_ids = batch["input_ids"].to(DEVICE)
        attention_mask = batch["attention_mask"].to(DEVICE)
        labels = batch["labels"].to(DEVICE)
        decoder_attention_mask = batch["decoder_attention_mask"].to(DEVICE)

        outputs = MODEL(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels,
            decoder_attention_mask=decoder_attention_mask
        )

        val_loss += outputs.loss.item()
        val_batch_count += 1

    print(f"{epoch+1}/{2} -> Train loss: {train_loss / train_batch_count}\tValidation loss: {val_loss/val_batch_count}")

Training batches:   0%|          | 0/240 [00:00<?, ?it/s]

Validation batches:   0%|          | 0/61 [00:00<?, ?it/s]

1/2 -> Train loss: 2.493030342956384	Validation loss: 1.435157461244552


Training batches:   0%|          | 0/240 [00:00<?, ?it/s]

Validation batches:   0%|          | 0/61 [00:00<?, ?it/s]

2/2 -> Train loss: 1.8402289805933834	Validation loss: 1.0448074530749047


In [86]:
MODEL.save_pretrained("qa_model")
TOKENIZER.save_pretrained("qa_tokenizer")

# Saved files
"""('qa_tokenizer/tokenizer_config.json',
 'qa_tokenizer/special_tokens_map.json',
 'qa_tokenizer/spiece.model',
'qa_tokenizer/added_tokens.json',
'qa_tokenizer/tokenizer.json')"""

"('qa_tokenizer/tokenizer_config.json',\n 'qa_tokenizer/special_tokens_map.json',\n 'qa_tokenizer/spiece.model',\n'qa_tokenizer/added_tokens.json',\n'qa_tokenizer/tokenizer.json')"

In [87]:
def predict_answer(context, question, ref_answer=None):
    inputs = TOKENIZER(question, context, max_length=Q_LEN, padding="max_length", truncation=True, add_special_tokens=True)
    
    input_ids = torch.tensor(inputs["input_ids"], dtype=torch.long).to(DEVICE).unsqueeze(0)
    attention_mask = torch.tensor(inputs["attention_mask"], dtype=torch.long).to(DEVICE).unsqueeze(0)

    outputs = MODEL.generate(input_ids=input_ids, attention_mask=attention_mask)
  
    predicted_answer = TOKENIZER.decode(outputs.flatten(), skip_special_tokens=True)
    
    if ref_answer:
        bleu = evaluate.load("google_bleu")
        print(predicted_answer, ref_answer)
        score = bleu.compute(predictions=[predicted_answer], references=[ref_answer])
    
        print("Context: \n", context)
        print("\n")
        print("Question: \n", question)
        return {
            "Reference Answer: ": ref_answer, 
            "Predicted Answer: ": predicted_answer, 
            "BLEU Score: ": score
        }
    else:
        return predicted_answer

In [88]:
context = data.iloc[0]["context"]
question = data.iloc[0]["question"]
answer = data.iloc[0]["answer"]

In [89]:
predict_answer(context, question, answer)



Saint Bernadette Soubirous Saint Bernadette Soubirous
Context: 
 Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.


Question: 
 To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?


{'Reference Answer: ': 'Saint Bernadette Soubirous',
 'Predicted Answer: ': 'Saint Bernadette Soubirous',
 'BLEU Score: ': {'google_bleu': 1.0}}

In [90]:
context = """Virat Kohli on Tuesday slammed his 73rd international hundred by reaching the three-figure mark off 80 deliveries against Sri Lanka in first ODI in Guwahati. 
The 34-year-old became the fastest batter to slam 20 ODI hundreds on Indian soil, taking 99 innings. 
He also overtook Sachin Tendulkar to become the fastest batter to smash 45 ODI hundreds, taking 257 innings."""

In [91]:
q_1="How many innings does Virat Kohli took to reach 45 ODI hundreds?" 
predict_answer(context, q_1)

'257'