## Table of Contents

### Python Imports
   * [Link to package import](#import_packages)
   * [Link to training data import](#import_training_data)
   
### Data Formatting
* [Article List Formating](#generate_python_list)
* [Article Retrieval Code](#retrieve_articles)

### Article Retrieval (TF-IDF)
* [Data Preprocessing](#preprocess_data)
* [Document Retrieval Benchmarking](#document_retrieval_benchmarking)
* [Pretrained Model](#import_pretrained_model)

### Answer Retrieval (BERT)
* [BERT Implementation](#BERT_training)
* [BERT Looping Helper Functions](#looping_helpers)
* [BERT Standard Benchmarking](#standard_benchmarking)
* [BERT Benchmarking with Looping](#benchmarking_BERT_with_loop)
* [BERT Benchmarking Troubleshooting](#BERT_analysis)

In [None]:
!pip install transformers
!pip install pandas
!pip install nltk
!pip install torch
!pip install spacy
!pip install sklearn
!pip install scipy

## Import Python Packages <a id='import_packages'></a>

In [16]:
import pandas as pd
import numpy as np
import os
import torch
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD
from scipy.sparse import csr_matrix
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
import transformers
transformers.logging.set_verbosity_error()
# from transformers import BertTokenizer, AutoTokenizer, BertForQuestionAnswering, BertTokenizerFast, BertConfig, DistilBertForQuestionAnswering, DistilBertTokenizerFast
from torch.utils.data import DataLoader
from tqdm import tqdm
from transformers import BertTokenizer,AutoTokenizer,BertForQuestionAnswering
from torch.optim import AdamW
import random
import re
import spacy

## Import Training Sets <a id='import_training_data'></a>
### PLEASE CHANGE THE FILE PATH OF WHERE THE TRAINING AND TEST DATA FILES ARE LOCATED!!!

In [40]:
#flag for displaying the dataframes
display_dataframe = True

#flag for using the test data
useTestData = False

# Import the csv file containing the BERT test and training data
training_BERT_filename = "/content/squad_training_dataset_BERT.csv"
test_BERT_filename = "/content/squad_test_training_dataset_BERT.csv"
training_BERT_dataset_df = pd.read_csv(training_BERT_filename)
test_BERT_dataset_df = pd.read_csv(test_BERT_filename)

# drop non-answers from the dataset
training_BERT_dataset_df = training_BERT_dataset_df.dropna()
test_BERT_dataset_df = test_BERT_dataset_df.dropna()

# display the training set dataframe
if display_dataframe:
  if(useTestData):
      display(test_BERT_dataset_df)
  else:
      display(training_BERT_dataset_df)

# create training lists
if(useTestData):
    questions_BERT = test_BERT_dataset_df['question'].to_list()
    answers_BERT = test_BERT_dataset_df['answer'].to_list()
    topics_BERT = test_BERT_dataset_df['topic'].to_list()
    answers_start = test_BERT_dataset_df['answer_start'].to_list()
else:
    questions_BERT = training_BERT_dataset_df['question'].to_list()
    answers_BERT = training_BERT_dataset_df['answer'].to_list()
    topics_BERT = training_BERT_dataset_df['topic'].to_list()
    answers_start = training_BERT_dataset_df['answer_start'].to_list()

# Import the csv files containing test and training data for article to questions
topic_to_article_BERT_filename = "/content/topics_to_articles_BERT.csv"
test_topic_to_article_BERT_filename = "/content/topics_to_articles_BERT.csv"
topics_articles_BERT_df = pd.read_csv(topic_to_article_BERT_filename)
topics_articles_test_BERT_df = pd.read_csv(test_topic_to_article_BERT_filename)

if(useTestData):
    #get the unique topic names from the dataframe
    topic_strings_BERT = topics_articles_test_BERT_df.columns.to_list()
    #get the correct label for the questions
    question_topic_BERT = test_BERT_dataset_df['topic'].to_list()
    #get the correct answers for questions
    question_answers_BERT = test_BERT_dataset_df['answer'].to_list()
else:
    #get the unique topic names from the dataframe
    topic_strings_BERT = topics_articles_BERT_df.columns.to_list()
    #get the correct label for the questions
    question_topic_BERT = training_BERT_dataset_df['topic'].to_list()
    #get the correct answers for questions
    question_answers_BERT = training_BERT_dataset_df['answer'].to_list()

Unnamed: 0.1,Unnamed: 0,topic,question,answer,answer_start,question_id,answer_context
0,0,Beyonce,When did Beyonce start becoming popular?,in the late 1990s,269.0,56be85543aeaaa14008c9063,Beyonce Giselle Knowles-Carter (/bi:'janseI/ b...
1,1,Beyonce,What areas did Beyonce compete in when she was...,singing and dancing,207.0,56be85543aeaaa14008c9065,Beyonce Giselle Knowles-Carter (/bi:'janseI/ b...
2,2,Beyonce,When did Beyonce leave Destiny's Child and bec...,2003,526.0,56be85543aeaaa14008c9066,Beyonce Giselle Knowles-Carter (/bi:'janseI/ b...
3,3,Beyonce,In what city and state did Beyonce grow up?,"Houston, Texas",166.0,56bf6b0f3aeaaa14008c9601,Beyonce Giselle Knowles-Carter (/bi:'janseI/ b...
4,4,Beyonce,In which decade did Beyonce become famous?,late 1990s,276.0,56bf6b0f3aeaaa14008c9602,Beyonce Giselle Knowles-Carter (/bi:'janseI/ b...
...,...,...,...,...,...,...,...
130046,130046,Kathmandu,In what US state did Kathmandu first establish...,Oregon,229.0,5735d259012e2f140011a09d,"Kathmandu Metropolitan City (KMC), in order to..."
130047,130047,Kathmandu,What was Yangon previously known as?,Rangoon,414.0,5735d259012e2f140011a09e,"Kathmandu Metropolitan City (KMC), in order to..."
130048,130048,Kathmandu,With what Belorussian city does Kathmandu have...,Minsk,476.0,5735d259012e2f140011a09f,"Kathmandu Metropolitan City (KMC), in order to..."
130049,130049,Kathmandu,In what year did Kathmandu create its initial ...,1975,199.0,5735d259012e2f140011a0a0,"Kathmandu Metropolitan City (KMC), in order to..."


## Generate List of Articles <a id='generate_python_list'></a>

In [5]:
def Get_Article(df,topic):
    article = df[topic].to_list()[0]
    
    return article

articles_BERT = []
for topic in topic_strings_BERT:
    articles_BERT.append(Get_Article(topics_articles_BERT_df,topic))

In [6]:
def segment_documents(docs, max_doc_length=500):
    # List containing full and segmented docs
    segmented_docs = []

    for doc in docs:
        # Split document by spaces to obtain a word count that roughly approximates the token count
        split_to_words = doc.split(" ")

        # If the document is longer than our maximum length, split it up into smaller segments and add them to the list 
        if len(split_to_words) > max_doc_length:
            for doc_segment in range(0, len(split_to_words), max_doc_length):
                segmented_docs.append(" ".join(split_to_words[doc_segment:doc_segment + max_doc_length]))

        # If the document is shorter than our maximum length, add it to the list
        else:
            segmented_docs.append(doc)

    return segmented_docs

## Article Retrieval Code <a id='retrieve_articles'></a>

In [14]:
#make pre-processing functions
def convert_lower_case(data):
    return str(np.char.lower(data))

def remove_punctuation(data):
    new_data = ""
    symbols = "!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n"
    for i in symbols:
        new_data = np.char.replace(data, i, ' ')
        
    return(str(new_data))
def remove_apostrophe(data):
    return str(np.char.replace(data, "'", ""))

def remove_single_characters(data):
    new_text = ""
    
    word_list = nltk.word_tokenize(data)
    
    for w in word_list:
        if len(w) > 1:
            new_text = new_text + " " + w
    
    return new_text

def Lemmatize(data):
    lemmatizer = WordNetLemmatizer()
    
    word_list = nltk.word_tokenize(data)
    
    lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
    
    return lemmatized_output
def Stemming(data):
    ps = PorterStemmer()
    
    word_list = nltk.word_tokenize(data)
    
    stem_output = ' '.join([ps.stem(w) for w in word_list])
    
    return stem_output

#create a function to preprocess the data
def Pre_Process_Data(data):
    new_data = remove_punctuation(data)
    return new_data

def Retrieve_Article(query, docs, k=5):
    #pre-process the query
    query = Pre_Process_Data(query)
    
    query_words = re.split('\s+', query)
    num_cols = len(query_words)
    
    # Initialize a vectorizer that removes English stop words
    vectorizer = TfidfVectorizer(analyzer="word", stop_words='english',sublinear_tf=True,use_idf=True)
    
    # Create a corpus of query and documents and convert to TFIDF vectors
    query_and_docs = [query] + docs
    matrix = vectorizer.fit_transform(query_and_docs)
    
    #apply SVD to the TF-IDF vectorized matrix
    svd = TruncatedSVD(n_components=num_cols+250,n_iter=1,random_state=42)
    
    #fit and transform the SVD model
    matrix_new = svd.fit_transform(matrix)
    matrix_new = csr_matrix(matrix_new)

    # Holds our cosine similarity scores
    scores = []

    # The first vector is our query text, so compute the similarity of our query against all document vectors
    for i in range(1, len(query_and_docs)):
        scores.append(cosine_similarity(matrix_new[0], matrix_new[i])[0][0])

    # Sort list of scores and return the top k highest scoring documents
    sorted_list = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
    top_doc_indices = [x[0] for x in sorted_list[:k]]
    top_docs = [docs[x] for x in top_doc_indices]

    return top_docs, top_doc_indices

## Benchmark Article/Document Retrieval <a id='document_retrieval_benchmarking'></a>

In [54]:
#flag for running the document/article retrival benchmarking test
RunDocRetrBenchmarking = False

#conduct document retrieval for each question in the dataset
def Benchmark_DocRetrieval(num_articles,num_samples,questions,question_topic,articles,articles_true,topic_strings,RandQs=False):
    import random
    
    if(RandQs):
        random_indices = list(random.sample(range(0, len(questions)), num_samples))

        sample_questions = []
        for idx in random_indices:
            sample_questions.append(questions[idx])
    else:
        #create a sample of questions
        sample_questions = questions[0:10]
    
    total = len(sample_questions)
    correct = 0
    tracker = 1
    for question in sample_questions:
        #create a tracker
#         print(question)
        print("Question #%d/%d" %(tracker,total))

        #get the true label for the question
        true_question_label = question_topic[questions.index(question)]

        #run the doc retriever on the current question 
        top_articles, predicted_articles_indices = Retrieve_Article(question,articles,k=num_articles)

#         predicted_articles = []
#         for idx in predicted_articles_indices:
#             predicted_articles.append(articles_true[idx])
        
        #iterate over all predicted articles and check if the prediction is correct
        for prediction in top_articles:    
            #get the true topic of the predicted article
            true_article_label = topic_strings[articles_true.index(prediction)]

            #this will handle if the correct article is even chosen
            if(true_question_label==true_article_label):
                correct += 1

        tracker += 1

    return correct/total

if(RunDocRetrBenchmarking):
    num_top_articles = [1,3,5,10]
    doc_accuracy_results = []
    num_rand_samples = 10
    for i in range(len(num_top_articles)):
        print("Retrieving the top %d articles for each question..." % num_top_articles[i])
        retrieval_accuracy = Benchmark_DocRetrieval(num_top_articles[i],num_rand_samples,questions_BERT,question_topic_BERT,articles_BERT,articles_BERT,topic_strings_BERT,RandQs=True)
        doc_accuracy_results.append(retrieval_accuracy)
        print("k=%d Article Retrieval Accuracy: %.3f" % (num_top_articles[i],retrieval_accuracy))

    #plot the accuracy
    import matplotlib.pyplot as plt
    import seaborn as sns
    sns.set()

    sns.set_theme(font_scale=1)
    accPlot = sns.lineplot(x=num_top_articles, y=doc_accuracy_results)
    accPlot.set_title("Custom Article Retrieval Model Performance\n 1000 Random Question Samples per iteration")
    accPlot.set(xlabel = "Number of Articles Retrieved", ylabel = "Accuracy")
    plt.show()

## Set-up for GPU use

In [9]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(device)

cuda:0


## Import the PreTrained or the Custom-Trained Model <a id='import_pretrained_model'></a>

In [64]:
import transformers
transformers.logging.set_verbosity_error()
from transformers import BertTokenizer, AutoTokenizer, BertForQuestionAnswering, BertTokenizerFast, BertConfig, DistilBertForQuestionAnswering, DistilBertTokenizerFast

#if this flag is true, you can run the code using the test data
useCustomModel = False

if(useCustomModel):
    #this will only work if you have loaded the custom model into your Google Colaboratoty environment!
    model_path = '/content/Custom_BERT_model_5'
    model = BertForQuestionAnswering.from_pretrained(model_path).to(device)
    tokenizer = BertTokenizer.from_pretrained(model_path)
else:
    modelname = 'deepset/bert-base-cased-squad2'
    # modelname = 'deepset/bert-large-uncased-whole-word-masking-squad2'
    model = BertForQuestionAnswering.from_pretrained(modelname).to(device)
    tokenizer = BertTokenizer.from_pretrained(modelname)

## BERT Function <a id='BERT_training'></a>

In [11]:
#create function that runs the BERT model
def Run_BERT(question, text_batch):
    
    #encode the question and the paragraph(text)
    input_ids = tokenizer.encode(question,text_batch,max_length=512)
    
    #search the input_ids for the first instance of the SEP token
    sep_index = input_ids.index(tokenizer.sep_token_id)
            
    #Segment A occurs from the first char to the end of the SEP token instance
    num_seg_a = sep_index+1
    
    #The rest of the tokens will belong to segment B
    num_seg_b = len(input_ids)-num_seg_a
    
    #construct a list of 0's and 1's
    segment_ids = [0]*num_seg_a + [1]*num_seg_b
    
    #there should be a segment id for every input token
    #if this doesnt return an error we are good
    assert len(segment_ids) == len(input_ids)
    
    #run the model using the current data
    outputs = model(torch.as_tensor([input_ids]).to(device), #the tokens representing the input text 
                   token_type_ids=torch.as_tensor([segment_ids]).to(device), #the segment ids to differentiate Q from A
                   return_dict=True)
    
    #get the start and end vectors
    start_scores = outputs.start_logits
    end_scores = outputs.end_logits
    
    #reconstruct the answer from the scores
    answer_start = torch.argmax(start_scores).to(device)
    answer_end = torch.argmax(end_scores).to(device)
    
    #get the string versions of the input tokens
    tokens = tokenizer.convert_ids_to_tokens(input_ids)
    
    #create an answer variable and append the start of the first word
    answer = tokens[answer_start]
    
    #fill out the remainder of the answer
    for i in range(answer_start + 1, answer_end + 1):
        #if we have a subword token, recombine it with the previous token
        if(tokens[i][0:2]=='##'):
            answer += tokens[i][2:]
        elif(tokens[i][0]==','):
            answer += tokens[i][0]
        elif(tokens[i][0]=='\''):
            answer += tokens[i][0]
        elif(tokens[i][0]=='-'):
            answer += tokens[i][0]
        elif(tokens[i][0]=='s'):
            answer += tokens[i]
        elif(tokens[i][0] == '.'):
            answer += tokens[i][0]
        elif(tokens[i][0].isnumeric() and i > 1):
            if tokens[i-1][0]=='.':
                answer += tokens[i][0]
        else:
            answer += ' ' + tokens[i]

    return answer

## Helper Functions: Must run this block <a id='looping_helpers'></a>

In [12]:
def find_answer_context(paragraph, answer, buffer_sentences=1):
    pgraph_chars = ''.join(paragraph.split(' '))
    answer_chars = ''.join(answer.split(' '))
    
    # find the index in the character string where answer starts
    answer_loc = pgraph_chars.find(answer_chars)
    
    # find all the indices of the periods
    period_indices = [x for x in findall('.', pgraph_chars)]
    
    # find the index value where the answer would be inserted
    stop_idx = np.searchsorted(period_indices, answer_loc)
    
    # find the periods marking to the left and right of the start point
    if stop_idx > 0:
        context_left = period_indices[stop_idx-1::-1]
    else:
        context_left = []
    context_right = period_indices[stop_idx:-1]
    
    # loop through the periods until we find the appropraite number of buffer sentences worth
    p_idx = 0
    left_count = 0
    while left_count <= buffer_sentences and p_idx < len(context_left):
        if not pgraph_chars[context_left[p_idx]+1].isnumeric():
            left_count +=1
        p_idx += 1
    
    left_period_num = len(context_left) - p_idx
    
    p_idx = 0
    right_count = 0
    while right_count <= buffer_sentences and p_idx < len(context_right):
        if not pgraph_chars[context_right[p_idx]+1].isnumeric():
            right_count +=1
        p_idx += 1
    
    right_period_num = len(context_left) + p_idx + 1
  
    # find the indices in the paragraph
    left_p_idx = find_nth(paragraph, '.', left_period_num)+1
    right_p_idx = find_nth(paragraph, '.', right_period_num)+1

    return paragraph[left_p_idx:right_p_idx]
    
    
def findall(p, s):
    '''Yields all the positions of
    the pattern p in the string s.'''
    i = s.find(p)
    while i != -1:
        yield i
        i = s.find(p, i+1)
        
def find_nth(haystack, needle, n):
    start = haystack.find(needle)
    while start >= 0 and n > 1:
        start = haystack.find(needle, start+len(needle))
        n -= 1
    return start

def sentences_with_answers(paragraphs, answers):
    # storage data structures
    answer_sentences = []
    total_count = 0
    for idx, ans in enumerate(answers):
        text = find_answer_context(paragraphs[idx], ans)
        if text != '':
            answer_sentences.append(text)
            total_count += len(find_answer_context(paragraphs[idx], ans).split(' '))
    return total_count, answer_sentences

def compress_corpus(documents, max_doc_size = 450):
    size = 0
    compressed_corp = []
    current_doc = ''
    for d in documents:
        d_list = d
        if size + len(d_list.split(' ')) < max_doc_size:
            current_doc += ' ' + d
            size += len(d_list.split(' '))
        else:
            compressed_corp.append(current_doc)
            current_doc = d
            size = len(d_list.split(' '))
    compressed_corp.append(current_doc)
    return compressed_corp

def narrow_down_answers(question, documents, answers):
    
    while len(answers) > 1:
        # back out sentences from the answers
        counts, sentences = sentences_with_answers(documents, answers)
        # compress the sentences down to a smaller number of documents
        compressed_corpus = compress_corpus(sentences)

        answers = []
        documents = compressed_corpus.copy()
        for paragraph in compressed_corpus:
            # run BERT
            BERT_answer = Run_BERT(question, paragraph)
            
            # check that BERT answer is acceptable before adding to answer list
            if '[CLS]' not in BERT_answer:
                answers.append(BERT_answer)

    if len(answers) == 0:
        return None
    else:
        return answers[0]

### BERT Standard Benchmarking <a id='standard_benchmarking'></a>


In [None]:
num_samples = 500

random_indices = list(random.sample(range(0, len(questions_BERT)), num_samples))

sample_questions_BERT = []
for idx in random_indices:
    sample_questions_BERT.append(questions_BERT[idx])

# sample_questions_BERT = questions_BERT[0:num_samples]

test_rounds = num_samples
correct_answers = 0

#define the stopwords
sp = spacy.load('en_core_web_sm')
noncontext_words = sp.Defaults.stop_words
# noncontext_words = ['the','i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

#loop over the sample questions
count = 1
index = 0
for question in sample_questions_BERT:
    BERT_answers = []
    print("########################################################################################################")
    print("Correct Answers: " + str(correct_answers))
    print("Question #" + str(count))
    print(question)
    
    count += 1
        
    #get the correct answer to this question
    correct_answer = question_answers_BERT[questions_BERT.index(question)]
    print("Correct Answer: %s" % (correct_answer))
    if(type(correct_answer!=str)):
        correct_answer = str(correct_answer)
        
    #get the top k paragraphs from the non punctuated list
    #use the indices to retrieve the punctuated article that BERT wants
    candidate_articles_BERT, doc_ret_idx = Retrieve_Article(question,articles_BERT,k=5)
    
    #segment the chosen candidate article in "paragraphs"
    candidate_seg_articles = segment_documents(candidate_articles_BERT, max_doc_length=500)
    
    #retrieve which paragraph contains the correct answer
#     top_paragraphs, doc_ret_idx = Retrieve_Article(question,candidate_seg_articles,useParagraphs=True,k=10)
                
    #return the answers from each of the top k paragraphs in descending order by relevancy
    for segment in candidate_seg_articles:
        BERT_prediction = Run_BERT(question, segment)   
        if(BERT_prediction=="[CLS]"):
            continue
        if("[CLS]" in BERT_prediction):
            continue
#         print(BERT_prediction)
                        
        BERT_answers.append(BERT_prediction)
        
        #check to see if the return type is a string
        if((type(BERT_prediction)==str)):
            #create lists of words for the predicted and the correct answers
            BERT_pred_list = re.split('\s+', BERT_prediction)
            BERT_true_list = re.split('\s+', correct_answer)
            
            BERT_pred_list_fix = []
            BERT_true_list_fix = []
            #remove the stop words in the lists
            for word in BERT_pred_list:
                if(word not in noncontext_words):
                    BERT_pred_list_fix.append(word)
                    
            #remove the stop words in the lists
            for word in BERT_true_list:
                if(word not in noncontext_words):
                    BERT_true_list_fix.append(word)
                                            
            #check to see if any words in the prediction are in the answer
            true_ans_len = len(BERT_true_list_fix)
            num_matches = 0
            for word in BERT_pred_list_fix:
                if(word in BERT_true_list_fix):
                    num_matches += 1
            
            if(true_ans_len==1):
                if(num_matches==true_ans_len):
                    correct_answers += 1
                    print("BERT Predicted Answer: " + BERT_prediction)
                    break
            else:
                if(num_matches>=round(0.5*true_ans_len)):
                    correct_answers += 1
                    print("BERT Predicted Answer: " + BERT_prediction)
                    break

                    
BERT_accuracy = round(correct_answers/test_rounds,2)
print(BERT_accuracy)

########################################################################################################
Correct Answers: 0
Question #1
What group famously enjoyed themselves on Union Street?
Correct Answer: sailors from the Royal Navy
########################################################################################################
Correct Answers: 0
Question #2
In what century did the Ottoman's start to desire foreign manuscripts?
Correct Answer: 15th Century
BERT Predicted Answer: 15th Century
########################################################################################################
Correct Answers: 1
Question #3
What had to be evacuated due to potential flooding?
Correct Answer: Entire villages
BERT Predicted Answer: villages
########################################################################################################
Correct Answers: 2
Question #4
A mammal's brain is how many times larger than a birds relative to body size?
Correct Answer: twice as l

## BERT Benchmarking with Looping <a id='benchmarking_BERT_with_loop'></a>

In [62]:
import random
import re
import spacy
import nltk

for _ in range(100):
    record_questions = []
    record_correct_answer = []
    record_answers = []
    record_segments = []
    record_articles = []

    random_indices = list(random.sample(range(0, len(questions_BERT)), 1))

    sample_questions = []
    for idx in random_indices:
        sample_questions.append(questions_BERT[idx])

    # sample_questions = questions_BERT[0:2]

    test_rounds = len(sample_questions)
    correct_answers = 0

    #define the stopwords
    sp = spacy.load('en_core_web_sm')
    noncontext_words = sp.Defaults.stop_words
    # noncontext_words = ['the','i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

    #loop over the sample questions
    for question in sample_questions:
        BERT_answers = []
        print("####################################################################################")
        print(question)

        # Dan added dictionary to look up segment
        lookup_segment = {}
        segments = []
        answers_to_consider = []
        current = 0

        #get the correct answer to this question
        correct_answer = question_answers_BERT[questions_BERT.index(question)]
        print("Correct Answer: %s" % (correct_answer))
        record_correct_answer.append(correct_answer)

        #get the top k paragraphs
        candidate_articles, article_indices = Retrieve_Article(question,articles_BERT,10)
        record_articles.append(candidate_articles)

        #segment the chosen candidate article in "paragraphs"
        candidate_seg_articles = segment_documents(candidate_articles, max_doc_length=450)

        #return the answers from each of the top k paragraphs in descending order by relevancy
        for seg_idx, segment in enumerate(candidate_seg_articles):
            #BERT_prediction, start_idx, end_idx = Run_BERT(question, segment)
            BERT_prediction = Run_BERT(question, segment)

            BERT_answers.append(BERT_prediction)

            ## Dan added this for troubleshooting next part
            if '[CLS]' not in BERT_prediction:
                # store segment information
                lookup_segment[seg_idx] = current
                segments.append(segment)

                # store answer information extracted from text
                answers_to_consider.append(BERT_prediction)


                # increment index for dictionary
                current +=1
            #check to see if the return type is a string
            if(type(BERT_prediction)==str):
                #create lists of words for the predicted and the correct answers
                BERT_pred_list = re.split('\s+', BERT_prediction)
                BERT_true_list = re.split('\s+', correct_answer)

                BERT_pred_list_fix = []
                BERT_true_list_fix = []
                #remove the stop words in the lists
                for word in BERT_pred_list:
                    if(word not in noncontext_words):
                        BERT_pred_list_fix.append(word)

                #remove the stop words in the lists
                for word in BERT_true_list:
                    if(word not in noncontext_words):
                        BERT_true_list_fix.append(word)

                #check to see if any words in the prediction are in the answer
                true_ans_len = len(BERT_true_list_fix)
                num_matches = 0
                for word in BERT_pred_list_fix:
                    if(word in BERT_true_list_fix):
                        num_matches += 1

                if(true_ans_len==1):
                    if(num_matches==true_ans_len):
                        correct_answers += 1
                else:
                    if(num_matches>=round(0.5*true_ans_len)):
                        correct_answers += 1
        
        # BERT_accuracy = correct_answers/test_rounds
        # print(BERT_accuracy)
        
        record_questions.append(question)
        record_answers.append(answers_to_consider)
        record_segments.append(segments)

        final_answer = narrow_down_answers(question, segments, answers_to_consider)   
        print('BERT Answer:', final_answer)

####################################################################################
What reaction describes this process?
Correct Answer: Schikorr reaction
BERT Answer: process theologians view God as " the fellowsufferer who understands ", and as the being who issupremely affected by temporal events. Hartshorne points out that people would not praise a human ruler who was unaffected by either the joys orsorrows of his followers-so why would this be a praise- worthy quality in God ? Instead, as the being who is most affected by the world, God is the being who can most appropriately respond to the world. Montini and Angelo Roncalli were considered to be friends, but when Roncalli, as Pope John XXIII announced a new Ecumenical Council, Cardinal Montini reacted with disbelief
####################################################################################
In what year did Planck receive the Nobel Prize in Physics for his discovery of energy quanta?
Correct Answer: 1918
BERT Answer: 1