# Question Answering Engine

## 01: Dataset Creation

In the text files provided in the [SimpleQuestions](https://github.com/askplatypus/wikidata-simplequestions) repository, only the files ending with "answerable" contain questions that exist in Wikidata, so for this reason I used these to create my dataset.

### Libraries

Importing the necessary libraries.

In [52]:
# Formats for datasets
import json
import csv

# Wikidata api
from wikidata.client import Client
from wikidata.entity import EntityId

# Preprocessing
from unidecode import unidecode
import difflib
import spacy
nlp = spacy.load('en_core_web_lg')

### Span Entity Recognition

Then defining the functions to perform the span entity detection based on the entity retrieved from the Wikidata identifier. I preprocess the data to remove accents, 's endings and question marks so that entities are more easily identifiable. Then I use difflib with each of the start and end words of the span to identify them in the question, since I'll be keeping only those indices for the span.

For most questions this process is rather straightforward since the entity appears identically or with a small divergence, for example the entity: jonathan rozen appears as: yonatan rozen in the question. But for some they are not lexically similar, but semantically, for example in the question: **Name a model that died due to a car accident** the entity mapped to it is: **traffic collision**.

Initially, I approached this problem by using python libraries with synonyms to compare the words, but this wasn't succesful. As can be observed in the example I provided above, car accident and traffic collision have similar meanings, but they are not synonymous. So after experimenting with various apis that capture the semantic content, I decided to use the spacy library that performed the best at this task. This way I rank all words by similarity and then get the one with the highest score.

In [76]:
# Finds best match based on their similarity
def find_best_match(entity, question):

    # Find similarity scores between entity and each ngram in question
    n = len(entity.split())
    ngram_scores = {}
    for i in range(len(question)-n+1):
        ngram = ' '.join(question[i:i+n])
        ngram_scores[ngram] = nlp(ngram).similarity(nlp(entity))  
    
    # Return the highest similarity score ngram as a list
    if ngram_scores == {}:
        return None
    best_score = max(ngram_scores.values())
    return [ngram for ngram, score in ngram_scores.items() if score == best_score][0].split()

# Finds the entity span of an entity
def entity_span(question, entity):

    if entity == '' or question =='':
        return None

    # To locate the position in the question
    question_list = question.split()

    # Normalize both names in case there are accents
    norm_name_1 = unidecode(entity.split()[0])
    norm_name_2 = unidecode(entity.split()[-1])

    # Find the word with the closest match (eg yonatan for jonathan)
    norm_name_1 = difflib.get_close_matches(norm_name_1, question_list)
    norm_name_2 = difflib.get_close_matches(norm_name_2, question_list)

    # If no close matches perform similarity search (e.g. car accident for traffic collision)
    if (norm_name_2 ==[] or norm_name_1 ==[]):
        new_entity = find_best_match(entity, question_list)
        if new_entity is None:
            return None
        norm_name_1 = new_entity[0]
        norm_name_2 = new_entity[-1]
    else:
        norm_name_1 = norm_name_1[0]
        norm_name_2 = norm_name_2[0]

    # Beginning and end of span
    entity_start = question_list.index(norm_name_1)
    entity_end = question_list.index(norm_name_2)

    return entity_start, entity_end

### Dataset Ingestion

To perform the dataset ingestion I use the Wikidata client to retrieve the entity and relation from their identifiers. The datasets I create have the following format:

|  entity_id | ent_label  |relation_id|  rel_label | inverse  | question  |  answer | start  | end|
|---|---|---|---|---|---|---|---|---|
|Q7358590|roger marquis|P20|place of death|False|Where did roger marquis die|Q1637790|2|3 |

Inverse refers to the Rxxx property identifiers encode the inverse property of the Wikidata property Pxxx. For example R19 encodes the properties "born here", i.e. the inverse of P19 ("birth place").

I chose to keep the span entity instead of a binary by having the start and end indices of the span, since I'll be doing the classification this way in the model. Finally, after extracting all the data from the three sets I'm saving them in csv and json formats, as well as the dictionary for the entity to identifier mapping and relation vocabulary.

In [None]:
# The files ending with "_answerable" contain only triples that are also in Wikidata.
paths = [[ 'dataset/test_dataset.json','wikidata/annotated_wd_data_test_answerable.txt'],
         [ 'dataset/val_dataset.json','wikidata/annotated_wd_data_valid_answerable.txt'],
         [ 'dataset/train_dataset.json','wikidata/annotated_wd_data_train_answerable.txt']]

# Wikidata client session
client = Client()

relation_vocab = []
entity_dict = {}
counter = 0

# Loop over the input text files
for path in paths:
    
    # Dataset to keep everything
    dataset = []
    mydataset = path[0]
    wikidata = path[1]
    
    with open(wikidata, 'r', encoding='utf-8') as f:
        for line in f:
            counter+=1

            # Split the line to its parts
            parts = line.strip().split('\t')
            
            # Extract the question 
            answer = parts[2]
            question = unidecode(parts[3].replace("?", "").replace("'s", ""))

            # Extract the relation and replace the identifiers that encode the inverse property
            inverse = False
            relation_id = parts[1]
            if (relation_id.startswith("P")):
                try:
                    rel = client.get(EntityId(relation_id))
                    relation_label = rel.label
                except Exception as e:
                    print("Error: ", e, " at: ", counter)
                    # handle exceptions here by simply ignoring this row
                    continue
            if (relation_id.startswith("R")):
                try:
                    rel = client.get(EntityId(relation_id).replace("R","P"))
                    relation_label = rel.label
                except Exception as e:
                    print("Error: ", e, " at question: ", counter)
                    continue
                inverse = True

            # Extract the entity
            entity_id = parts[0]
            try:
                ent = client.get(EntityId(entity_id))
                entity_label = str(ent.label).lower()
            except Exception as e:
                print("Error: ", e, " at question: ", counter)
                continue

            # Extract the entity span
            try:
                output = entity_span(question, entity_label)
                if output != None:
                    entity_start, entity_end = output
                else:
                    continue
            except Exception as e:
                print("Error: ", e, " at question: ", counter)
                continue
            
            # Save the entity in the dictionary and relation in the vocab
            entity_dict[entity_label] = entity_id
            if relation_id not in relation_vocab:
                relation_vocab.append(relation_id)

            # Append the preprocessed data to the list
            dataset.append({
                'entity_id': entity_id,
                'entity_label': entity_label,
                'relation_id': relation_id,
                'relation_label': str(relation_label),
                'inverse': inverse,
                'question': question,
                'answer_id': answer,
                'entity_start': entity_start,
                'entity_end': entity_end
            })

    # Write the data to the corresponding JSON file
    with open(mydataset, 'w', encoding='utf-8') as f:
        json.dump(dataset, f)

    # Define the header row for the CSV file
    header = ['entity_id', 'entity_label', 'relation_id', 'relation_label', 'inverse', 'question', 'answer_id', 'entity_start', 'entity_end']

    # Create a new file object in write mode and specify the filename and encoding
    with open(mydataset.replace(".json",".csv"), mode='w', newline='', encoding='utf-8') as f:
        
        # Create a csv.writer object and write the header row
        writer = csv.writer(f)
        writer.writerow(header)
        
        # Loop through the dataset and write each dictionary as a row to the CSV file
        for data in dataset:
            row = [data['entity_id'], data['entity_label'], data['relation_id'], data['relation_label'], data['inverse'], data['question'], data['answer_id'], data['entity_start'], data['entity_end']]
            writer.writerow(row)

# Finally save the vocabulary to the corresponding JSON file
with open('dataset/relation_vocab.json', 'w', encoding='utf-8') as f:
    json.dump(relation_vocab, f)

# Create a new file object in write mode and specify the filename and encoding
with open('dataset/relation_vocab.csv', mode='w', newline='', encoding='utf-8') as f:

    # Write the list to a CSV file
    writer = csv.writer(f)
    writer.writerow(['Relation'])
    
    # Loop through the relation vocab
    for relation in relation_vocab:
        writer.writerow([relation])

# Write the dictionary to the corresponding JSON file
with open('dataset/entity_dict.json', 'w', encoding='utf-8') as f:
    json.dump(entity_dict, f)

# Create a new file object in write mode and specify the filename and encoding
with open('dataset/entity_dict.csv', mode='w', newline='', encoding='utf-8') as f:
    
    writer = csv.writer(f)
    writer.writerow(['Entity', 'Id'])
    for key, value in entity_dict.items():
        writer.writerow([key, value])