# Conventional Feature Based QA System

### Structure
- import library
- initialize data file path
- functions of read dataset
    - read MCTest dataset
    - read DREAM dataset
    - read RACE dataset
- read dataset
- functions of predicting answer
    - get highest similarity choice
    - word tokenization
    - get synonyms of words
- predict MC answer, of each question
- predict MC answer, of whole dataset
- main function (starting point)
- analysis 

### Getting Started
Install required python package
Execute by Jupyter notebook compiler
Set data file path, default path as:
    /datasets/MCTest/MCTest/
    /datasets/DREAM/
    /datasets/RACE/RACE/
Results of each dataset will be exported to:
    /Stage 1 result/{dataset name}.csv

### Package used
python 3.8.6, os, json, itertools
pandas 1.2.2, numpy 1.19.5, gensim 3.8.3, nltk 3.5


In [1]:
import pandas as pd
import numpy as np
import os
import json
import gensim       # for similarity
from nltk import word_tokenize, WordNetLemmatizer
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.corpus import wordnet     # get synonyms
from itertools import chain

## Initialize data file path

In [2]:
# file path, dataset stored in "/datasets" in the script directory 
path = {
    "MC160":{
        "Train":{
            "Question": os.path.join(sys.path[0], "datasets", "MCTest/MCTest/mc160.train.tsv"), 
            "Answer": os.path.join(sys.path[0], "datasets", "MCTest/MCTest/mc160.train.ans")
        },
        "Dev":{
            "Question": os.path.join(sys.path[0], "datasets", "MCTest/MCTest/mc160.dev.tsv"), 
            "Answer": os.path.join(sys.path[0], "datasets", "MCTest/MCTest/mc160.dev.ans")
        },
        "Test":{
            "Question": os.path.join(sys.path[0], "datasets", "MCTest/MCTest/mc160.test.tsv"), 
            "Answer": os.path.join(sys.path[0], "datasets", "MCTest/MCTestAnswers/mc160.test.ans")
        }
    },
    "MC500":{
        "Train":{
            "Question": os.path.join(sys.path[0], "datasets", "MCTest/MCTest/mc500.train.tsv"), 
            "Answer": os.path.join(sys.path[0], "datasets", "MCTest/MCTest/mc500.train.ans")
        },
        "Dev":{
            "Question": os.path.join(sys.path[0], "datasets", "MCTest/MCTest/mc500.dev.tsv"), 
            "Answer": os.path.join(sys.path[0], "datasets", "MCTest/MCTest/mc500.dev.ans")
        },
        "Test":{
            "Question": os.path.join(sys.path[0], "datasets", "MCTest/MCTest/mc500.test.tsv"), 
            "Answer": os.path.join(sys.path[0], "datasets", "MCTest/MCTestAnswers/mc500.test.ans")
        }
    },
    "DREAM":{
        "Train": os.path.join(sys.path[0], "datasets", "DREAM/train.json"),
        "Dev": os.path.join(sys.path[0], "datasets", "DREAM/dev.json"),
        "Test": os.path.join(sys.path[0], "datasets", "DREAM/test.json")
    },
    "RACE":{
        "high": {
            "Train": os.path.join(sys.path[0], "datasets", "RACE/RACE/train/high"),
            "Dev": os.path.join(sys.path[0], "datasets", "RACE/RACE/dev/high"),
            "Test": os.path.join(sys.path[0], "datasets", "RACE/RACE/test/high")
        },
        "middle": {
            "Train": os.path.join(sys.path[0], "datasets", "RACE/RACE/train/middle"),
            "Dev": os.path.join(sys.path[0], "datasets", "RACE/RACE/dev/middle"),
            "Test": os.path.join(sys.path[0], "datasets", "RACE/RACE/test/middle")
        }
    }
}

## Read dataset file

In [3]:
def readMCTest(questionPath, answerPath):
    question = pd.read_csv(questionPath,
        sep='\t',
        header=None,
        names=["id", "properties", "article",
               "q0", "q0_c0", "q0_c1", "q0_c2", "q0_c3",
               "q1", "q1_c0", "q1_c1", "q1_c2", "q1_c3",
               "q2", "q2_c0", "q2_c1", "q2_c2", "q2_c3",
               "q3", "q3_c0", "q3_c1", "q3_c2", "q3_c3",])
    answer = pd.read_csv(answerPath,
            sep='\t',
            header=None,
            names=['q0_ans', 'q1_ans', 'q2_ans', 'q3_ans', ])

    # pre-processing 
    dataset = []
    for index, row in question.iterrows():
        # for each story
        for i in range(4):
            # for each question
            temp = {}
            temp["article"] = row["article"].replace("\\newline", " ")  # remove "\newline" char in article
            temp["question"] = row[f"q{i}"].split(":")[1]
            temp["answer sentence type"] = row[f"q{i}"].split(":")[0]
            for j in range(4):
                temp[f"choice {j}"] = row[f"q{i}_c{j}"]
            
            # answer choice = A/B/C/D, answer index = 0/1/2/3, answer = answer in string format
            temp["answer choice"] = answer.iloc[index][f"q{i}_ans"]
            temp["answer index"] = ord(temp["answer choice"]) - 65      # from "A" to 0
            temp["answer"] = temp["choice {}".format(ord(temp["answer choice"]) - 65)]

            dataset.append(temp)

    return pd.DataFrame(dataset)

In [4]:
def readDREAM(docPath):
    with open(docPath) as f:
        data = json.load(f)

    dataset = []
    for story in data:
        temp = {}

        # pre-processing of article
        # for sentence spoke by "M:", add prefix "Men:" to every sentence 
        # for sentence spoke by "W:", add prefix "Women:" to every sentence 
        temp["article"] = ""
        for sentence in story[0]:
            if "M:" in sentence:
                sentence = sentence.replace("M: ", "")
                for sent in sent_tokenize(sentence):
                    temp["article"] += "Men: " + sent + " "
            elif "W:" in sentence:
                sentence = sentence.replace("W: ", "")
                for sent in sent_tokenize(sentence):
                    temp["article"] += "Woman: " + sent + " "
            else:
                temp["article"] += sentence

        temp["question"] = story[1][0]["question"]
        for i in range(len(story[1][0]["choice"])):
            temp[f"choice {i}"] = story[1][0]["choice"][i]

        # answer choice = A/B/C/D, answer index = 0/1/2/3, answer = answer in string format
        temp["answer"] = story[1][0]["answer"]
        for i in range(len(story[1][0]["choice"])):
            if story[1][0]["choice"][i] == story[1][0]["answer"]:
                temp["answer choice"] = chr(i + 65)      # from 0 to "A"
                temp["answer index"] = i
                break
        
        dataset.append(temp)

    return pd.DataFrame(dataset)

In [5]:
def readRACE(docPath):
    dataset = []

    for filename in os.listdir(docPath):
        with open(os.path.join(docPath, filename), 'r') as f: # open in readonly mode
            story = json.load(f)
        
        temp = {}
        temp["article"] = story["article"]

        tempPerQuestion = {}
        for i in range(len(story["questions"])):
            temp = {"article": temp["article"]}   

            temp["question"] = story["questions"][i]
            for j in range(len(story["options"][i])):
                temp[f"choice {j}"] = story["options"][i][j]
                
            # answer choice = A/B/C/D, answer index = 0/1/2/3, answer = answer in string format
            temp["answer index"] = ord(story["answers"][i]) - 65      # from "A" to 0
            temp["answer"] = temp[f"choice {temp['answer index']}"]
            temp["answer choice"] = story["answers"][i]

            dataset.append(temp)

    return pd.DataFrame(dataset)

In [6]:
def getDataset(dSet, purpose):
    """
    Inputs:
        dSet[String] = MC160 / MC500 / DREAM / RACE-middle / RACE-high
        purpose[String] = Train / Test / Dev
    Return:
        dataset[pd.DataFrame]: dataset from selected data file
    """
    print("{} {}".format(dSet, purpose))
    if dSet == "MC160" or dSet == "MC500":
        pathQuestion = path[dSet][purpose]["Question"]
        pathAnswer = path[dSet][purpose]["Answer"]

        dataset = readMCTest(questionPath=pathQuestion, answerPath=pathAnswer)
    elif dSet == "DREAM":
        dataset = readDREAM(path[dSet][purpose])
    elif dSet == "RACE-high":
        dataset = readRACE(path["RACE"]["high"][purpose])
    elif dSet == "RACE-middle":
        dataset = readRACE(path["RACE"]["middle"][purpose])

    return dataset

# for quick view only
dataset = getDataset("MC160", "Dev")
print(dataset.shape)
dataset.head(2)

MC160 Dev
(120, 10)


Unnamed: 0,article,question,answer sentence type,choice 0,choice 1,choice 2,choice 3,answer choice,answer index,answer
0,It was Jessie Bear's birthday. She was having ...,Who was having a birthday?,one,Jessie Bear,no one,Lion,Tiger,A,0,Jessie Bear
1,It was Jessie Bear's birthday. She was having ...,Who didn't come to the party?,multiple,Lion,Tiger,Snake,Jessie Bear,C,2,Snake


## Sub function of getting prediction 

In [7]:
def getClosestSentence(choices, answerString):
    """
    Find the highest similarity choice by a given query
    Inputs:
        choices[List]: list of sentences as document
        answerString[String]: sentence as query
    Return:
        index[Int]: index of choice with highest similarity 
    """
    # for choices
    gen_docs = []   # 1 item = 1 choice
    
    # word lemmatize, to lower case, and filter stopword of each choice
    for choice in choices:
        gen_docs.append([w.lower() for w in wordTokenize(choice)])
    
    # dictionary of choices, convert to BOW for each choice
    dictionary = gensim.corpora.Dictionary(gen_docs)
    corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]
    
    # build TFIDF model
    tfidf = gensim.models.TfidfModel(corpus)

    # build similarity model, using TFIDF of choices
    sims = gensim.similarities.MatrixSimilarity(tfidf[corpus], num_features=len(dictionary))


    # for answerString
    # word lemmatize, to lower case, and filter stopword of each choice
    tokenizedAnswer = wordTokenize(answerString.lower())

    # convert to BOW
    bowAnswer = dictionary.doc2bow(tokenizedAnswer)

    # convert to TFIDF
    tfidfAnswer = tfidf[bowAnswer]

    # get similarity, select argmax, return closest sentence index
    return np.argmax(sims[tfidfAnswer])

In [8]:
wnl = WordNetLemmatizer()
stopWords = set(stopwords.words("english")) 

def wordTokenize(sentence):
    """
    Word lemmatize, to lower case, and filter stopword of each choice
    Input:
        sentence[String]: a sentence
    Return:
        tokenized word[List]: list of tokenized word
    """
    return [wnl.lemmatize(w.lower()) for w in word_tokenize(sentence) if w not in stopWords]
        
def getSynonyms(lemmatized):
    """
    Getting sysnonyms of words by NLTK wordnet
    Input:
        lemmatized[List]: list of tokenized word
    Return:
        tempString[String]: joining all synonyms of all tokenized input word into one string
    """
    tempString = ""
    for word in lemmatized:
        # get synonyms
        synonyms = wordnet.synsets(word)
        lemmas = set(chain.from_iterable([word.lemma_names() for word in synonyms]))
        if lemmas:
            # if word have synonyms
            tempString += " ".join(lemmas) + " "
        else:
            # if word doesn't have synonyms, e.g. wh-words, stopwords
            tempString += "".join(word) + " "

    return tempString

## Main function of getting prediction

In [9]:
def predictMC(article, question, options, answerSentenceType=""):
    """
    Predict answer of each question
    Inputs:
        article[String]
        question[String]
        options[List]: MC choices
        answerSentenceType[String]: "one"/"multiple", only usable for MCTest
    Return:
        closestOption[Int]: predicted answer in index
    """
    # article
    sentences = sent_tokenize(article)      # from article to list of sentences
    synonymsSentences = []
    for sent in sentences:
        lemmatizedSent = wordTokenize(sent)
        synonymsSent = getSynonyms(lemmatizedSent)
        if synonymsSent:
            # if sentence has synonyms
            synonymsSentences.append(synonymsSent)
    
    # question
    lemmatizedQuestion = wordTokenize(question)
    synonymsQuestion = getSynonyms(lemmatizedQuestion)

    # get closest sentence(s) in article given the question
    closestSentenceIndex = getClosestSentence(synonymsSentences, synonymsQuestion)
    if (answerSentenceType == "one"):
        # for MCTest, only get 1 sentence if answer sentence type = "one"
        closestSentence = synonymsSentences[closestSentenceIndex]
    else:
        # get 2 sentences before and after and the closest sentence
        closestSentence = synonymsSentences[max(closestSentenceIndex - 2, 0)]
        closestSentence += synonymsSentences[max(closestSentenceIndex - 1, 0)]
        closestSentence += synonymsSentences[closestSentenceIndex]
        closestSentence += synonymsSentences[min(closestSentenceIndex + 1, len(synonymsSentences)-1)]
        closestSentence += synonymsSentences[min(closestSentenceIndex + 2, len(synonymsSentences)-1)]
    
    # options
    synonymsOptions = []
    for option in options:
        # for each option
        tempLemmatizedOption = wordTokenize(option)
        tempSynonymsOption = getSynonyms(tempLemmatizedOption)
        synonymsOptions.append(tempSynonymsOption)

    # get closest answer in choices given the closest sentence(s) in article
    closestOption = getClosestSentence(synonymsOptions, closestSentence)
    
    return closestOption

In [10]:
def predict(dataset):
    """
    Predict answer of whole dataset
    Input:
        dataset[pd.DataFrame]
    Return:
        accuracy info.[Dict]
        dataset[pd.DataFrame]: with the predicted answer as new column 
    """
    # foc accuracy calculation purpose
    count = 0
    correct = 0
    wrong = 0

    for index, row in dataset.iterrows():
        # for each question

        # get MC options of question
        options = row[dataset.columns[dataset.columns.str.startswith('choice')]].tolist()

        # predict answer by article, question, choices
        if "answer sentence type" in row:
            # for MCTest
            predictedAnswer = predictMC(row["article"], row["question"], options, row["answer sentence type"])
        else:
            # for non MCTest
            predictedAnswer = predictMC(row["article"], row["question"], options)

        # concate predicted answer to dataset
        dataset.loc[index,'predicted answer'] = predictedAnswer

        # for accuracy calculation purpose
        if(row["answer index"] == predictedAnswer):
            correct += 1
        else:
            wrong += 1
        count += 1
        
        # for executing information
        if count % 100 == 0:
            print(f"correct= {correct}  wrong= {wrong}  count= {count}  accuracy= {correct/count}")
    # for executing information
    print(f"correct= {correct}  wrong= {wrong}  count= {count}  accuracy= {correct/count}")

    return {"correct":correct, "wrong":wrong, "count":count, "accuracy": correct/count}, dataset


## Main function (starting point)

In [11]:
# define dataset 
purpose = "Train"
purpose = "Dev"
purpose = "Test"

dSetList = ["MC160", "MC500", "DREAM", "RACE-middle", "RACE-high"]

accuracy = []
dataset = {}

for dSet in dSetList:
    temp = []
    dataset[dSet] = getDataset(dSet, purpose)
    temp, dataset[dSet] = predict(dataset[dSet])
    temp["dSet"] = dSet
    temp["purpose"] = purpose
    accuracy.append(temp)

MC160 Test
correct= 48  wrong= 52  count= 100  accuracy= 0.48
correct= 111  wrong= 89  count= 200  accuracy= 0.555
correct= 131  wrong= 109  count= 240  accuracy= 0.5458333333333333
MC500 Test
correct= 53  wrong= 47  count= 100  accuracy= 0.53
correct= 102  wrong= 98  count= 200  accuracy= 0.51
correct= 153  wrong= 147  count= 300  accuracy= 0.51
correct= 203  wrong= 197  count= 400  accuracy= 0.5075
correct= 249  wrong= 251  count= 500  accuracy= 0.498
correct= 307  wrong= 293  count= 600  accuracy= 0.5116666666666667
correct= 307  wrong= 293  count= 600  accuracy= 0.5116666666666667
DREAM Test
correct= 30  wrong= 70  count= 100  accuracy= 0.3
correct= 61  wrong= 139  count= 200  accuracy= 0.305
correct= 95  wrong= 205  count= 300  accuracy= 0.31666666666666665
correct= 142  wrong= 258  count= 400  accuracy= 0.355
correct= 178  wrong= 322  count= 500  accuracy= 0.356
correct= 225  wrong= 375  count= 600  accuracy= 0.375
correct= 274  wrong= 426  count= 700  accuracy= 0.391428571428571

## Analysis

In [12]:
# performance
pd.DataFrame(accuracy).set_index(["dSet", "purpose"])

Unnamed: 0_level_0,Unnamed: 1_level_0,correct,wrong,count,accuracy
dSet,purpose,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
MC160,Test,131,109,240,0.545833
MC500,Test,307,293,600,0.511667
DREAM,Test,506,781,1287,0.393162
RACE-middle,Test,520,916,1436,0.362117
RACE-high,Test,1078,2420,3498,0.308176


In [13]:
# export result to csv
for ds in dataset:
    dataset[ds].to_csv(f"Stage 1 result/{ds}.csv", index=False)


In [14]:
# for MCTest only
dataset["MC500"][dataset["MC500"]["answer index"] != dataset["MC500"]["predicted answer"]].groupby(["answer sentence type"]).count()

dataset["MC500"][dataset["MC500"]["answer index"] == dataset["MC500"]["predicted answer"]].groupby(["answer sentence type"]).count()

Unnamed: 0_level_0,article,question,choice 0,choice 1,choice 2,choice 3,answer choice,answer index,answer,predicted answer
answer sentence type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
multiple,152,152,152,152,152,152,152,152,152,152
one,155,155,155,155,155,155,155,155,155,155


In [15]:
dataset["MC500"]

Unnamed: 0,article,question,answer sentence type,choice 0,choice 1,choice 2,choice 3,answer choice,answer index,answer,predicted answer
0,It was Sally's birthday. She was very excited....,What time did the party start?,one,10,2,11,1,D,3,1,0.0
1,It was Sally's birthday. She was very excited....,Who got hurt at the party?,multiple,Erin and Jennifer,Cathy and Erin,Jennifer and Sally,Erin and Sally,C,2,Jennifer and Sally,3.0
2,It was Sally's birthday. She was very excited....,Whose birthday is it?,one,Cathy,Jessica,Sally,Jennifer,C,2,Sally,2.0
3,It was Sally's birthday. She was very excited....,What time did Jennifer arrive to the party?,multiple,1,2,8,10,B,1,2,0.0
4,On the farm there was a little piggy named And...,What did the piggies do when Andy got back fr...,multiple,play games and eat dinner,play in the mud and go for a walk,swim in the river and play games,go for a walk and look at flowers,A,0,play games and eat dinner,3.0
...,...,...,...,...,...,...,...,...,...,...,...
595,Greg and his mother were building a racing car...,Where was the race happening?,multiple,At the park.,On the track near his school.,In a river.,In their backyard.,B,1,On the track near his school.,1.0
596,Joey went to a baseball game during the winter...,Who went to the baseball game and with how ma...,multiple,"Joey, nobody.","Mark, nobody","Sam, two others","Joey, three others.",A,0,"Joey, nobody.",1.0
597,Joey went to a baseball game during the winter...,what kind of store did Joey turn into?,one,Garden store,Grocery store,Car store,Coffee store,D,3,Coffee store,3.0
598,Joey went to a baseball game during the winter...,Which team won the game Joey went to and by h...,multiple,"Home team, by two runs.","Away team, by one run","Away team, by two runs.","Home team, by one run.",D,3,"Home team, by one run.",3.0
