# IR Dataset (Quasar-t and Quasar-s) Preparation and Preprocessing

_____________CREDIT FOR DATASET_______________

@article {dhingra2017quasar,
  title={Quasar: Datasets for Question Answering by Search and Reading},
  author={Dhingra, Bhuwan and Mazaitis, Kathryn and Cohen, William W},
  journal={arXiv preprint arXiv:1707.03904},
  year={2017}
}

## Importing Pandas and Numpy for basic preprocessing, cleaning, and pickling

In [1]:
import pandas as pd
import numpy as np
import json
import csv

## Data Preparation

 ### Quasar-t dataset is comprised of questions and answer pairs from popular trivia websites with context retrieved through various wiki articles

In [2]:
quasar_t_train = ['train_questions_qt.json', 'train_contexts_qt_short.json', 'train_contexts_qt_long.json']
quasar_t_dev = ['dev_questions_qt.json', 'dev_contexts_qt_short.json', 'dev_contexts_qt_long.json']
quasar_t_test = ['test_questions_qt.json', 'test_contexts_qt_short.json', 'test_contexts_qt_long.json']

### Creating Dataframes from JSON files for Train, Dev, and Test sets accordingly

Using a JSON Reader object and then iterating through the chunked files and appending them to a list and then concatinating them together to create a unified dataframe

In [3]:
def json_to_df_chunker(file):
    chunk_value = 6000
    temp_dfs = []
    temp_dfs_2 = pd.read_json(file, typ = 'frame', dtype = 'dict', lines=True, chunksize=chunk_value)
    for chunk in temp_dfs_2:
        temp_dfs.append(chunk)
    df_to_return = pd.concat(temp_dfs, ignore_index=True)
    return df_to_return

Shruthi, our TA, helped us with this code, as we were stuck and confused with the "squashed" columns in the JSON file. The workaround was to dump the JSON data into a CSV and then manipulate the file with the csv module. subsequently using the cleaned csv as the input to a pd dataframe.

In [4]:
def json_to_df_contexts(file, save_file):
    f = open(file, "r")
    lines = f.readlines()
    row = []
    with open(f'{save_file}.csv', mode='w') as write_file:
        write_csv = csv.writer(write_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
        write_csv.writerow(['uid', 'score', 'context'])
        for line in lines:
            row = []
            data = json.loads(line)
            context = list(data['contexts'])
            uid = data['uid']
            for c in context:
                row = []
                list_c = list(c)
                row.append(uid)
                row.append(list_c[0])
                row.append(list_c[1])
                write_csv.writerow(row)
        

In [5]:
save_file = "qt_context_long"
qt_context_long = json_to_df_contexts(quasar_t_train[2], save_file)

In [6]:
qt_train_context_long = pd.read_csv('qt_context_long.csv')

In [7]:
qt_train_context_long.head(25)

Unnamed: 0,uid,score,context
0,s3q8053,5.811907,Backgammon FAQ : Different Ways of Playing Bac...
1,s3q8053,5.488704,Backgammon Rules - How to Play Backgammon Navi...
2,s3q8053,5.472953,Backgammon Rules Backgammon Backgammon Home Ba...
3,s3q8053,5.097101,Backgammon FAQ Home Backgammon Articles Backga...
4,s3q8053,4.011511,The Rules of How to Play Backgammon Back to 1o...
5,s3q8053,3.888346,Best Backgammon Opening Moves Backgammon Backg...
6,s3q8053,3.673027,Backgammon rules < a href = `` http://www.back...
7,s3q8053,3.448444,Backgammon Rules . How to play Backgammon Onli...
8,s3q8053,3.36505,Special Backgammon Top Backgammon Sites Site O...
9,s3q8053,3.346089,Buy backgammon sets and backgammon boards at t...


In [8]:
qt_train_context_long.iloc[0][2]

'Backgammon FAQ : Different Ways of Playing Backgammon FAQ Different Ways of Playing Tables and Backgammon What is tables ? How is backgammon different from the other games of tables ? Does backgammon have official rules ? Backgammon Variants What is Nackgammon ? What is hyper-backgammon ? What is long-gammon ? What is roll-over ? What is backgammon-to-lose ? Acey-Deucey What is acey-deucey ? How do you play American acey-deucey ? How do you play European acey-deucey ? Greek Backgammon What is tavli ? How do you play portes ? How do you play plakoto ? How do you play fevga ? Other Games What is trictrac ? What is Russian backgammon ? What is French backgammon ? What is Dutch backgammon ? What is snake ? Forms of Competition What is money play ? What is match play ? What is a freeze-out match ? What is duplicate backgammon ? Table Stakes What is table stakes betting ? Why is table stakes used ? How does strategy in table stakes differ from unlimited money play ? Chouette What is a choue

### Sweet! love to see that. Now that its all cleaned up and we have the context corpus genuinely isolated for quasar-t, we can go ahead and pickle them to distribute to teammates for our own exprimentation with NLP preprocessing

In [9]:
qt_train_questions = json_to_df_chunker(quasar_t_train[0])

In [10]:
qt_train_questions.tail()

Unnamed: 0,answer,question,uid,tags
37007,ohio,Hang On Sloopy ' was the official rock song of...,s3q30425,"[1tok, yes-answer-long, yes-answer-short]"
37008,a commodore pet,Name the first self contained home computer -,s3q30420,[]
37009,ku klux klan,This racist organisation was formed in Tenness...,s3q30421,"[yes-answer-long, yes-answer-short]"
37010,wyoming usa,Where is the Devil 's Tower,s3q30422,[yes-answer-long]
37011,iran,In What Country Did The Rather Prestigious Spo...,s3q30423,"[1tok, yes-answer-long, yes-answer-short]"


In [11]:
qt_train_questions.iloc[2][1]

'Which Scottish Golfer Was Captain Of Europes 2002 Ryder Cup Team'

In [12]:
qt_train_questions.head()

Unnamed: 0,answer,question,uid,tags
0,24,How many points does a backgammon board have,s3q8053,"[1tok, yes-answer-long, yes-answer-short]"
1,sherlock holmes,Whose cases were Empty House Copper Beeches Bl...,s3q33199,"[yes-answer-long, yes-answer-short]"
2,sam torrance,Which Scottish Golfer Was Captain Of Europes 2...,s3q33198,[]
3,first quarter,What is a two-bit moon,s3q33194,"[yes-answer-long, yes-answer-short]"
4,nissan,The `` Maxima '' was a model of which car,s3q33197,"[1tok, yes-answer-long, yes-answer-short]"


In [13]:
#qt_train_questions.to_pickle("qt_train_questions.pickle")

In [None]:
#qt_train_q_2 = pd.read_pickle("qt_train_questions.pickle")

In [15]:
#qt_train_q_2.tail()

Unnamed: 0,answer,question,uid,tags
37007,ohio,Hang On Sloopy ' was the official rock song of...,s3q30425,"[1tok, yes-answer-long, yes-answer-short]"
37008,a commodore pet,Name the first self contained home computer -,s3q30420,[]
37009,ku klux klan,This racist organisation was formed in Tenness...,s3q30421,"[yes-answer-long, yes-answer-short]"
37010,wyoming usa,Where is the Devil 's Tower,s3q30422,[yes-answer-long]
37011,iran,In What Country Did The Rather Prestigious Spo...,s3q30423,"[1tok, yes-answer-long, yes-answer-short]"


In [None]:
quasar_t_train = ['train_questions_qt.json', 'train_contexts_qt_short.json', 'train_contexts_qt_long.json']
quasar_t_dev = ['dev_questions_qt.json', 'dev_contexts_qt_short.json', 'dev_contexts_qt_long.json']
quasar_t_test = ['test_questions_qt.json', 'test_contexts_qt_short.json', 'test_contexts_qt_long.json']

In [9]:
qt_train_context_long.to_pickle('qt_train_context_long.pickle')

In [36]:
qt_dev_questions = json_to_df_chunker(quasar_t_dev[0])
qt_test_questions = json_to_df_chunker(quasar_t_test[0])
qt_dev_questions.to_pickle("qt_dev_questions.pickle")
qt_test_questions.to_pickle("qt_test_questions.pickle")

In [11]:
#save_file = 'qt_t_context_short'
#qt_t_context_short = json_to_df_contexts(quasar_t_train[1], save_file)
qt_train_context_short = pd.read_csv('qt_t_context_short.csv')
qt_train_context_short.to_pickle('qt_train_context_short.pickle')

In [13]:
#save_file = "qt_dev_context_short"
#qt_context_long = json_to_df_contexts(quasar_t_dev[1], save_file)
qt_dev_context_long = pd.read_csv('qt_d_context_short.csv')
qt_dev_context_long.to_pickle('qt_dev_context_short.pickle')

In [14]:
save_file = "qt_dev_context_long"
qt_context_long = json_to_df_contexts(quasar_t_dev[2], save_file)
qt_dev_context_long = pd.read_csv('qt_dev_context_long.csv')
qt_dev_context_long.to_pickle('qt_dev_context_long.pickle')

In [15]:
save_file = "qt_test_context_short"
qt_context_long = json_to_df_contexts(quasar_t_test[1], save_file)
qt_test_context_long = pd.read_csv('qt_test_context_short.csv')
qt_test_context_long.to_pickle('qt_test_context_short.pickle')

In [16]:
save_file = "qt_test_context_long"
qt_context_long = json_to_df_contexts(quasar_t_test[2], save_file)
qt_test_context_long = pd.read_csv('qt_test_context_long.csv')
qt_test_context_long.to_pickle('qt_test_context_long.pickle')

In [18]:
df = pd.read_pickle("qt_dev_context_long.pickle")
df.head()

Unnamed: 0,uid,score,context
0,s3q1674,5.322954,Tripedia Information - Drugs and Treatments - ...
1,s3q1674,5.322954,Acel-Imune Information - Drugs and Treatments ...
2,s3q1674,4.139742,DHPE Home Tetanus Tetanus -LSB- TET-nus -RSB- ...
3,s3q1674,4.139742,Tetanus Tetanus Tetanus -LSB- TET-nus -RSB- is...
4,s3q1674,3.880914,Tetanus KidsHealth > Parents > Infections > Ba...


In [19]:
df_1 = pd.read_pickle("qt_dev_questions.pickle")
df_1.head()

Unnamed: 0,answer,question,uid,tags
0,tetanus,Lockjaw is another name for which disease,s3q1674,"[1tok, yes-answer-long, yes-answer-short]"
1,leek,Which vegetable is a Welsh emblem ?,s3q18157,[1tok]
2,the guns of naverone,Which film won the best special effects Oscar ...,s3q6589,[]
3,sitting on the dock of a bay,What Was Otis Redding 's Biggest Hit Coming Af...,s3q22477,[]
4,king herod,Who ordered John the Baptists execution,s3q17645,"[yes-answer-long, yes-answer-short]"


## Sweet, everything is looking good, going to send these out to the team and begin working on expirimenting with different NLP preprocessing techniques