# Simple QA Preprocessing

1. Using `BuboQA` for every MID include a english token.
2. Preprocess Simple QA by stripping and lowercase the dataset.
3. Remove questions marks from the questions.
4. 

With this notebook, we combine in one TSV file with the columns:

[Object MID, Freebase Property, Subject MID, Question EN, WikiData Property, Object EN, Subject EN, French EN]

The goal of this notebook is to compile different pieces of data used to train Simple QA models.

The resulting file can be found:
http://hdpb.prn.parsec.apple.com:50070/explorer.html#/user/mpetrochuk/simple_qa

In [1]:
import pandas as pd
import yaml

DEST = '/Users/petrochuk/data/simple_qa/train.tsv'

# Original Simple QA Downloaded from
# https://research.fb.com/publications/large-scale-simple-question-answering-with-memory-networks/
FREEBASE = '/Users/petrochuk/data/simple_qa/source/annotated_fb_data_train.txt'
COLUMNS_FREEBASE = ['Object MID', 'Freebase Property', 'Subject MID', 'Question']
# Generated by this script notebooks/simple_qa/Build map from MID to name using Freebase Dump.ipynb
MID_TO_NAME = '/Users/petrochuk/data/simple_qa/source/freebase_dump/mid_to_name.tsv'
# Manually annotated map from Freebase to Wikidata
FREEBASE_TO_WIKIDATA = 'pytorch-seq2seq/data/freebase_to_wikidata_property.yml'

# Set of translation files for Simple QA French
HUMAN_TRANSLATION = '/Users/petrochuk/data/simple_qa/source/human_translation.tsv'
MACHINE_TRANSLATION = '/Users/petrochuk/data/simple_qa/source/deepl_translation.tsv'

data = pd.read_table(FREEBASE, names=COLUMNS_FREEBASE, header=None)
print('Number of rows:', len(data))
data.head()

Number of rows: 75910


Unnamed: 0,Object MID,Freebase Property,Subject MID,Question
0,www.freebase.com/m/04whkz5,www.freebase.com/book/written_work/subjects,www.freebase.com/m/01cj3p,what is the book e about
1,www.freebase.com/m/0tp2p24,www.freebase.com/music/release_track/release,www.freebase.com/m/0sjc7c1,to what release does the release track cardiac...
2,www.freebase.com/m/04j0t75,www.freebase.com/film/film/country,www.freebase.com/m/07ssc,what country was the film the debt from
3,www.freebase.com/m/0ftqr,www.freebase.com/music/producer/tracks_produced,www.freebase.com/m/0p600l,what songs have nobuo uematsu produced?
4,www.freebase.com/m/036p007,www.freebase.com/music/release/producers,www.freebase.com/m/0677ng,Who produced eve-olution?


Add Wikidata properities using the freebase_to_wikidata_property.

In [2]:
freebase_to_wikidata = yaml.load(open(FREEBASE_TO_WIKIDATA))

def wikidata_property(row):
    pid = freebase_to_wikidata[row['Freebase Property']]
    pid = None if pid == 'None' else pid
    if pid and 'K' in pid:
        return pid.replace('K', 'inverse:')
    elif pid and 'Q' in pid:
        return None
    return pid
    
data['WikiData Property'] = data.apply(wikidata_property, axis=1)
data.head()

46033


Unnamed: 0,Object MID,Freebase Property,Subject MID,Question,WikiData Property
0,www.freebase.com/m/04whkz5,www.freebase.com/book/written_work/subjects,www.freebase.com/m/01cj3p,what is the book e about,
1,www.freebase.com/m/0tp2p24,www.freebase.com/music/release_track/release,www.freebase.com/m/0sjc7c1,to what release does the release track cardiac...,
2,www.freebase.com/m/04j0t75,www.freebase.com/film/film/country,www.freebase.com/m/07ssc,what country was the film the debt from,P495
3,www.freebase.com/m/0ftqr,www.freebase.com/music/producer/tracks_produced,www.freebase.com/m/0p600l,what songs have nobuo uematsu produced?,
4,www.freebase.com/m/036p007,www.freebase.com/music/release/producers,www.freebase.com/m/0677ng,Who produced eve-olution?,


Generate a set MIDs we need names for.

In [3]:
def url_to_mid(url):
    if 'www.freebase.com/m/' in url:
        mid = url.replace('www.freebase.com/m/', '').strip()
        return mid
    print('FAILED', url)
    return None

mids = [url_to_mid(u) for u in data['Subject MID']] + [url_to_mid(u) for u in data['Object MID']]
mids = set(mids)
print('Unique MIDS:', len(mids))

Unique MIDS: 30476


Create a dictionary of `mid_to_name` for every MID in the set `mids`.

In [4]:
mid_to_name = {}
for i, line in enumerate(open(MID_TO_NAME)):
    split = line.strip().split('\t')
    assert len(split) < 3, split
    if len(split) == 2:
        mid, name = split
        if mid in mids: 
            mid_to_name[mid] = name
print('Got %d mappings' % len(mid_to_name))

Got 30285 mappings


Apply the dictionary to our Simple QA.

In [5]:
def mid_to_name_apply(row):
    subject_url = row['Subject MID']
    object_url = row['Object MID']
    
    subject_mid = url_to_mid(subject_url)
    object_mid = url_to_mid(object_url)
    
    row['Subject EN'] = mid_to_name[subject_mid] if subject_mid in mid_to_name else None
    row['Object EN'] = mid_to_name[object_mid] if object_mid in mid_to_name else None
    return row
    
data = data.apply(mid_to_name_apply, axis=1)
data.head()

Unnamed: 0,Object MID,Freebase Property,Subject MID,Question,WikiData Property,Subject EN,Object EN
0,www.freebase.com/m/01jp8ww,www.freebase.com/music/album/genre,www.freebase.com/m/01qzt1,Which genre of album is harder.....faster?,P136,Classic rock,Harder.....Faster
1,www.freebase.com/m/0np6z99,www.freebase.com/music/album/release_type,www.freebase.com/m/02lx2r,what format is fearless,,Album,Fearless
2,www.freebase.com/m/0wzc58l,www.freebase.com/people/person/place_of_birth,www.freebase.com/m/0n2z,what city was alex golfis born in,P19,Athens,Alex Golfis
3,www.freebase.com/m/0jtw9c,www.freebase.com/film/writer/film,www.freebase.com/m/05szq8z,what film is by the writer phil hay?,inverse:P58,Clash of the Titans,Phil Hay
4,www.freebase.com/m/0gys2sn,www.freebase.com/people/deceased_person/place_...,www.freebase.com/m/0tzls,Where did roger marquis die,P20,Holyoke,Roger Marquis


In [6]:
import csv 
from numpy import nan

ENGLISH_TO_FRENCH_QUESTION = [HUMAN_TRANSLATION]
# DeepL is the a state-of-the-art translator https://www.deepl.com/translator
# Scoring 44 BLEU in En-FR on 2014 newstest set
# ENGLISH_TO_FRENCH_QUESTION.append(MACHINE_TRANSLATION)
COLUMNS = ['Object', 'Predicate', 'Subject', 'Question EN', 'Question FR']
english_to_french_map = {}
for en_to_fr_question in ENGLISH_TO_FRENCH_QUESTION:
    simple_qa_translate = pd.read_table(en_to_fr_question, names=COLUMNS, header=None)
    for i, (_, _, _, question_en, question_fr) in simple_qa_translate.iterrows():
        english_to_french_map[question_en.strip().lower()] = question_fr
print('Translation Questions:', len(english_to_french_map))

def translate(row):
    question = row['Question']
    if question.strip().lower() in english_to_french_map:
        return english_to_french_map[question.strip().lower()]
    return None

data['Question FR'] = data.apply(translate, axis=1)
data['Question EN'] = data['Question']
data.drop('Question', axis=1)
print('Translated Rows:', len(data[data['Question FR'].notnull()]))
data.fillna(value=nan, inplace=True)
print('Saved in:', DEST)
data.to_csv(DEST, sep='\t', quoting=csv.QUOTE_NONE, index=False)
data.head()

Translation Questions: 126
Translated Rows: 104
Saved in: /Users/petrochuk/data/simple_qa/test.tsv


Unnamed: 0,Object MID,Freebase Property,Subject MID,Question,WikiData Property,Subject EN,Object EN,Question FR,Question EN
0,www.freebase.com/m/01jp8ww,www.freebase.com/music/album/genre,www.freebase.com/m/01qzt1,Which genre of album is harder.....faster?,P136,Classic rock,Harder.....Faster,,Which genre of album is harder.....faster?
1,www.freebase.com/m/0np6z99,www.freebase.com/music/album/release_type,www.freebase.com/m/02lx2r,what format is fearless,,Album,Fearless,,what format is fearless
2,www.freebase.com/m/0wzc58l,www.freebase.com/people/person/place_of_birth,www.freebase.com/m/0n2z,what city was alex golfis born in,P19,Athens,Alex Golfis,,what city was alex golfis born in
3,www.freebase.com/m/0jtw9c,www.freebase.com/film/writer/film,www.freebase.com/m/05szq8z,what film is by the writer phil hay?,inverse:P58,Clash of the Titans,Phil Hay,,what film is by the writer phil hay?
4,www.freebase.com/m/0gys2sn,www.freebase.com/people/deceased_person/place_...,www.freebase.com/m/0tzls,Where did roger marquis die,P20,Holyoke,Roger Marquis,,Where did roger marquis die
