# Relation Classifier Data

The goal of this notebook is to create training examples for the relation classifier.

Simple Questions already provides such data but here we do extra preprocessing. The preprocessing we do is to replace the entity with <e>. This is important because given the entity name the model will learn that some entities map to some relations more often. 
    
This is true but that can be more accurately determined by measuring the bias in FB2M / FB5M. We can simply count the number of relations a particular subject maps too. 

In [1]:
import sys
sys.path.insert(0, '../../')
import pandas as pd
from tqdm import tqdm_notebook
import lib.import_notebook
from lib.utils import get_connection 

tqdm_notebook().pandas()

connection = get_connection()
cursor = connection.cursor()




In [2]:
from lib.simple_qa import load_simple_qa 

# Destination Filename
DEST_TRAIN = './../../data/relation_classifier/train.txt'
DEST_DEV = './../../data/relation_classifier/dev.txt'

df_dev, = load_simple_qa(dev=True)
print('Dev:')
display(df_dev[:5])
df_train, = load_simple_qa(train=True)
print('Train:')
display(df_train[:5])

Dev:


Unnamed: 0,subject,relation,object,question
0,0f3xg_,symbols/namesake/named_after,0cqt90,Who was the trump ocean club international hot...
1,07f3jg,people/person/place_of_birth,0565d,where was sasha vujačić born
2,031j8nn,music/release/region,07ssc,What is a region that dead combo was released in
3,0c1cyhd,film/director/film,0wxsz5y,What is a film directed by wiebke von carolsfeld?
4,0fvhc0g,music/release/region,0345h,what country was music for stock exchange rel...


Train:


Unnamed: 0,subject,relation,object,question
0,04whkz5,book/written_work/subjects,01cj3p,what is the book e about
1,0tp2p24,music/release_track/release,0sjc7c1,to what release does the release track cardiac...
2,04j0t75,film/film/country,07ssc,what country was the film the debt from
3,0ftqr,music/producer/tracks_produced,0p600l,what songs have nobuo uematsu produced?
4,036p007,music/release/producers,0677ng,Who produced eve-olution?


## Step 1 - Link Question to Subject Name

In [3]:
import importlib
from functools import partial
edit_distance_link_alias = importlib.import_module(
                "notebooks.Simple QA Numbers.HYPOTHESIS - Question Refers to Multiple Subjects").edit_distance_link_alias
normalize = importlib.import_module(
                "notebooks.Simple QA Numbers.HYPOTHESIS - Subject Name not in Question").normalize

# Create a column with the subject_name linked per example
df_dev['subject_name'] = df_dev.progress_apply(partial(edit_distance_link_alias, cursor, normalize), axis=1)
print('Dev Linked', sum(df_dev.subject_name.notnull()), 'examples')
display(df_dev[:5])
df_train['subject_name'] = df_train.progress_apply(partial(edit_distance_link_alias, cursor, normalize), axis=1)
print('Train Linked', sum(df_train.subject_name.notnull()), 'examples')
display(df_train[:5])

importing Jupyter notebook from ../../notebooks/Simple QA Numbers/HYPOTHESIS - Question Refers to Multiple Subjects.ipynb
importing Jupyter notebook from ../../notebooks/Simple QA Numbers/HYPOTHESIS - Subject Name not in Question.ipynb



Dev Linked 10648 examples


Unnamed: 0,subject,relation,object,question,subject_name
0,0f3xg_,symbols/namesake/named_after,0cqt90,Who was the trump ocean club international hot...,trump ocean club international hotel and tower
1,07f3jg,people/person/place_of_birth,0565d,where was sasha vujačić born,sasha vujacic
2,031j8nn,music/release/region,07ssc,What is a region that dead combo was released in,dead combo
3,0c1cyhd,film/director/film,0wxsz5y,What is a film directed by wiebke von carolsfeld?,wiebke von carolsfeld
4,0fvhc0g,music/release/region,0345h,what country was music for stock exchange rel...,music for stock exchange



Train Linked 74520 examples


Unnamed: 0,subject,relation,object,question,subject_name
0,04whkz5,book/written_work/subjects,01cj3p,what is the book e about,e
1,0tp2p24,music/release_track/release,0sjc7c1,to what release does the release track cardiac...,cardiac arrest
2,04j0t75,film/film/country,07ssc,what country was the film the debt from,the debt
3,0ftqr,music/producer/tracks_produced,0p600l,what songs have nobuo uematsu produced?,nobuo uematsu
4,036p007,music/release/producers,0677ng,Who produced eve-olution?,eve-olution


## Step 2 - Determine the Span of the Subject Name

In [4]:
import importlib
find_subject_name_span = importlib.import_module(
                "notebooks.Simple QA Models.Subject Recognition Data").find_subject_name_span

# Create a column with the subject_name linked per example
df_dev = df_dev.progress_apply(find_subject_name_span, axis=1)
print('Dev:')
display(df_dev[:5])
df_train = df_train.progress_apply(find_subject_name_span, axis=1)
print('Train:')
display(df_train[:5])

importing Jupyter notebook from ../../notebooks/Simple QA Models/Subject Recognition Data.ipynb



Dev:


Unnamed: 0,end_index,object,question,question_tokens,relation,start_index,subject,subject_name,subject_name_tokens
0,10.0,0cqt90,Who was the trump ocean club international hot...,"[who, was, the, trump, ocean, club, internatio...",symbols/namesake/named_after,3.0,0f3xg_,trump ocean club international hotel and tower,"(trump, ocean, club, international, hotel, and..."
1,4.0,0565d,where was sasha vujačić born,"[where, was, sasha, vujacic, born]",people/person/place_of_birth,2.0,07f3jg,sasha vujacic,"(sasha, vujacic)"
2,7.0,07ssc,What is a region that dead combo was released in,"[what, is, a, region, that, dead, combo, was, ...",music/release/region,5.0,031j8nn,dead combo,"(dead, combo)"
3,9.0,0wxsz5y,What is a film directed by wiebke von carolsfeld?,"[what, is, a, film, directed, by, wiebke, von,...",film/director/film,6.0,0c1cyhd,wiebke von carolsfeld,"(wiebke, von, carolsfeld)"
4,7.0,0345h,what country was music for stock exchange rel...,"[what, country, was, music, for, stock, exchan...",music/release/region,3.0,0fvhc0g,music for stock exchange,"(music, for, stock, exchange)"



Train:


Unnamed: 0,end_index,object,question,question_tokens,relation,start_index,subject,subject_name,subject_name_tokens
0,5.0,01cj3p,what is the book e about,"[what, is, the, book, e, about]",book/written_work/subjects,4.0,04whkz5,e,"(e,)"
1,9.0,0sjc7c1,to what release does the release track cardiac...,"[to, what, release, does, the, release, track,...",music/release_track/release,7.0,0tp2p24,cardiac arrest,"(cardiac, arrest)"
2,7.0,07ssc,what country was the film the debt from,"[what, country, was, the, film, the, debt, from]",film/film/country,5.0,04j0t75,the debt,"(the, debt)"
3,5.0,0p600l,what songs have nobuo uematsu produced?,"[what, songs, have, nobuo, uematsu, produced, ?]",music/producer/tracks_produced,3.0,0ftqr,nobuo uematsu,"(nobuo, uematsu)"
4,5.0,0677ng,Who produced eve-olution?,"[who, produced, eve, -, olution, ?]",music/release/producers,2.0,036p007,eve-olution,"(eve, -, olution)"


## Step 3 - Format Examples

In [8]:
from tqdm import tqdm_notebook

def get_formatted_examples(df):
    examples = []
    for index, row in tqdm_notebook(df.iterrows(), total=df.shape[0]):
        if not isinstance(row['subject_name'], str):
            continue
        
        ret =  ''
        for i, token in enumerate(row['question_tokens']):
            if i == row['start_index']:
                ret += '<e>'
            elif i > row['start_index'] and i < row['end_index']:
                continue
            else:
                ret += token.lower().strip()
            ret += ' '
            
        examples.append(ret.strip())
    return examples

# TODO: Create a notebook summarizing this
train_examples = get_formatted_examples(df_train)
print(len(set(train_examples)))
print(len(train_examples))
print('Train:')
print(train_examples[:5])
dev_examples = get_formatted_examples(df_dev)
print('Dev:')
print(dev_examples[:5])


33388
74520
Train:
['what is the book <e> about', 'to what release does the release track <e> come from', 'what country was the film <e> from', 'what songs have <e> produced ?', 'who produced <e> ?']



Dev:
['who was the <e> named after', 'where was <e> born', 'what is a region that <e> was released in', 'what is a film directed by <e> ?', 'what country was <e>  released in']


## Step 4 - Write

In [6]:
file_ = open(DEST_TRAIN, 'w')
file_.write('\n'.join(train_examples))

file_ = open(DEST_DEV, 'w')
file_.write('\n'.join(dev_examples))

639701