# HYPOTHESIS - Unknown relations

Here we check how many relations in the development set are not seen during training time. This should give us a good handle at how many relations our model is not learning during training.

# Discussion

Only .6% (71 / 10845) of examples in development have a unseen relationship; therefore, we do not believe this a large issue.

In [2]:
import sys
sys.path.insert(0, '../../')

In [3]:
from scripts.utils.simple_qa import load_simple_qa 

# Load development set because its a magnitude smaller than the training set.
df_dev, = load_simple_qa(dev=True)
df_dev[:5]

Unnamed: 0,subject,relation,object,question
0,0f3xg_,symbols/namesake/named_after,0cqt90,Who was the trump ocean club international hot...
1,07f3jg,people/person/place_of_birth,0565d,where was sasha vujačić born
2,031j8nn,music/release/region,07ssc,What is a region that dead combo was released in
3,0c1cyhd,film/director/film,0wxsz5y,What is a film directed by wiebke von carolsfeld?
4,0fvhc0g,music/release/region,0345h,what country was music for stock exchange rel...


In [4]:
df_train, = load_simple_qa(train=True)
df_train[:5]

Unnamed: 0,subject,relation,object,question
0,04whkz5,book/written_work/subjects,01cj3p,what is the book e about
1,0tp2p24,music/release_track/release,0sjc7c1,to what release does the release track cardiac...
2,04j0t75,film/film/country,07ssc,what country was the film the debt from
3,0ftqr,music/producer/tracks_produced,0p600l,what songs have nobuo uematsu produced?
4,036p007,music/release/producers,0677ng,Who produced eve-olution?


In [5]:
train_relations = set(df_train.relation.unique())
print('%d unique relations in train' % (len(train_relations)))
print('Sample:', list(train_relations)[:5])

1629 unique relations in train
Sample: ['people/ethnicity/geographic_distribution', 'book/journal/discipline', 'cvg/cvg_developer/game_versions_developed', 'user/sue_anne/default_domain/olympic_medal_event/gold_medalist', 'royalty/monarch/kingdom']


In [7]:
from tqdm import tqdm_notebook

unseen = 0
for index, row in tqdm_notebook(df_dev.iterrows(), total=df_dev.shape[0]):
    if row['relation'] not in train_relations:
        unseen += 1
print('%f%% rows with unseen relations in dev [%d of %d]' % (unseen / df_dev.shape[0], unseen, df_dev.shape[0]))


0.006547% rows with unseen relations in dev [71 of 10845]
