# Simple QA Subject Recognition

The goal of this notebook is to preprocess a file for subject recognition. For every token in a question, we tag it with a I for inside subject or O for outside subject.

In [34]:
import sys
sys.path.insert(0, '../../')

Here we load the source file and declare the name of the destination file. 

In [35]:
from scripts.utils.simple_qa import load_simple_qa 

# Destination Filename
DEST = './../../../allennlp/experiments/train.txt'

df, = load_simple_qa(train=True)
df[:5]

Unnamed: 0,subject,relation,object,question
0,04whkz5,book/written_work/subjects,01cj3p,what is the book e about
1,0tp2p24,music/release_track/release,0sjc7c1,to what release does the release track cardiac...
2,04j0t75,film/film/country,07ssc,what country was the film the debt from
3,0ftqr,music/producer/tracks_produced,0p600l,what songs have nobuo uematsu produced?
4,036p007,music/release/producers,0677ng,Who produced eve-olution?


We use the cursor from psycopg2 to connect to PostgreSQL with our knowledge graph.

In [36]:
from scripts.utils.connect import get_connection 

connection = get_connection()
cursor = connection.cursor()

For every example, there is not a clear subject name but we can do a fair job a linking questions with subject names only missing 2.7% of examples.

Our dataset will not include the examples for which we cannot link a subject name. Following an analysis in `Explore Simple QA` we find that the missing subject names are typically due to misspellings or do not exist in Freebase anymore.

In [49]:
from scripts.utils.link_subject_name import add_subject_name

add_subject_name(df, cursor, print_=True)

Subject MID (0rpphvw) does not have aliases.
Subject MID (06zpt0n) does not have aliases.
Subject MID (0pdsfmz) does not have aliases.
Subject MID (0dz8wx) does not have aliases.
Subject MID (0b5d_l) does not have aliases.
Subject MID (0j3zvz_) does not have aliases.
Subject MID (0bvqgyw) does not have aliases.
Subject MID (0l8mm0j) does not have aliases.
Subject MID (05xwcnm) does not have aliases.
Subject MID (06zzpc) does not have aliases.
Subject MID (0dyvxvf) does not have aliases.
Subject MID (0rygyy8) does not have aliases.
Subject MID (07jg3g) does not have aliases.
Subject MID (0lmk5l) does not have aliases.
Subject MID (0n5_b_c) does not have aliases.
Subject MID (0w_mkc2) does not have aliases.
Subject MID (0djb8lg) does not have aliases.
Subject MID (0c0l3hf) does not have aliases.
Subject MID (04htqk) does not have aliases.
Subject MID (0j5gy9s) does not have aliases.
Subject MID (0zj79nx) does not have aliases.
Subject MID (09k116f) does not have aliases.
Subject MID (063

Subject MID (01hksbh) does not have aliases.
Subject MID (0c68ns9) does not have aliases.
Subject MID (0crsbks) does not have aliases.
Subject MID (0g7pcl5) does not have aliases.
Subject MID (03m82zk) does not have aliases.
Subject MID (01myj0t) does not have aliases.
Subject MID (0c1r708) does not have aliases.
Subject MID (0s8yzqf) does not have aliases.
Subject MID (0j_zvxz) does not have aliases.
Subject MID (04sw7fd) does not have aliases.
Subject MID (0x03kwr) does not have aliases.
Subject MID (0z8nbpd) does not have aliases.
Subject MID (0n1s3sr) does not have aliases.
Subject MID (03whpdf) does not have aliases.
Subject MID (0yq8f0w) does not have aliases.
Subject MID (0285c83) does not have aliases.
Subject MID (0gyk8tl) does not have aliases.
Subject MID (03g2qrt) does not have aliases.
Subject MID (0sh4pjl) does not have aliases.
Subject MID (0xncrxf) does not have aliases.
Subject MID (0c8d292) does not have aliases.
Subject MID (0g41plx) does not have aliases.
Subject MI

Unnamed: 0,Aliases,Question,Subject
0,"[harry blackstone, jr.]","which country does harry blackstone, jr. come ...",0428h5
1,"[slinkee minx, slinky minx]",what tracks are by slinkeeminx?,01qtxd4
2,"[is it love? ultra naté best remixes, vol. 1]","what types of album is the best remixes, vol. 1",02w851w
3,[i'm breathless],what platform was im breathless released on?,038xf82
4,"[human genome, the human genome]",what is the name of a gene in the hunan genome.,0bsdc
5,[all the rage!!],what is a track from all the rage!,0q9lhqm
6,[blood brothers],who is the male composer of the song?,0_78wqs
7,"[<bold>constance georgine, countess markiewicz...",what is constance markiewicz's religion?,023395
8,[the piper at the gates of dawn],Who released the piper at the gates of down tr...,0f1ncqg
9,[new hampshire chicken],what is the primary use of the chicken breed n...,027843j


### Numbers
2.501647% [1899 of 75910] questions do not reference subject

0.439995% [334 of 75910] subject mids do not have aliases

In [51]:
from tqdm import tqdm_notebook
import random
from scripts.utils.link_subject_name import tokenize

tagged = [] 
for index, row in tqdm_notebook(df.iterrows(), total=df.shape[0]):
    if isinstance(row['subject_name'], str):
        ret =  ''
        for i, token in enumerate(tokenize(row['question'])):
            ret += token.lower()
            if i >= row['subject_name_start_index'] and i < row['subject_name_stop_index']:
                ret += '/I '
            else:
                ret += '/O ' # IO – Inside Outside tagging schema

        tagged.append(ret.strip())

random.sample(tagged, 50)




['what/O is/O the/O hide/I and/I seek/I :/I the/I search/I for/I truth/I in/I iraq/I book/O about/O',
 'what/O company/O is/O the/O manufacturer/O of/O vivarin/I 200/I tablet/I ?/O',
 'what/O is/O a/O famous/O fiction/I book/O',
 'which/O artist/O made/O the/I snow/I files/I',
 'what/O instrument/O does/O claudia/I gonson/I play/O ?/O',
 'nitrogen/I 0.99/I liquid/I is/O a/O formulation/O of/O what/O',
 "what/O is/O christian/I gaul/I 's/O nationality/O ?/O",
 'what/O cellular/O system/O does/O the/O htc/I touch/I hd/I use/O',
 'what/O is/O a/O recording/O of/O miles/I called/O ?/O',
 'who/O produced/O the/O recording/O pinocchio/I',
 'whats/O an/O example/O of/O a/O comedy-drama/I film/O',
 'who/O wrote/O lyrics/O to/O another/I heart/I breaks/I ?/O',
 'what/O kind/O of/O music/O does/O eli/I degibri/I play/O',
 'where/O did/O the/O battle/I of/I camp/I wildcat/I take/O place/O',
 "what/O 's/O the/O title/O track/O off/O rachel/I",
 'where/O did/O george/I w./I p./I hunt/I die/O',
 'wh

In [52]:
file_ = open(DEST, 'w')
file_.write('\n'.join(tagged))

4366969