# Simple QA Subject Recognition

The goal of this notebook is to preprocess a file into the format:

**Subject Name:** angels vengeance

**Result:** what\O language\O is\O angels\I vengeance\I in\O ?\O

In [1]:
import sys
sys.path.insert(0, '../../')

%load_ext autoreload
%autoreload 2

In [2]:
from scripts.utils.simple_qa import load_simple_qa 

# Destination Filename
DEST = './../../data/simple_qa_subject_recognition_dev.txt'

df, = load_simple_qa(dev=True)
df[:5]

Unnamed: 0,subject,relation,object,question
0,0f3xg_,symbols/namesake/named_after,0cqt90,Who was the trump ocean club international hot...
1,07f3jg,people/person/place_of_birth,0565d,where was sasha vujačić born
2,031j8nn,music/release/region,07ssc,What is a region that dead combo was released in
3,0c1cyhd,film/director/film,0wxsz5y,What is a film directed by wiebke von carolsfeld?
4,0fvhc0g,music/release/region,0345h,what country was music for stock exchange rel...


In [3]:
from scripts.utils.connect import get_connection 

connection = get_connection()
cursor = connection.cursor()

In [4]:
from scripts.utils.add_subject_name import add_subject_name

add_subject_name(df, cursor, print_=True)

Subject MID (0yzv6q0) does not have aliases.
Subject MID (0k571hg) does not have aliases.
Subject MID (02z6y3p) does not have aliases.
Subject MID (08vqg8n) does not have aliases.
Subject MID (045p9_r) does not have aliases.
Subject MID (03czw5g) does not have aliases.
Subject MID (04j2m25) does not have aliases.
Subject MID (0bvrjh4) does not have aliases.
Subject MID (08c_49) does not have aliases.
Subject MID (0h4867) does not have aliases.
Subject MID (02qplxv) does not have aliases.
Subject MID (0kz5hyk) does not have aliases.
Subject MID (07wk3m) does not have aliases.
Subject MID (01jyqwk) does not have aliases.
Subject MID (0_s2kgd) does not have aliases.
Subject MID (02vk_bv) does not have aliases.
Subject MID (03f86d) does not have aliases.
Subject MID (0kyvhwk) does not have aliases.
Subject MID (0l8d74) does not have aliases.
Subject MID (04jnfd) does not have aliases.
Subject MID (04zfl8) does not have aliases.
Subject MID (0z5y43n) does not have aliases.
Subject MID (040b

Unnamed: 0,Aliases,Question,Subject
0,"[bill hosket, jr.]","What was bill hosket, jr.'s position",02vvz00
1,"[megan leigh romero, megan romero]",What track is featured on the salisbury release,098j5p7
2,[datskat],What composer created?,0zhwd1v
3,[ron warner],Who's a linebacker,03m9zc3
4,"[mr m, mr. m., mr.m.]",what label does the artist mr. m. belong to,01x20gj
5,"[sabrina fredrica washington, sabrina washingt...",Which label is sabrinawmusic signed to?,0bnsl7
6,"[t-town, tulsa, tulsa, oklahoma, wagoner count...",What newspaper circulates in the town of kearny,013kcv
7,[nguyễn văn toàn],Where is the place of birth of nguyen van toan,02vnwfd
8,"[aftereight, capital lights]",what types of music is played by capitallights,02pvx1q
9,"[35 mm film, 35mm film]",what is an example of a film on 35mm,0cj16


### Numbers
2.351314% [255 of 10845] questions do not reference subject

0.461042% [50 of 10845] subject mids do not have aliases

In [5]:
from tqdm import tqdm_notebook
from numpy import nan
from nltk.stem.snowball import SnowballStemmer
from nltk import word_tokenize
from IPython.display import display
import random
import json
import pandas as pd

stem = SnowballStemmer('english', ignore_stopwords=True).stem

tagged = []
exact_match = 0
not_exact_match = []
for index, row in tqdm_notebook(df.iterrows(), total=df.shape[0]):
    if isinstance(row['subject_name'], str):
        # Line up the question with the subject name
        possesives = ["'", "'s", "`s", "`"]
        token_question = word_tokenize(row['question'].lower())
        token_subject_name = word_tokenize(row['subject_name'].lower())
        # Save the original index but get rid of the possesives
        token_question_no_poss = [(i, token) for i, token in enumerate(token_question)
                          if token not in possesives]
        token_subject_name_no_poss = [token for i, token in enumerate(token_subject_name)
                          if token not in possesives]
        start_index, stop_index = None, None
        for i, (original_i, token) in enumerate(token_question_no_poss):
            # Check if spacy alias matches
            for j, other_token in enumerate(token_subject_name_no_poss):
                offset_original_i, offset_token = token_question_no_poss[i + j]
                if stem(offset_token) != stem(other_token):
                    break
                if j == len(token_subject_name_no_poss) - 1:  # Last iteration
                    stop_index = offset_original_i + 1
                    start_index = original_i

            if start_index is not None:
                break
    
        assert start_index is not None and stop_index is not None
    
        ret =  ''
        for i, token in enumerate(token_question):
            ret += token
            if i >= start_index and i < stop_index:
                ret += ' XX XX I-SUB\n'
            else:
                ret += ' XX XX 0\n'
        
        if token_subject_name == token_question[start_index:stop_index]:
            exact_match += 1
        else:
            not_exact_match.append([' '.join(token_subject_name),
                                    ' '.join(token_question[start_index:stop_index])])

        tagged.append(ret)

print('Exact Match: %f [%d of %d]' % (exact_match / df.shape[0], exact_match, df.shape[0]))
display(random.sample(tagged, 50))
display(random.sample(not_exact_match, 50))


Exact Match: 0.955371 [10361 of 10845]


['where XX XX 0\nis XX XX 0\nmaplewood XX XX I-SUB\nlocated XX XX 0\n? XX XX 0\n',
 'what XX XX 0\ntype XX XX 0\nof XX XX 0\nalbum XX XX 0\nis XX XX 0\noranges XX XX I-SUB\n& XX XX I-SUB\nlemons XX XX I-SUB\n',
 'what XX XX 0\nis XX XX 0\nblack XX XX I-SUB\nand XX XX I-SUB\nwhite XX XX I-SUB\nlive XX XX I-SUB\nbundle XX XX I-SUB\n# XX XX I-SUB\n3 XX XX I-SUB\n',
 'what XX XX 0\ntype XX XX 0\nof XX XX 0\nmusic XX XX 0\nis XX XX 0\nmade XX XX 0\nby XX XX 0\ndavid XX XX I-SUB\nguetta XX XX I-SUB\n',
 'what XX XX 0\nkind XX XX 0\nof XX XX 0\nmusic XX XX 0\ndoes XX XX 0\nlobo XX XX I-SUB\ncreate XX XX 0\n',
 'what XX XX 0\nconflict XX XX 0\noccurred XX XX 0\nin XX XX 0\nmorocco XX XX I-SUB\n',
 'name XX XX 0\na XX XX 0\nsymptom XX XX 0\nof XX XX 0\nthe XX XX 0\ndisease XX XX 0\nventricular XX XX I-SUB\ntachycardia XX XX I-SUB\n',
 'what XX XX 0\ntype XX XX 0\nof XX XX 0\nmusic XX XX 0\nis XX XX 0\nof XX XX I-SUB\nwars XX XX I-SUB\nin XX XX I-SUB\nosyrhia XX XX I-SUB\n? XX XX 0\n',
 'which X

[["clemson tigers men 's basketball", 'clemson tigers mens basketball'],
 ['skeeter davies', 'skeeter davis'],
 ['death eaters', 'death eater'],
 ['regan gascoigne', 'regan gascoignes'],
 ["acorna 's world", 'acornas world'],
 ['bretons', 'breton'],
 ['joules', 'joule'],
 ["peter 's point plantation", 'peters point plantation'],
 ['short films', 'short film'],
 ['drugs', 'drug'],
 ["parkinson 's disease", 'parkinsons disease'],
 ['personal computers', 'personal computer'],
 ["disney 's kim possible 2 : drakken 's demise",
  'disneys kim possible 2 : drakkens demise'],
 ["heaton 's crossroads", 'heatons crossroads'],
 ['horror film', 'horror films'],
 ['spike milligans', 'spike milligan'],
 ['drugs', 'drug'],
 ["fido 's summer fun , volume 2", 'fidos summer fun , volume 2'],
 ["love 's boomerang", 'loves boomerang'],
 ['billie ocean', 'billy ocean'],
 ["chet 's speech , part ii", 'chets speech , part ii'],
 ['mithun chakraborty', 'mithun chakrabortys'],
 ["raffi 's christmas album", 'ra

In [6]:
file_ = open(DEST, 'w')
file_.write('-DOCSTART- -X- O O\n\n' + '\n'.join(tagged))

1248783