# HYPOTHESIS - Subject Name not in Question

For a question in the SimpleQuestions dataset to be answerable, the true `subject_mid` must have an alias that is referenced in the question.

Given that aliases can be misspelled, it's not straight forward to detect if the alias is reference in the question.

In [2]:
import sys
sys.path.insert(0, '../../')
import pandas as pd
import random
from tqdm import tqdm_notebook
tqdm_notebook().pandas()
from scripts.utils.data import FB5M_NAME_TABLE
from scripts.utils.data import FB2M_NAME_TABLE
from scripts.utils.connect import get_connection 

connection = get_connection()
cursor = connection.cursor()




In [3]:
from scripts.utils.simple_qa import load_simple_qa 
from sklearn.utils import shuffle

# Load development set because its a magnitude smaller than the training set.
df, = load_simple_qa(dev=True)
df = shuffle(df, random_state=123)
df[:5]

Unnamed: 0,subject,relation,object,question
6219,03k3r,biology/organism_classification/organisms_of_t...,0bs56bp,Name an American Thoroughbread racehorse
3364,02qlppc,cvg/computer_videogame/cvg_genre,01sjng,what kind of game is vision racing driving sim...
9374,02l7c8,tv/tv_genre/programs,0dlmm88,what tv program is romance film
10142,049_zj3,location/location/containedby,04rrx,what state is polaski located in
97,02w9ycr,people/deceased_person/cause_of_death,0qcr0,what disease claimed the life of fern emmett


## Baseline Reference Resolution

Our baseline reference resolution is to look for the exact alias in question following lowercasing.

In [20]:
import pandas as pd
from scripts.utils.table import format_pipe_table

no_reference = []

def naive_link_alias(row, name_table=FB2M_NAME_TABLE):
    cursor.execute('SELECT alias FROM ' + name_table + ' WHERE mid=%s', (row['subject'],))
    aliases = [a[0] for a in cursor.fetchall() if len(a[0]) > 0]
    if len(aliases) > 0:
        if_alias_in_question = [a.lower() in row['question'].lower() for a in aliases]
        if not any(if_alias_in_question):
            no_reference.append({'Aliases': aliases,
                                 'Question': row['question']})
    else:
        no_reference.append({'Aliases': [],
                             'Question': row['question']})

df.progress_apply(naive_link_alias, axis=1)

no_aliases = [r for r in no_reference if len(r['Aliases']) == 0]
print('Subjects with no alias: %f [%d of %d]' % (len(no_aliases) / df.shape[0], 
                                                 len(no_aliases), df.shape[0]))
print('Questions that do reference subject alias: %f [%d of %d]' % (len(no_reference) / df.shape[0], 
                                                                    len(no_reference), df.shape[0]))
print()
print(format_pipe_table(no_reference[:50]))

Subjects with no alias: 0.004795 [52 of 10845]
Questions that do reference subject alias: 0.033564 [364 of 10845]

| Index | Aliases | Question |
| --- | --- | --- |
| 0 | ["red cloud's war"] | what was involved in the red clouds war? |
| 1 | ['seychelles airport', 'seychelles international airport'] | which Canadian city is served by e> |
| 2 | ["peter's point plantation"] | What is peters point plantation's architectural style |
| 3 | ['pg (usa)'] | what film is rated pg |
| 4 | ['pillows & prayers: cherry red 1982–1983'] | What is the name of the track list for the release pillows & prayers: cherry red 1982-1983? |
| 5 | [] | what type of music does arsen shomakhov play? |
| 6 | ['duke aiona'] | What is the birth place of james aiona |
| 7 | ['fragments of unbecoming'] | what kind of music is fragmentsofunbecoming known for? |
| 8 | ['kristy thirsk', 'thirsk, kristy'] | Where was kristythirsk born |
| 9 | ["battle of hudson's bay"] | where did the battle of hudsons bay take place |


### Discussion

#### Numbers:
- Subjects with no alias: 0.004795 [52 of 10845]
- Questions that do reference subject alias: 0.033564 [364 of 10845]

We find that most questions exactly reference the alias. In the next experiments, we seek to deal with some of the 364 cases. 

#### Error Bucket:

**Discussion:**

We choose to ignore partial reference and extra word (9 / 50). We expect that these paraphrases of the aliases are handled in Freebase aliases.

Our next step is to handle the largest class of errors: Punctuation. In addition, as part of sentance normalization, we handle accents.

**Buckets:**
- (15 / 50) Punctuation: An alias is referenced but with a difference in punctuation
- (11 / 50) Not Referenced: An alias is not referenced in the question
- (7 / 50) Partial Reference: Part of an alias is referenced
- (5 / 50) No alias: There are no aliases for the subject_mid associated with the question
- (5 / 50) Spaces: An alias is referenced but with a difference in spaces
- (2 / 50) Suffix: An alias is referenced but with a difference in the suffix
- (2 / 50) Article: An alias is referenced but with a difference in a article
- (2 / 50) Extra word: An alias is referenced but has an extra descriptor in the question
- (1 / 50) Accent: An alias is referenced but with a difference in accents
- (1 / 50) Parentheses: An alias is referenced but with a difference in a parenthetical word or phrase

| Index | Aliases | Question | Bucket |
| --- | --- | --- | --- |
| 0 | ["red cloud's war"] | what was involved in the red clouds war? | Punctuation |
| 1 | ['seychelles airport', 'seychelles international airport'] | which Canadian city is served by e> | Not Referenced |
| 2 | ["peter's point plantation"] | What is peters point plantation's architectural style | Punctuation |
| 3 | ['pg (usa)'] | what film is rated pg | Parentheses |
| 4 | ['pillows & prayers: cherry red 1982–1983'] | What is the name of the track list for the release pillows & prayers: cherry red 1982-1983? | Punctuation |
| 5 | [] | what type of music does arsen shomakhov play? | No alias |
| 6 | ['duke aiona'] | What is the birth place of james aiona | Not referenced |
| 7 | ['fragments of unbecoming'] | what kind of music is fragmentsofunbecoming known for? | Spaces |
| 8 | ['kristy thirsk', 'thirsk, kristy'] | Where was kristythirsk born | Spaces  |
| 9 | ["battle of hudson's bay"] | where did the battle of hudsons bay take place | Punctuation |
| 10 | ['ikkyū'] | What is ikkyu's gender? | Accent |
| 11 | ['topical medication'] | Name a topical medicine | Suffix |
| 12 | ['mike stefanski'] | What baseball position does  play | Not Referenced |
| 13 | ['t-town', 'tulsa', 'tulsa, oklahoma', 'wagoner county / tulsa city'] | What newspaper circulates in the town of kearny | Not referenced |
| 14 | ['japanese movies'] | what is a japanese netflix movie | Extra word |
| 15 | ['lucius quinctius cincinnatus'] | which military conflict was cincinnatus involved in? | Partial Reference |
| 16 | ['drummer', 'drums', 'drums & cymbals', 'drumset', 'drum set', 'rumpusetti'] | which musician plays the drum kit | Not referenced |
| 17 | [] | who created the character lex luthor | No alias |
| 18 | [] | what profession does ronald reagan have | No alias |
| 19 | ["dimillo's floating restaurant"] | what state is dimillos floating restaurant in? | Punctuation |
| 20 | ['ellis, tinsley', 'tinsley ellis', 'tnsley ellis'] | what is tinsleyellis's profession? | Spaces |
| 21 | ['prince rashid bin hassan'] | what religion is rashid bin el hassan | Extra word, Partial Reference |
| 22 | ["npr: wait wait... don't tell me!", "wait wait... don't tell me!"] | who produces npr: wait wait... dont tell me! podcast | Punctuation |
| 23 | ['ralph santolla', 'santolla, ralph'] | what instrument does ralphsantolla play  | Spaces |
| 24 | ['the regatta mystery'] | what theme is in the piece regatta mystery | Article |
| 25 | ['los que la montan', 'plan b'] | What genre of music does houseofpleasure make | Not Referenced |
| 26 | ['racing video game'] | Name a racing game | Partial Reference |
| 27 | ["álvaro d'ataide da gama"] | what is Álvaro dataide da gama | Punctuation |
| 28 | ['navajo county / pinon cdp', 'piñon', 'pinon, arizona', 'piñon, arizona'] | name a census-designated place  | Not Referenced |
| 29 | ['african american', 'africian-american', 'afro-american', 'black', 'black american'] | What ethnicity is Jermaine Jackson? | Not referenced |
| 30 | ["this pud's for you"] | what is the series where the episode this puds for you comes from | Punctuation |
| 31 | ['pop', 'pop music'] | What genre is Kazz Kumar's music?  | Not referenced |
| 32 | ["now that's what i call music! 28"] | where was now thats what i call music! 28 released? | Punctuation |
| 33 | ["chet's speech, part 2", "chet's speech, part ii"] | who sings chets speech, part ii | Punctuation |
| 34 | [] | which language was red satin recorded in? | No alias |
| 35 | ['popcaan', 'poppy'] | what genre music does andre sutherland do? | Not referenced |
| 36 | [] | What country filmed vla lcinéma ou le roman de charles pathé | No alias |
| 37 | ["men's pommel horse"] | What olympic games featured mens pommel horse | Punctuation |
| 38 | ['marwari people'] | which person is marwaris | Partial Reference, Suffix |
| 39 | ['multiplayer video game'] | what is a multiplayer computer video game | Partial Reference |
| 40 | ['josef hassid'] | where did josefhassid die | Spaces  |
| 41 | ['multiplayer video game'] | What's a text based multiplayer game | Partial Reference |
| 42 | ['the crystal city'] | what genre is crystal city | Article |
| 43 | ['symphony no. 1 including \\"blumine\\'] | who released the album symphony no. 1 including \\"blumine\\" | Punctuation |
| 44 | ['(7693) hoshitakuhai'] | Which star system is (7693) 1982 we apart of? | Not referenced |
| 45 | ["badminton - men's singles", "men's badminton, singles", "men's singles badminton"] | what olympic games was mens badminton, singles apart of | Punctuation |
| 46 | ["diff'rent strokes", "it takes diff'rent strokes"] | What musician recorded the track it takes diffrent strokes | Punctuation |
| 47 | ["brian o'shea"] | brian oshea performs what type of martial art | Punctuation |
| 48 | ["stompin' at the savoy: live", "stompin' at the savoy: live (disc 2)"] | Where was stompin at the savoy: live released? | Puncutation |
| 49 | ['penske racing south', 'team penske'] | Who is a driver for penske racing? | Partial Reference |

## Normalized Reference Resolution

Our second attempt at reference resolution is to remove punctuation as it is the highest error bucket in the last algorithm.

In [5]:
import pandas as pd
import re
import unicodedata
from scripts.utils.table import format_pipe_table

no_reference = []

def strip_accents(s):
    nfkd_form = unicodedata.normalize('NFKD', s)
    return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])

def normalize(s):
    # Represent characters in ASCII
    s = strip_accents(s)
    s = s.lower()
    # Remove punctuation
    s = re.sub(r'[^\w\s]','',s)
    # Removing characters can create gaps of multiple spaces
    # Substitue multiple spaces with one
    s = re.sub('\s+', ' ', s)
    s = s.strip()
    return s

def normalized_link_alias(row, name_table=FB2M_NAME_TABLE):
    cursor.execute('SELECT alias FROM ' + name_table + ' WHERE mid=%s', (row['subject'],))
    aliases = [a[0] for a in cursor.fetchall() if len(a[0]) > 0]
    normalized_aliases = [normalize(a) for a in aliases]
    normalized_aliases = [a for a in normalized_aliases if len(a) > 0]
    if len(normalized_aliases) > 0:
        if_alias_in_question = [n_a in normalize(row['question']) for n_a in normalized_aliases]
        if not any(if_alias_in_question):
            no_reference.append({'Aliases': aliases,
                                 'Question': row['question']})
    else:
        no_reference.append({'Aliases': [],
                             'Question': row['question']})

df.progress_apply(normalized_link_alias, axis=1)

no_aliases = [r for r in no_reference if len(r['Aliases']) == 0]
print('Subjects with no alias: %f [%d of %d]' % (len(no_aliases) / df.shape[0], 
                                                 len(no_aliases), df.shape[0]))
print('Questions that do reference subject alias: %f [%d of %d]' % (len(no_reference) / df.shape[0], 
                                                                    len(no_reference), df.shape[0]))
print()
# Get a different set of errors
# Do not overfit to the first 50 errors
print(format_pipe_table(no_reference[50:100]))

Subjects with no alias: 0.004795 [52 of 10845]
Questions that do reference subject alias: 0.021485 [233 of 10845]

| Index | Aliases | Question |
| --- | --- | --- |
| 0 | ['atta muhammad nur'] | which religion does ustad atta mohammed noor practice |
| 1 | ['racing video game'] | Name a popular racing game for the Xbox. |
| 2 | ['3-methylbutanoic acid', 'delphinate', 'delphinic acid', 'isopentanoate', 'isopentanoic acid', 'isopropylacetate', 'isopropylacetic acid', 'isovalerate', 'isovalerianate', 'isovalerianic acid', 'isovaleric acid'] | which biofluid can butanoic acid, 3-methyl- be found in |
| 3 | ['made-for-tv movie', 'television film', 'tv movie'] | what is a television movie film adaptation  |
| 4 | ['single-player video game'] | Name a game that you can play in single-player |
| 5 | ['effingham, surrey'] | Name someone who was born in effingham |
| 6 | ['mickey drexler', 'millard \\"mickey\\" s. drexler', 'millard s drexler'] | What organization did millard drexler found |
| 

### Discussion

#### Numbers:

- Subjects with no alias: 0.004795 [52 of 10845]
- Questions that do reference subject alias: 0.021577 [233 of 10845]

Following handling punctuation and accents, we handled 131 (35.98%) of errors. Similar to the 16/50 (32%) errors we were expecting to handle.

#### Future:

Check if we linked any incorrectly by sampling the examples that differ in both the baseline and normalization algorithm.

#### Error Bucket:

**Discussion:**

Partial is the largest category of errors. The problem is to find the subject alias that is referenced in the question. Our opinion is that partial references to a subject alias is not a reference but rather a new alias; therefore, we choose not handle it. We also consider parentheses as a special case of a partial reference.

Suffix, spelling and middle name (9 / 50) are a case of small edits betweens the alias and the question reference. We can handle this with edit distance.

Spaces is difficult to handle because the tokens are merged as one token; therefore, the edit distance between tokens is not helpful. We would need to employ an edit distance between characters that does not respect word boundaries; therefore, we do not handle this case.

**Buckets:**
- (13 / 50) Partial: Part of an alias is referenced
- (13 / 50) Not Referenced: An alias is not referenced in the question
- (9 / 50) No alias: There are no aliases for the subject_mid associated with the question
- (4 / 50) Suffix: An alias is referenced but with a difference in the suffix
- (3 / 50) Spelling: An alias is referenced but with a different spelling
- (3 / 50) Spaces: An alias is referenced but with a difference in spaces
- (2 / 50) Middle Name: The middle name is different between the alias and question reference.
- (2 / 50) Other: It does not fit in all the above. The alias is referenced but in an odd way.
- (1 / 50) Parentheses: An alias is referenced but with a difference in a parenthetical word or phrase

| Index | Aliases | Question | Bucket |
| --- | --- | --- | --- |
| 0 | ['atta muhammad nur'] | which religion does ustad atta mohammed noor practice | Spelling |
| 1 | ['racing video game'] | Name a popular racing game for the Xbox. | Partial |
| 2 | ['3-methylbutanoic acid', 'delphinate', 'delphinic acid', 'isopentanoate', 'isopentanoic acid', 'isopropylacetate', 'isopropylacetic acid', 'isovalerate', 'isovalerianate', 'isovalerianic acid', 'isovaleric acid'] | which biofluid can butanoic acid, 3-methyl- be found in | Other |
| 3 | ['made-for-tv movie', 'television film', 'tv movie'] | what is a television movie film adaptation  | Partial |
| 4 | ['single-player video game'] | Name a game that you can play in single-player | Partial |
| 5 | ['effingham, surrey'] | Name someone who was born in effingham | Partial |
| 6 | ['mickey drexler', 'millard \\"mickey\\" s. drexler', 'millard s drexler'] | What organization did millard drexler found | Middle Name |
| 7 | ['me and the first lady'] | what is the release type of the album? | Not Referenced |
| 8 | ['william hay, baron hay of ballyore'] | What profession is william hay? | Partial |
| 9 | ['donora-monessen bridge'] | What is the location of the stan musial bridge | Not Referenced |
| 10 | ['godbout v longueuil (city of)'] | what court handled the godbout v. longueuil case? | Parentheses |
| 11 | ['ray meier', 'raymond a. meier'] | Which city was raymond meier born in | Middle Name  |
| 12 | ['drummer', 'drums'] | who played the drum  in the Los Angeles rock quintet Rooney | Suffix |
| 13 | ['death eaters'] | which is the name of a death eater in harry potter? | Suffix  |
| 14 | ['pasja'] | What language is spoken in the film? | Not Referenced |
| 15 | ['single-player video game'] | what is a single-player mode game?  | Partial |
| 16 | ['psygnosis'] | What is the name of a game that was published by sce studio liverpool? | Not Referenced |
| 17 | ['elizabeth stuart, daughter of charles i'] | what was the location of elizabeth stuart's death | Partial |
| 18 | ['dynamike enjoy incubus', 'einziger, mike', 'fabio fungus amongus', 'jawa s.c.i.e.n.c.e.', 'michael aaron einziger', 'mike \\"danger\\" einziger', 'mike einziger', 'mikey'] | What type of rock does michaelaaroneinziger play | Spaces |
| 19 | [] | whats the netflix genre of the film  the other one | No Alias |
| 20 | ["the bishop's bedroom"] | what country produced the film la stanza del vescovo | Not Referenced |
| 21 | ['defender'] | what Romanian player plays as a center in basketball? | Not Referenced |
| 22 | ['battle of ft. lamar', 'battle of secessionville', 'first battle of james island'] | Name a soldier involved in the battle of james island. | Partial |
| 23 | [] | What type of content is the album friends? | No Alias |
| 24 | [] | what kind of music is southern associated with | No Alias |
| 25 | [] | what is can you tell me how to get to sesame street? the theme song for  | No Alias |
| 26 | [] | Name a histamine used to treat a Duodenal Ulcer. | No Alias |
| 27 | [] | which region contains asti docg | No Alias |
| 28 | ['horror', 'horror fiction', 'horror film', 'horror movie', 'horror movies', 'horror novel'] | what story is in the literary genre | Not Referenced |
| 29 | ['single-player video game'] | what game requires single-player | Partial |
| 30 | ['charles h. caffin', 'charles henry caffin'] | What release includes the recording red | Not Referenced |
| 31 | ["don't tell me (what love can do)"] | Who composed "Don't Tell Me 9What Love Can Do)"? | Spelling |
| 32 | [] | what's a movie in the children & family movies section on netflix | No Alias |
| 33 | ['christoph schlingensief trifft: richard wagner'] | what was the release type of  the basic wagner (disc 2) | Partial |
| 34 | ['funky meters', 'meters', 'meters, the', 'the meters', 'the original meters'] | what is a song by | Partial |
| 35 | ['alvis td 21'] | Which breed of dog has the colors of liver & white | Not Referenced |
| 36 | ['ses plus belles chansons, volume 1 : sauf le respect que je vous dois...'] | What is the release type of the album sauf le respect que je vous dois... | Partial |
| 37 | ['biologic'] | what is a drug | Not referenced |
| 38 | ['shuttle tydirium approaches endor'] | who produced ? | Not Referenced |
| 39 | ['jeff hatrix', 'jeffrey hatrix', 'jeffrey nothing', 'nothing, jeffrey'] | what style of music does mrhjeffreynothing perform | Spaces |
| 40 | ['brodnica county'] | what is one of the 19 voivodeships located in mid-northern Poland | Not Referenced |
| 41 | ['alexis reich'] | what is john mark karr's gender  | Not Referenced |
| 42 | ['bikash malla'] | What is bikash malal's nationality? | Spelling |
| 43 | ['ann rabson', 'rabson, ann'] | What is the nationality of annrabson? | Spaces |
| 44 | ['lake cowichan', 'lake cowichan, british columbia', 'lake cowichan, canada'] | which country is cowichan lake in | Other |
| 45 | [] | What type of computer game is last chaos | No alias |
| 46 | [] | What type of release did ohio have? | No alias |
| 47 | ['asteroid', 'minor planets'] | What is a main-belt minor planet discovered in 1999? | Suffix |
| 48 | ['somalis'] | Who's somebody that identifies with the somali people | Suffix |
| 49 | ['atlantic time zone'] | Name a location in the atlantic standard time zone | Partial |

## Edit Distance Reference Resolution

Our third attempt is define a threshhold edit distance for reference resolution.

In [1]:
import pandas as pd
from scripts.utils.edit_distance import edit_token_distance

no_reference = []
reference = []

def edit_distance_link_alias(row, name_table=FB2M_NAME_TABLE):
    cursor.execute('SELECT alias FROM ' + name_table + ' WHERE mid=%s', (row['subject'],))
    aliases = [a[0] for a in cursor.fetchall()]
    normalized_aliases = [normalize(a) for a in aliases]
    normalized_aliases = [tuple(a.split()) for a in normalized_aliases if len(a) > 0]
    if len(normalized_aliases) > 0:
        # NOTE: Given that there is no punctuation in the phrase then splitting based on white spaces works
        normalized_question = tuple(normalize(row['question']).split())
        distances = [edit_token_distance(n_a, normalized_question) for n_a in normalized_aliases]
        normalized_aliases_lengths = [sum(len(t) for t in n_a) for n_a in normalized_aliases]
        normalized_distances = [(normalized_aliases_lengths[i] - d[0]) / normalized_aliases_lengths[i]
                                for i, d in enumerate(distances)]
        
        if max(normalized_distances) < 0.75:
            no_reference.append({'Distance': max(normalized_distances),
                                 'Aliases': aliases,
                                 'Question': row['question']})
        elif max(normalized_distances) < 1.0: # Sample edge cases
            reference.append({'Distance': max(normalized_distances),
                             'Aliases': aliases,
                             'Question': row['question']})
    else:
        no_reference.append({'Distance': 0,
                             'Aliases': [],
                             'Question': row['question']})

df.progress_apply(edit_distance_link_alias, axis=1)

no_aliases = [r for r in no_reference if len(r['Aliases']) == 0]
print('Subjects with no alias: %f [%d of %d]' % (len(no_aliases) / df.shape[0], 
                                                 len(no_aliases), df.shape[0]))
print('Questions that do reference subject alias: %f [%d of %d]' % (len(no_reference) / df.shape[0], 
                                                                    len(no_reference), df.shape[0]))
print()
# Get a different set of errors
# Do not overfit to the first 50 errors
no_reference = sorted(no_reference, key=lambda i: i['Distance'])
print(format_pipe_table(no_reference[-50:]))
reference = sorted(reference, key=lambda i: i['Distance'])
print(format_pipe_table(reference[0:50]))

ModuleNotFoundError: No module named 'scripts'

### Discussion

#### Numbers:

- Subjects with no alias: 0.004795 [52 of 10845]
- Questions that do reference subject alias: 0.018903 [205 of 10845]

Following handling punctuation and accents, we handled 28 (12.01%) of errors. Less than the 9/50 (18%) that we were expecting.

#### Error Bucket:

**Discussion:**

This technique had in the error bucket only 1/50 false positives. The token edit distance missed exectanly concatenated tokens. One solution is to use substring edit distance with a penality for starting or ending half way through a token. This would allow us to handle 7/50 errors while maintain a low false positive rate.

We find that 36/50 (72%) of the unlinked questions have no alias for their subject_mid or do not reference the subject_mid alias.

**Buckets:**
- (20 / 50) Not Referenced: An alias is not referenced in the question
- (16 / 50) No alias: There are no aliases for the subject_mid associated with the question
- (7 / 50) Spelling: An alias is referenced but with a different spelling
- (5 / 50) Partial: Part of an alias is referenced
- (1 / 50) Parentheses: An alias is referenced but with a difference in a parenthetical word or phrase
- (1 / 50) Other: It does not fit in all the above. The alias is referenced but in an odd way.

| Index | Aliases | Distance | Question | Bucket |
| --- | --- | --- | --- | --- |
| 0 | [] | 0.0 | what type of film is to kill a cop? | No Alias |
| 1 | [] | 0.0 | What dialect is spoken in republic of kosovo | No Alias |
| 2 | [] | 0.0 | What genre of music is the album no limitations? | No Alias |
| 3 | [] | 0.0 | What game is a version of minotaur: the labyrinths of crete | No Alias |
| 4 | [] | 0.0 | what type of video game is quattro power | No Alias |
| 5 | [] | 0.0 | what war did joseph guillemot participate in | No Alias |
| 6 | [] | 0.0 | what type of novel is elemental: the power of illuminated love? | No Alias |
| 7 | [] | 0.0 | who developed the game vay | No Alias |
| 8 | [] | 0.0 | What's a novel that wilhelmina grubbly-plank appears in | No Alias |
| 9 | [] | 0.0 | What artist recorded europa? | No Alias |
| 10 | [] | 0.0 | which video game company published sid meiers alien crossfire | No Alias |
| 11 | [] | 0.0 | which track contains vienna | No Alias |
| 12 | [] | 0.0 | Where is yu~ki from? | No Alias |
| 13 | [] | 0.0 | What format was e che mai sarà - le mie più belle canzoni released in | No Alias |
| 14 | [] | 0.0 | what is the album the artist dissecting table made | No Alias |
| 15 | [] | 0.0 | What industry is sega technical institute apart of? | No Alias |
| 16 | ['monster'] | 0.14285714285714285 | Which release track is also a recording? | Not Referenced |
| 17 | ['memnon'] | 0.16666666666666666 | Where was anna pavlova buried after death | Not Referenced |
| 18 | ['podensac', 'podensac, france'] | 0.1875 | what commune is within aquitaine? | Not Referenced |
| 19 | ['soundtrack'] | 0.2 | Name a music album from lara croft | Not Referenced |
| 20 | ['pulse'] | 0.2 | what country is opopomoz from | Not Referenced |
| 21 | ['rob rock', 'rock, rob'] | 0.2222222222222222 | robrock1 plays what instrument | Spelling |
| 22 | ['cone five'] | 0.2222222222222222 | Where is conefive from | Spelling |
| 23 | ['capitol punishment'] | 0.2222222222222222 | This self-titled released was originally released in what region? | Not Referenced |
| 24 | ['ca', 'calif', 'california', 'california, usa', 'golden state'] | 0.26666666666666666 | Which book is written about? | Not Referenced |
| 25 | ['project mersh', 'project: mersh'] | 0.2857142857142857 | Which artist released the album project:mersh? |  Spelling |
| 26 | ['david williams'] | 0.2857142857142857 | what guitar does thescamp play | Not Referenced |
| 27 | ['vesania'] | 0.2857142857142857 | what type of music is godthelux | Not Referenced |
| 28 | ['03 9.31'] | 0.2857142857142857 | What artist is behind the recording? | Not Referenced |
| 29 | ['aftereight', 'capital lights'] | 0.3 | what types of music is played by capitallights | Spelling |
| 30 | ['donmar', 'donmar warehouse', 'donmar warehouse, london borough of camden'] | 0.3125 | What was theatrical production of MacBeth staged in? | Not Referenced |
| 31 | ['marc and the mambas', 'raoul and the ruined'] | 0.35 | what record label worked with tormentandtoreros | Not Referenced |
| 32 | ['batman: original motion picture score'] | 0.35135135135135137 | what country was the music album batman released in | Partial |
| 33 | ['equistone partners europe'] | 0.36 | Which industry does barclays private equity operate in? | Not Referenced |
| 34 | ['sabrina fredrica washington', 'sabrina washington', 'washington, sabrina'] | 0.3684210526315789 | Which label is sabrinawmusic signed to? | Spelling |
| 35 | ['kyle clinton eastwood', 'kyle eastwood'] | 0.38461538461538464 | what does kyleeastwood do for a living | Spelling |
| 36 | ['jacobus franciscus thorpe', 'james francis \\"jim\\" thorpe', 'james francis thorpe', 'james thorpe', 'jim thorpe', 'wathahuck-brightpath'] | 0.4 | What artist recorded the track madagascar |  Not Referenced |
| 37 | ['studio album'] | 0.5 | What is the name of the award winning album by Linda Ronstant? | Not Referenced |
| 38 | ['the 7pm project', 'the 7pm project ratings', 'the project'] | 0.5333333333333333 | what country is the tv program originally from  | Not Referenced |
| 39 | ['evolution (album version)'] | 0.6 | What release is evolution a part of? | Parentheses |
| 40 | ['vocal jazz'] | 0.6 | What is the name of Nat King Cole's jazz album? | Not Referenced |
| 41 | ['in london', "marlene dietrich at queen's theatre", 'marlene dietrich in london'] | 0.6 | What format was at queen's theatre released as | Not Referenced |
| 42 | ['35 mm film', '35mm film'] | 0.6 | what is an example of a film on 35mm | Other |
| 43 | ['edgar award for best fact crime'] | 0.6129032258064516 | what is best fact crime | Not Referenced |
| 44 | ['canaro, italy', 'canaro, rovigo'] | 0.6153846153846154 | Where is canaro located? | Partial |
| 45 | ['cabbit', 'kayt manninagh', 'manx cat', 'stubbin'] | 0.625 | What type of animal is a manx | Partial |
| 46 | ['patty mills'] | 0.6363636363636364 | patrick mills was born in what Australian city | Spelling |
| 47 | ["women's studies"] | 0.6666666666666666 | what is a book about feminist studies? | Not Referenced |
| 48 | ['latin music', 'music of latin america'] | 0.6818181818181818 | which song is latin american music in netflix | Partial |
| 49 | ['multiplayer video game'] | 0.7727272727272727 | what is a multiplayer game? | Partial |

We take a look at the matched question and subject aliases to check if we tagged any false positives. We found one false positive between "t-town" and "town" for the question "What newspaper circulates in the town of kearny". The technique caught 5/50 partial references.

| Index | Aliases | Distance | Question | Bucket |
| --- | --- | --- | --- | --- |
| 0 | ['ikkyū'] | 0.8 | What is ikkyu's gender? | Referenced |
| 1 | ['drummer', 'drums', 'drums & cymbals', 'drumset', 'drum set', 'rumpusetti'] | 0.8 | which musician plays the drum kit | Referenced |
| 2 | ['mœnia'] | 0.8 | what's the title of one of mœnia's albums | Referenced |
| 3 | ['christa paffgen', 'christa päffgen', 'krista nico', 'krista päffgen', 'nico', 'nicol', 'nico otzak'] | 0.8 | what is the name of one of nico's songs | Referenced |
| 4 | ['drummer', 'drums'] | 0.8 | who played the drum  in the Los Angeles rock quintet Rooney | Referenced |
| 5 | ['album'] | 0.8 | which albums were released by the century media label? | Referenced |
| 6 | ['queen'] | 0.8 | What is queen's compilation album called | Referenced |
| 7 | ['aasai'] | 0.8 | What was aasai's country of origin? | Referenced |
| 8 | ['album'] | 0.8 | which albums were released by deep purple? | Referenced |
| 9 | ['bink', 'bink!', 'roosevelt harrell iii'] | 0.8 | what is bink!'s profession? | Referenced |
| 10 | ['mecca'] | 0.8 | what is the name for meccas canonical version | Referenced |
| 11 | ['album'] | 0.8 | what albums did Willie Nelson release? | Referenced |
| 12 | ['album'] | 0.8 | what are albums? | Referenced |
| 13 | ['kylun'] | 0.8 | what is kylun's fictional gender | Referenced |
| 14 | ['album'] | 0.8 | what albums were released in 2000? | Referenced |
| 15 | ['drummer', 'drums', 'drums & cymbals', 'drumset', 'drum set', 'rumpusetti'] | 0.8 | Which instrumentalists play the drum kit? | Referenced |
| 16 | ['drummer', 'drums', 'drums & cymbals', 'drumset', 'drum set', 'rumpusetti'] | 0.8 | Who uses a drum kit? | Referenced |
| 17 | ['the crystal city'] | 0.8125 | what genre is crystal city | Partial Reference |
| 18 | ['battle of ft. lamar', 'battle of secessionville', 'first battle of james island'] | 0.8214285714285714 | Name a soldier involved in the battle of james island. | Partial Reference |
| 19 | ['godbout v longueuil (city of)'] | 0.8275862068965517 | what court handled the godbout v. longueuil case? |  Partial Reference |
| 20 | ['master'] | 0.8333333333333334 | what is one of the master's powers  | Referenced |
| 21 | ['t-town', 'tulsa', 'tulsa, oklahoma', 'wagoner county / tulsa city'] | 0.8333333333333334 | What newspaper circulates in the town of kearny | Not Referenced |
| 22 | ['merari'] | 0.8333333333333334 | who was merari's father  | Referenced |
| 23 | ['bikash malla'] | 0.8333333333333334 | What is bikash malal's nationality? | Referenced |
| 24 | ['peanut', 'peanuts, all types, raw'] | 0.8333333333333334 | What Indonesian food contains chilis and peanuts? | Referenced |
| 25 | ['ahmed bin mubarak bin obaid al mahaijri', 'ahmed mubarak al-mahaijri', 'ahmed mubarak obaid al mahajiri'] | 0.84 | What is the birth location of ahmed mubarak al mahaijri? | Referenced |
| 26 | ['the regatta mystery'] | 0.8421052631578947 | what theme is in the piece regatta mystery | Partial Reference |
| 27 | ['new american soldier'] | 0.85 | what is the language spoken in the filmnew american soldier? | Referenced |
| 28 | ['bretons'] | 0.8571428571428571 | Name one of the breton people | Referenced |
| 29 | ['heng wu'] | 0.8571428571428571 | What is Heng Wu's profession | Referenced |
| 30 | ['somalis'] | 0.8571428571428571 | Who's somebody that identifies with the somali people | Referenced|
| 31 | ['mahemood ali khan', 'mahmood', 'mahmood ali', 'mahmud', 'mehmood', 'mehmood ali', 'mehmood ali khan', 'mehmood bhaijaan', 'mehmood/malabari mahmood', 'mehmoood', 'mehmud', 'محمود', 'محمود علی)'] | 0.8571428571428571 | where did mehmood's life end | Referenced |
| 32 | ['stretch'] | 0.8571428571428571 | what is stretch's gender | Referenced |
| 33 | ['german people', 'germans'] | 0.8571428571428571 | who is german? | Referenced |
| 34 | ['miss my', 'myriam boucher'] | 0.8571428571428571 | what is miss my's gender | Referenced |
| 35 | ['pablo longueira'] | 0.8666666666666667 | Name the profession of Pablo Longueria. | Referenced |
| 36 | ['tristeza'] | 0.875 | what is tristeza's album that starts with an s | Referenced |
| 37 | ['jonwayne'] | 0.875 | what is jonwayne's birth location? | Referenced |
| 38 | ['bomb, the', 'the bomb'] | 0.875 | where is the bomb's origins | Referenced |
| 39 | ['timothy mark vine', 'tim vine'] | 0.875 | What is tim vine's native language | Referenced |
| 40 | ['don n. page', 'don page'] | 0.875 | what is don page's gender? | Referenced |
| 41 | ['single-player video game'] | 0.875 | what is a single-player mode game? | Partial Reference |
| 42 | ['styve c.'] | 0.875 | What is Styve C.'s gender? | Referenced |
| 43 | ['ben gage', 'benjamin austin gage'] | 0.875 | what caused ben gage's death | Referenced |
| 44 | ['jay bell'] | 0.875 | What is Jay Bell's profession. | Referenced |
| 45 | ['ken owen', 'owen, ken'] | 0.875 | what is ken owen's nationality  | Referenced |
| 46 | ['abu al-qasim', 'abū al-qāsim muḥammad ibn ʿabd allāh ibn ʿabd al-muṭṭalib ibn hāshim', 'ahmed', 'messenger muhammad pbuh', 'mohammed, mahomet, muhammad', 'muhammad', 'muḥammad', 'muḥammad ibn `abd allāh', 'muhammad in islam'] | 0.875 | What is one of muhammad's children's names?  | Referenced |
| 47 | ['malaysia'] | 0.875 | Name one of Malaysia's thirteen states that lies on the western coast of the Peninsula and south of Kuala Lumpur  | Referenced |
| 48 | ['greco–punic wars'] | 0.875 | where did greek–punic wars take place? | Referenced |
| 49 | ['eli m noam', 'eli noam'] | 0.875 | what is eli noam's job  | Referenced |