# Phase Two: Find Similar Entities via Adapted Wikipedia2vec most_similar()

In Phase Two, we try to generate candidate pools for each full mention. In this method, we use an adaptation of Wikipedia2vec's most_similar() function that translates a full mention into a vector representation, returns a large sample of similar words and entities and filters to just entities as our candidates.

To improve speed of processing, we used Google Colab to download the pre-trained embeddings and run this notebook.

#### Import Packages

In [1]:
import os
import time

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Progress bar
from tqdm import tqdm

In [2]:
!pip install wikipedia2vec

Collecting wikipedia2vec
[?25l  Downloading https://files.pythonhosted.org/packages/d8/88/751037c70ca86581d444824e66bb799ef9060339a1d5d1fc1804c422d7cc/wikipedia2vec-1.0.4.tar.gz (1.2MB)
[K     |████████████████████████████████| 1.2MB 11.8MB/s 
Collecting marisa-trie
[?25l  Downloading https://files.pythonhosted.org/packages/20/95/d23071d0992dabcb61c948fb118a90683193befc88c23e745b050a29e7db/marisa-trie-0.7.5.tar.gz (270kB)
[K     |████████████████████████████████| 276kB 32.0MB/s 
[?25hCollecting mwparserfromhell
[?25l  Downloading https://files.pythonhosted.org/packages/23/03/4fb04da533c7e237c0104151c028d8bff856293d34e51d208c529696fb79/mwparserfromhell-0.5.4.tar.gz (135kB)
[K     |████████████████████████████████| 143kB 36.2MB/s 
Building wheels for collected packages: wikipedia2vec, marisa-trie, mwparserfromhell
  Building wheel for wikipedia2vec (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia2vec: filename=wikipedia2vec-1.0.4-cp36-cp36m-linux_x86_64.whl size=45819

In [3]:
# Package
from wikipedia2vec import Wikipedia2Vec

# Class to compare type
from wikipedia2vec.dictionary import Entity

In [4]:
# Download dimensional file from Wikipedia2vec website
!curl -O http://wikipedia2vec.s3.amazonaws.com/models/en/2018-04-20/enwiki_20180420_100d.pkl.bz2

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3352M  100 3352M    0     0  13.0M      0  0:04:17  0:04:17 --:--:-- 13.3M


In [5]:
!bzip2 -d enwiki_20180420_100d.pkl.bz2

In [6]:
!ls

enwiki_20180420_100d.pkl  sample_data


In [7]:
%%time
# Load unzipped pkl file with word embeddings
w2v = Wikipedia2Vec.load("enwiki_20180420_100d.pkl")

CPU times: user 53.4 ms, sys: 218 ms, total: 272 ms
Wall time: 350 ms


## Load ACY Input Data

In [8]:
from google.colab import files
uploaded = files.upload()

Saving Aida-Conll-Yago-Input.csv to Aida-Conll-Yago-Input.csv


In [9]:
!ls

Aida-Conll-Yago-Input.csv  enwiki_20180420_100d.pkl  sample_data


In [10]:
# Load data
acy_input = pd.read_csv("Aida-Conll-Yago-Input.csv", delimiter=",")
acy_input.head(3)

Unnamed: 0,mention,full_mention,wikipedia_URL,wikipedia_page_ID,wikipedia_title,sentence_id,doc_id,congruent_mentions
0,B,EU,,,,0,0,"['EU', 'German', 'British']"
1,B,German,http://en.wikipedia.org/wiki/Germany,11867.0,Germany,0,0,"['EU', 'German', 'British']"
2,B,British,http://en.wikipedia.org/wiki/United_Kingdom,31717.0,United Kingdom,0,0,"['EU', 'German', 'British']"


In [11]:
# Re-name 
candidate_pools = acy_input.copy()

## Find most similar entity using Wikipedia2vec

We now turn to using a variation on Wikipedia2vec's `most_similar()` function to find, for entered words, the most similar entity.

In [12]:
# Normalize full_mentions to lower case for entry into most_similar() function
full_mention_norm = np.array([x.lower() for x in candidate_pools['full_mention']])
candidate_pools['full_mention_norm'] = full_mention_norm
candidate_pools.head(3)

Unnamed: 0,mention,full_mention,wikipedia_URL,wikipedia_page_ID,wikipedia_title,sentence_id,doc_id,congruent_mentions,full_mention_norm
0,B,EU,,,,0,0,"['EU', 'German', 'British']",eu
1,B,German,http://en.wikipedia.org/wiki/Germany,11867.0,Germany,0,0,"['EU', 'German', 'British']",german
2,B,British,http://en.wikipedia.org/wiki/United_Kingdom,31717.0,United Kingdom,0,0,"['EU', 'German', 'British']",british


In [13]:
### Test single full mention query time
start_time = time.time()

# Print word
search_word = candidate_pools['full_mention_norm'][2]
print("Search Word: ", search_word)

# Translate word into vector
# Handles multi-word mentions
search_word_list = search_word.split(" ")
search_word_vector = None
for word in search_word_list:
    try:
        vector = w2v.get_word_vector(str(word))
    except KeyError:
        print("This word is Out of Vocabulary (OOV) for Wikipedia2vec.")
        vector = None
        
    if search_word_vector is None:
        search_word_vector = vector
    else:
        search_word_vector += vector

if search_word_vector is not None:
    # Get most similar word
    count_similar = 500
    similar = w2v.most_similar_by_vector(search_word_vector, count_similar)

    # Retrieve only entities from word
    entities = []
    return_similar = 10
    for i in similar:
    #     print(type(i[0]))
        if isinstance(i[0], Entity):
            entities.append(i)
    #     if len(entities) == return_similar:
    #         break
    display(entities)
end_time = time.time()
print(f"Single Word Query Time: {round(end_time - start_time, 2)}s")

Search Word:  british


[(<Entity Russians in the United Kingdom>, 0.61556417),
 (<Entity Henry Wood (naval officer)>, 0.60587215),
 (<Entity D.N. Penfold>, 0.60256517),
 (<Entity Christopher J. Burgess>, 0.5973172),
 (<Entity Commonwealth of Nations>, 0.5968252),
 (<Entity British Empire>, 0.59353995),
 (<Entity File:Flag of The Commonwealth.svg>, 0.59272784),
 (<Entity Dial 999 (1938 film)>, 0.59214795),
 (<Entity Black British>, 0.5881806),
 (<Entity Numa François Henri Sadoul>, 0.5852481),
 (<Entity F.W. Hunt>, 0.58309895),
 (<Entity George Paice (bowls)>, 0.5830955),
 (<Entity File:Ribbon - Volunteer Long Service Medal.png>, 0.58157176),
 (<Entity British Hong Kong>, 0.58109015),
 (<Entity Walter Wilkinson Wallace>, 0.5809743),
 (<Entity Canadians in the United Kingdom>, 0.58094454),
 (<Entity Arthur Walker (trade unionist)>, 0.5803683),
 (<Entity Derek George Cudmore>, 0.5799971),
 (<Entity Colonial Auxiliary Forces Long Service Medal>, 0.5795136),
 (<Entity File:Peter O'Toole in Lawrence of Arabia.png>

Single Word Query Time: 4.3s


In [14]:
### Test single full mention query time on a mention with multiple words
start_time = time.time()

# Print word
search_word = candidate_pools['full_mention_norm'][51]
print("Search Word: ", search_word)

# Translate word into vector
# Handles multi-word mentions
search_word_list = search_word.split(" ")
search_word_vector = None
for word in search_word_list:
    try:
        vector = w2v.get_word_vector(str(word))
        
        if search_word_vector is None:
            search_word_vector = vector
        else:
            search_word_vector += vector

    except KeyError:
        print(f"\"{word}\" is Out of Vocabulary (OOV) for Wikipedia2vec.")

if search_word_vector is not None:
    # Get most similar word
    count_similar = 500
    similar = w2v.most_similar_by_vector(search_word_vector, count_similar)

    # Retrieve only entities from word
    entities = []
    return_similar = 10
    for i in similar:
    #     print(type(i[0]))
        if isinstance(i[0], Entity):
            entities.append(i)
    #     if len(entities) == return_similar:
    #         break
    display(entities)
end_time = time.time()
print(f"Single Word Query Time: {round(end_time - start_time, 2)}s")

Search Word:  welsh national farmers ' union
"'" is Out of Vocabulary (OOV) for Wikipedia2vec.


[(<Entity :Category:Wikipedians interested in the European Union>, 0.74048346),
 (<Entity Whiteheads RFC>, 0.7279615),
 (<Entity Ian Mackay (rugby league)>, 0.7100251),
 (<Entity E. Gwyndaf Evans>, 0.70472246),
 (<Entity Ystrad Rhondda RFC>, 0.7012322),
 (<Entity Abercynon RFC>, 0.70090365),
 (<Entity Caernarfon RFC>, 0.69320804),
 (<Entity Tredegar Ironsides RFC>, 0.69225544),
 (<Entity Fleur De Lys RFC>, 0.6915729),
 (<Entity Cwmgwrach RFC>, 0.6905903),
 (<Entity File:Harden NSW.PNG>, 0.6901128),
 (<Entity Betws RFC>, 0.6887361),
 (<Entity Dai Francis (trade union leader)>, 0.6886294),
 (<Entity Llandybie RFC>, 0.6878249),
 (<Entity Berwyn Rangers F.C.>, 0.6861579),
 (<Entity :Category:All Blacks>, 0.68473554),
 (<Entity :Eastern Conference (NHL)>, 0.6845522),
 (<Entity Markham RFC>, 0.6823692),
 (<Entity Bethesda RFC>, 0.6821802),
 (<Entity Fall Bay RFC>, 0.6820504),
 (<Entity Cwmtwrch RFC>, 0.6806142),
 (<Entity Neil Lashkari>, 0.6795225),
 (<Entity Joost Adriaan van Hamel>, 0.6789

Single Word Query Time: 2.17s


### Run Over Large Subset of Data

In [15]:
# Prepare output array
most_similar_entities = []
most_similar_scores = []
get_similar_candidate_pool = []
get_similar_candidate_scores = []

# Track metrics
success_word_query = 0
oov_errors = 0
start_time = time.time()

In [16]:
# Provide filter ability
end_size = 5000

for mention in tqdm(candidate_pools['full_mention_norm'][:end_size]):
    
    # Translate word into vector
    # Handles multi-word mentions
    search_word_list = mention.split(" ")
    search_word_vector = None
    for word in search_word_list:
        try:
            vector = w2v.get_word_vector(str(word))
            
            if search_word_vector is None:
                search_word_vector = vector
            else:
                search_word_vector += vector
                
        except KeyError:
            oov_errors += 1

    # Save candidate pool
    candidate_pool = []
    candidate_scores = []
    
    if search_word_vector is not None:
        success_word_query += 1
        
        # Search most similar words/entities from found word
        # Retrieve 500 most similar to test large coverage
        similars = w2v.most_similar_by_vector(search_word_vector, 500)

        # Retrieve most similar entity
        most_similar = None
        for s in similars:
            if isinstance(s[0], Entity):
                candidate_pool.append(s[0].title)
                candidate_scores.append(s[1])
                if most_similar is None:
                    most_similar = s
                
    # Save lists
    get_similar_candidate_pool.append(candidate_pool)
    get_similar_candidate_scores.append(candidate_scores)
    
    if most_similar is not None:
        most_similar_entities.append(most_similar[0].title)
        most_similar_scores.append(most_similar[1])
    else:
        most_similar_entities.append(None)
        most_similar_scores.append(None)

100%|██████████| 5000/5000 [2:52:31<00:00,  2.07s/it]


In [17]:
print("Successfully Found Words: ", round(success_word_query/end_size*100,3),"%")
print("Out-of-Vocabulary Issues: ", round(oov_errors/end_size*100,3),"%")
execution_time = time.time() - start_time
print("Execution time: ", round(execution_time, 3),"s")

Successfully Found Words:  97.02 %
Out-of-Vocabulary Issues:  10.0 %
Execution time:  10353.608 s


In [18]:
# Append to dataframe
mini_df = candidate_pools[:end_size].copy()
mini_df['preds_w2v_mostsimilar'] = most_similar_entities
mini_df['score_w2v_mostsimilar'] = most_similar_scores
mini_df['candidate_pool_mostsimilar'] = get_similar_candidate_pool
mini_df['candidate_scores_mostsimilar'] = get_similar_candidate_scores
mini_df.head(3)

Unnamed: 0,mention,full_mention,wikipedia_URL,wikipedia_page_ID,wikipedia_title,sentence_id,doc_id,congruent_mentions,full_mention_norm,preds_w2v_mostsimilar,score_w2v_mostsimilar,candidate_pool_mostsimilar,candidate_scores_mostsimilar
0,B,EU,,,,0,0,"['EU', 'German', 'British']",eu,European Union,0.787421,"[European Union, European Free Trade Associati...","[0.7874206, 0.7662648, 0.76082164, 0.7605217, ..."
1,B,German,http://en.wikipedia.org/wiki/Germany,11867.0,Germany,0,0,"['EU', 'German', 'British']",german,Culture of Germany,0.686803,"[Culture of Germany, 1860 in Germany, 1866 in ...","[0.68680257, 0.6840672, 0.6836184, 0.68068546,..."
2,B,British,http://en.wikipedia.org/wiki/United_Kingdom,31717.0,United Kingdom,0,0,"['EU', 'German', 'British']",british,Russians in the United Kingdom,0.615564,"[Russians in the United Kingdom, Henry Wood (n...","[0.61556417, 0.60587215, 0.60256517, 0.5973172..."


In [19]:
# Estimate length of time to run over full dataset
print("Estimated Duration for Full Dataset: ",\
     round((len(candidate_pools)/end_size)*execution_time/60/60,2), " hours")

Estimated Duration for Full Dataset:  12.55  hours


## Calculate Accuracy of Most Similar Entity Predictions

In [20]:
# Calculate accuracy
accurate_predictions = (mini_df['preds_w2v_mostsimilar'] == mini_df['wikipedia_title']).sum()
print("****************************")
print(f"Predictive Accuracy: {round(accurate_predictions / len(mini_df) * 100, 3)}%")
print("****************************")

****************************
Predictive Accuracy: 18.82%
****************************


In [21]:
# Calculate percentage of candidate pools with the correct answer present
# Use Wikipedia Title
# Necessary to determine if shuffling pool could even get the right answer
response_present = [mini_df['wikipedia_title'][i] in mini_df['candidate_pool_mostsimilar'][i] for i in range(len(mini_df))]
print(f"Correct answer is present in {round(sum(response_present) / len(mini_df) * 100, 3)}% of generated candidate pools via adapted Wikipedia2vec's most_similar() method.")

Correct answer is present in 30.06% of generated candidate pools via adapted Wikipedia2vec's most_similar() method.



## Save predictive dataframe for input to next step

In [22]:
#Final DF
mini_df.head(10)

Unnamed: 0,mention,full_mention,wikipedia_URL,wikipedia_page_ID,wikipedia_title,sentence_id,doc_id,congruent_mentions,full_mention_norm,preds_w2v_mostsimilar,score_w2v_mostsimilar,candidate_pool_mostsimilar,candidate_scores_mostsimilar
0,B,EU,,,,0,0,"['EU', 'German', 'British']",eu,European Union,0.787421,"[European Union, European Free Trade Associati...","[0.7874206, 0.7662648, 0.76082164, 0.7605217, ..."
1,B,German,http://en.wikipedia.org/wiki/Germany,11867.0,Germany,0,0,"['EU', 'German', 'British']",german,Culture of Germany,0.686803,"[Culture of Germany, 1860 in Germany, 1866 in ...","[0.68680257, 0.6840672, 0.6836184, 0.68068546,..."
2,B,British,http://en.wikipedia.org/wiki/United_Kingdom,31717.0,United Kingdom,0,0,"['EU', 'German', 'British']",british,Russians in the United Kingdom,0.615564,"[Russians in the United Kingdom, Henry Wood (n...","[0.61556417, 0.60587215, 0.60256517, 0.5973172..."
3,B,Peter Blackburn,,,,1,0,"['Peter Blackburn', 'BRUSSELS', 'European Comm...",peter blackburn,James Watson (Rangers footballer),0.752623,"[James Watson (Rangers footballer), Wally Wils...","[0.75262326, 0.75080603, 0.74791527, 0.7458616..."
4,I,Peter Blackburn,,,,1,0,"['Peter Blackburn', 'BRUSSELS', 'European Comm...",peter blackburn,James Watson (Rangers footballer),0.757774,"[James Watson (Rangers footballer), Bryan Will...","[0.75777376, 0.74518, 0.74305403, 0.73202455, ..."
5,B,BRUSSELS,http://en.wikipedia.org/wiki/Brussels,3708.0,Brussels,1,0,"['Peter Blackburn', 'BRUSSELS', 'European Comm...",brussels,Brussels,0.88707,"[Brussels, Ghent, Timeline of Brussels, Brusse...","[0.88706994, 0.7689268, 0.7686756, 0.76811683,..."
6,B,European Commission,http://en.wikipedia.org/wiki/European_Commission,9974.0,European Commission,1,0,"['Peter Blackburn', 'BRUSSELS', 'European Comm...",european commission,:Category:Wikipedians interested in the Europe...,0.689626,[:Category:Wikipedians interested in the Europ...,"[0.68962616, 0.65377027, 0.6495704, 0.6477843]"
7,I,European Commission,http://en.wikipedia.org/wiki/European_Commission,9974.0,European Commission,1,0,"['Peter Blackburn', 'BRUSSELS', 'European Comm...",european commission,Safety League,0.641829,[Safety League],[0.641829]
8,B,German,http://en.wikipedia.org/wiki/Germany,11867.0,Germany,1,0,"['Peter Blackburn', 'BRUSSELS', 'European Comm...",german,Culture of Germany,0.686803,"[Culture of Germany, 1860 in Germany, 1866 in ...","[0.68680257, 0.6840672, 0.6836184, 0.68068546,..."
9,B,British,http://en.wikipedia.org/wiki/United_Kingdom,31717.0,United Kingdom,1,0,"['Peter Blackburn', 'BRUSSELS', 'European Comm...",british,Russians in the United Kingdom,0.615564,"[Russians in the United Kingdom, Henry Wood (n...","[0.61556417, 0.60587215, 0.60256517, 0.5973172..."


In [23]:
# Save dataframe
mini_df.to_csv("w2v_mostsimilar_100d.csv", index=False)

In [24]:
# Download from Colab
files.download("w2v_mostsimilar_100d.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>