# Phase Two: Find Similar Entities via Adapted Wikipedia2vec most_similar()

In Phase Two, we try to generate candidate pools for each full mention. In this method, we use an adaptation of Wikipedia2vec's most_similar() function that translates a full mention into a vector representation, returns a large sample of similar words and entities and filters to just entities as our candidates.

To improve speed of processing, we used Google Colab to download the pre-trained embeddings and run this notebook.

#### Import Packages

In [1]:
import os
import time

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Progress bar
from tqdm import tqdm

In [2]:
!pip install wikipedia2vec



In [3]:
# Package
from wikipedia2vec import Wikipedia2Vec

# Class to compare type
from wikipedia2vec.dictionary import Entity

In [4]:
# Download dimensional file from Wikipedia2vec website
!curl -O http://wikipedia2vec.s3.amazonaws.com/models/en/2018-04-20/enwiki_20180420_300d.pkl.bz2

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9921M  100 9921M    0     0  61.6M      0  0:02:40  0:02:40 --:--:-- 74.1M


In [5]:
!bzip2 -d enwiki_20180420_300d.pkl.bz2

bzip2: Output file enwiki_20180420_300d.pkl already exists.


In [6]:
!ls

Aida-Conll-Yago-Input.csv     enwiki_20180420_300d.pkl	    sample_data
enwiki_20180420_100d.pkl.bz2  enwiki_20180420_300d.pkl.bz2


In [7]:
%%time
# Load unzipped pkl file with word embeddings
w2v = Wikipedia2Vec.load("enwiki_20180420_300d.pkl")

CPU times: user 63 ms, sys: 217 ms, total: 280 ms
Wall time: 1.64 s


## Load ACY Input Data

In [8]:
from google.colab import files
uploaded = files.upload()

Saving Aida-Conll-Yago-Input.csv to Aida-Conll-Yago-Input (1).csv


In [9]:
!ls

'Aida-Conll-Yago-Input (1).csv'   enwiki_20180420_300d.pkl
 Aida-Conll-Yago-Input.csv	  enwiki_20180420_300d.pkl.bz2
 enwiki_20180420_100d.pkl.bz2	  sample_data


In [10]:
# Load data
acy_input = pd.read_csv("Aida-Conll-Yago-Input.csv", delimiter=",")
acy_input.head(3)

Unnamed: 0,mention,full_mention,wikipedia_URL,wikipedia_page_ID,wikipedia_title,sentence_id,doc_id,congruent_mentions
0,B,EU,,,,0,0,"['EU', 'German', 'British']"
1,B,German,http://en.wikipedia.org/wiki/Germany,11867.0,Germany,0,0,"['EU', 'German', 'British']"
2,B,British,http://en.wikipedia.org/wiki/United_Kingdom,31717.0,United Kingdom,0,0,"['EU', 'German', 'British']"


In [11]:
# Re-name 
candidate_pools = acy_input.copy()

## Find most similar entity using Wikipedia2vec

We now turn to using a variation on Wikipedia2vec's `most_similar()` function to find, for entered words, the most similar entity.

In [12]:
# Normalize full_mentions to lower case for entry into most_similar() function
full_mention_norm = np.array([x.lower() for x in candidate_pools['full_mention']])
candidate_pools['full_mention_norm'] = full_mention_norm
candidate_pools.head(3)

Unnamed: 0,mention,full_mention,wikipedia_URL,wikipedia_page_ID,wikipedia_title,sentence_id,doc_id,congruent_mentions,full_mention_norm
0,B,EU,,,,0,0,"['EU', 'German', 'British']",eu
1,B,German,http://en.wikipedia.org/wiki/Germany,11867.0,Germany,0,0,"['EU', 'German', 'British']",german
2,B,British,http://en.wikipedia.org/wiki/United_Kingdom,31717.0,United Kingdom,0,0,"['EU', 'German', 'British']",british


In [13]:
### Test single full mention query time
start_time = time.time()

# Print word
search_word = candidate_pools['full_mention_norm'][2]
print("Search Word: ", search_word)

# Translate word into vector
# Handles multi-word mentions
search_word_list = search_word.split(" ")
search_word_vector = None
for word in search_word_list:
    try:
        vector = w2v.get_word_vector(str(word))
    except KeyError:
        print("This word is Out of Vocabulary (OOV) for Wikipedia2vec.")
        vector = None
        
    if search_word_vector is None:
        search_word_vector = vector
    else:
        search_word_vector += vector

if search_word_vector is not None:
    # Get most similar word
    count_similar = 500
    similar = w2v.most_similar_by_vector(search_word_vector, count_similar)

    # Retrieve only entities from word
    entities = []
    return_similar = 10
    for i in similar:
    #     print(type(i[0]))
        if isinstance(i[0], Entity):
            entities.append(i)
    #     if len(entities) == return_similar:
    #         break
    display(entities)
end_time = time.time()
print(f"Single Word Query Time: {round(end_time - start_time, 2)}s")

Search Word:  british


[(<Entity Derek George Cudmore>, 0.49873415),
 (<Entity HMS Birkenhead>, 0.48730087),
 (<Entity File:Turner, The Battle of Trafalgar (1822).jpg>, 0.48630852),
 (<Entity John Osbaldiston Field>, 0.48565567),
 (<Entity Henry Maynard Ball>, 0.48533762),
 (<Entity Edward John Cameron>, 0.48491052),
 (<Entity David Robert Barwick>, 0.48152602),
 (<Entity List of Commissioners of the Turks and Caicos Islands>, 0.47786582),
 (<Entity Starlight (TV series)>, 0.47318417),
 (<Entity List of colonial governors of the British Virgin Islands>,
  0.47302568),
 (<Entity Gordon Wesley Jewkes>, 0.4724612),
 (<Entity Rollo Pain>, 0.4719429),
 (<Entity Percivale Liesching>, 0.47097537),
 (<Entity Walter Wilkinson Wallace>, 0.46967942),
 (<Entity Millions (1937 film)>, 0.4696355),
 (<Entity United Kingdom>, 0.469226),
 (<Entity File:El Alamein 1942 - British infantry.jpg>, 0.469008),
 (<Entity Ernest Gordon Lewis>, 0.46749967),
 (<Entity The Windmill (1937 film)>, 0.4670006),
 (<Entity Paul Adam (English 

Single Word Query Time: 30.21s


In [14]:
### Test single full mention query time on a mention with multiple words
start_time = time.time()

# Print word
search_word = candidate_pools['full_mention_norm'][51]
print("Search Word: ", search_word)

# Translate word into vector
# Handles multi-word mentions
search_word_list = search_word.split(" ")
search_word_vector = None
for word in search_word_list:
    try:
        vector = w2v.get_word_vector(str(word))
        
        if search_word_vector is None:
            search_word_vector = vector
        else:
            search_word_vector += vector

    except KeyError:
        print(f"\"{word}\" is Out of Vocabulary (OOV) for Wikipedia2vec.")

if search_word_vector is not None:
    # Get most similar word
    count_similar = 500
    similar = w2v.most_similar_by_vector(search_word_vector, count_similar)

    # Retrieve only entities from word
    entities = []
    return_similar = 10
    for i in similar:
    #     print(type(i[0]))
        if isinstance(i[0], Entity):
            entities.append(i)
    #     if len(entities) == return_similar:
    #         break
    display(entities)
end_time = time.time()
print(f"Single Word Query Time: {round(end_time - start_time, 2)}s")

Search Word:  welsh national farmers ' union
"'" is Out of Vocabulary (OOV) for Wikipedia2vec.


[(<Entity :Category:Wikipedians interested in the European Union>, 0.6399574),
 (<Entity Whiteheads RFC>, 0.6378816),
 (<Entity Caernarfon RFC>, 0.61923885),
 (<Entity Abercynon RFC>, 0.6164113),
 (<Entity Geoff Evans (rugby union born 1942)>, 0.61393803),
 (<Entity Andy Allen (rugby union)>, 0.6118196),
 (<Entity Rhodri Gomer-Davies>, 0.6114087),
 (<Entity Bethesda RFC>, 0.61053175),
 (<Entity Dan Thomas (rugby player)>, 0.61023164),
 (<Entity E. Gwyndaf Evans>, 0.6101505),
 (<Entity Josh Lewis (rugby union)>, 0.60955167),
 (<Entity Members of the 2nd National Assembly for Wales>, 0.6083609),
 (<Entity Betws RFC>, 0.60643876),
 (<Entity Llandybie RFC>, 0.60576683),
 (<Entity Phil Thomas (rugby)>, 0.6049313),
 (<Entity Gareth Roberts (rugby player)>, 0.6012342),
 (<Entity Alltwen RFC>, 0.59988105),
 (<Entity Cwmtwrch RFC>, 0.5990562),
 (<Entity Glais RFC>, 0.5988778),
 (<Entity Nant Conwy RFC>, 0.59881973),
 (<Entity Mark Wyatt (Welsh rugby player)>, 0.59853023),
 (<Entity Llangadog RF

Single Word Query Time: 5.84s


### Run Over Large Subset of Data

In [15]:
# Prepare output array
most_similar_entities = []
most_similar_scores = []
get_similar_candidate_pool = []
get_similar_candidate_scores = []

# Track metrics
success_word_query = 0
oov_errors = 0
start_time = time.time()

In [16]:
# Provide filter ability
end_size = 500

for mention in tqdm(candidate_pools['full_mention_norm'][:end_size]):
    
    # Translate word into vector
    # Handles multi-word mentions
    search_word_list = mention.split(" ")
    search_word_vector = None
    for word in search_word_list:
        try:
            vector = w2v.get_word_vector(str(word))
            
            if search_word_vector is None:
                search_word_vector = vector
            else:
                search_word_vector += vector
                
        except KeyError:
            oov_errors += 1

    # Save candidate pool
    candidate_pool = []
    candidate_scores = []
    
    if search_word_vector is not None:
        success_word_query += 1
        
        # Search most similar words/entities from found word
        # Retrieve 500 most similar to test large coverage
        similars = w2v.most_similar_by_vector(search_word_vector, 500)

        # Retrieve most similar entity
        most_similar = None
        for s in similars:
            if isinstance(s[0], Entity):
                candidate_pool.append(s[0].title)
                candidate_scores.append(s[1])
                if most_similar is None:
                    most_similar = s
                
    # Save lists
    get_similar_candidate_pool.append(candidate_pool)
    get_similar_candidate_scores.append(candidate_scores)
    
    if most_similar is not None:
        most_similar_entities.append(most_similar[0].title)
        most_similar_scores.append(most_similar[1])
    else:
        most_similar_entities.append(None)
        most_similar_scores.append(None)

100%|██████████| 500/500 [45:33<00:00,  5.47s/it]


In [17]:
print("Successfully Found Words: ", round(success_word_query/end_size*100,3),"%")
print("Out-of-Vocabulary Issues: ", round(oov_errors/end_size*100,3),"%")
execution_time = time.time() - start_time
print("Execution time: ", round(execution_time, 3),"s")

Successfully Found Words:  96.8 %
Out-of-Vocabulary Issues:  7.4 %
Execution time:  2734.274 s


In [18]:
# Append to dataframe
mini_df = candidate_pools[:end_size].copy()
mini_df['preds_w2v_mostsimilar'] = most_similar_entities
mini_df['score_w2v_mostsimilar'] = most_similar_scores
mini_df['candidate_pool_mostsimilar'] = get_similar_candidate_pool
mini_df['candidate_scores_mostsimilar'] = get_similar_candidate_scores
mini_df.head(3)

Unnamed: 0,mention,full_mention,wikipedia_URL,wikipedia_page_ID,wikipedia_title,sentence_id,doc_id,congruent_mentions,full_mention_norm,preds_w2v_mostsimilar,score_w2v_mostsimilar,candidate_pool_mostsimilar,candidate_scores_mostsimilar
0,B,EU,,,,0,0,"['EU', 'German', 'British']",eu,European Union,0.668622,"[European Union, Directorate-General for Trade...","[0.668622, 0.61401534, 0.6123832, 0.60006154, ..."
1,B,German,http://en.wikipedia.org/wiki/Germany,11867.0,Germany,0,0,"['EU', 'German', 'British']",german,1856 in Germany,0.59712,"[1856 in Germany, 1860 in Germany, 1862 in Ger...","[0.5971198, 0.58756816, 0.57571477, 0.5744798,..."
2,B,British,http://en.wikipedia.org/wiki/United_Kingdom,31717.0,United Kingdom,0,0,"['EU', 'German', 'British']",british,Derek George Cudmore,0.498734,"[Derek George Cudmore, HMS Birkenhead, File:Tu...","[0.49873415, 0.48730087, 0.48630852, 0.4856556..."


In [19]:
# Estimate length of time to run over full dataset
print("Estimated Duration for Full Dataset: ",\
     round((len(candidate_pools)/end_size)*execution_time/60/60,2), " hours")

Estimated Duration for Full Dataset:  33.13  hours


## Calculate Accuracy of Most Similar Entity Predictions

In [20]:
# Calculate accuracy
accurate_predictions = (mini_df['preds_w2v_mostsimilar'] == mini_df['wikipedia_title']).sum()
print("****************************")
print(f"Predictive Accuracy: {round(accurate_predictions / len(mini_df) * 100, 3)}%")
print("****************************")

****************************
Predictive Accuracy: 32.0%
****************************


In [21]:
# Calculate percentage of candidate pools with the correct answer present
# Use Wikipedia Title
# Necessary to determine if shuffling pool could even get the right answer
response_present = [mini_df['wikipedia_title'][i] in mini_df['candidate_pool_mostsimilar'][i] for i in range(len(mini_df))]
print(f"Correct answer is present in {round(sum(response_present) / len(mini_df) * 100, 3)}% of generated candidate pools via adapted Wikipedia2vec's most_similar() method.")

Correct answer is present in 43.6% of generated candidate pools via adapted Wikipedia2vec's most_similar() method.



## Save predictive dataframe for input to next step

In [22]:
#Final DF
mini_df.head(10)

Unnamed: 0,mention,full_mention,wikipedia_URL,wikipedia_page_ID,wikipedia_title,sentence_id,doc_id,congruent_mentions,full_mention_norm,preds_w2v_mostsimilar,score_w2v_mostsimilar,candidate_pool_mostsimilar,candidate_scores_mostsimilar
0,B,EU,,,,0,0,"['EU', 'German', 'British']",eu,European Union,0.668622,"[European Union, Directorate-General for Trade...","[0.668622, 0.61401534, 0.6123832, 0.60006154, ..."
1,B,German,http://en.wikipedia.org/wiki/Germany,11867.0,Germany,0,0,"['EU', 'German', 'British']",german,1856 in Germany,0.59712,"[1856 in Germany, 1860 in Germany, 1862 in Ger...","[0.5971198, 0.58756816, 0.57571477, 0.5744798,..."
2,B,British,http://en.wikipedia.org/wiki/United_Kingdom,31717.0,United Kingdom,0,0,"['EU', 'German', 'British']",british,Derek George Cudmore,0.498734,"[Derek George Cudmore, HMS Birkenhead, File:Tu...","[0.49873415, 0.48730087, 0.48630852, 0.4856556..."
3,B,Peter Blackburn,,,,1,0,"['Peter Blackburn', 'BRUSSELS', 'European Comm...",peter blackburn,William Dunlop (footballer),0.62522,"[William Dunlop (footballer), I'll Be Back Bef...","[0.62522006, 0.61881196, 0.60742164, 0.6032656..."
4,I,Peter Blackburn,,,,1,0,"['Peter Blackburn', 'BRUSSELS', 'European Comm...",peter blackburn,Heywood United F.C.,0.600855,"[Heywood United F.C., William Dunlop (football...","[0.60085547, 0.5962453, 0.59373885, 0.59146744..."
5,B,BRUSSELS,http://en.wikipedia.org/wiki/Brussels,3708.0,Brussels,1,0,"['Peter Blackburn', 'BRUSSELS', 'European Comm...",brussels,Brussels,0.86907,"[Brussels, Bordet railway station, Timeline of...","[0.8690704, 0.6816344, 0.67974824, 0.6619611, ..."
6,B,European Commission,http://en.wikipedia.org/wiki/European_Commission,9974.0,European Commission,1,0,"['Peter Blackburn', 'BRUSSELS', 'European Comm...",european commission,European Commission,0.594913,"[European Commission, Office of Infrastructure...","[0.5949126, 0.5785086, 0.57685864, 0.562716, 0..."
7,I,European Commission,http://en.wikipedia.org/wiki/European_Commission,9974.0,European Commission,1,0,"['Peter Blackburn', 'BRUSSELS', 'European Comm...",european commission,European Commission,0.55876,"[European Commission, Montreal Catholic School...","[0.55876046, 0.5487425, 0.54632604, 0.51790774..."
8,B,German,http://en.wikipedia.org/wiki/Germany,11867.0,Germany,1,0,"['Peter Blackburn', 'BRUSSELS', 'European Comm...",german,1856 in Germany,0.59712,"[1856 in Germany, 1860 in Germany, 1862 in Ger...","[0.5971198, 0.58756816, 0.57571477, 0.5744798,..."
9,B,British,http://en.wikipedia.org/wiki/United_Kingdom,31717.0,United Kingdom,1,0,"['Peter Blackburn', 'BRUSSELS', 'European Comm...",british,Derek George Cudmore,0.498734,"[Derek George Cudmore, HMS Birkenhead, File:Tu...","[0.49873415, 0.48730087, 0.48630852, 0.4856556..."


In [23]:
# Save dataframe
mini_df.to_csv("w2v_mostsimilar_300d.csv", index=False)

In [24]:
# Download from Colab
files.download("w2v_mostsimilar_300d.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>