# Step Three (B): Find Similar Entities via Adapted Wikipedia2vec most_similar()

We have now returned all of the entities we'll get from direct querying of the package. We must now use alternate measures to identify candidate entities and select from that pool.

# Note: This is the colab version used for trying to improve speed and use higher-dimensional vectors.

# 500D IS INTRACTABLE. IT TAKES 8 MINUTES FOR A SINGLE QUERY, VERSUS 2S FOR 100D

#### Import Packages

In [1]:
import os
import time

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Progress bar
from tqdm import tqdm

In [3]:
!pip install wikipedia2vec

Collecting wikipedia2vec
[?25l  Downloading https://files.pythonhosted.org/packages/d8/88/751037c70ca86581d444824e66bb799ef9060339a1d5d1fc1804c422d7cc/wikipedia2vec-1.0.4.tar.gz (1.2MB)
[K     |████████████████████████████████| 1.2MB 13.2MB/s 
Collecting marisa-trie
[?25l  Downloading https://files.pythonhosted.org/packages/20/95/d23071d0992dabcb61c948fb118a90683193befc88c23e745b050a29e7db/marisa-trie-0.7.5.tar.gz (270kB)
[K     |████████████████████████████████| 276kB 54.0MB/s 
[?25hCollecting mwparserfromhell
[?25l  Downloading https://files.pythonhosted.org/packages/23/03/4fb04da533c7e237c0104151c028d8bff856293d34e51d208c529696fb79/mwparserfromhell-0.5.4.tar.gz (135kB)
[K     |████████████████████████████████| 143kB 56.9MB/s 
Building wheels for collected packages: wikipedia2vec, marisa-trie, mwparserfromhell
  Building wheel for wikipedia2vec (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia2vec: filename=wikipedia2vec-1.0.4-cp36-cp36m-linux_x86_64.whl size=45817

In [4]:
# Package
from wikipedia2vec import Wikipedia2Vec

# Class to compare type
from wikipedia2vec.dictionary import Entity

In [5]:
# Download dimensional file from Wikipedia2vec website
!curl -O http://wikipedia2vec.s3.amazonaws.com/models/en/2018-04-20/enwiki_20180420_500d.pkl.bz2

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 16.1G  100 16.1G    0     0  19.8M      0  0:13:51  0:13:51 --:--:-- 19.9M


In [6]:
!bzip2 -d enwiki_20180420_500d.pkl.bz2

In [7]:
!ls

enwiki_20180420_500d.pkl  sample_data


In [8]:
%%time
# Load unzipped pkl file with word embeddings
w2v = Wikipedia2Vec.load("enwiki_20180420_500d.pkl")

CPU times: user 46.6 ms, sys: 164 ms, total: 210 ms
Wall time: 2.66 s


## Load ACY Input Data

In [9]:
from google.colab import files
uploaded = files.upload()

Saving Aida-Conll-Yago-Input.csv to Aida-Conll-Yago-Input.csv


In [10]:
!ls

Aida-Conll-Yago-Input.csv  enwiki_20180420_500d.pkl  sample_data


In [11]:
# Load data
acy_input = pd.read_csv("Aida-Conll-Yago-Input.csv", delimiter=",")
acy_input.head(3)

Unnamed: 0,mention,full_mention,wikipedia_URL,wikipedia_page_ID,wikipedia_title,sentence_id,doc_id,congruent_mentions
0,B,EU,,,,0,0,"['EU', 'German', 'British']"
1,B,German,http://en.wikipedia.org/wiki/Germany,11867.0,Germany,0,0,"['EU', 'German', 'British']"
2,B,British,http://en.wikipedia.org/wiki/United_Kingdom,31717.0,United Kingdom,0,0,"['EU', 'German', 'British']"


In [12]:
# Re-name 
candidate_pools = acy_input.copy()

## Find most similar entity using Wikipedia2vec

We now turn to using a variation on Wikipedia2vec's `most_similar()` function to find, for entered words, the most similar entity. We do this as an added-layer, meaning only for those without an estimate, and for all full mentions, to compare performance.

In [13]:
# Normalize full_mentions to lower case for entry into most_similar() function
full_mention_norm = np.array([x.lower() for x in candidate_pools['full_mention']])
candidate_pools['full_mention_norm'] = full_mention_norm
candidate_pools.head(3)

Unnamed: 0,mention,full_mention,wikipedia_URL,wikipedia_page_ID,wikipedia_title,sentence_id,doc_id,congruent_mentions,full_mention_norm
0,B,EU,,,,0,0,"['EU', 'German', 'British']",eu
1,B,German,http://en.wikipedia.org/wiki/Germany,11867.0,Germany,0,0,"['EU', 'German', 'British']",german
2,B,British,http://en.wikipedia.org/wiki/United_Kingdom,31717.0,United Kingdom,0,0,"['EU', 'German', 'British']",british


In [14]:
### Test single full mention query time
start_time = time.time()

# Print word
search_word = candidate_pools['full_mention_norm'][2]
print("Search Word: ", search_word)

# Translate word into vector
# Handles multi-word mentions
search_word_list = search_word.split(" ")
search_word_vector = None
for word in search_word_list:
    try:
        vector = w2v.get_word_vector(str(word))
    except KeyError:
        print("This word is Out of Vocabulary (OOV) for Wikipedia2vec.")
        vector = None
        
    if search_word_vector is None:
        search_word_vector = vector
    else:
        search_word_vector += vector

if search_word_vector is not None:
    # Get most similar word
    count_similar = 500
    similar = w2v.most_similar_by_vector(search_word_vector, count_similar)

    # Retrieve only entities from word
    entities = []
    return_similar = 10
    for i in similar:
    #     print(type(i[0]))
        if isinstance(i[0], Entity):
            entities.append(i)
    #     if len(entities) == return_similar:
    #         break
    display(entities)
end_time = time.time()
print(f"Single Word Query Time: {round(end_time - start_time, 2)}s")

Search Word:  british


[(<Entity United Kingdom>, 0.4746339),
 (<Entity John Osbaldiston Field>, 0.45445728),
 (<Entity Valentine Boucher>, 0.45051035),
 (<Entity British Army>, 0.4449676),
 (<Entity List of colonial governors of the British Virgin Islands>,
  0.44208843),
 (<Entity List of British middleweight boxing champions>, 0.4411608),
 (<Entity Derek George Cudmore>, 0.4407883),
 (<Entity Stephen Chapman (British Army officer)>, 0.4302066),
 (<Entity Walter Wilkinson Wallace>, 0.43019333),
 (<Entity The Bigamist (1921 film)>, 0.42996034),
 (<Entity Henry Maynard Ball>, 0.4297536),
 (<Entity Deadlock (1943 film)>, 0.42930114),
 (<Entity David Robert Barwick>, 0.42919323),
 (<Entity John Henry Cates>, 0.4291481),
 (<Entity Recorder of Manchester>, 0.42779085),
 (<Entity Frank McKelvey Bell>, 0.42712817),
 (<Entity Colin Shortis>, 0.42692956),
 (<Entity Stepping Stones (film)>, 0.42692292),
 (<Entity Norman Williams (artist)>, 0.42577127),
 (<Entity Transatlantic (1960 film)>, 0.42566535),
 (<Entity The 

Single Word Query Time: 505.09s


In [None]:
### Test single full mention query time on a mention with multiple words
start_time = time.time()

# Print word
search_word = candidate_pools['full_mention_norm'][51]
print("Search Word: ", search_word)

# Translate word into vector
# Handles multi-word mentions
search_word_list = search_word.split(" ")
search_word_vector = None
for word in search_word_list:
    try:
        vector = w2v.get_word_vector(str(word))
        
        if search_word_vector is None:
            search_word_vector = vector
        else:
            search_word_vector += vector

    except KeyError:
        print(f"\"{word}\" is Out of Vocabulary (OOV) for Wikipedia2vec.")

if search_word_vector is not None:
    # Get most similar word
    count_similar = 500
    similar = w2v.most_similar_by_vector(search_word_vector, count_similar)

    # Retrieve only entities from word
    entities = []
    return_similar = 10
    for i in similar:
    #     print(type(i[0]))
        if isinstance(i[0], Entity):
            entities.append(i)
    #     if len(entities) == return_similar:
    #         break
    display(entities)
end_time = time.time()
print(f"Single Word Query Time: {round(end_time - start_time, 2)}s")

Search Word:  welsh national farmers ' union
"'" is Out of Vocabulary (OOV) for Wikipedia2vec.


[(<Entity :Category:Wikipedians interested in the European Union>, 0.74048346),
 (<Entity Whiteheads RFC>, 0.7279615),
 (<Entity Ian Mackay (rugby league)>, 0.7100251),
 (<Entity E. Gwyndaf Evans>, 0.70472246),
 (<Entity Ystrad Rhondda RFC>, 0.7012322),
 (<Entity Abercynon RFC>, 0.70090365),
 (<Entity Caernarfon RFC>, 0.69320804),
 (<Entity Tredegar Ironsides RFC>, 0.69225544),
 (<Entity Fleur De Lys RFC>, 0.6915729),
 (<Entity Cwmgwrach RFC>, 0.6905903),
 (<Entity File:Harden NSW.PNG>, 0.6901128),
 (<Entity Betws RFC>, 0.6887361),
 (<Entity Dai Francis (trade union leader)>, 0.6886294),
 (<Entity Llandybie RFC>, 0.6878249),
 (<Entity Berwyn Rangers F.C.>, 0.6861579),
 (<Entity :Category:All Blacks>, 0.68473554),
 (<Entity :Eastern Conference (NHL)>, 0.6845522),
 (<Entity Markham RFC>, 0.6823692),
 (<Entity Bethesda RFC>, 0.6821802),
 (<Entity Fall Bay RFC>, 0.6820504),
 (<Entity Cwmtwrch RFC>, 0.6806142),
 (<Entity Neil Lashkari>, 0.6795225),
 (<Entity Joost Adriaan van Hamel>, 0.6789

Single Word Query Time: 1.9s


### Run Over Large Subset of Data

In [15]:
# Prepare output array
most_similar_entities = []
most_similar_scores = []
get_similar_candidate_pool = []
get_similar_candidate_scores = []

# Track metrics
success_word_query = 0
oov_errors = 0
start_time = time.time()

# Provide filter ability
size = 10

for mention in tqdm(candidate_pools['full_mention_norm'][:size]):
    
    # Translate word into vector
    # Handles multi-word mentions
    search_word_list = mention.split(" ")
    search_word_vector = None
    for word in search_word_list:
        try:
            vector = w2v.get_word_vector(str(word))
            
            if search_word_vector is None:
                search_word_vector = vector
            else:
                search_word_vector += vector
                
        except KeyError:
            oov_errors += 1

    # Save candidate pool
    candidate_pool = []
    candidate_scores = []
    
    if search_word_vector is not None:
        success_word_query += 1
        
        # Search most similar words/entities from found word
        # Retrieve 500 most similar to test large coverage
        similars = w2v.most_similar_by_vector(search_word_vector, 500)

        # Retrieve most similar entity
        most_similar = None
        for s in similars:
            if isinstance(s[0], Entity):
                candidate_pool.append(s[0].title)
                candidate_scores.append(s[1])
                if most_similar is None:
                    most_similar = s
                
    # Save lists
    get_similar_candidate_pool.append(candidate_pool)
    get_similar_candidate_scores.append(candidate_scores)
    
    if most_similar is not None:
        most_similar_entities.append(most_similar[0].title)
        most_similar_scores.append(most_similar[1])
    else:
        most_similar_entities.append(None)
        most_similar_scores.append(None)

    
print("Successfully Found Words: ", round(success_word_query/size*100,3),"%")
print("Out-of-Vocabulary Issues: ", round(oov_errors/size*100,3),"%")
execution_time = time.time() - start_time
print("Execution time: ", round(execution_time, 3),"s")

100%|██████████| 10/10 [1:24:11<00:00, 505.13s/it]

Successfully Found Words:  100.0 %
Out-of-Vocabulary Issues:  0.0 %
Execution time:  5051.299 s





In [16]:
# Append to dataframe
mini_df = candidate_pools[:size].copy()
mini_df['preds_w2v_mostsimilar'] = most_similar_entities
mini_df['score_w2v_mostsimilar'] = most_similar_scores
mini_df['candidate_pool_mostsimilar'] = get_similar_candidate_pool
mini_df['candidate_scores_mostsimilar'] = get_similar_candidate_scores
mini_df.head(3)

Unnamed: 0,mention,full_mention,wikipedia_URL,wikipedia_page_ID,wikipedia_title,sentence_id,doc_id,congruent_mentions,full_mention_norm,preds_w2v_mostsimilar,score_w2v_mostsimilar,candidate_pool_mostsimilar,candidate_scores_mostsimilar
0,B,EU,,,,0,0,"['EU', 'German', 'British']",eu,European Union,0.673878,"[European Union, Directorate-General for Trade...","[0.67387825, 0.54621464, 0.53486407, 0.5341276..."
1,B,German,http://en.wikipedia.org/wiki/Germany,11867.0,Germany,0,0,"['EU', 'German', 'British']",german,1856 in Germany,0.5327,"[1856 in Germany, 1860 in Germany, 1868 in Ger...","[0.53269994, 0.5228385, 0.51995784, 0.51975304..."
2,B,British,http://en.wikipedia.org/wiki/United_Kingdom,31717.0,United Kingdom,0,0,"['EU', 'German', 'British']",british,United Kingdom,0.474634,"[United Kingdom, John Osbaldiston Field, Valen...","[0.4746339, 0.45445728, 0.45051035, 0.4449676,..."


In [17]:
# Estimate length of time to run over full dataset
print("Estimated Duration for Full Dataset: ",\
     round((len(candidate_pools)/size)*execution_time/60/60,2), " hours")

Estimated Duration for Full Dataset:  4112.88  hours


## Calculate Accuracy of Most Similar Entity Predictions

In [18]:
# Calculate accuracy
accurate_predictions = (mini_df['preds_w2v_mostsimilar'] == mini_df['wikipedia_title']).sum()
print("****************************")
print(f"Predictive Accuracy: {round(accurate_predictions / len(mini_df) * 100, 3)}%")
print("****************************")

****************************
Predictive Accuracy: 40.0%
****************************


In [19]:
# Calculate percentage of candidate pools with the correct answer present
# Use Wikipedia Title
# Necessary to determine if shuffling pool could even get the right answer
response_present = [mini_df['wikipedia_title'][i] in mini_df['candidate_pool_mostsimilar'][i] for i in range(len(mini_df))]
print(f"Correct answer is present in {round(sum(response_present) / len(mini_df) * 100, 3)}% of generated candidate pools via adapted Wikipedia2vec's most_similar() method.")

Correct answer is present in 50.0% of generated candidate pools via adapted Wikipedia2vec's most_similar() method.



## Save predictive dataframe for input to next step

In [20]:
#Final DF
mini_df.head(10)

Unnamed: 0,mention,full_mention,wikipedia_URL,wikipedia_page_ID,wikipedia_title,sentence_id,doc_id,congruent_mentions,full_mention_norm,preds_w2v_mostsimilar,score_w2v_mostsimilar,candidate_pool_mostsimilar,candidate_scores_mostsimilar
0,B,EU,,,,0,0,"['EU', 'German', 'British']",eu,European Union,0.673878,"[European Union, Directorate-General for Trade...","[0.67387825, 0.54621464, 0.53486407, 0.5341276..."
1,B,German,http://en.wikipedia.org/wiki/Germany,11867.0,Germany,0,0,"['EU', 'German', 'British']",german,1856 in Germany,0.5327,"[1856 in Germany, 1860 in Germany, 1868 in Ger...","[0.53269994, 0.5228385, 0.51995784, 0.51975304..."
2,B,British,http://en.wikipedia.org/wiki/United_Kingdom,31717.0,United Kingdom,0,0,"['EU', 'German', 'British']",british,United Kingdom,0.474634,"[United Kingdom, John Osbaldiston Field, Valen...","[0.4746339, 0.45445728, 0.45051035, 0.4449676,..."
3,B,Peter Blackburn,,,,1,0,"['Peter Blackburn', 'BRUSSELS', 'European Comm...",peter blackburn,I'll Be Back Before Midnight,0.533969,"[I'll Be Back Before Midnight, William Dunlop ...","[0.53396887, 0.53210217, 0.52428687, 0.5232596..."
4,I,Peter Blackburn,,,,1,0,"['Peter Blackburn', 'BRUSSELS', 'European Comm...",peter blackburn,Blackburn Rovers F.C.,0.541318,"[Blackburn Rovers F.C., Derek Leaver, Heywood ...","[0.54131776, 0.5234059, 0.52236223, 0.50598407..."
5,B,BRUSSELS,http://en.wikipedia.org/wiki/Brussels,3708.0,Brussels,1,0,"['Peter Blackburn', 'BRUSSELS', 'European Comm...",brussels,Brussels,0.865907,"[Brussels, Timeline of Brussels, Meiser railwa...","[0.8659067, 0.6254457, 0.62286663, 0.615217, 0..."
6,B,European Commission,http://en.wikipedia.org/wiki/European_Commission,9974.0,European Commission,1,0,"['Peter Blackburn', 'BRUSSELS', 'European Comm...",european commission,European Commission,0.542032,"[European Commission, Europe for Citizens, Off...","[0.54203236, 0.53617793, 0.52699053, 0.5178218..."
7,I,European Commission,http://en.wikipedia.org/wiki/European_Commission,9974.0,European Commission,1,0,"['Peter Blackburn', 'BRUSSELS', 'European Comm...",european commission,Montreal Catholic School Commission,0.522123,"[Montreal Catholic School Commission , Europea...","[0.5221227, 0.50598323, 0.48425615, 0.4838054,..."
8,B,German,http://en.wikipedia.org/wiki/Germany,11867.0,Germany,1,0,"['Peter Blackburn', 'BRUSSELS', 'European Comm...",german,1856 in Germany,0.5327,"[1856 in Germany, 1860 in Germany, 1868 in Ger...","[0.53269994, 0.5228385, 0.51995784, 0.51975304..."
9,B,British,http://en.wikipedia.org/wiki/United_Kingdom,31717.0,United Kingdom,1,0,"['Peter Blackburn', 'BRUSSELS', 'European Comm...",british,United Kingdom,0.474634,"[United Kingdom, John Osbaldiston Field, Valen...","[0.4746339, 0.45445728, 0.45051035, 0.4449676,..."


In [22]:
# Save dataframe
mini_df.to_csv("wikipedia2vec_most_similar_10n_500d.csv", index=False)

In [23]:
# Download from Colab
files.download("wikipedia2vec_most_similar_10n_500d.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>