# Part Two: Enter Full Mention into Wikipedia2vec Get Entities

Our most direct step is to use Wikipedia2vec's API and its get_entities() function to enter `full_mention` directly into that. We do that as our first process step, with the assumption that a returned result is near a 100% chance of being the correct page. We test this hypothesis at the end of this notebook.

#### Import Packages

In [1]:
import os
import time

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Progress bar
from tqdm import tqdm

## Load Processed ACY Input

In [69]:
# Base path to input
acy_path = '../../data/aida-conll-yago-dataset/'

# Load data
acy_input = pd.read_csv(os.path.join(acy_path, "Aida-Conll-Yago-Input.csv"), delimiter=",")
acy_input.head(3)

Unnamed: 0,mention,full_mention,wikipedia_URL,wikipedia_ID,sentence_id,doc_id,congruent_entities
0,B,German,http://en.wikipedia.org/wiki/Germany,11867,0,0,"['EU', 'German', 'British']"
1,B,British,http://en.wikipedia.org/wiki/United_Kingdom,31717,0,0,"['EU', 'German', 'British']"
2,B,BRUSSELS,http://en.wikipedia.org/wiki/Brussels,3708,1,0,"['Peter Blackburn', 'BRUSSELS', 'European Comm..."


In [70]:
# Re-name for this predictive step
entity_disambiguation = acy_input

## Import Wikipedia2Vec Model

In [5]:
# Package
from wikipedia2vec import Wikipedia2Vec

In [4]:
%%time
# Load unzipped pkl file with word embeddings
w2v = Wikipedia2Vec.load("../../embeddings/enwiki_20180420_100d.pkl")

CPU times: user 92.9 ms, sys: 142 ms, total: 235 ms
Wall time: 316 ms


#### Query using `full_mention`

In [71]:
# Track success rate for returned values
successes = 0
queries = 0
failed_searches = []
preds_w2v_getentity = []

# Run through each full_mention
for full_mention in tqdm(acy_input['full_mention']):
    
    # Query API
    entity = w2v.get_entity(full_mention)
    
    # Increment count
    queries += 1
    if entity is not None:
        successes += 1
    else:
        # Save X% of random failures
        if np.random.uniform() <= 0.1:
            failed_searches.append(full_mention)
    
    # Save just title
    try:
        entity = entity.title
    except:
        pass

    # Save prediction
    preds_w2v_getentity.append(entity)
print("Query Success Rate: ", round(successes/queries*100, 3),"%")

100%|██████████| 22257/22257 [00:00<00:00, 46354.66it/s]

Query Success Rate:  72.638 %





In [72]:
# Append predictions to table
entity_disambiguation['preds_w2v_getentity'] = preds_w2v_getentity
entity_disambiguation.head(10)

Unnamed: 0,mention,full_mention,wikipedia_URL,wikipedia_ID,sentence_id,doc_id,congruent_entities,preds_w2v_getentity
0,B,German,http://en.wikipedia.org/wiki/Germany,11867,0,0,"['EU', 'German', 'British']",
1,B,British,http://en.wikipedia.org/wiki/United_Kingdom,31717,0,0,"['EU', 'German', 'British']",
2,B,BRUSSELS,http://en.wikipedia.org/wiki/Brussels,3708,1,0,"['Peter Blackburn', 'BRUSSELS', 'European Comm...",
3,B,European Commission,http://en.wikipedia.org/wiki/European_Commission,9974,1,0,"['Peter Blackburn', 'BRUSSELS', 'European Comm...",European Commission
4,I,European Commission,http://en.wikipedia.org/wiki/European_Commission,9974,1,0,"['Peter Blackburn', 'BRUSSELS', 'European Comm...",European Commission
5,B,German,http://en.wikipedia.org/wiki/Germany,11867,1,0,"['Peter Blackburn', 'BRUSSELS', 'European Comm...",
6,B,British,http://en.wikipedia.org/wiki/United_Kingdom,31717,1,0,"['Peter Blackburn', 'BRUSSELS', 'European Comm...",
7,B,Germany,http://en.wikipedia.org/wiki/Germany,11867,2,0,"['Germany', 'European Union', 'Werner Zwingman...",Germany
8,B,European Union,http://en.wikipedia.org/wiki/European_Union,9317,2,0,"['Germany', 'European Union', 'Werner Zwingman...",European Union
9,I,European Union,http://en.wikipedia.org/wiki/European_Union,9317,2,0,"['Germany', 'European Union', 'Werner Zwingman...",European Union


### Assess Accuracy of Predictions

In [74]:
# Define response variable
def replace_lines(text):
    return str(text).replace("_", " ")
response = [replace_lines(i.split("/")[-1]) if not isinstance(i, float) else None for i in entity_disambiguation['wikipedia_URL']]
response[:5]

['Germany',
 'United Kingdom',
 'Brussels',
 'European Commission',
 'European Commission']

In [75]:
# Calculate accuracy
accurate_predictions = (entity_disambiguation['preds_w2v_getentity'] == response).sum()
print("****************************")
print(f"Predictive Accuracy: {round(accurate_predictions / len(entity_disambiguation) * 100, 3)}%")
print("****************************")

****************************
Predictive Accuracy: 55.6%
****************************


## Save predictive dataframe for input to next step

In [77]:
# Save dataframe
preds_path = '../../predictions/'
entity_disambiguation.to_csv(os.path.join(preds_path, "wikipedia2vec_getentities.csv"), index=False)