# Phase Two: Generate Candidate Pool via Anchor Link Statistics

In Phase Two, we try to generate candidate pools for each full mention. This notebook uses anchor links on Wikipedia, or hyperlinks from a text string to a Wikipedia page, to propose a candidate pool of possible entities/pages for each full mention. Kensho has provided a version of this dataset structured for NLP tasks that provides a list of all anchor link strings, the associated Wikipedia page that anchor string linked to and the count of links and page popularity. We propose two methods of using anchor links: one sorts by most popular or viewed pages and the other by the most linked or central pages. We output both dataframes for evaluation by congruence.

#### Import Packages

In [1]:
import os
import time
import json

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Progress bar
from tqdm import tqdm

### Load Processed ACY Input

In [2]:
# Base path to input
acy_path = '../../data/aida-conll-yago-dataset/'

# Load data
acy_input = pd.read_csv(os.path.join(acy_path, "Aida-Conll-Yago-Input.csv"), delimiter=",")
acy_input.head(3)

Unnamed: 0,mention,full_mention,wikipedia_URL,wikipedia_page_ID,wikipedia_title,sentence_id,doc_id,congruent_mentions
0,B,EU,,,,0,0,"['EU', 'German', 'British']"
1,B,German,http://en.wikipedia.org/wiki/Germany,11867.0,Germany,0,0,"['EU', 'German', 'British']"
2,B,British,http://en.wikipedia.org/wiki/United_Kingdom,31717.0,United Kingdom,0,0,"['EU', 'German', 'British']"


In [3]:
# Count of full mentions
len(acy_input)

21810

### Load Kensho Target Dataset

This dataset provides anchor link statistics for Wikipedia pages and is provided by Kensho Technologies. We perform some processing to make it into an improved structure for our pipeline.

In [4]:
# Base path to KWNLP
kwnlp_path = '../../data/kwnlp'

In [5]:
# Load article data
article_df = pd.read_csv(os.path.join(kwnlp_path, 'kwnlp-enwiki-20200920-article.csv'))
article_df.head(3)

Unnamed: 0,page_id,item_id,page_title,views,len_article_chars,len_intro_chars,in_link_count,out_link_count,tmpl_good_article,tmpl_featured_article,tmpl_pseudoscience,tmpl_conspiracy_theories,isa_Q17442446,isa_Q14795564,isa_Q18340514
0,12,6199,Anarchism,35558,40449,409,3826,371,1,0,0,0,0,0,0
1,25,38404,Autism,40081,47659,419,2313,309,0,1,0,0,0,0,0
2,39,101038,Albedo,10770,18766,293,3090,115,0,0,0,0,0,0,0


In [6]:
# Load anchor target counts data
anchor_df = pd.read_csv(os.path.join(kwnlp_path, 'kwnlp-enwiki-20200920-anchor-target-counts.csv'))
anchor_df.head(3)

Unnamed: 0,anchor_text,target_page_id,count
0,United States,3434750,152451
1,World War II,32927,133668
2,India,14533,112069


### Process Target Data

We apply normalization to the anchor text to make for simpler matching.

In [7]:
# Copy to new dataframe for processing
anchor_text = anchor_df.copy()

In [8]:
# Define text normalization function
def normalize_text(text):
    """
    We define normalized in this notebook as:
    - lowercase
    - strip whitespace
    - Spaces, not underlines
    """
    return str(text).strip().lower().replace("_", " ")

In [9]:
# Apply normalization to anchor text
anchor_text['norm_anchor_text'] = anchor_text['anchor_text'].apply(normalize_text)

In [10]:
# Assess presence of Null values in anchor_text
print(f"There are {anchor_text['anchor_text'].isnull().sum():,} 'None' values in anchor_text.")

There are 3,581 'None' values in anchor_text.


In [11]:
# Filter out None values
print("Before: {:,}".format(len(anchor_text)))
anchor_text = anchor_text[anchor_text['anchor_text'].notnull()]
print("After: {:,}".format(len(anchor_text)))

Before: 15,269,229
After: 15,265,648


In [12]:
# Preview dataframe
anchor_text.head(5)

Unnamed: 0,anchor_text,target_page_id,count,norm_anchor_text
0,United States,3434750,152451,united states
1,World War II,32927,133668,world war ii
2,India,14533,112069,india
3,France,5843419,109669,france
4,footballer,10568,101027,footballer


In [13]:
%%time
# Aggregate duplicates now exist as a result of normalization
# Example: 'united states' and 'united states' both link to page 3434750 in separate rows
# because they were different pre-normalization, but now we want to treat them the same
# One row for every every anchor text -> page pair
print("Before: {:,}".format(len(anchor_text)))
anchor_text = anchor_text.groupby(['norm_anchor_text', 'target_page_id'], as_index=False).agg({'count': sum})
display(anchor_text.head(5))
print("After: {:,}".format(len(anchor_text)))
print("-------------------")

Before: 15,265,648


Unnamed: 0,norm_anchor_text,target_page_id,count
0,,477081,2
1,,51968092,6
2,(album),51968092,1
3,(underscore),477081,1
4,builtin popcount,1127884,1


After: 14,431,024
-------------------
CPU times: user 1min 9s, sys: 11.4 s, total: 1min 20s
Wall time: 1min 27s


#### Join Page Statistics to Anchor Text Data

This provides us with information on page views and link counts for every anchor text string in our desired format.

In [14]:
%%time
# Merge anchor_text and article stats dataframes
anchor_text = pd.merge(
    anchor_text,
    article_df,
    how="inner",
    left_on="target_page_id",
    right_on="page_id")

# Rename columns for clarity
anchor_text = anchor_text.rename(columns={
    'title': 'target_page_title',
    'item_id': 'target_item_id',
    'views': 'target_page_views',
    'count': 'anchor_target_count',
    'page_title': 'target_page_title'})

# Specify column ordering
anchor_text = anchor_text[[
    "norm_anchor_text",
    "target_page_id",
    "target_item_id",
    "target_page_title",
    "target_page_views",
    "anchor_target_count"]]

# Display preview
anchor_text.head(3)

CPU times: user 21.6 s, sys: 22.7 s, total: 44.2 s
Wall time: 53.2 s


Unnamed: 0,norm_anchor_text,target_page_id,target_item_id,target_page_title,target_page_views,anchor_target_count
0,,477081,11199,Underscore,8857,2
1,(underscore),477081,11199,Underscore,8857,1
2,underline mark,477081,11199,Underscore,8857,1


In [15]:
# Display filtered example
anchor_text[anchor_text['norm_anchor_text'] == 'united states']

Unnamed: 0,norm_anchor_text,target_page_id,target_item_id,target_page_title,target_page_views,anchor_target_count
6995,united states,496626,2416331,Forbes_400,3790,1
15041,united states,320247,482421,United_States_Army_Special_Forces,26292,1
28490,united states,2175,40578,American_cuisine,7398,13
29141,united states,3434750,30,United_States,460156,152456
29254,united states,19792942,846570,Americans,14146,15
...,...,...,...,...,...,...
14020894,united states,61098676,65049959,United_States_at_the_2020_Summer_Paralympics,33,1
14020896,united states,61115649,65050638,United_States_men's_national_3x3_team,145,2
14020899,united states,62050037,85814603,Water_polo_in_the_United_States,101,2
14020901,united states,62548724,85812126,United_States_at_the_2019_Winter_Deaflympics,5,1


## Calculate Prior Confidence

To create a confidence score that can serve as a prior, we calculate, for every anchor text, the percentage of values that each frequency or popularity link constituted of all links for that anchor text. If an anchor text to page link scores high in this, it means the page is a very frequent link or very popular, both good indicators that the link is likely strong and thus true.

In [16]:
# %%time
# # Load with priors so you don't need to re-run the hour each time

# # Base path to input
# preds_path = '../../predictions/'

# # Save candidate pools dataframe
# anchor_text = pd.read_csv(os.path.join(preds_path, "anchortext_with_priors.csv"))

In [17]:
%%time
# Calculate frequency and popularity as percentage of total
anchor_text_totals = anchor_text.groupby('norm_anchor_text', as_index=False).agg({
    'anchor_target_count': sum,
    'target_page_views': sum
})
anchor_text_totals[anchor_text_totals['norm_anchor_text'] == 'united states']

CPU times: user 57.4 s, sys: 3.26 s, total: 1min
Wall time: 1min 2s


Unnamed: 0,norm_anchor_text,anchor_target_count,target_page_views
10552160,united states,160780,8572980


In [18]:
%%time
# Save values as dictionary to speed up downstream processing
totals_dict = {}
for i in tqdm(range(len(anchor_text_totals))):
    row = anchor_text_totals.loc[i]
    totals_dict[row['norm_anchor_text']] = (row['anchor_target_count'], row['target_page_views'])

100%|██████████| 11327029/11327029 [29:15<00:00, 6451.74it/s] 

CPU times: user 26min 32s, sys: 1min 23s, total: 27min 55s
Wall time: 29min 15s





In [19]:
%%time
# Calculate individual link prior as "confidence"
list_prior_freq = []
list_prior_pop = []

for i in tqdm(range(len(anchor_text))):
    row = anchor_text.loc[i]
    prior_freq = round(row['anchor_target_count'] / totals_dict[row['norm_anchor_text']][0], 7)
    prior_pop = round(row['target_page_views'] / totals_dict[row['norm_anchor_text']][1], 7)
    
    list_prior_freq.append(prior_freq)
    list_prior_pop.append(prior_pop)

  
100%|██████████| 14431024/14431024 [1:56:09<00:00, 2070.47it/s]  


CPU times: user 56min 49s, sys: 17min 19s, total: 1h 14min 9s
Wall time: 1h 56min 9s


In [20]:
# Save prior to dataframe
anchor_text['prior_target_count'] = list_prior_freq
anchor_text['prior_page_views'] = list_prior_pop

In [21]:
# Preview dataframe
anchor_text.head(10)

Unnamed: 0,norm_anchor_text,target_page_id,target_item_id,target_page_title,target_page_views,anchor_target_count,prior_target_count,prior_page_views
0,,477081,11199,Underscore,8857,2,0.25,0.946665
1,(underscore),477081,11199,Underscore,8857,1,1.0,1.0
2,underline mark,477081,11199,Underscore,8857,1,1.0,1.0
3,underscore,477081,11199,Underscore,8857,68,0.819277,0.466674
4,underscore ( ),477081,11199,Underscore,8857,1,1.0,1.0
5,underscores,477081,11199,Underscore,8857,3,0.75,0.968507
6,,51968092,27536732,(album),499,6,0.75,0.053335
7,(album),51968092,27536732,(album),499,1,1.0,1.0
8,builtin popcount,1127884,5645805,Hamming_weight,3363,1,1.0,1.0
9,available in only some high-level languages,1127884,5645805,Hamming_weight,3363,1,1.0,1.0


In [22]:
# Preview single anchor text in dataframe
anchor_text[anchor_text['norm_anchor_text'] == 'united states'][:10]

Unnamed: 0,norm_anchor_text,target_page_id,target_item_id,target_page_title,target_page_views,anchor_target_count,prior_target_count,prior_page_views
6995,united states,496626,2416331,Forbes_400,3790,1,6e-06,0.000442
15041,united states,320247,482421,United_States_Army_Special_Forces,26292,1,6e-06,0.003067
28490,united states,2175,40578,American_cuisine,7398,13,8.1e-05,0.000863
29141,united states,3434750,30,United_States,460156,152456,0.948227,0.053675
29254,united states,19792942,846570,Americans,14146,15,9.3e-05,0.00165
30113,united states,3058522,3021105,Race_and_ethnicity_in_the_United_States,91352,2,1.2e-05,0.010656
38855,united states,4567589,7892316,United_States_and_the_International_Criminal_C...,2184,1,6e-06,0.000255
57494,united states,1974940,2630384,Beauty_and_the_Geek,2355,1,6e-06,0.000275
62363,united states,27007275,1102213,Eugenics_in_the_United_States,7599,4,2.5e-05,0.000886
68476,united states,423161,180072,Billboard_Hot_100,49494,31,0.000193,0.005773


In [23]:
# Save with priors so you don't need to re-run the hour each time

# Base path to input
preds_path = '../../predictions/'

# Save candidate pools dataframe
anchor_text.to_csv(os.path.join(preds_path, "anchortext_with_priors.csv"), index=False)

# Generate Candidate Pools

## Anchor Link Frequency

This model generates a candidate pool of Wikipedia pages for each full mention by looking at the pages that anchor string links to the most number of times.

In [24]:
%%time
# Sort dataframe by anchor text and then most frequently linked page
anchor_text = anchor_text.sort_values(['norm_anchor_text', 'anchor_target_count'], ascending=False)

CPU times: user 1min 15s, sys: 22.9 s, total: 1min 38s
Wall time: 1min 58s


In [25]:
%%time
# Return just the top N most linked entities to create our candidate pool for each anchor link
# Testing shows that 5-10 candidates is the most effective range for downstream congruence impact
top_N = 5
anchor_text_link_frequency = anchor_text.groupby('norm_anchor_text').head(top_N).reset_index(drop=True)

CPU times: user 1min 3s, sys: 30.2 s, total: 1min 33s
Wall time: 1min 54s


In [26]:
# Manually preview 'united states'
anchor_text_link_frequency[anchor_text_link_frequency['norm_anchor_text'] == 'united states']

Unnamed: 0,norm_anchor_text,target_page_id,target_item_id,target_page_title,target_page_views,anchor_target_count,prior_target_count,prior_page_views
882966,united states,3434750,30,United_States,460156,152456,0.948227,0.053675
882967,united states,582488,164134,United_States_men's_national_soccer_team,25804,1466,0.009118,0.00301
882968,united states,647757,334526,United_States_women's_national_soccer_team,12292,594,0.003694,0.001434
882969,united states,1145226,1143805,United_States_national_rugby_union_team,1165,352,0.002189,0.000136
882970,united states,945923,913651,United_States_men's_national_ice_hockey_team,2537,257,0.001599,0.000296


In [27]:
# Assess remaining rows
print("Unique anchor links numbered {:,}".format(len(anchor_text)))
print("Remaining dataframe contains {:,} rows".format(len(anchor_text_link_frequency)))

Unique anchor links numbered 14,431,024
Remaining dataframe contains 12,906,013 rows


We did not reduce the dataframe by much, suggesting only a few anchor texts have more than our selected N number of distinct links. To append to our ACY Input data, we produce a dictionary of anchor text to its candidate pool.

In [28]:
# # In case of prior road, load saved json file before re-running the whole thing
# # Load dictionary
# with open('../../predictions/dict_anchor_pool_frequency.json', 'r') as filepath:
#     dict_anchor_pool_frequency = json.load(filepath)

In [29]:
%%time
# Group by anchor text to produce list of item IDs, page IDs and page titles (our candidate pools)
anchor_text_candidate_pools = anchor_text_link_frequency.groupby('norm_anchor_text')\
                                    [['target_page_id',
                                      'target_page_title',
                                      'target_item_id',
                                      'prior_target_count']]\
                                    .agg(lambda x: list(x)).reset_index()

CPU times: user 17min 6s, sys: 4min 46s, total: 21min 53s
Wall time: 27min 3s


In [30]:
%%time
## Add to dictionary for faster searching later in the pipeline

# Create dictionary
dict_anchor_pool_frequency = {}

# Add lists to dictionary with anchor text as search term
# This should match the full mention search when measuring accuracy later
for i in tqdm(range(len(anchor_text_candidate_pools))):
    row = anchor_text_candidate_pools.loc[i]
    dict_anchor_pool_frequency[row['norm_anchor_text']] = [row['target_page_id'],
                                                           row['target_page_title'],
                                                           row['target_item_id'],
                                                           row['prior_target_count']]

100%|██████████| 11327029/11327029 [51:14<00:00, 3684.50it/s]  


CPU times: user 25min 47s, sys: 7min 50s, total: 33min 37s
Wall time: 51min 14s


#### Demonstrate search performance boost of dictionary

In [31]:
%%time
# Demonstrate search benefit of storing as dictionary
o = dict_anchor_pool_frequency['united states']

CPU times: user 10 µs, sys: 145 µs, total: 155 µs
Wall time: 479 µs


In [32]:
%%time
# Compare Pandas dataframe search to dictionary search
o = anchor_text_candidate_pools[anchor_text_candidate_pools['norm_anchor_text'] == 'united states']

CPU times: user 10.5 s, sys: 1min 21s, total: 1min 32s
Wall time: 3min 43s


In [33]:
# Save dictionary
with open('../../predictions/dict_anchor_pool_frequency.json', 'w') as filepath:
    json.dump(dict_anchor_pool_frequency, filepath)

## Anchor Link Popularity

This model generates a candidate pool of Wikipedia pages for each full mention by looking at the popularity of pages that string has linked to and sorting by the pages with the most views.

In [34]:
%%time
# Sort dataframe by anchor text and then most frequently linked page
anchor_text = anchor_text.sort_values(['norm_anchor_text', 'target_page_views'], ascending=False)

CPU times: user 1min 24s, sys: 1min 22s, total: 2min 47s
Wall time: 4min 13s


In [35]:
%%time
# Return just the top N most viewed entities to create our candidate pool for each anchor link
top_N = 5
anchor_text_link_popularity = anchor_text.groupby('norm_anchor_text').head(top_N).reset_index(drop=True)

CPU times: user 1min 6s, sys: 27.6 s, total: 1min 33s
Wall time: 1min 50s


In [36]:
# Manually test United States to assess resulting dataframe
anchor_text_link_popularity[anchor_text_link_popularity['norm_anchor_text'] == 'united states']

Unnamed: 0,norm_anchor_text,target_page_id,target_item_id,target_page_title,target_page_views,anchor_target_count,prior_target_count,prior_page_views
882966,united states,3434750,30,United_States,460156,152456,0.948227,0.053675
882967,united states,63136490,83873577,COVID-19_pandemic_in_the_United_States,428030,31,0.000193,0.049928
882968,united states,58993617,41174436,2020_Formula_One_World_Championship,343066,1,6e-06,0.040017
882969,united states,44751865,19600530,Black_Lives_Matter,250974,1,6e-06,0.029275
882970,united states,12610470,1682357,List_of_states_and_territories_of_the_United_S...,185044,1,6e-06,0.021585


In [37]:
# Assess remaining rows
print("Unique anchor links numbered {:,}".format(len(anchor_text)))
print("Remaining dataframe contains {:,} rows".format(len(anchor_text_link_popularity)))

Unique anchor links numbered 14,431,024
Remaining dataframe contains 12,906,013 rows


In [38]:
# # In case of prior road, load saved json file before re-running the whole thing
# # Load dictionary
# with open('../../predictions/dict_anchor_pool_popularity.json', 'r') as filepath:
#     dict_anchor_pool_popularity = json.load(filepath)

In [39]:
%%time
# Group by anchor text to produce list of item IDs, page IDs and page titles (our candidate pools)
anchor_text_candidate_pools = anchor_text_link_popularity.groupby('norm_anchor_text')\
                                    [['target_page_id',
                                      'target_page_title',
                                      'target_item_id',
                                      'prior_page_views']]\
                                    .agg(lambda x: list(x)).reset_index()

CPU times: user 16min 9s, sys: 11min 24s, total: 27min 33s
Wall time: 51min 33s


In [40]:
%%time
## Add to dictionary for faster searching later in the pipeline

# Create dictionary
dict_anchor_pool_popularity = {}

# Add lists to dictionary with anchor text as search term
# This should match the full mention search when measuring accuracy later
for i in tqdm(range(len(anchor_text_candidate_pools))):
    row = anchor_text_candidate_pools.loc[i]
    dict_anchor_pool_popularity[row['norm_anchor_text']] = [row['target_page_id'],
                                                           row['target_page_title'],
                                                           row['target_item_id'],
                                                           row['prior_page_views']]

100%|██████████| 11327029/11327029 [59:40<00:00, 3163.31it/s]  


CPU times: user 24min 40s, sys: 10min 24s, total: 35min 4s
Wall time: 59min 40s


In [41]:
# Save dictionary
with open('../../predictions/dict_anchor_pool_popularity.json', 'w') as filepath:
    json.dump(dict_anchor_pool_popularity, filepath)

# Assess Accuracy of Anchor Link Models without Congruence

For each full mention in our ACY input dataset, we now append the generated candidate pool as a column and save our predictions.

In [42]:
# Normalize full mentions for direct comparison with normalized anchor texts
acy_input['norm_full_mention'] = acy_input['full_mention'].apply(normalize_text)

## Anchor Link Frequency

In [43]:
# Copy input dataframe
preds_anchor_frequency = acy_input.copy()

In [44]:
# For each full mention, retrieve the candidate pool generated by the model
candidate_pools_page_ids = []
candidate_pools_item_ids = []
candidate_pools_titles = []
candidate_pools_priors = []

# Track metrics
oov_error = 0

for i in tqdm(range(len(acy_input))):
    
    # Retrieve normalized full mention
    full_mention = acy_input['norm_full_mention'][i]
    
    # Retrieve candidate pools for full mention
    try:
        dicts = dict_anchor_pool_frequency[full_mention]
    except KeyError:
        oov_error += 1
        dicts = (None, None, None, None)
        
    candidate_pool_page_ids = dicts[0]
    candidate_pool_titles = dicts[1]
    candidate_pool_item_ids = dicts[2]
    candidate_pool_priors = dicts[3]
    
    # Save candidate pools
    candidate_pools_page_ids.append(candidate_pool_page_ids)
    candidate_pools_item_ids.append(candidate_pool_item_ids)
    candidate_pools_titles.append(candidate_pool_titles)
    candidate_pools_priors.append(candidate_pool_priors)
    
preds_anchor_frequency['candidate_pool_page_ids'] = candidate_pools_page_ids
preds_anchor_frequency['candidate_pool_item_ids'] = candidate_pools_item_ids
preds_anchor_frequency['candidate_pool_titles'] = candidate_pools_titles
preds_anchor_frequency['candidate_pool_priors'] = candidate_pools_priors

100%|██████████| 21810/21810 [00:14<00:00, 1512.17it/s]


In [45]:
print(f"We received {oov_error:,} Out-of-Vocabulary Errors.")

We received 3,782 Out-of-Vocabulary Errors.


An Out-of-Vocabulary Error means that we have a full mention that does not exist as an anchor link string in our anchor link statistics dataframe. This can result from differing dates of dataset creation between ACY and Anchor Link Statistics or as a result of the natural progression of language to develop new terms and phrases.

In [46]:
# Preview dataframe
preds_anchor_frequency.head(3)

Unnamed: 0,mention,full_mention,wikipedia_URL,wikipedia_page_ID,wikipedia_title,sentence_id,doc_id,congruent_mentions,norm_full_mention,candidate_pool_page_ids,candidate_pool_item_ids,candidate_pool_titles,candidate_pool_priors
0,B,EU,,,,0,0,"['EU', 'German', 'British']",eu,"[9317, 9239, 21347120, 9477, 1882861]","[458, 46, 211593, 1396, 363404]","[European_Union, Europe, Eu,_Seine-Maritime, E...","[0.9227799, 0.024651, 0.020196, 0.005346, 0.00..."
1,B,German,http://en.wikipedia.org/wiki/Germany,11867.0,Germany,0,0,"['EU', 'German', 'British']",german,"[11867, 11884, 152735, 21212, 12674]","[183, 188, 42884, 7318, 43287]","[Germany, German_language, Germans, Nazi_Germa...","[0.4192066, 0.2893363, 0.1470461, 0.03832, 0.0..."
2,B,British,http://en.wikipedia.org/wiki/United_Kingdom,31717.0,United Kingdom,0,0,"['EU', 'German', 'British']",british,"[31717, 19097669, 13530298, 4721, 158019]","[145, 842438, 23666, 8680, 161885]","[United_Kingdom, British_people, Great_Britain...","[0.6101256, 0.1146913, 0.0681775, 0.0366451, 0..."


In [47]:
# Calculate overall accuracy with no intelligent filtering applied yet
accurate_predictions = 0
for i in range(len(preds_anchor_frequency)):
    try:
        if preds_anchor_frequency['wikipedia_page_ID'][i] == preds_anchor_frequency['candidate_pool_page_ids'][i][0]:
            accurate_predictions += 1
    except TypeError:
        pass
print("****************************")
print(f"Overall Predictive Accuracy: {round(accurate_predictions / len(preds_anchor_frequency) * 100, 3)}%")
print("****************************")

****************************
Overall Predictive Accuracy: 52.572%
****************************


In [48]:
# Calculate percentage of candidate pools with the correct answer present
# Necessary to determine if shuffling pool with congruence could even get the right answer
response_present = 0
for i in range(len(preds_anchor_frequency)):
    try:
        if preds_anchor_frequency['wikipedia_page_ID'][i] in preds_anchor_frequency['candidate_pool_page_ids'][i]:
            response_present += 1
    except TypeError:
        pass
print(f"Correct answer is present in {round(response_present / len(preds_anchor_frequency) * 100, 3)}% of generated candidate pools via Anchor Links Frequency method.")

Correct answer is present in 65.878% of generated candidate pools via Anchor Links Frequency method.


#### Confirm Accuracy Only for Full Mentions with True Positives

We have standardized on comparing accuracy across model types for only true mentions with a known positive response value.

In [49]:
# Calculate accuracy only for known true values
accurate_predictions = 0
preds_frequency_nonulls = preds_anchor_frequency[preds_anchor_frequency['wikipedia_page_ID'].notnull()].reset_index()
for i in range(len(preds_frequency_nonulls)):
    try:
        if preds_frequency_nonulls['wikipedia_page_ID'][i] == preds_frequency_nonulls['candidate_pool_page_ids'][i][0]:
            accurate_predictions += 1
    except TypeError:
        pass
print("****************************")
print(f"Predictive Accuracy: {round(accurate_predictions / len(preds_frequency_nonulls) * 100, 3)}%")
print("****************************")

****************************
Predictive Accuracy: 72.032%
****************************


In [50]:
# Calculate percentage of candidate pools with the correct answer present for known true values
# Necessary to determine if shuffling pool could even get the right answer
response_present = 0
for i in range(len(preds_frequency_nonulls)):
    try:
        if preds_frequency_nonulls['wikipedia_page_ID'][i] in preds_frequency_nonulls['candidate_pool_page_ids'][i]:
            response_present += 1
    except TypeError:
        pass
print(f"Correct answer is present in {round(response_present / len(preds_frequency_nonulls) * 100, 3)}% of generated candidate pools via Anchor Links Frequency method.")

Correct answer is present in 90.263% of generated candidate pools via Anchor Links Frequency method.


In [51]:
# Base path to input
preds_path = '../../predictions/'

# Save candidate pools dataframe
preds_anchor_frequency.to_csv(os.path.join(preds_path, "anchortext_frequency.csv"), index=False)

## Anchor Link Popularity

In [52]:
# Copy input dataframe
preds_anchor_popularity = acy_input.copy()

In [53]:
# For each full mention, retrieve the candidate pool generated by the model
candidate_pools_page_ids = []
candidate_pools_item_ids = []
candidate_pools_titles = []
candidate_pools_priors = []

# Track metrics
oov_error = 0

for i in tqdm(range(len(acy_input))):
    
    # Retrieve normalized full mention
    full_mention = acy_input['norm_full_mention'][i]
    
    # Retrieve candidate pools for full mention
    try:
        dicts = dict_anchor_pool_popularity[full_mention]
    except KeyError:
        oov_error += 1
        dicts = (None, None, None, None)
        
    candidate_pool_page_ids = dicts[0]
    candidate_pool_titles = dicts[1]
    candidate_pool_item_ids = dicts[2]
    candidate_pool_priors = dicts[3]
    
    # Save candidate pools
    candidate_pools_page_ids.append(candidate_pool_page_ids)
    candidate_pools_item_ids.append(candidate_pool_item_ids)
    candidate_pools_titles.append(candidate_pool_titles)
    candidate_pools_priors.append(candidate_pool_priors)
    
preds_anchor_popularity['candidate_pool_page_ids'] = candidate_pools_page_ids
preds_anchor_popularity['candidate_pool_item_ids'] = candidate_pools_item_ids
preds_anchor_popularity['candidate_pool_titles'] = candidate_pools_titles
preds_anchor_popularity['candidate_pool_priors'] = candidate_pools_priors

100%|██████████| 21810/21810 [00:08<00:00, 2680.66it/s]


In [54]:
print(f"We received {oov_error:,} Out-of-Vocabulary Errors.")

We received 3,782 Out-of-Vocabulary Errors.


In [55]:
# Preview dataframe
preds_anchor_popularity.head(3)

Unnamed: 0,mention,full_mention,wikipedia_URL,wikipedia_page_ID,wikipedia_title,sentence_id,doc_id,congruent_mentions,norm_full_mention,candidate_pool_page_ids,candidate_pool_item_ids,candidate_pool_titles,candidate_pool_priors
0,B,EU,,,,0,0,"['EU', 'German', 'British']",eu,"[9317, 9239, 9891, 9472, 10890716]","[458, 46, 45003, 4916, 185441]","[European_Union, Europe, Entropy, Euro, Member...","[0.2245141, 0.2015168, 0.0965482, 0.0768826, 0..."
1,B,German,http://en.wikipedia.org/wiki/Germany,11867.0,Germany,0,0,"['EU', 'German', 'British']",german,"[11867, 27318, 21148, 21212, 26964606]","[183, 334, 55, 7318, 40]","[Germany, Singapore, Netherlands, Nazi_Germany...","[0.0558084, 0.0548498, 0.044104, 0.0268639, 0...."
2,B,British,http://en.wikipedia.org/wiki/United_Kingdom,31717.0,United Kingdom,0,0,"['EU', 'German', 'British']",british,"[3434750, 31717, 19344654, 26061, 8569916]","[30, 145, 9531, 172771, 1860]","[United_States, United_Kingdom, BBC, Royal_Nav...","[0.1243038, 0.0660245, 0.0317426, 0.0291264, 0..."


In [56]:
# Calculate accuracy
accurate_predictions = 0
for i in range(len(preds_anchor_popularity)):
    try:
        if preds_anchor_popularity['wikipedia_page_ID'][i] == preds_anchor_popularity['candidate_pool_page_ids'][i][0]:
            accurate_predictions += 1
    except TypeError:
        pass
print("****************************")
print(f"Predictive Accuracy: {round(accurate_predictions / len(preds_anchor_popularity) * 100, 3)}%")
print("****************************")

****************************
Predictive Accuracy: 45.158%
****************************


In [57]:
# Calculate percentage of candidate pools with the correct answer present
# Necessary to determine if shuffling pool could even get the right answer
response_present = 0
for i in range(len(preds_anchor_popularity)):
    try:
        if preds_anchor_popularity['wikipedia_page_ID'][i] in preds_anchor_popularity['candidate_pool_page_ids'][i]:
            response_present += 1
    except TypeError:
        pass
print(f"Correct answer is present in {round(response_present / len(preds_anchor_popularity) * 100, 3)}% of generated candidate pools via Anchor Links popularity method.")

Correct answer is present in 62.765% of generated candidate pools via Anchor Links popularity method.


#### Confirm Accuracy Only for Full Mentions with True Positives

In [58]:
# Calculate accuracy only for known true values
accurate_predictions = 0
preds_popularity_nonulls = preds_anchor_popularity[preds_anchor_popularity['wikipedia_page_ID'].notnull()].reset_index()
for i in range(len(preds_popularity_nonulls)):
    try:
        if preds_popularity_nonulls['wikipedia_page_ID'][i] == preds_popularity_nonulls['candidate_pool_page_ids'][i][0]:
            accurate_predictions += 1
    except TypeError:
        pass
print("****************************")
print(f"Predictive Accuracy: {round(accurate_predictions / len(preds_popularity_nonulls) * 100, 3)}%")
print("****************************")

****************************
Predictive Accuracy: 61.873%
****************************


In [59]:
# Calculate percentage of candidate pools with the correct answer present for known true values
# Necessary to determine if shuffling pool could even get the right answer
response_present = 0
for i in range(len(preds_popularity_nonulls)):
    try:
        if preds_popularity_nonulls['wikipedia_page_ID'][i] in preds_popularity_nonulls['candidate_pool_page_ids'][i]:
            response_present += 1
    except TypeError:
        pass
print(f"Correct answer is present in {round(response_present / len(preds_popularity_nonulls) * 100, 3)}% of generated candidate pools via Anchor Links popularity method.")

Correct answer is present in 85.997% of generated candidate pools via Anchor Links popularity method.


In [60]:
# Save candidate pools dataframe
preds_anchor_popularity.to_csv(os.path.join(preds_path, "anchortext_popularity.csv"), index=False)