# Generate Candidate Pool via Anchor Links

This notebook uses anchor links on Wikipedia, or hyperlinks from a string to a Wikipedia page, to propose a candidate pool of possible entities/pages for each full mention. We propose two methods of using anchor links: one sorts by most popular or viewed pages and the other by the most linked or central pages. We output both dataframes for evaluation.

#### Import Packages

In [1]:
import os
import time
import json

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Progress bar
from tqdm import tqdm

### Load Processed ACY Input

In [6]:
# Base path to input
acy_path = '../../data/aida-conll-yago-dataset/'

# Load data
acy_input = pd.read_csv(os.path.join(acy_path, "Aida-Conll-Yago-Input.csv"), delimiter=",")
acy_input.head(3)

Unnamed: 0,mention,full_mention,wikipedia_URL,wikipedia_page_ID,wikipedia_title,sentence_id,doc_id,congruent_mentions
0,B,EU,,,,0,0,"['EU', 'German', 'British']"
1,B,German,http://en.wikipedia.org/wiki/Germany,11867.0,Germany,0,0,"['EU', 'German', 'British']"
2,B,British,http://en.wikipedia.org/wiki/United_Kingdom,31717.0,United Kingdom,0,0,"['EU', 'German', 'British']"


### Load Kensho Target Dataset

This dataset provides anchor linkage statistics for Wikipedia pages and is provided by Kensho Technologies.

In [3]:
# Base path to KWNLP
kwnlp_path = '../../data/kwnlp'

In [4]:
# Load article data
article_df = pd.read_csv(os.path.join(kwnlp_path, 'kwnlp-enwiki-20200920-article.csv'))
article_df.head(3)

Unnamed: 0,page_id,item_id,page_title,views,len_article_chars,len_intro_chars,in_link_count,out_link_count,tmpl_good_article,tmpl_featured_article,tmpl_pseudoscience,tmpl_conspiracy_theories,isa_Q17442446,isa_Q14795564,isa_Q18340514
0,12,6199,Anarchism,35558,40449,409,3826,371,1,0,0,0,0,0,0
1,25,38404,Autism,40081,47659,419,2313,309,0,1,0,0,0,0,0
2,39,101038,Albedo,10770,18766,293,3090,115,0,0,0,0,0,0,0


In [5]:
# Load anchor target counts data
anchor_df = pd.read_csv(os.path.join(kwnlp_path, 'kwnlp-enwiki-20200920-anchor-target-counts.csv'))
anchor_df.head(3)

Unnamed: 0,anchor_text,target_page_id,count
0,United States,3434750,152451
1,World War II,32927,133668
2,India,14533,112069


### Process Target Data

We apply normalization to the anchor text to make for simpler matching.

In [6]:
# Copy to new dataframe for processing
anchor_texts = anchor_df.copy()

In [6]:
# Define text normalization function
def normalize_text(text):
    """
    We define normalized as:
    - lowercase
    - strip whitespace
    - Spaces, not underlines
    - Remove punctuation (todo decide&implement)
    """
    return str(text).strip().lower().replace("_", " ")

In [8]:
# Apply normalization to anchor text
anchor_texts['norm_anchor_text'] = anchor_texts['anchor_text'].apply(normalize_text)

In [9]:
# Assess presence of Null values in anchor_text
print(f"There are {anchor_texts['anchor_text'].isnull().sum():,} 'None' values in anchor_text.")

There are 3,581 'None' values in anchor_text.


In [10]:
# Filter out None values
print("Before: {:,}".format(len(anchor_texts)))
anchor_texts = anchor_texts[anchor_texts['anchor_text'].notnull()]
print("After: {:,}".format(len(anchor_texts)))

Before: 15,269,229
After: 15,265,648


#### Join Page Data to Anchor Text Data

This provides us with information on page views and links.

In [11]:
%%time
# Merge at_count and article stats dataframes
anchor_texts = pd.merge(
    anchor_texts,
    article_df,
    how="inner",
    left_on="target_page_id",
    right_on="page_id")

# Rename columns for clarity
anchor_texts = anchor_texts.rename(columns={
    'title': 'target_page_title',
    'item_id': 'target_item_id',
    'views': 'target_page_views',
    'count': 'anchor_target_count',
    'page_title': 'target_page_title'})

# Specify column ordering
anchor_texts = anchor_texts[[
    "norm_anchor_text",
    "target_page_id",
    "target_item_id",
    "target_page_title",
    "target_page_views",
    "anchor_target_count"]]

# Display preview
anchor_texts.head(3)

CPU times: user 35.4 s, sys: 28.2 s, total: 1min 3s
Wall time: 1min 13s


Unnamed: 0,norm_anchor_text,target_page_id,target_item_id,target_page_title,target_page_views,anchor_target_count
0,united states,3434750,30,United_States,460156,152451
1,american,3434750,30,United_States,460156,65722
2,usa,3434750,30,United_States,460156,8559


## Calculate Weights

To create a likelihood score that can serve as a prior, we calculate for each anchor text, the percentage of values that each frequency or popularity link constituted of all links for that anchor text. If an anchor text to page link scores high in this, it means the page is a very frequent link or very popular, both good indicators that the link is likely strong and thus true.

In [81]:
%%time
# Calculate frequency and popularity as percentage likelihood
anchor_text_totals = anchor_texts.groupby('norm_anchor_text', as_index=False).agg({
    'anchor_target_count': sum,
    'target_page_views': sum
})
anchor_text_totals[10000:10010]

CPU times: user 59.1 s, sys: 2.41 s, total: 1min 1s
Wall time: 1min 2s


Unnamed: 0,norm_anchor_text,anchor_target_count,target_page_views
10000,"""changing"" (sigma song)",1,264
10001,"""changing"" (supergirl)",1,2985
10002,"""changsha"" (poem)",1,99
10003,"""channel 42"" (instrumental)",1,42
10004,"""channel 5""",1,123
10005,"""channel 9 show""",1,2146
10006,"""channel dash""",1,2176
10007,"""channel x.""",1,4137
10008,"""channel z""",1,287
10009,"""channel z"" (song)",1,287


In [93]:
%%time
# Save values as dictionary to speed up downstream processing
totals_dict = {}
for i in tqdm(range(len(anchor_text_totals))):
    row = anchor_text_totals.loc[i]
    totals_dict[row['norm_anchor_text']] = (row['anchor_target_count'], row['target_page_views'])

100%|██████████| 11327029/11327029 [27:51<00:00, 6776.74it/s] 


CPU times: user 26min 27s, sys: 49.7 s, total: 27min 17s
Wall time: 27min 51s


In [105]:
%%time
# Calculate likelihood
list_lklhd_freq = []
list_lklhd_pop = []

for i in tqdm(range(len(anchor_texts))):
    row = anchor_texts.loc[i]
    likelihood_freq = round(row['anchor_target_count'] / totals_dict[row['norm_anchor_text']][0], 7)
    likelihood_pop = round(row['target_page_views'] / totals_dict[row['norm_anchor_text']][1], 7)
    
    list_lklhd_freq.append(likelihood_freq)
    list_lklhd_pop.append(likelihood_pop)

  
100%|██████████| 15265648/15265648 [1:07:57<00:00, 3744.31it/s]


CPU times: user 50min 39s, sys: 6min 51s, total: 57min 30s
Wall time: 1h 7min 57s


In [106]:
# Save likelihood to dataframe
anchor_texts['likelihood_target_count'] = list_lklhd_freq
anchor_texts['likelihood_page_views'] = list_lklhd_pop

In [107]:
# Preview dataframe
anchor_texts.head(10)

Unnamed: 0,norm_anchor_text,target_page_id,target_item_id,target_page_title,target_page_views,anchor_target_count,likelihood_target_count,likelihood_page_views
0,united states,3434750,30,United_States,460156,152451,0.948196,0.05082
1,american,3434750,30,United_States,460156,65722,0.784375,0.082544
2,usa,3434750,30,United_States,460156,8559,0.850035,0.170588
3,u.s.,3434750,30,United_States,460156,7633,0.942346,0.166977
4,us,3434750,30,United_States,460156,5288,0.781093,0.132912
5,united states of america,3434750,30,United_States,460156,4873,0.973043,0.226055
6,america,3434750,30,United_States,460156,3257,0.619201,0.19215
7,americans,3434750,30,United_States,460156,457,0.400526,0.391219
8,the united states,3434750,30,United_States,460156,360,0.593081,0.212039
9,union,3434750,30,United_States,460156,295,0.035993,0.147523


In [109]:
# Preview single anchor text in dataframe
anchor_texts[anchor_texts['norm_anchor_text'] == 'united states'][:10]

Unnamed: 0,norm_anchor_text,target_page_id,target_item_id,target_page_title,target_page_views,anchor_target_count,likelihood_target_count,likelihood_page_views
0,united states,3434750,30,United_States,460156,152451,0.948196,0.05082
63,united states,3434750,30,United_States,460156,5,3.1e-05,0.05082
5304,united states,18951490,41323,American_football,23598,1,6e-06,0.002606
5981,united states,29810,1439,Texas,84077,1,6e-06,0.009286
6162,united states,18309966,485240,Billboard_(magazine),17442,1,6e-06,0.001926
7354,united states,18618239,35657,U.S._state,155646,3,1.9e-05,0.01719
8475,united states,32070,29468,Republican_Party_(United_States),139051,1,6e-06,0.015357
12472,united states,20518076,11220,United_States_Navy,50649,40,0.000249,0.005594
13622,united states,21139,49,North_America,68253,2,1.2e-05,0.007538
14150,united states,423161,180072,Billboard_Hot_100,49494,31,0.000193,0.005466


# Develop Anchor Link Candidate Generation Models

## Anchor Link Frequency

This model generates a candidate pool of Wikipedia pages for each full mention by looking at the pages that string links to the most number of times.

In [110]:
%%time
# Sort dataframe by anchor text and then most frequently linked page
anchor_texts = anchor_texts.sort_values(['norm_anchor_text', 'anchor_target_count'], ascending=False)

CPU times: user 1min 17s, sys: 22.3 s, total: 1min 40s
Wall time: 2min 8s


In [111]:
%%time
# Return just the top N most linked entities to create our candidate pool for each anchor link
top_N = 10
anchor_text_link_frequency = anchor_texts.groupby('norm_anchor_text').head(top_N).reset_index(drop=True)

CPU times: user 56.4 s, sys: 6.37 s, total: 1min 2s
Wall time: 1min 4s


In [112]:
# Manually test United States to assess resulting dataframe
anchor_text_link_frequency[anchor_text_link_frequency['norm_anchor_text'] == 'united states']

Unnamed: 0,norm_anchor_text,target_page_id,target_item_id,target_page_title,target_page_views,anchor_target_count,likelihood_target_count,likelihood_page_views
950752,united states,3434750,30,United_States,460156,152451,0.948196,0.05082
950753,united states,582488,164134,United_States_men's_national_soccer_team,25804,1466,0.009118,0.00285
950754,united states,647757,334526,United_States_women's_national_soccer_team,12292,594,0.003694,0.001357
950755,united states,1145226,1143805,United_States_national_rugby_union_team,1165,352,0.002189,0.000129
950756,united states,945923,913651,United_States_men's_national_ice_hockey_team,2537,257,0.001599,0.00028
950757,united states,924170,279283,Elections_in_the_United_States,14936,243,0.001511,0.001649
950758,united states,378405,3054793,Secondary_education_in_the_United_States,6907,225,0.001399,0.000763
950759,united states,980450,2738955,United_States_national_cricket_team,2043,223,0.001387,0.000226
950760,united states,6311052,1389353,United_States_Davis_Cup_team,335,218,0.001356,3.7e-05
950761,united states,89611,244847,United_States_men's_national_basketball_team,10624,177,0.001101,0.001173


In [113]:
# Assess remaining rows
print("Unique anchor links numbered {:,}".format(len(anchor_texts)))
print("Remaining dataframe contains {:,} rows".format(len(anchor_text_link_frequency)))

Unique anchor links numbered 15,265,648
Remaining dataframe contains 14,009,323 rows


We did not reduce the dataframe by much, suggesting only a few anchor texts have more than our selected N number of distinct links. To append to our ACY Input data, we produce a dictionary of anchor text to its candidate pool.

In [3]:
# # In case of prior road, load saved json file before re-running the whole thing
# # Load dictionary
# with open('../../predictions/dict_anchor_pool_frequency.json', 'r') as filepath:
#     dict_anchor_pool_frequency = json.load(filepath)

In [115]:
%%time
# Group by anchor text to produce list of item IDs, page IDs and page titles (our candidate pools)
anchor_text_candidate_pools = anchor_text_link_frequency.groupby('norm_anchor_text')\
                                    [['target_page_id',
                                      'target_page_title',
                                      'target_item_id',
                                      'likelihood_target_count']]\
                                    .agg(lambda x: list(x)).reset_index()

CPU times: user 14min 57s, sys: 4min 1s, total: 18min 59s
Wall time: 21min 34s


In [116]:
%%time
## Add to dictionary for faster searching later in the pipeline

# Create dictionary
dict_anchor_pool_frequency = {}

# Add lists to dictionary with anchor text as search term
# This should match the full mention search when measuring accuracy later
for i in tqdm(range(len(anchor_text_candidate_pools))):
    row = anchor_text_candidate_pools.loc[i]
    dict_anchor_pool_frequency[row['norm_anchor_text']] = [row['target_page_id'],
                                                           row['target_page_title'],
                                                           row['target_item_id'],
                                                           row['likelihood_target_count']]

100%|██████████| 11327029/11327029 [20:20<00:00, 9279.88it/s]

CPU times: user 19min 31s, sys: 33.8 s, total: 20min 5s
Wall time: 20min 20s





#### Demonstrate search performance boost of dictionary

In [117]:
%%time
# Demonstrate search benefit of storing as dictionary
o = dict_anchor_pool_frequency['united states']

CPU times: user 6 µs, sys: 9 µs, total: 15 µs
Wall time: 16.2 µs


In [118]:
%%time
# Compare Pandas dataframe search to dictionary search
o = anchor_text_candidate_pools[anchor_text_candidate_pools['norm_anchor_text'] == 'united states']

CPU times: user 6.96 s, sys: 26.6 s, total: 33.5 s
Wall time: 54.1 s


In [119]:
# Save dictionary
with open('../../predictions/dict_anchor_pool_frequency.json', 'w') as filepath:
    json.dump(dict_anchor_pool_frequency, filepath)

## Anchor Link Popularity

This model generates a candidate pool of Wikipedia pages for each full mention by looking at the popularity of pages that string has linked to and sorting by the pages with the most views.

In [120]:
%%time
# Sort dataframe by anchor text and then most frequently linked page
anchor_texts = anchor_texts.sort_values(['norm_anchor_text', 'target_page_views'], ascending=False)

CPU times: user 1min 11s, sys: 13.6 s, total: 1min 24s
Wall time: 1min 32s


In [121]:
%%time
# Return just the top N most viewed entities to create our candidate pool for each anchor link
top_N = 10
anchor_text_link_popularity = anchor_texts.groupby('norm_anchor_text').head(top_N).reset_index(drop=True)

CPU times: user 57.9 s, sys: 6.03 s, total: 1min 3s
Wall time: 1min 7s


In [122]:
# Manually test United States to assess resulting dataframe
anchor_text_link_popularity[anchor_text_link_popularity['norm_anchor_text'] == 'united states']

Unnamed: 0,norm_anchor_text,target_page_id,target_item_id,target_page_title,target_page_views,anchor_target_count,likelihood_target_count,likelihood_page_views
950752,united states,3434750,30,United_States,460156,152451,0.948196,0.05082
950753,united states,3434750,30,United_States,460156,5,3.1e-05,0.05082
950754,united states,63136490,83873577,COVID-19_pandemic_in_the_United_States,428030,31,0.000193,0.047272
950755,united states,58993617,41174436,2020_Formula_One_World_Championship,343066,1,6e-06,0.037888
950756,united states,44751865,19600530,Black_Lives_Matter,250974,1,6e-06,0.027718
950757,united states,12610470,1682357,List_of_states_and_territories_of_the_United_S...,185044,1,6e-06,0.020436
950758,united states,54803678,37093861,Antifa_(United_States),183516,1,6e-06,0.020268
950759,united states,1649321,131079,List_of_United_States_cities_by_population,163698,1,6e-06,0.018079
950760,united states,18618239,35657,U.S._state,155646,3,1.9e-05,0.01719
950761,united states,3356,1124,Bill_Clinton,155162,1,6e-06,0.017136


In [123]:
# Assess remaining rows
print("Unique anchor links numbered {:,}".format(len(anchor_texts)))
print("Remaining dataframe contains {:,} rows".format(len(anchor_text_link_popularity)))

Unique anchor links numbered 15,265,648
Remaining dataframe contains 14,009,323 rows


In [4]:
# # In case of prior road, load saved json file before re-running the whole thing
# # Load dictionary
# with open('../../predictions/dict_anchor_pool_popularity.json', 'r') as filepath:
#     dict_anchor_pool_popularity = json.load(filepath)

In [125]:
%%time
# Group by anchor text to produce list of item IDs, page IDs and page titles (our candidate pools)
anchor_text_candidate_pools = anchor_text_link_popularity.groupby('norm_anchor_text')\
                                    [['target_page_id',
                                      'target_page_title',
                                      'target_item_id',
                                      'likelihood_page_views']]\
                                    .agg(lambda x: list(x)).reset_index()

CPU times: user 15min 17s, sys: 6min 11s, total: 21min 28s
Wall time: 27min 3s


In [126]:
%%time
## Add to dictionary for faster searching later in the pipeline

# Create dictionary
dict_anchor_pool_popularity = {}

# Add lists to dictionary with anchor text as search term
# This should match the full mention search when measuring accuracy later
for i in tqdm(range(len(anchor_text_candidate_pools))):
    row = anchor_text_candidate_pools.loc[i]
    dict_anchor_pool_popularity[row['norm_anchor_text']] = [row['target_page_id'],
                                                           row['target_page_title'],
                                                           row['target_item_id'],
                                                           row['likelihood_page_views']]

100%|██████████| 11327029/11327029 [20:00<00:00, 9432.75it/s] 


CPU times: user 19min 16s, sys: 32.4 s, total: 19min 48s
Wall time: 20min


In [127]:
# Save dictionary
with open('../../predictions/dict_anchor_pool_popularity.json', 'w') as filepath:
    json.dump(dict_anchor_pool_popularity, filepath)

# Assess Accuracy of Anchor Link Models without Congruence

For each full mention in our ACY input dataset, we now append the generated candidate pool as a column and save our predictions.

In [7]:
# Normalize full mentions for direct comparison with normalized anchor texts
acy_input['norm_full_mention'] = acy_input['full_mention'].apply(normalize_text)

## Anchor Link Frequency

In [8]:
# Copy input dataframe
preds_anchor_frequency = acy_input.copy()

In [9]:
# For each full mention, retrieve the candidate pool generated by the model
mention_candidate_pools_page_ids = []
mention_candidate_pools_item_ids = []
mention_candidate_pools_titles = []
mention_candidate_pools_likelihood = []

# Track metrics
oov_error = 0

for i in tqdm(range(len(acy_input))):
    
    # Retrieve normalized full mention
    full_mention = acy_input['norm_full_mention'][i]
    
    # Retrieve candidate pools for full mention
    try:
        dicts = dict_anchor_pool_frequency[full_mention]
    except KeyError:
        oov_error += 1
        dicts = (None, None, None, None)
        
    candidate_pool_page_ids = dicts[0]
    candidate_pool_titles = dicts[1]
    candidate_pool_item_ids = dicts[2]
    candidate_pool_likelihoods = dicts[3]
    
    # Save candidate pools
    mention_candidate_pools_page_ids.append(candidate_pool_page_ids)
    mention_candidate_pools_item_ids.append(candidate_pool_item_ids)
    mention_candidate_pools_titles.append(candidate_pool_titles)
    mention_candidate_pools_likelihood.append(candidate_pool_likelihoods)
    
preds_anchor_frequency['candidate_pool_page_ids'] = mention_candidate_pools_page_ids
preds_anchor_frequency['candidate_pool_item_ids'] = mention_candidate_pools_item_ids
preds_anchor_frequency['candidate_pool_titles'] = mention_candidate_pools_titles
preds_anchor_frequency['candidate_pool_likelihoods'] = mention_candidate_pools_likelihood

100%|██████████| 29312/29312 [00:05<00:00, 4886.76it/s]


In [10]:
print(f"We received {oov_error:,} Out-of-Vocabulary Errors.")

We received 4,625 Out-of-Vocabulary Errors.


In [11]:
# Preview dataframe
preds_anchor_frequency.head(3)

Unnamed: 0,mention,full_mention,wikipedia_URL,wikipedia_page_ID,wikipedia_title,sentence_id,doc_id,congruent_mentions,norm_full_mention,candidate_pool_page_ids,candidate_pool_item_ids,candidate_pool_titles,candidate_pool_likelihoods
0,B,EU,,,,0,0,"['EU', 'German', 'British']",eu,"[9317, 9239, 21347120, 9477, 1882861, 3261189,...","[458, 46, 211593, 1396, 363404, 3327447, 40537...","[European_Union, Europe, Eu,_Seine-Maritime, E...","[0.9227799, 0.024651, 0.020196, 0.005346, 0.00..."
1,B,German,http://en.wikipedia.org/wiki/Germany,11867.0,Germany,0,0,"['EU', 'German', 'British']",german,"[11867, 11884, 152735, 21212, 12674, 290327, 1...","[183, 188, 42884, 7318, 43287, 141817, 181287,...","[Germany, German_language, Germans, Nazi_Germa...","[0.4191423, 0.2892399, 0.14703, 0.0382718, 0.0..."
2,B,British,http://en.wikipedia.org/wiki/United_Kingdom,31717.0,United Kingdom,0,0,"['EU', 'German', 'British']",british,"[31717, 19097669, 13530298, 4721, 158019, 1522...","[145, 842438, 23666, 8680, 161885, 174193, 354...","[United_Kingdom, British_people, Great_Britain...","[0.610078, 0.1146438, 0.0681775, 0.0366451, 0...."


In [12]:
# Calculate accuracy
accurate_predictions = 0
for i in range(len(preds_anchor_frequency)):
    try:
        if preds_anchor_frequency['wikipedia_page_ID'][i] == preds_anchor_frequency['candidate_pool_page_ids'][i][0]:
            accurate_predictions += 1
    except TypeError:
        pass
print("****************************")
print(f"Predictive Accuracy: {round(accurate_predictions / len(preds_anchor_frequency) * 100, 3)}%")
print("****************************")

****************************
Predictive Accuracy: 54.609%
****************************


In [13]:
# Calculate percentage of candidate pools with the correct answer present
# Necessary to determine if shuffling pool could even get the right answer
response_present = 0
for i in range(len(preds_anchor_frequency)):
    try:
        if preds_anchor_frequency['wikipedia_page_ID'][i] in preds_anchor_frequency['candidate_pool_page_ids'][i]:
            response_present += 1
    except TypeError:
        pass
print(f"Correct answer is present in {round(response_present / len(preds_anchor_frequency) * 100, 3)}% of generated candidate pools via Anchor Links Frequency method.")

Correct answer is present in 68.661% of generated candidate pools via Anchor Links Frequency method.


#### Confirm Accuracy Only for Full Mentions with True Positives

In [31]:
# Calculate accuracy only for known true values
accurate_predictions = 0
preds_frequency_nonulls = preds_anchor_frequency[preds_anchor_frequency['wikipedia_page_ID'].notnull()].reset_index()
for i in range(len(preds_frequency_nonulls)):
    try:
        if preds_frequency_nonulls['wikipedia_page_ID'][i] == preds_frequency_nonulls['candidate_pool_page_ids'][i][0]:
            accurate_predictions += 1
    except TypeError:
        pass
print("****************************")
print(f"Predictive Accuracy: {round(accurate_predictions / len(preds_frequency_nonulls) * 100, 3)}%")
print("****************************")

****************************
Predictive Accuracy: 71.919%
****************************


In [32]:
# Calculate percentage of candidate pools with the correct answer present for known true values
# Necessary to determine if shuffling pool could even get the right answer
response_present = 0
for i in range(len(preds_frequency_nonulls)):
    try:
        if preds_frequency_nonulls['wikipedia_page_ID'][i] in preds_frequency_nonulls['candidate_pool_page_ids'][i]:
            response_present += 1
    except TypeError:
        pass
print(f"Correct answer is present in {round(response_present / len(preds_frequency_nonulls) * 100, 3)}% of generated candidate pools via Anchor Links Frequency method.")

Correct answer is present in 90.425% of generated candidate pools via Anchor Links Frequency method.


In [164]:
# Base path to input
preds_path = '../../predictions/'

# Save candidate pools dataframe
preds_anchor_frequency.to_csv(os.path.join(preds_path, "anchortext_frequency.csv"), index=False)

## Anchor Link Popularity

In [14]:
# Copy input dataframe
preds_anchor_popularity = acy_input.copy()

In [15]:
# For each full mention, retrieve the candidate pool generated by the model
mention_candidate_pools_page_ids = []
mention_candidate_pools_item_ids = []
mention_candidate_pools_titles = []
mention_candidate_pools_likelihood = []

# Track metrics
oov_error = 0

for i in tqdm(range(len(acy_input))):
    
    # Retrieve normalized full mention
    full_mention = acy_input['norm_full_mention'][i]
    
    # Retrieve candidate pools for full mention
    try:
        dicts = dict_anchor_pool_popularity[full_mention]
    except KeyError:
        oov_error += 1
        dicts = (None, None, None, None)
        
    candidate_pool_page_ids = dicts[0]
    candidate_pool_titles = dicts[1]
    candidate_pool_item_ids = dicts[2]
    candidate_pool_likelihoods = dicts[3]
    
    # Save candidate pools
    mention_candidate_pools_page_ids.append(candidate_pool_page_ids)
    mention_candidate_pools_item_ids.append(candidate_pool_item_ids)
    mention_candidate_pools_titles.append(candidate_pool_titles)
    mention_candidate_pools_likelihood.append(candidate_pool_likelihoods)
    
preds_anchor_popularity['candidate_pool_page_ids'] = mention_candidate_pools_page_ids
preds_anchor_popularity['candidate_pool_item_ids'] = mention_candidate_pools_item_ids
preds_anchor_popularity['candidate_pool_titles'] = mention_candidate_pools_titles
preds_anchor_popularity['candidate_pool_likelihoods'] = mention_candidate_pools_likelihood

100%|██████████| 29312/29312 [00:06<00:00, 4723.50it/s]


In [16]:
print(f"We received {oov_error:,} Out-of-Vocabulary Errors.")

We received 4,625 Out-of-Vocabulary Errors.


In [17]:
# Preview dataframe
preds_anchor_popularity.head(3)

Unnamed: 0,mention,full_mention,wikipedia_URL,wikipedia_page_ID,wikipedia_title,sentence_id,doc_id,congruent_mentions,norm_full_mention,candidate_pool_page_ids,candidate_pool_item_ids,candidate_pool_titles,candidate_pool_likelihoods
0,B,EU,,,,0,0,"['EU', 'German', 'British']",eu,"[9317, 9239, 9891, 9472, 10890716, 2780146, 18...","[458, 46, 45003, 4916, 185441, 932442, 8268, 8...","[European_Union, Europe, Entropy, Euro, Member...","[0.2237612, 0.2008411, 0.0962244, 0.0766248, 0..."
1,B,German,http://en.wikipedia.org/wiki/Germany,11867.0,Germany,0,0,"['EU', 'German', 'British']",german,"[11867, 11867, 27318, 21148, 21212, 21212, 269...","[183, 183, 334, 55, 7318, 7318, 40, 12548, 825...","[Germany, Germany, Singapore, Netherlands, Naz...","[0.0494115, 0.0494115, 0.0485629, 0.0390487, 0..."
2,B,British,http://en.wikipedia.org/wiki/United_Kingdom,31717.0,United Kingdom,0,0,"['EU', 'German', 'British']",british,"[3434750, 31717, 31717, 19344654, 26061, 85699...","[30, 145, 145, 9531, 172771, 1860, 21, 22, 868...","[United_States, United_Kingdom, United_Kingdom...","[0.115186, 0.0611816, 0.0611816, 0.0294143, 0...."


In [18]:
# Calculate accuracy
accurate_predictions = 0
for i in range(len(preds_anchor_popularity)):
    try:
        if preds_anchor_popularity['wikipedia_page_ID'][i] == preds_anchor_popularity['candidate_pool_page_ids'][i][0]:
            accurate_predictions += 1
    except TypeError:
        pass
print("****************************")
print(f"Predictive Accuracy: {round(accurate_predictions / len(preds_anchor_popularity) * 100, 3)}%")
print("****************************")

****************************
Predictive Accuracy: 46.571%
****************************


In [19]:
# Calculate percentage of candidate pools with the correct answer present
# Necessary to determine if shuffling pool could even get the right answer
response_present = 0
for i in range(len(preds_anchor_popularity)):
    try:
        if preds_anchor_popularity['wikipedia_page_ID'][i] in preds_anchor_popularity['candidate_pool_page_ids'][i]:
            response_present += 1
    except TypeError:
        pass
print(f"Correct answer is present in {round(response_present / len(preds_anchor_popularity) * 100, 3)}% of generated candidate pools via Anchor Links popularity method.")

Correct answer is present in 67.535% of generated candidate pools via Anchor Links popularity method.


#### Confirm Accuracy Only for Full Mentions with True Positives

In [33]:
# Calculate accuracy only for known true values
accurate_predictions = 0
preds_popularity_nonulls = preds_anchor_popularity[preds_anchor_popularity['wikipedia_page_ID'].notnull()].reset_index()
for i in range(len(preds_popularity_nonulls)):
    try:
        if preds_popularity_nonulls['wikipedia_page_ID'][i] == preds_popularity_nonulls['candidate_pool_page_ids'][i][0]:
            accurate_predictions += 1
    except TypeError:
        pass
print("****************************")
print(f"Predictive Accuracy: {round(accurate_predictions / len(preds_popularity_nonulls) * 100, 3)}%")
print("****************************")

****************************
Predictive Accuracy: 61.334%
****************************


In [34]:
# Calculate percentage of candidate pools with the correct answer present for known true values
# Necessary to determine if shuffling pool could even get the right answer
response_present = 0
for i in range(len(preds_popularity_nonulls)):
    try:
        if preds_popularity_nonulls['wikipedia_page_ID'][i] in preds_popularity_nonulls['candidate_pool_page_ids'][i]:
            response_present += 1
    except TypeError:
        pass
print(f"Correct answer is present in {round(response_present / len(preds_popularity_nonulls) * 100, 3)}% of generated candidate pools via Anchor Links popularity method.")

Correct answer is present in 88.943% of generated candidate pools via Anchor Links popularity method.


In [171]:
# Save candidate pools dataframe
preds_anchor_popularity.to_csv(os.path.join(preds_path, "anchortext_popularity.csv"), index=False)