# Process ACY for Congruence

This notebook takes our input dataset, Aida-Conll-Yago, and processes it into a form that is necessary for congruence. Notably, this includes a sentence ID for each full_mention that allows us to understand what full_mentions occur in the same context.

### Import Packages

In [7]:
import os
import time
import csv

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Add directory above current directory to path so we can import our pre-built package
import sys; sys.path.insert(0, '../..')
from entity_disambiguation.preprocessing import process_input, normalize_text

from nltk.corpus import stopwords

from tqdm import tqdm

In [4]:
# Import stopwords
# todo how are these used?
try:
    stop = stopwords.words('english')
except LookupError:
    # Download stopwords if it's your first time
    import nltk
    nltk.download('stopwords')

## Load Aida-Conll-Yago

In [5]:
# Relative path to ACY dataset
acy_path = "../../data/aida-conll-yago-dataset/"

# process_input() takes path location where you've stored AIDA-YAGO2-DATASET.tsv file
# Transforms tsv file into train, test split
# x are words, y are indices
# todo for now I'm putting the processing code into this notebook instead of using the Python package
# todo doing this for code clarity
pass

In [10]:
acy_file = open(os.path.join(acy_path, "AIDA-YAGO2-DATASET.tsv"))
acy_tsv = csv.reader(acy_file, delimiter="\t")

df = []
for row in acy_tsv:
    df.append(row)
acy_df = pd.DataFrame(data=df[1:])
new_columns = ["token", "mention", "full_mention", "YAGO2", "wikipedia_URL", "wikipedia_ID", "freebase"]
acy_df = acy_df.rename(columns=dict(zip(range(7), new_columns)))
acy_df.head(3)

Unnamed: 0,token,mention,full_mention,YAGO2,wikipedia_URL,wikipedia_ID,freebase
0,EU,B,EU,--NME--,,,
1,rejects,,,,,,
2,German,B,German,Germany,http://en.wikipedia.org/wiki/Germany,11867.0,/m/0345h


#### Add Sentence ID

In order to determine which full_mentions occur in the same sentence, we add a Sentence_ID that increases by one whenever it comes across a period and a space.


#### Add Doc ID
It appears sample articles are split by a `-DOCSTART-` so we've also added a `doc_id` based on that.

In [20]:
# Confirm length of iteration
print("Tokens in ACY: ", len(acy_df))

Tokens in ACY:  176615


In [63]:
# Start with sentence_id of 0
sentence_id = 0
doc_id = 0
prior_token = 0

# Prepare list to append
sentence_list = []
doc_list = []

# Iterate through every token to find same sentences
for row in tqdm(range(len(acy_df))):
    
    # Find token and add sentence ID to list
    token = acy_df.iloc[row]['token']
    sentence_list.append(sentence_id)
    
    # If we've reached our sentence stopping condition, increment sentence_id
    if prior_token == "." and token == None:
        sentence_id += 1
    try:
        if "-DOCSTART-" in token:
            doc_id += 1
    except TypeError: # Can't use 'in' when NoneType present
        pass
    doc_list.append(doc_id)
        
    # Update token to prior token
    prior_token = token
        
# Add list to DataFrame as column
acy_df['sentence_id'] = sentence_list
acy_df['doc_id'] = doc_list

100%|██████████| 176615/176615 [00:20<00:00, 8591.61it/s]


In [64]:
# Print out example showing increment sentence and doc
acy_df[620:645]

Unnamed: 0,token,mention,full_mention,YAGO2,wikipedia_URL,wikipedia_ID,freebase,sentence_id,doc_id
620,(,,,,,,,22,1
621,$,,,,,,,22,1
622,7845,,,,,,,22,1
623,),,,,,,,22,1
624,.,,,,,,,22,1
625,,,,,,,,22,1
626,The,,,,,,,23,1
627,guitarist,,,,,,,23,1
628,died,,,,,,,23,1
629,of,,,,,,,23,1


In [65]:
# Create mini copy of dataset
mini_acy_df = acy_df[620:645].copy()

In [77]:
# Test function on mini version
# todo eliminate None's
mini_acy_df.groupby('sentence_id')['full_mention'].agg(['unique'])

Unnamed: 0_level_0,unique
sentence_id,Unnamed: 1_level_1
22,[None]
23,[None]
24,"[None, China, Taiwan]"


In [94]:
%%time
# Run over full dataset
same_sentence_entities = acy_df.groupby('sentence_id')['full_mention'].agg(['unique'])
same_sentence_entities[:5]

CPU times: user 358 ms, sys: 95.5 ms, total: 453 ms
Wall time: 517 ms


Unnamed: 0_level_0,unique
sentence_id,Unnamed: 1_level_1
0,"[EU, None, German, British]"
1,"[Peter Blackburn, None, BRUSSELS, European Com..."
2,"[Germany, None, European Union, Werner Zwingma..."
3,"[None, Commission, Nikolaus van der Pas]"
4,"[None, European Union]"


## Create ACY Dataset where row = full_mention

Now we create a version of the dataset where each row represents a unique instance of a full_mention value - basically all rows where full_mention != None.

In [120]:
acy_input = acy_df[acy_df['full_mention'].notnull()].reset_index(drop=True)
acy_input.head(5)

Unnamed: 0,token,mention,full_mention,YAGO2,wikipedia_URL,wikipedia_ID,freebase,sentence_id,doc_id
0,EU,B,EU,--NME--,,,,0,0
1,German,B,German,Germany,http://en.wikipedia.org/wiki/Germany,11867.0,/m/0345h,0,0
2,British,B,British,United_Kingdom,http://en.wikipedia.org/wiki/United_Kingdom,31717.0,/m/07ssc,0,0
3,Peter,B,Peter Blackburn,--NME--,,,,1,0
4,Blackburn,I,Peter Blackburn,--NME--,,,,1,0


In [135]:
# Add congruent entities to dataframe
congruent_entities_list = []
for mention in tqdm(range(len(acy_input))):
    sentence_id = acy_input.iloc[mention]['sentence_id']
    congruent_entities = list(filter(None, same_sentence_entities.iloc[sentence_id][0]))
    congruent_entities_list.append(congruent_entities)

acy_input['congruent_entities'] = congruent_entities_list

100%|██████████| 29312/29312 [00:06<00:00, 4704.32it/s]


# Final Input Dataset Form

In [136]:
acy_input.head(5)

Unnamed: 0,token,mention,full_mention,YAGO2,wikipedia_URL,wikipedia_ID,freebase,sentence_id,doc_id,congruent_entities
0,EU,B,EU,--NME--,,,,0,0,"[EU, German, British]"
1,German,B,German,Germany,http://en.wikipedia.org/wiki/Germany,11867.0,/m/0345h,0,0,"[EU, German, British]"
2,British,B,British,United_Kingdom,http://en.wikipedia.org/wiki/United_Kingdom,31717.0,/m/07ssc,0,0,"[EU, German, British]"
3,Peter,B,Peter Blackburn,--NME--,,,,1,0,"[Peter Blackburn, BRUSSELS, European Commissio..."
4,Blackburn,I,Peter Blackburn,--NME--,,,,1,0,"[Peter Blackburn, BRUSSELS, European Commissio..."
