<a href="https://www.kaggle.com/code/angevalli/entity-disambiguation?scriptVersionId=133913725" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a><a target="_blank" href="https://drive.google.com/drive/folders/1yfRcsf5I0KtBfywgGUjhDYRhDBNbqmnK?usp=sharing"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

=== Purpose ===

The goal of this notebook is to disambiguate entities in a text. For example, given a Wikipedia article:

    <Paris_17>
    Paris is a figure in the Greek mythology.

the goal is to determine that <Paris_17> = <Paris_(mythology)>.
Here, <Paris_17> is an artificial title of the Wikipedia article, and <Paris_(mythology)> is the unambiguous entity in the YAGO knowledge base.
(https://yago-knowledge.org/graph/%22Paris%22@en?relation=all&inverse=1)

=== Provided Data ===

We provide
1) a preprocessed version of the Simple Wikipedia wikipedia-ambiguous.txt, which contains ambiguous article titles with their content, as above.
2) a simplified version of the YAGO knowledge base.
3) a template for your code, disambiguator.py
4) a gold standard sample.

=== Task ===

Your task is to complete the function disambiguate() in this file.
It receives as input (1) the ambiguous Wikipedia title ("Paris" in the example), and (2) the article content.
The method shall return the unambiguous entity from YAGO.
In order to ensure a fair evaluation, do not use any non-standard Python libraries except NLTK.
The lab will be graded by a variant of the F1 score that gives higher weight to precision (with beta=0.5).

Input:
<Babilonia_0>
Babilonia is a 1987 Argentine drama film directed and written by Jorge Salvador based on a play by Armando Discépolo.

Output:
<Babilonia_0>   <Babilonia>

=== Development and Testing ===

In YAGO, the entities have readable ids, as in <Ashok_Kumar_(British_politician)>. This is, however, not the case in all knowledge bases. Therefore, the algorithm should not rely on the suffix "British Politician"!

To enforce this, there are two versions of the notebook and dataset associated:
1) Development: With readable entity ids
The corresponding YAGO knowledge base is dev_yago.tsv, and the gold standard is dev_gold_samples.tsv
2) Testing: Without readable entity ids
The corresponding YAGO knowledge base is test_yago.tsv. Here, the British politician has the id <Ashok_Kumar_1081507>.

In [1]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/yago-samples-and-wikipedia-ambiguous-articles/dev_yago.tsv
/kaggle/input/yago-samples-and-wikipedia-ambiguous-articles/wikipedia-ambiguous.txt
/kaggle/input/yago-samples-and-wikipedia-ambiguous-articles/test_gold_samples.tsv
/kaggle/input/yago-samples-and-wikipedia-ambiguous-articles/dev_gold_samples.tsv
/kaggle/input/yago-samples-and-wikipedia-ambiguous-articles/test_yago.tsv


In [2]:
# import custom packages
from utility_script_entity_disambiguation import Parsy, KnowledgeBase, evaluate

# a preprocessed version of the Simple Wikipedia wikipedia-ambiguous.txt,
# which contains ambiguous article titles with their content.
wikipedia_file = "/kaggle/input/yago-samples-and-wikipedia-ambiguous-articles/wikipedia-ambiguous.txt"

# development dataset (suffix is readable)
# [ dev_kb_file ] a simplified YAGO knowledge base
# [ dev_result_file ] generate your prediction
# [ dev_gold_file ] a certain number of gold standard samples
dev_kb_file = "/kaggle/input/yago-samples-and-wikipedia-ambiguous-articles/dev_yago.tsv"
dev_result_file = "/kaggle/working/dev_results.tsv"
dev_gold_file = "/kaggle/input/yago-samples-and-wikipedia-ambiguous-articles/dev_gold_samples.tsv"

# test dataset (suffix is un-readable)
# [ test_kb_file ] a simplified YAGO knowledge base
# [ test_result_file ] generate your prediction. You should submit this file.
# [ test_gold_file ] a certain number of gold standard samples

test_kb_file = "/kaggle/input/yago-samples-and-wikipedia-ambiguous-articles/test_yago.tsv"
test_result_file = "/kaggle/working/test_results.tsv"
test_gold_file = "/kaggle/input/yago-samples-and-wikipedia-ambiguous-articles/test_gold_samples.tsv"

In [3]:
#### IMPORTS

from gensim.test.utils import common_texts
from gensim.models import Word2Vec # Import Word2Vec
import numpy as np

# We use nltk to download stopwords and wordnet
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [4]:
#### DOWNLOADS
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [5]:
!unzip /usr/share/nltk_data/corpora/wordnet.zip -d /usr/share/nltk_data/corpora/

Archive:  /usr/share/nltk_data/corpora/wordnet.zip
   creating: /usr/share/nltk_data/corpora/wordnet/
  inflating: /usr/share/nltk_data/corpora/wordnet/lexnames  
  inflating: /usr/share/nltk_data/corpora/wordnet/data.verb  
  inflating: /usr/share/nltk_data/corpora/wordnet/index.adv  
  inflating: /usr/share/nltk_data/corpora/wordnet/adv.exc  
  inflating: /usr/share/nltk_data/corpora/wordnet/index.verb  
  inflating: /usr/share/nltk_data/corpora/wordnet/cntlist.rev  
  inflating: /usr/share/nltk_data/corpora/wordnet/data.adj  
  inflating: /usr/share/nltk_data/corpora/wordnet/index.adj  
  inflating: /usr/share/nltk_data/corpora/wordnet/LICENSE  
  inflating: /usr/share/nltk_data/corpora/wordnet/citation.bib  
  inflating: /usr/share/nltk_data/corpora/wordnet/noun.exc  
  inflating: /usr/share/nltk_data/corpora/wordnet/verb.exc  
  inflating: /usr/share/nltk_data/corpora/wordnet/README  
  inflating: /usr/share/nltk_data/corpora/wordnet/index.sense  
  inflating: /usr

In [6]:
#### INITIALIZATIONS

# We set stop words
stop_words = set(stopwords.words('english'))

# We create Word2Vec Model and save it
model_temp = Word2Vec(sentences=common_texts, window=5, min_count=1, workers=4)
model_temp.save("word2vec.model")
Word2Vec_model = Word2Vec.load('word2vec.model')

# We consider the lemmatizer
lemmatizer = WordNetLemmatizer()

In [7]:
def disambiguate(entityName, text, kb):
    '''
    :param entityName: a string, name appearing in wikipedia-ambiguous.txt
    :param text: a corresponding context
    :param kb: knowledge base
    :return: return a correct entity from this kb
    '''
    print(entityName, text)
    list_of_candidates = [] # We initialize the list of candidates 
    entityName = str('"' + entityName + '"') # We format entityName
    ############################################ We check the knowledge base and look for the entity so we can get candidates
    if entityName in kb.inverseFacts.keys(): # We look for the entity in the keys of the knowledge base dictionary, if there is not, we return '<NIL>'
        if "<label>" in kb.inverseFacts[entityName].keys(): # If there is a label, we add it in the candidates
            list_of_candidates = kb.inverseFacts[entityName]["<label>"]
        else: 
            list_of_candidates = kb.inverseFacts[entityName]["<iataCode>"]

        # We extract the raw format of text without any stopwords, upper case letters or punctuations
        text_lower = [word.lower() for word in text.split() if word not in stop_words] # We eliminate the stop words and convert upper case letters into lower case
        punctuations_string = '''!()-[]{};:'"\,<>./?@#$%^&*_~''' # We consider the string of all punctuations which is useful to remove them from words
        text_extract = [] # We initialise the extracted version of the text
        text_embedding = 0  # We initialise text embedding
        ######################################### We compute the set of words from the text
        ######################################### We eliminate all punctuation and lemmatize all the found words, which means eleminate all forms of plural, gender etc. in a word.
        for word in text_lower :     
          word_new = ""
          for char in word :
            if char not in punctuations_string :
              word_new += char
          word_new = lemmatizer.lemmatize(word_new) # We rewrite and lemmatize the word without punctuation. Lemmatization is made using nltk
          ######################################## We check if there is an existing embedding in Word2Vec and in this case we add it.
          if word_new in Word2Vec_model.wv :
            text_embedding += Word2Vec_model.wv[word_new]
          
          text_extract.append(word_new) # We append the new_word to the list of words of the text.
        ########################################## The text embedding is the average of the word embeddings for each word found in the Word2Vec model.
        text_embedding = text_embedding/len(text_extract)
        ########################################## We finally obtain the set of words from the text.
        set_from_text = set(text_extract)
        ########################################## Now, we are scoring the candidates we have found.
        set_of_scores = {}
        for candidate in list_of_candidates:

            ###################################### If we have found facts for the current candidate, we consider facts embedding for calculating score, otherwise not.
            if candidate in kb.facts.keys():
              facts_current = kb.facts[candidate]
          
              #################################### We construct the set of facts associated to the current candidate, which is used for computing Jaccard similarity
              set_of_facts = set()
              for fact in facts_current :
                for element in facts_current[fact] :
                  element = (element[1:-1]).lower().split('_') # We remove first and last char which are '<' and '>' and lower case to clean the element.
                  for subelement in element :
                    set_of_facts.add(lemmatizer.lemmatize(subelement)) # We add the lemmatized subelements to the set of facts.
              
              #################################### We first compute the Jaccard similarity between the set of facts for the current candidate and the text set of the entity and store it as the score of the current candidate.
              set_of_scores[candidate] = float(len(set_from_text & set_of_facts))/float(len(set_from_text | set_of_facts))

              #################################### As we did before for all words in the text, we compute the facts embedding as the average embeddings of the facts present in the Word2Vec model.
              fact_embedding = 0
              for fact in set_of_facts :
                if fact in Word2Vec_model.wv :
                      fact_embedding += Word2Vec_model.wv[fact] 
              fact_embedding = fact_embedding/len(set_of_facts)

              #################################### We also compute the euclidian distance between word embeddings and add it to the score of the candidate.
              set_of_scores[candidate] += np.linalg.norm(text_embedding - fact_embedding)
        
        ######################################## The best candidate is the one who maximizes the score, so the sum of Jaccard similarity and Euclidian distance.
        return max(set_of_scores, key=(lambda k: set_of_scores[k]))
    else :
      return "<NIL>"

In [8]:
def evaluate_on_dev():
    '''
    evaluate your model on the development dataset.
    In the development dataset, each entity name (suffix) is readable.
    :return:
    '''

    # load YAGO knowledge base
    # example: kb.facts["<Babilonia>"]
    kb = KnowledgeBase(dev_kb_file)

    # predict each record and generate results.tsv file
    with open(dev_result_file, 'w', encoding="utf-8") as output:
        for page in Parsy(wikipedia_file):
            result = disambiguate(page.label(), page.content, kb)
            if result is not None:
                output.write(page.title+"\t"+result+"\n")

    # evaluate
    evaluate(dev_result_file, dev_gold_file)


def evaluate_on_test():
    '''
    evaluate your model on the test dataset.
    In the test dataset, each entity name (suffix) is un-readable.
    We hide all suffixes.
    :return:
    '''

    # load YAGO knowledge base
    # example: kb.facts["<Babilonia_1049451>"]
    kb = KnowledgeBase(test_kb_file)

    # predict each record and generate results.tsv file
    with open(test_result_file, 'w', encoding="utf-8") as output:
        for page in Parsy(wikipedia_file):
            result = disambiguate(page.label(), page.content, kb)
            if result is not None:
                output.write(page.title + "\t" + result + "\n")

    # evaluate
    evaluate(test_result_file, test_gold_file)


# evaluate
evaluate_on_dev()
evaluate_on_test()

Loading /kaggle/input/yago-samples-and-wikipedia-ambiguous-articles/dev_yago.tsv...done
Babilonia Babilonia is a 1987 Argentine drama film directed and written by Jorge Salvador based on a play by Armando Discépolo.
Babilonia Tai Reina Babilonia is an American former pair skater.
Willmar Willmar Township is a township in Kandiyohi County, Minnesota, United States.
Willmar Willmar is a city in, and the county seat of, Kandiyohi County, Minnesota, United States.
Willmar Willmar Air Force Station is a closed United States Air Force General Surveillance Radar station.
Ashok Kumar Ashok Kumar was a Labour Party politician in the United Kingdom who was the Member of Parliament for Middlesbrough South and East Cleveland from 1997 until his death shortly before the 2010 general election.
Ashok Kumar Ashok Kumar is a professional golfer from India, currently playing on the Professional Golf Tour of India, where he was the 2003/04 and the 2006/07 Order of Merit winner.
Ashok Kumar Ashok Kumar , 