# Entity linking incorporated retrieval (ELR)

In this exercise you will implement the entity matches feature function:  
$$		      	f_{E}(e_i; e) = \log \sum_{f \in \tilde{F}} w_{f}^{E} \left( (1- \lambda )\, \mathbb{1}(e_i , f_{\tilde{e}}) + \lambda\, \frac{\sum_{e' \in E} \mathbb{1}(e_i,f_{\tilde{e}'})}{|\{e' \in E : f_{\tilde{e}'} \neq \emptyset\}|} \right)
$$

In [None]:
import ipytest
import math
import pytest
from typing import Dict, List, Tuple

ipytest.autoconfig()

Term-based representations. These representations are only given to provide some context for a better understanding of the entity-based representations.


In [None]:
TERM_BASED_REPS = [{
    "label": "Ann Dunham",
     "abstract": """Stanley Ann Dunham the mother Barack Obama, was an American
        anthropologist who ...""",
     "birthPlace": "Honolulu Hawaii ...",
     "child": "Barack Obama",
     "wikiPageWikiLink": "United States Family Barack Obama",
     },
     {
    "label": "Michael Jackson",
     "abstract": """Michael Joseph Jackson (August 29, 1958 – June 25, 2009)
        was an American singer, songwriter, and dancer. Dubbed the "King of
        Pop", he is regarded as one of the most significant cultural figures
        of the 20th century. Over a four-decade career, his contributions to
        music, dance, and fashion...""",
     "birthPlace": "Gary Indiana",
     "wikiPageWikiLink": "35th_Annual_Grammy_Awards, A._R._Rahman, ...",
}]

Entity-based representations


In [None]:
ENTITY_BASED_REPS = [{
    "birthPlace": ["<Honolulu>", "<Hawaii>"],
    "child": ["<Barack_Obama>"],
    "wikiPageWikiLink": ["<United_States>", "<Family_of_Barack_Obama>"],
    },
    {
    "birthPlace": ["<Gary_Indiana>"],
    "wikiPageWikiLink": ["<35th_Annual_Grammy_Awards>", "<A._R._Rahman>"],
}]

Field weights

In [None]:
FIELD_WEIGHTS = {
    "birthPlace": 0.4,
    "child": 0.4,
    "wikiPageWikiLink": 0.2,
}

Query

In [None]:
QUERY = ("barack obama parents", ["<Barack_Obama>"])

## Entity matches feature function

$$		      	f_{E}(e_i; e) = \log \sum_{f \in \tilde{F}} w_{f}^{E} \left( (1- \lambda )\, \mathbb{1}(e_i , f_{\tilde{e}}) + \lambda\, \frac{\sum_{e' \in E} \mathbb{1}(e_i,f_{\tilde{e}'})}{|\{e' \in E : f_{\tilde{e}'} \neq \emptyset\}|} \right)
$$

First, we implement the binary indicator function:
$$\mathbb{1}(e_i , f_{\tilde{e}})$$

In [None]:
def binary_indicator_function(entity: str, field_uris: List[str]) -> int:
  """Indicates whether or not the entity is present in the field

  Args:
    entity: URI string.
    field_uris: List of URI string in field.

  Returns:
    1 if entity is in the field, 0 otherwise.
  """
  return 1 if entity in field_uris else 0

Then, we implement a function to get document frequencies.

$$df_{e,f} = \sum_{e' \in E} \mathbb{1}(e_i,f_{\tilde{e}'})$$

$$df_f = |\{e' \in E : f_{\tilde{e}'} \neq \emptyset\}|$$

In [None]:
def get_document_frequencies(f: str, entity: str, entity_based_reps: List[Dict]) -> Tuple[int, int]:
  """Computes document frequencies for entity matches feature score.

  df_e_f is the total number of documents that contain the entity e in field f.
  df_f is the number of documents with a non-empty field f.

  Args:
    f: Field.
    entity: URI string.
    entity_based_reps: All entity-based representations.

  Returns:
    Tuple with df_e_f and df_f.
  """
  df_e_f, df_f = 0, 0
  for e in entity_based_reps:
    if f in e.keys():
      df_f += 1
      if entity in e[f]:
        df_e_f += 1

  return df_e_f, df_f

Based on the two previous functions, we implement the entity matches feature score.

$$		      	f_{E}(e_i; e) = \log \sum_{f \in \tilde{F}} w_{f}^{E} \left( (1- \lambda )\, \mathbb{1}(e_i , f_{\tilde{e}}) + \lambda\, \frac{\sum_{e' \in E} \mathbb{1}(e_i,f_{\tilde{e}'})}{|\{e' \in E : f_{\tilde{e}'} \neq \emptyset\}|} \right)
$$

In [None]:
def compute_entity_matches_feature(entity:str, entity_based_rep:Dict, entity_based_reps:List[Dict], field_weights: Dict[str,float], smoothing_param:float=0.1) -> float:
  """Computes entity matches feature score for an entity.

  Args:
    entity: URI string.
    entity_based_rep: Entity-based representation.
    entity_based_reps: All entity-based representations.
    field_weights: Field weights may be set manually or via dynamic mapping
      using PRMS.
    smoothing_param: Smoothing parameter.Defaults to 0.1.
  Returns:
    Entity matches feature score.
  """
  sum = 0
  for f, w_f_e in field_weights.items():
    e_presence = binary_indicator_function(entity, entity_based_rep[f]) if f in entity_based_rep else 0
    df_e_f, df_f = get_document_frequencies(f, entity, entity_based_reps)
    sum += w_f_e * ((1 - smoothing_param) * e_presence + smoothing_param * df_e_f / df_f)
  return math.log(sum)

Tests

In [None]:
%%run_pytest[clean]

def test_binary_indicator_function():
  assert 1 == binary_indicator_function("<Honolulu>", ["<Honolulu>", "<Hawaii>"])
  assert 0 == binary_indicator_function("<Honolulu>", ["<Gary_Indiana>"])

def test_get_document_frequencies():
  assert (1, 1) == get_document_frequencies("child", QUERY[1][0], ENTITY_BASED_REPS)
  assert (0, 2) == get_document_frequencies("birthPlace", QUERY[1][0], ENTITY_BASED_REPS)

def test_compute_entity_matches_feature():
  assert pytest.approx(math.log(0.4), rel=1e-2) == compute_entity_matches_feature(QUERY[1][0], ENTITY_BASED_REPS[0], ENTITY_BASED_REPS, FIELD_WEIGHTS)
  assert pytest.approx(math.log(0.04), rel=1e-2) == compute_entity_matches_feature(QUERY[1][0], ENTITY_BASED_REPS[1], ENTITY_BASED_REPS, FIELD_WEIGHTS)


%%run_pytest[clean] and %%run_pytest are deprecated in favor of %%ipytest. %%ipytest will clean tests, evaluate the cell and then run pytest. To disable cleaning, configure ipytest with ipytest.config(clean=False).
ipytest.clean_tests is deprecated in favor of ipytest.clean


[32m.[0m[32m.[0m[32m.[0m[32m                                                                                          [100%][0m
[32m[32m[1m3 passed[0m[32m in 0.02s[0m[0m
