## NERD Entity Fishing - A Guide to Identifying Entities in Text

## Learning Objectives

This tutorial aims to provide a comprehensive understanding of Named Entity Recognition (NER) using the Entity-Fishing tool, including:

- Understand the basics of NER (named entity recognition) and its applications.
- Use the Entity-Fishing tool to extract entities from text.
- Analyze and interpret the results in a Jupyter Notebook environment.

# Target Audience 

This tutorial is meant for users who want to use various state-of-the-art entity linking and disambiguation tools, offering all information in one place.

- Basic [Python programming](https://www.python.org/) and natural language processing skills, [named entities](https://en.wikipedia.org/wiki/Named_entity), and [knowledge graphs](https://www.ontotext.com/knowledgehub/fundamentals/what-is-a-knowledge-base/) is required.
- Users curious about the off-the-shelf named entity recognition and disambiguation tools.
- The detailed introduction to the concepts targets beginners.

## Duration 

2 hours

## Use Cases

- Research on socioeconomic disparities in urban areas using a dataset of news articles, research papers, and community forum posts that require the extraction of people's names, locations, social organizations, and government agencies, etc. The tutorial helps in the recognition and disambiguation of named entities. For example, Washington may refer to the President of the United States or  Washington, DC, the capital of the United States. 
- Investigating the impact of celebrity endorsements on social attitudes by extracting celebrity name mentions using NERD tools to recognize people and their associated entities. For example, Ryan Reynolds is a Canadian actor as well as a New Zealand cricketer.


## Environment Setup

Execute the following command to install dependencies from the requirements.txt file (if working locally). It works with Python version >= 3.7

In [None]:
#!pip install -r requirements.txt
!python -m spacy download en_core_web_sm


## Overview

In the vast landscape of natural language processing (NLP), Named Entity Recognition (NER) and Named Entity Disambiguation play pivotal roles towards understanding and extracting valuable information from text.

Named entities are specific, named elements in text, such as names of people, organizations, locations, dates, and more. Named Entity Recognition is the process of automatically identifying and classifying these named entities within a given text. It forms the foundation for a wide range of applications, from information retrieval and question answering to sentiment analysis and knowledge graph construction.

However, the journey doesn't stop at just recognizing named entities. In real-world scenarios, the same name can often refer to multiple entities depending on the context. This brings us to the challenge of Named Entity Disambiguation, which is the process of determining the correct entity a name refers to, particularly in cases of ambiguity. For instance, does "Michael Jordan" refer to the Actor or the Sportsperson? This is where disambiguation comes into play, making NER not only about identification but also about understanding context.

**Sample Data**

Let us consider a sentence (or a list of sentences) in which we want to perform named entity recognition and disambiguation.

Example = *Tesla CEO Elon Musk and Jeffrey Epstein associate Ghislaine Maxwell were once photographed together*.

In [7]:
with open('data/input_text.txt', 'r') as file:
    texts = file.read().split('\n')

texts

['Microsoft founder Bill Gates met with Indian Prime Minister Narendra Modi to discuss technology partnerships.',
 'Tesla CEO Elon Musk and Jeffrey Epstein associate Ghislaine Maxwell were once photographed together.',
 'Amazon CEO Andy Jassy spoke alongside climate activist Greta Thunberg at the sustainability summit in Berlin.']

## **1. Entity Fishing Tool**


Entity fishing (https://nerd.readthedocs.io/en/latest/index.html) performs general entity recognition and disambiguation against Wikidata knowledge base. The tool currently supports 15 languages, English, French, German, Spanish, Italian, Arabic, Japanese, Chinese (Mandarin), Russian, Portuguese, Farsi, Ukrainian, Swedish, Bengali and Hindi. For English and French, *grobid-ner* is used named entity recognition and disambiguation.

**GROBID NER** : GROBID (GeneRation Of BIbliographic Data) is an open-source machine learning library designed to extract and structure bibliographic metadata from scholarly documents. While GROBID's primary focus is on bibliographic data extraction, it also includes a Named Entity Recognition (NER) component that can be used to extract entities like person names, dates, and locations from scholarly texts.GROBID NER is trained to recognize specific types of entities commonly found in scholarly documents, such as author names, publication dates, journal titles, and more. It's particularly useful for processing academic literature and extracting structured information from research papers and articles.
      
**Training data for GROBID NER** : GROBID is trained on Wikipedia articles &  CONLL 2003 dataset that recognises 27 named entity classes (https://grobid-ner.readthedocs.io/en/latest/). 

**Knowledge Base**: Entity fishing disambiguates against wikidata knowledge base.



**Installation for Entity Fishing**

Install entity fishing client https://pypi.org/project/entity-fishing-client/

In [8]:
#!pip install nerd
import json
from nerd import nerd_client
client = nerd_client.NerdClient()

ls_output_disambiguate = []
for text in texts:
    ls_output_disambiguate.append(client.disambiguate_text(text)[0])

with open("data/output_disambiguate.json", "w") as f:
    json.dump(ls_output_disambiguate, f, indent=4)

ls_output_disambiguate[0]



 2025-08-24 19:52:56,532 - nerd.nerd_client - DEBUG - About to submit the following query {'text': 'Microsoft founder Bill Gates met with Indian Prime Minister Narendra Modi to discuss technology partnerships.', 'entities': [], 'customisation': 'generic', 'sentence': 'true'}


 2025-08-24 19:52:56,687 - nerd.nerd_client - DEBUG - About to submit the following query {'text': 'Tesla CEO Elon Musk and Jeffrey Epstein associate Ghislaine Maxwell were once photographed together.', 'entities': [], 'customisation': 'generic', 'sentence': 'true'}


 2025-08-24 19:52:56,800 - nerd.nerd_client - DEBUG - About to submit the following query {'text': 'Amazon CEO Andy Jassy spoke alongside climate activist Greta Thunberg at the sustainability summit in Berlin.', 'entities': [], 'customisation': 'generic', 'sentence': 'true'}


{'text': 'Microsoft founder Bill Gates met with Indian Prime Minister Narendra Modi to discuss technology partnerships.',
 'entities': [{'rawName': 'Microsoft',
   'offsetStart': 0,
   'offsetEnd': 9,
   'confidence_score': 0.7067,
   'wikipediaExternalRef': 19001,
   'wikidataId': 'Q2283',
   'domains': ['Electronics', 'Commerce', 'Enterprise', 'Computer_Science']},
  {'rawName': 'Bill Gates',
   'offsetStart': 18,
   'offsetEnd': 28,
   'confidence_score': 0.7072,
   'wikipediaExternalRef': 3747,
   'wikidataId': 'Q5284',
   'domains': ['Home', 'Computer_Science', 'Electronics']},
  {'rawName': 'Indian Prime Minister Narendra Modi',
   'offsetStart': 38,
   'offsetEnd': 73,
   'confidence_score': 0.4541,
   'wikipediaExternalRef': 444222,
   'wikidataId': 'Q1058',
   'domains': ['Sociology', 'Biology']},
  {'rawName': 'partnerships',
   'offsetStart': 96,
   'offsetEnd': 108,
   'confidence_score': 0.3742,
   'wikipediaExternalRef': 22666280,
   'wikidataId': 'Q7888184',
   'domains'

## 2. Dbpedia Spotlight

Dbpedia Spotlight - https://github.com/dbpedia-spotlight/dbpedia-spotlight-model 

DBpedia Spotlight is a tool for annotating mentions of DBpedia resources in text. This allows linking unstructured information sources to the Linked Open Data cloud through DBpedia. It performs both NER and entity linking by linking recognized entities to their corresponding entries in the DBpedia knowledge base.
 
**Knowledge Base**: Dbpedia Spotlight disambiguates against DBpedia.


**Installation of DBpedia Spoltlight**

Call the API directly

In [9]:
import requests

ls_output_annotate_mentions = []
for text in texts:
    data = "text=" + text
    result = requests.post("https://api.dbpedia-spotlight.org/en/annotate", data=data, headers={"Accept": "application/json"})
    ls_output_annotate_mentions.append(result.text)


with open("data/output_annotated_mentions.json", "w") as f:
    json.dump(ls_output_annotate_mentions, f, indent=4)

ls_output_annotate_mentions[0]

'{"@text":"Microsoft founder Bill Gates met with Indian Prime Minister Narendra Modi to discuss technology partnerships.","@confidence":"0.5","@support":"0","@types":"","@sparql":"","@policy":"whitelist","Resources":[{"@URI":"http://dbpedia.org/resource/Microsoft","@support":"37800","@types":"Wikidata:Q4830453,Wikidata:Q43229,Wikidata:Q24229398,DUL:SocialPerson,DUL:Agent,Schema:Organization,DBpedia:Organisation,DBpedia:Agent,DBpedia:Company","@surfaceForm":"Microsoft","@offset":"0","@similarityScore":"0.9999578373542678","@percentageOfSecondRank":"3.5435327250394117E-5"},{"@URI":"http://dbpedia.org/resource/Bill_Gates","@support":"2491","@types":"Http://xmlns.com/foaf/0.1/Person,Wikidata:Q729,Wikidata:Q5,Wikidata:Q215627,Wikidata:Q19088,DUL:NaturalPerson,Schema:Person,DBpedia:Species,DBpedia:Eukaryote,DBpedia:Animal,DBpedia:Person","@surfaceForm":"Bill Gates","@offset":"18","@similarityScore":"0.9999998577599557","@percentageOfSecondRank":"1.422313110181471E-7"},{"@URI":"http://dbpedia

## Conclusion

In conclusion, this tutorial equips you with the skills to use some of the existing open source off-the-shelf NERD tools. It gives you confidence to explore any new NERD tools other than those available in this tutorial.

For more in-depth exploration, consider checking out the following resources: 

https://nlpprogress.com/english/entity_linking.html#:~:text=Entity%20Linking%20(EL)%20is%20the,Named%20Entity%20Recognition%20and%20Disambiguation.


## FAQs

**Why use off-the-shelf tools for entity linking and disambiguation?** Off-the-shelf tools offer *pre-built solutions* that save time and resources. They are often trained on *large datasets*, providing a good starting point for various applications.

**What types of entities can be linked using these tools?** Most tools support common entities like *persons, organizations, and locations*. Some may also handle *specific domains* or custom entities based on the tool's training data.


**How accurate are off-the-shelf tools?** Accuracy varies among tools. It depends on factors such as the *quality of training data*, the *diversity of entities*, and the *specific use case*. Evaluation metrics like *precision, recall, and F1 score* help assess accuracy.

**Do these tools work for multiple languages?** Many off-the-shelf tools support *multiple languages*, but the level of accuracy can vary. It's essential to check the *documentation* for language support.

**Can these tools be fine-tuned for domain-specific applications?** Some tools offer the possibility of *fine-tuning* on domain-specific data. However, it depends on the tool's *architecture and capabilities*.

**How do these tools handle ambiguous references?** Ambiguity resolution depends on **context** and available information. Some tools use *machine learning models* that consider surrounding words, phrases, or contextual information to disambiguate references.

**Are there privacy concerns when using entity linking tools?** Yes, privacy concerns may arise, especially if the text contains *sensitive information*. It's crucial to review the tool's *privacy policy* and consider using it with proper *data anonymization practices*.


**What knowledge bases do these tools typically use?** Tools may use popular knowledge bases like *Wikidata, DBpedia, or Freebase*. Some tools allow users to specify *custom knowledge bases* or integrate with proprietary databases.

**Can these tools handle real-time processing?** Real-time processing capabilities vary. Some tools are optimized for *speed*, while others may be more suitable for *batch processing*. Consider the specific requirements of your application.

**How do these tools handle typos or misspellings?** Some tools include mechanisms to handle *typos or misspellings* through *fuzzy matching* or *probabilistic models*. However, their effectiveness may vary.

**Are there limitations to off-the-shelf tools?** Yes, limitations can include handling *rare entities*, dealing with *noisy or informal text*, and adapting to *highly specialized domains*. It's essential to understand the tool's *strengths and weaknesses*.

**Do these tools require internet access?** Some tools may require *internet access* to query external knowledge bases. Check the tool's *documentation* for offline or custom knowledge base options.

**How scalable are these tools for large datasets?** Scalability depends on the tool's *architecture*. Some tools are designed for *large-scale processing*, while others may be more suitable for *smaller datasets*.

**Can I combine multiple tools for better performance?** Yes, combining multiple tools (*ensemble methods*) can improve performance and mitigate the limitations of individual tools. However, *integration complexity* should be considered.


## Contact details
For more information, please contact <Susmita.Gangopadhyay@gesis.org>