## NERD Entity Fishing - A Guide to Identifying Entities in Text

## Introduction ##







This tutorial is designed for users who wants to compare and implement various off-the-shelf Named Entity Recognition and 
Disambiguation Tools
It assumes a basic knowledge of Python and Natural Language Processing.
Whether you are a beginner or an experienced computer Scientist or Social Scientist, this tutorial introduces some of the most popular
and existing NERD tools together and explains their installation and usage


**Learning Goal**
By the end of this tutorial, you will be able to use some of the state-of-the-art off-the-shelf NERD tools, and obtain an  understanding about them using their Public APIs

**Learning Objectives**
- Explore state-of-the-art NERD tools like Entity Fishing, SpacyFishing and DBPedia Spotlight
- Install them in your local system
- Try out some prepared examples

**Description**

In the vast landscape of natural language processing (NLP), Named Entity Recognition (NER) and Named Entity Disambiguation play pivotal roles towards understanding and extracting valuable information from text.

Named entities are specific, named elements in text, such as names of people, organizations, locations, dates, and more. Named Entity Recognition is the process of automatically identifying and classifying these named entities within a given text. It forms the foundation for a wide range of applications, from information retrieval and question answering to sentiment analysis and knowledge graph construction.

However, the journey doesn't stop at just recognizing named entities. In real-world scenarios, the same name can often refer to multiple entities depending on the context. This brings us to the challenge of Named Entity Disambiguation, which is the process of determining the correct entity a name refers to, particularly in cases of ambiguity. For instance, does "Michale Jordan" refer to the Actor or the Sportsperson? This is where disambiguation comes into play, making NER not only about identification but also about understanding context.




**Target Audience** 

This tutorial is meant for users who wants to use various state-of-the-art entity linking and disambiguation tools. This tutorial aims to put all related information in a single place

**Prerequisites**

1. Basic knowledge of python (https://www.python.org/)
2. Basic knowledge of Named Entities (https://en.wikipedia.org/wiki/Named_entity)
3. Basic knowledge of Knowledge Graphs and Knowledge Bases (https://www.ontotext.com/knowledgehub/fundamentals/what-is-a-knowledge-base/)

**Difficulty Level**
- Medium

**Duration**
- 2 hours

**Social Science Use Case**

John is a researcher who is researching on socioeconomic disparities in urban areas. He gathers a diverse dataset comprising news articles, research papers, and community forum posts related to urban development and socioeconomic issues. He wants to extract names of people, locations, cities, neighborhoods, social organizations, and government agencies in this data. He can utilize NER tools to extract named entities and also disambiguate similar entities(For example Washington referring to President of the United States or  Washington DC, the capital of United States) that would help him in analysis of his use case and aid to his research within a very short time. 


As a researcher, Rose is investigating the impact of celebrity endorsements on social attitudes. She has data collected from various social media blogs and forums and she wants to extract mention of celebrity names(For example Ryan Reynolds, the Canadian Actor or the New Zealand Cricketer) . She utilizes NERD tools to extract and categorize names of people and associated entities. This helps her in performing a quantitative analysis of which popular personalities are mentioned and how often in social media.


**Sample Data**

Let us consider a sentence (or a list of sentences) in which we want to perform named entity recognition and disambiguation.

Example = *Tesla CEO Elon Musk and Jeffrey Epstein associate Ghislaine Maxwell were once photographed together*.

In [1]:
text="Tesla CEO Elon Musk and Jeffrey Epstein associate Ghislaine Maxwell were once photographed together."

## **1.Entity Fishing Tool**


Entity fishing(https://nerd.readthedocs.io/en/latest/index.html) performs general entity recognition and disambiguation against Wikidata knowledge base.The tool currently supports 15 languages, English, French, German, Spanish, Italian, Arabic, Japanese, Chinese (Mandarin), Russian, Portuguese, Farsi, Ukrainian, Swedish, Bengali and Hindi.For English and French, *grobid-ner* is used named entity recognition and disambiguation.

     **GROBID NER** : GROBID (GeneRation Of BIbliographic Data) is an open-source machine learning library designed to extract and structure bibliographic metadata from scholarly documents. While GROBID's primary focus is on bibliographic data extraction, it also includes a Named Entity Recognition (NER) component that can be used to extract entities like person names, dates, and locations from scholarly texts.GROBID NER is trained to recognize specific types of entities commonly found in scholarly documents, such as author names, publication dates, journal titles, and more. It's particularly useful for processing academic literature and extracting structured information from research papers and articles.
      
     **Training data for GROBID NER** : GROBID is trained on Wikipedia articles &  CONLL 2003 dataset that recognises 27 named entity classes (https://grobid-ner.readthedocs.io/en/latest/). 
  
     **Knowledge Base**: Entity fishing disambiguates against wikidata knowledge base.



**Installation for Entity Fishing**
1. Install python in your system- pip install python
2. Install entity fishing client - pip install entity-fishing-client (https://pypi.org/project/entity-fishing-client/)

In [3]:
!pip install nerd
from nerd import nerd_client
client = nerd_client.NerdClient()
client.disambiguate_text(text)

Collecting nerd
  Using cached nerd-1.0.0.zip (785 bytes)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: nerd
  Building wheel for nerd (setup.py) ... [?25ldone
[?25h  Created wheel for nerd: filename=nerd-1.0.0-py3-none-any.whl size=1213 sha256=9aa857f264f61f2d227dad8c094da25c7e04dfb7869eb1c609ffb0b076d6d9b5
  Stored in directory: /Users/shyam/Library/Caches/pip/wheels/af/37/94/b7fd8a98b07ed9c98d0e91c5b3253b9e359eead67c0ad7eaa0
Successfully built nerd
Installing collected packages: nerd
Successfully installed nerd-1.0.0


ImportError: cannot import name 'nerd_client' from 'nerd' (/opt/anaconda3/envs/uni/lib/python3.9/site-packages/nerd.py)

**Output of Entity Fishing**

({'text': 'Tesla CEO Elon Musk and Jeffrey Epstein associate Ghislaine Maxwell were once photographed together.',
  'entities': [{'rawName': 'Tesla CEO Elon Musk',
    'offsetStart': 0,
    'offsetEnd': 19,
    'confidence_score': 0.6093,
    'wikipediaExternalRef': 909036,
    'wikidataId': 'Q317521',
    'domains': ['Aviation', 'Transport', 'Vehicles']},
   {'rawName': 'Jeffrey Epstein',
    'offsetStart': 24,
    'offsetEnd': 39,
    'confidence_score': 0.9152,
    'wikipediaExternalRef': 6253522,
    'wikidataId': 'Q2904131',
    'domains': ['Commerce', 'Enterprise', 'Administration', 'Sociology']},
   {'rawName': 'Ghislaine Maxwell',
    'offsetStart': 50,
    'offsetEnd': 67,
    'confidence_score': 0.9134,
    'wikipediaExternalRef': 32018562,
    'wikidataId': 'Q5556756',
    'domains': ['Aviation', 'Transport', 'Biology', 'Sociology']},
   {'rawName': 'photographed',
    'offsetStart': 78,
    'offsetEnd': 90,
    'confidence_score': 0.3687,
    'wikipediaExternalRef': 23604,
    'wikidataId': 'Q11633',
    'domains': ['Artisanship', 'Optics']}],
  'customisation': 'generic',
  'sentence': 'true',
  'language': {'lang': 'en', 'conf': 0.999995502488815}},
 200)

## **2.Spacyfishing Tool**

Spacyfishing - https://spacy.io/universe/project/spacyfishing
A spaCy wrapper of Entity-Fishing for named entity disambiguation and linking .

       **SpaCy** : spaCy is a library for advanced Natural Language Processing in Python and Cython. It features NER, POS tagging, dependency parsing, word vectors and more.
   
       **Training data** :SpaCy is trained on OntoNotes dataset and recognises 20 named entity classes (https://sparknlp.org/2020/12/05/onto_bert_base_cased_en.html). 
   
       **Knowledge Base**: Spacyfishing disambiguates against wikidata knowledge base.

**Installation of SpacyFishing**
1. Install python in your system

In [None]:
!pip install python

2. Install SpaCy fishing(https://spacy.io/universe/project/spacyfishing)

In [None]:
!pip install spacyfishing 

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('entityfishing')
doc = nlp(text)
for ent in doc.ents:
    print((ent.text, ent.label_, ent._.kb_qid, ent._.url_wikidata, ent._.nerd_score))

**Output of SpacyFishing**


('Tesla', 'ORG', 'Q478214', 'https://www.wikidata.org/wiki/Q478214', 0.7998)
('Elon Musk', 'PERSON', 'Q317521', 'https://www.wikidata.org/wiki/Q317521', 0.9165)
('Jeffrey Epstein', 'PERSON', 'Q2904131', 'https://www.wikidata.org/wiki/Q2904131', 0.9152)
('Ghislaine Maxwell', 'PERSON', 'Q5556756', 'https://www.wikidata.org/wiki/Q5556756', 0.9134)

## 3.Dbpedia Spotlight

Dbpedia Spotlight - https://github.com/dbpedia-spotlight/dbpedia-spotlight-model 
DBpedia Spotlight is a tool for annotating mentions of DBpedia resources in text. This allows linking unstructured information sources to the Linked Open Data cloud through DBpedia. It performs both NER and entity linking by linking recognized entities to their corresponding entries in the DBpedia knowledge base.
 
       **Knowledge Base**: Dbpedia Spotlight disambiguates against DBpedia.


**Installation of DBpedia Spoltlight**
1. Install python in your system- pip install python
2. Call the API directly

In [10]:
import requests
import json

data="text="+text

result = requests.post("https://api.dbpedia-spotlight.org/en/annotate", data=data, headers={"Accept": "application/json"})
print(json.loads(result.text))
  

**Output of DBpedia Spotlight**

{'@text': 'Tesla CEO Elon Musk and Jeffrey Epstein associate Ghislaine Maxwell were once photographed together.', '@confidence': '0.5', '@support': '0', '@types': '', '@sparql': '', '@policy': 'whitelist', 'Resources': [{'@URI': 'http://dbpedia.org/resource/Tesla_Model_3', '@support': '453', '@types': 'Schema:Product,DBpedia:MeanOfTransportation,DBpedia:Automobile', '@surfaceForm': 'Tesla', '@offset': '0', '@similarityScore': '0.8611072046202428', '@percentageOfSecondRank': '0.16123062516034467'}, {'@URI': 'http://dbpedia.org/resource/Chief_executive_officer', '@support': '21733', '@types': '', '@surfaceForm': 'CEO', '@offset': '6', '@similarityScore': '0.999998802534271', '@percentageOfSecondRank': '1.1322503808686818E-6'}, {'@URI': 'http://dbpedia.org/resource/Elon_Musk', '@support': '1028', '@types': '', '@surfaceForm': 'Elon', '@offset': '10', '@similarityScore': '0.9999999380270898', '@percentageOfSecondRank': '5.208287192689886E-8'}, {'@URI': 'http://dbpedia.org/resource/Jeffrey_Epstein', '@support': '330', '@types': 'Http://xmlns.com/foaf/0.1/Person,Wikidata:Q5,Wikidata:Q24229398,Wikidata:Q215627,DUL:NaturalPerson,DUL:Agent,Schema:Person,DBpedia:Agent,DBpedia:Person', '@surfaceForm': 'Jeffrey Epstein', '@offset': '24', '@similarityScore': '1.0', '@percentageOfSecondRank': '0.0'}, {'@URI': 'http://dbpedia.org/resource/Ghislaine_Maxwell', '@support': '49', '@types': 'Http://xmlns.com/foaf/0.1/Person,Wikidata:Q5,Wikidata:Q24229398,Wikidata:Q215627,DUL:NaturalPerson,DUL:Agent,Schema:Person,DBpedia:Agent,DBpedia:Person', '@surfaceForm': 'Ghislaine Maxwell', '@offset': '50', '@similarityScore': '1.0', '@percentageOfSecondRank': '0.0'}]}

**Conclusion**

In conclusion, this tutorial has equipped you with the skills to use some of the existing open source off-the-shelf NERD tools.

This tutorial will give you confidence to explore any new NERD tools other than those avaibale in this tutorial.



**Version History**
-1.0

**Contact details**
Susmita.Gangopadhyay@gesis.org



**References**

For more in-depth exploration, consider checking out the following resources: 

https://nlpprogress.com/english/entity_linking.html#:~:text=Entity%20Linking%20(EL)%20is%20the,Named%20Entity%20Recognition%20and%20Disambiguation.    






### **FAQs**

#### **Why use off-the-shelf tools for entity linking and disambiguation?**
Off-the-shelf tools offer **pre-built solutions** that save time and resources. They are often trained on **large datasets**, providing a good starting point for various applications.

#### **What types of entities can be linked using these tools?**
Most tools support common entities like **persons, organizations, and locations**. Some may also handle **specific domains** or custom entities based on the tool's training data.

#### **How accurate are off-the-shelf tools?**
Accuracy varies among tools. It depends on factors such as the **quality of training data**, the **diversity of entities**, and the **specific use case**. Evaluation metrics like **precision, recall, and F1 score** help assess accuracy.

#### **Do these tools work for multiple languages?**
Many off-the-shelf tools support **multiple languages**, but the level of accuracy can vary. It's essential to check the **documentation** for language support.

#### **Can these tools be fine-tuned for domain-specific applications?**
Some tools offer the possibility of **fine-tuning** on domain-specific data. However, it depends on the tool's **architecture and capabilities**.

#### **How do these tools handle ambiguous references?**
Ambiguity resolution depends on **context** and available information. Some tools use **machine learning models** that consider surrounding words, phrases, or contextual information to disambiguate references.

#### **Are there privacy concerns when using entity linking tools?**
Yes, privacy concerns may arise, especially if the text contains **sensitive information**. It's crucial to review the tool's **privacy policy** and consider using it with proper **data anonymization practices**.

#### **What knowledge bases do these tools typically use?**
Tools may use popular knowledge bases like **Wikidata, DBpedia, or Freebase**. Some tools allow users to specify **custom knowledge bases** or integrate with proprietary databases.

#### **Can these tools handle real-time processing?**
Real-time processing capabilities vary. Some tools are optimized for **speed**, while others may be more suitable for **batch processing**. Consider the specific requirements of your application.

#### **How do these tools handle typos or misspellings?**
Some tools include mechanisms to handle **typos or misspellings** through **fuzzy matching** or **probabilistic models**. However, their effectiveness may vary.

#### **Are there limitations to off-the-shelf tools?**
Yes, limitations can include handling **rare entities**, dealing with **noisy or informal text**, and adapting to **highly specialized domains**. It's essential to understand the tool's **strengths and weaknesses**.

#### **Do these tools require internet access?**
Some tools may require **internet access** to query external knowledge bases. Check the tool's **documentation** for offline or custom knowledge base options.

#### **How scalable are these tools for large datasets?**
Scalability depends on the tool's **architecture**. Some tools are designed for **large-scale processing**, while others may be more suitable for **smaller datasets**.

#### **Can I combine multiple tools for better performance?**
Yes, combining multiple tools (**ensemble methods**) can improve performance and mitigate the limitations of individual tools. However, **integration complexity** should be considered.
