![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **NerDisambiguator**

This notebook will cover the different parameters and usages of `NerDisambighator`. This annotator links words of interest, such as names of persons, locations and companies, from an input text document to a corresponding unique entity in a target Knowledge Base (KB)

**📖 Learning Objectives:**

1. Background: Understand the NerDisambiguation

2. Colab setup

3. Become comfortable using the different parameters of the annotator.


**🔗 Helpful Links:**

- Python Docs : [NerDisambiguator](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/disambiguation/ner_disambiguator/index.html#sparknlp_jsl.annotator.disambiguation.ner_disambiguator.NerDisambiguator)

- Scala Docs : [NerDisambiguator](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/disambiguation/NerDisambiguator.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings).

## **📜 Background**


NerDisambighuation is an important technique that allow to links words of interest, such as names of persons, locations and companies, from an input text document to a corresponding unique entity in a target Knowledge Base (KB). Words of interest are called Named Entities (NEs), mentions, or surface forms. Instantiated / pretrained model of the NerDisambiguator. Links words of interest, such as names of persons, locations and companies, from an input text document to a corresponding unique entity in a target Knowledge Base (KB). Words of interest are called Named Entities (NEs).

## **🎬 Colab Setup**

This module is licensed, so you need a valid license json file.

Installing johsnowlabs:

In [1]:
!pip install -q johnsnowlabs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.2/265.2 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m565.0/565.0 kB[0m [31m41.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m676.2/676.2 kB[0m [31m45.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.6/95.6 kB[0m [31m254.2 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m77.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.2/139.2 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 kB[0m [31m5.

In [2]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving 5.3.3.spark_nlp_for_healthcare.json to 5.3.3.spark_nlp_for_healthcare.json


In [3]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
🚨 Outdated Medical Secrets in license file. Version=5.3.3 but should be Version=5.3.2
🚨 Outdated OCR Secrets in license file. Version=5.1.2 but should be Version=5.3.2
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False if you want to install outdated secrets.
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-5.3.2-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-5.3.2-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-5.3.2.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-5.3.2.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False 

Starting spark session:

In [4]:
# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()
spark

👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False if you want to install outdated secrets.
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.3.2, 💊Spark-Healthcare==5.3.2, running on ⚡ PySpark==3.4.0


## **🖨️ Input/Output Annotation Types**

- Input: `CHUNK` , `SENTENCE_EMBEDDINGS`

- Output: `DISAMBIGUATION`

## **🔎 Parameters**


- `embeddingTypeParam`: (String) Could be ‘bow’ for word embeddings or ‘sentence’ for sentences

- `numFirstChars`: (Int) How many characters should be considered for initial prefix search in knowledge base

- `tokenSearch`: (BooleanParam) Should we search by token or by chunk in knowledge base (token is recommended ==> Default value: True)

- `narrowWithApproximateMatching`: (BooleanParam) Should we narrow prefix search results with levenstein distance based matching (true is recommended)

- `levenshteinDistanceThresholdParam`: (Float)
Levenshtein distance threshold to narrow results from prefix search (0.1 is default)

- `nearMatchingGapParam`: (Int)
Puts a limit on a string length (by trimming the candidate chunks) during levenshtein-distance based narrowing,len(candidate) - len(entity chunk) > nearMatchingGap (Default: 4)

- `predictionsLimit`: (BooleanParam)
Limit on amount of predictions N for topN predictions

- `s3KnowledgeBaseName`: (String)
knowledge base name in s3



### `setS3KnowledgeBaseName()`

First define pipeline stages that extract entities and embeddings. Entities are filtered for PER type entities.





In [7]:
documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

sentence_embeddings = nlp.SentenceEmbeddings() \
    .setInputCols(["sentence","embeddings"]) \
    .setOutputCol("sentence_embeddings")

ner_model = nlp.NerDLModel.pretrained() \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk") \
    .setWhiteList(["PER"])

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
ner_dl download started this may take some time.
Approximate size to download 13.6 MB
[OK!]


Then the extracted entities can be disambiguated

In [9]:
disambiguator = medical.NerDisambiguator() \
    .setS3KnowledgeBaseName("i-per") \
    .setInputCols(["ner_chunk", "sentence_embeddings"]) \
    .setOutputCol("disambiguation") \
    .setTokenSearch(False)

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    sentence_embeddings,
    ner_model,
    ner_converter,
    disambiguator])
data = spark.createDataFrame([["The show also had a contestant named Donald Trump who later defeated Christina Aguilera ..."]]).toDF("text")

model = nlpPipeline.fit(data)
result = model.transform(data)
result.selectExpr("explode(disambiguation)") \
    .selectExpr("col.metadata.chunk as chunk", "col.result as result").show(5, False)

+------------------+------------------------------------------------------------------------------------------------------------------------+
|chunk             |result                                                                                                                  |
+------------------+------------------------------------------------------------------------------------------------------------------------+
|Donald Trump      |http://en.wikipedia.org/?curid=55907961, http://en.wikipedia.org/?curid=31698421, http://en.wikipedia.org/?curid=4848272|
|Christina Aguilera|http://en.wikipedia.org/?curid=6636454, http://en.wikipedia.org/?curid=144171                                           |
+------------------+------------------------------------------------------------------------------------------------------------------------+



### `setNumFirstChars()`

This param is used to fix how many characters should be considered for initial prefix search in knowledge base, let's use it in the example below, by configuring it with a number that does not exist in the KB (10):

In [10]:
disambiguator = medical.NerDisambiguator() \
    .setS3KnowledgeBaseName("i-per") \
    .setInputCols(["ner_chunk", "sentence_embeddings"]) \
    .setOutputCol("disambiguation") \
    .setTokenSearch(False)\
    .setNumFirstChars(10)

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    sentence_embeddings,
    ner_model,
    ner_converter,
    disambiguator])
data = spark.createDataFrame([["The show also had a contestant named Donald Trump who later defeated Christina Aguilera ..."]]).toDF("text")

model = nlpPipeline.fit(data)
result = model.transform(data)
result.selectExpr("explode(disambiguation)") \
    .selectExpr("col.metadata.chunk as chunk", "col.result as result").show(5, False)

+-----+------+
|chunk|result|
+-----+------+
+-----+------+



As you can see, there is no results from the KB.

### `setTokenSearch()`


This param is used to sets whether to search by token or by chunk in knowledge base, the default value is True, so we search by token.
Let's use it for searching by chunk:



In [11]:
disambiguator = medical.NerDisambiguator() \
    .setS3KnowledgeBaseName("i-per") \
    .setInputCols(["ner_chunk", "sentence_embeddings"]) \
    .setOutputCol("disambiguation") \
    .setNumFirstChars(3)\
    .setTokenSearch(False)

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    sentence_embeddings,
    ner_model,
    ner_converter,
    disambiguator])

data = spark.createDataFrame([["The show also had a contestant named Donald Trump who later defeated Christina Aguilera ..."]]).toDF("text")

model = nlpPipeline.fit(data)
result = model.transform(data)
result.selectExpr("explode(disambiguation)") \
    .selectExpr("col.metadata.chunk as chunk", "col.result as result").show(5, False)

+------------------+------------------------------------------------------------------------------------------------------------------------+
|chunk             |result                                                                                                                  |
+------------------+------------------------------------------------------------------------------------------------------------------------+
|Donald Trump      |http://en.wikipedia.org/?curid=55907961, http://en.wikipedia.org/?curid=31698421, http://en.wikipedia.org/?curid=4848272|
|Christina Aguilera|http://en.wikipedia.org/?curid=6636454, http://en.wikipedia.org/?curid=144171                                           |
+------------------+------------------------------------------------------------------------------------------------------------------------+



### `setNarrowWithApproximateMatching()`


Using this param, we set whether to narrow prefix search results with levenstein distance based matching Default value is True. Let's configure it to not narrow prefix search results with levenstein distance:

In [13]:
disambiguator = medical.NerDisambiguator() \
    .setS3KnowledgeBaseName("i-per") \
    .setInputCols(["ner_chunk", "sentence_embeddings"]) \
    .setOutputCol("disambiguation") \
    .setNumFirstChars(3)\
    .setTokenSearch(False)\
    .setNarrowWithApproximateMatching(False)

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    sentence_embeddings,
    ner_model,
    ner_converter,
    disambiguator])
data = spark.createDataFrame([["The show also had a contestant named Donald Trump who later defeated Christina Aguilera ..."]]).toDF("text")

model = nlpPipeline.fit(data)
result = model.transform(data)
result.selectExpr("explode(disambiguation)") \
    .selectExpr("col.metadata.chunk as chunk", "col.result as result").show(5, False)

+------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

As we can see in the results, not narrow prefix search results with levenstein distance, we get more results (but maybe less accurate).

### `setLevenshteinDistanceThresholdParam()`


This param is used to sets Levenshtein distance threshold to narrow results from prefix search.
The default value is 0.1
Let's make it smaller to have more precision (so fewer results are expected):


In [14]:
disambiguator = medical.NerDisambiguator() \
    .setS3KnowledgeBaseName("i-per") \
    .setInputCols(["ner_chunk", "sentence_embeddings"]) \
    .setOutputCol("disambiguation") \
    .setNumFirstChars(3)\
    .setTokenSearch(False)\
    .setLevenshteinDistanceThresholdParam(0.05)

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    sentence_embeddings,
    ner_model,
    ner_converter,
    disambiguator])
data = spark.createDataFrame([["The show also had a contestant named Donald Trump who later defeated Christina Aguilera ..."]]).toDF("text")

model = nlpPipeline.fit(data)
result = model.transform(data)
result.selectExpr("explode(disambiguation)") \
    .selectExpr("col.metadata.chunk as chunk", "col.result as result").show(5, False)

+------------------+--------------------------------------+
|chunk             |result                                |
+------------------+--------------------------------------+
|Donald Trump      |http://en.wikipedia.org/?curid=4848272|
|Christina Aguilera|http://en.wikipedia.org/?curid=144171 |
+------------------+--------------------------------------+



As we can see, we get just one result for each chunk, but with a high precision

### `setNearMatchingGapParam()`


We can use this param to sets a limit on a string length, by trimming the candidate chunks, during levenshtein-distance based narrowing. Let's test it below:

In [15]:
disambiguator = medical.NerDisambiguator() \
    .setS3KnowledgeBaseName("i-per") \
    .setInputCols(["ner_chunk", "sentence_embeddings"]) \
    .setOutputCol("disambiguation") \
    .setNumFirstChars(3)\
    .setNearMatchingGapParam(2)

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    sentence_embeddings,
    ner_model,
    ner_converter,
    disambiguator])
data = spark.createDataFrame([["The show also had a contestant named Donald Trump who later defeated Christina Aguilera ..."]]).toDF("text")

model = nlpPipeline.fit(data)
result = model.transform(data)
result.selectExpr("explode(disambiguation)") \
    .selectExpr("col.metadata.chunk as chunk", "col.result as result").show(5, False)

+------------------+------------------------------------------------------------------------------------------------------------------------+
|chunk             |result                                                                                                                  |
+------------------+------------------------------------------------------------------------------------------------------------------------+
|Donald Trump      |http://en.wikipedia.org/?curid=55907961, http://en.wikipedia.org/?curid=31698421, http://en.wikipedia.org/?curid=4848272|
|Christina Aguilera|http://en.wikipedia.org/?curid=144171                                                                                   |
+------------------+------------------------------------------------------------------------------------------------------------------------+



### `setPredictionLimit()`


We can use this param to sets limit on amount of predictions N for top N predictions.
Let's limit the number of predictions to two in the example below:




In [16]:
disambiguator = medical.NerDisambiguator() \
    .setS3KnowledgeBaseName("i-per") \
    .setInputCols(["ner_chunk", "sentence_embeddings"]) \
    .setOutputCol("disambiguation") \
    .setNumFirstChars(3)\
    .setPredictionLimit(2)

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    sentence_embeddings,
    ner_model,
    ner_converter,
    disambiguator])
data = spark.createDataFrame([["The show also had a contestant named Donald Trump who later defeated Christina Aguilera ..."]]).toDF("text")

model = nlpPipeline.fit(data)
result = model.transform(data)
result.selectExpr("explode(disambiguation)") \
    .selectExpr("col.metadata.chunk as chunk", "col.result as result").show(5, False)

+------------------+--------------------------------------------------------------------------------+
|chunk             |result                                                                          |
+------------------+--------------------------------------------------------------------------------+
|Donald Trump      |http://en.wikipedia.org/?curid=55907961, http://en.wikipedia.org/?curid=31698421|
|Christina Aguilera|http://en.wikipedia.org/?curid=6636454, http://en.wikipedia.org/?curid=144171   |
+------------------+--------------------------------------------------------------------------------+



As we can see, we get two predictions for each chunk as result.