![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **NerOverwriter**

This notebook will cover the different parameters and usages of `NerOverwriter`. This annotator overwrites entities of specified strings.

**📖 Learning Objectives:**

1. Understand how different regex patterns split sequences of words in different ways.

2. Understand the difference between the regex tokenizer and regular tokenizer.

3. Become comfortable using the different parameters of the annotator.


**🔗 Helpful Links:**

- Documentation : [NerOverwriter](https://nlp.johnsnowlabs.com/docs/en/annotators#neroverwriter)

- Python Docs : [NerOverwriter](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/ner/ner_overwriter/index.html#sparknlp.annotator.ner.ner_overwriter.NerOverwriter)

- Scala Docs : [NerOverwriter](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/RegexTokenizer.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/).

## **📜 Background**

## **🎬 Colab Setup**

In [4]:
!pip install -q pyspark==3.1.2  spark-nlp==4.2.4

In [5]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

spark = sparknlp.start()

## **🖨️ Input/Output Annotation Types**

<table class="table">
<thead>
<tr class="row-odd"><th class="head"><p>Input Annotation types</p></th>
<th class="head"><p>Output Annotation type</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">NAMED_ENTITY</span></code></p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">NAMED_ENTITY</span></code></p></td>
</tr>
</tbody>
</table>

## **🔎 Parameters**

- `setNerWords` : (List[str]) Sets the words to be overwritten.

- `setNewNerEntity` : (str)  Sets new NER class to apply to those stopwords, by default I-OVERWRITE.

- `setReplaceEntities`: (Dict[str, str]) Sets weights dictionary with the tags that you want to replace.


##### First extract the prerequisite Entities



In [3]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")
tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("bert")
nerTagger = NerDLModel.pretrained() \
    .setInputCols(["sentence", "token", "bert"]) \
    .setOutputCol("ner")
pipeline = Pipeline().setStages([
    documentAssembler,
    sentence,
    tokenizer,
    embeddings,
    nerTagger
])
data = spark.createDataFrame([["Spark NLP Crosses Five Million Downloads, John Snow Labs Announces."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(ner)").show(truncate=False)

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
ner_dl download started this may take some time.
Approximate size to download 13.6 MB
[OK!]
+---------------------------------------------------------------------+
|col                                                                  |
+---------------------------------------------------------------------+
|{named_entity, 0, 4, B-ORG, {word -> Spark, sentence -> 0}, []}      |
|{named_entity, 6, 8, I-ORG, {word -> NLP, sentence -> 0}, []}        |
|{named_entity, 10, 16, O, {word -> Crosses, sentence -> 0}, []}      |
|{named_entity, 18, 21, O, {word -> Five, sentence -> 0}, []}         |
|{named_entity, 23, 29, O, {word -> Million, sentence -> 0}, []}      |
|{named_entity, 31, 39, O, {word -> Downloads, sentence -> 0}, []}    |
|{named_entity, 40, 40, O, {word -> ,, sentence -> 0}, []}            |
|{named_entity, 42, 45, B-ORG, {word -> John, sentence -> 0}, []}     |
|{named_entity, 47,

### `setNerWords()` , `setNewNerEntity()`


##### The recognized entities can then be overwritten

In [None]:
nerOverwriter = NerOverwriter() \
    .setInputCols(["ner"]) \
    .setOutputCol("ner_overwritten") \
    .setNerWords(["Million"]) \
    .setNewNerEntity("B-CARDINAL")
nerOverwriter.transform(result).selectExpr("explode(ner_overwritten)").show(truncate=False)

+------------------------------------------------------------------------+
|col                                                                     |
+------------------------------------------------------------------------+
|{named_entity, 0, 4, B-ORG, {word -> Spark, sentence -> 0}, []}         |
|{named_entity, 6, 8, I-ORG, {word -> NLP, sentence -> 0}, []}           |
|{named_entity, 10, 16, O, {word -> Crosses, sentence -> 0}, []}         |
|{named_entity, 18, 21, O, {word -> Five, sentence -> 0}, []}            |
|{named_entity, 23, 29, B-CARDINAL, {word -> Million, sentence -> 0}, []}|
|{named_entity, 31, 39, O, {word -> Downloads, sentence -> 0}, []}       |
|{named_entity, 40, 40, O, {word -> ,, sentence -> 0}, []}               |
|{named_entity, 42, 45, B-ORG, {word -> John, sentence -> 0}, []}        |
|{named_entity, 47, 50, I-ORG, {word -> Snow, sentence -> 0}, []}        |
|{named_entity, 52, 55, I-ORG, {word -> Labs, sentence -> 0}, []}        |
|{named_entity, 57, 65, I

### `setReplaceEntities()`


In [None]:
new_dict = {"B-ORG":"B-PLACE", "I-ORG":"I-PLACE"}

In [None]:
nerOverwriter = NerOverwriter() \
    .setInputCols(["ner"]) \
    .setOutputCol("ner_overwritten") \
    .setReplaceEntities(new_dict)
nerOverwriter.transform(result).selectExpr("explode(ner_overwritten)").show(truncate=False)

+-----------------------------------------------------------------------+
|col                                                                    |
+-----------------------------------------------------------------------+
|{named_entity, 0, 4, B-PLACE, {word -> Spark, sentence -> 0}, []}      |
|{named_entity, 6, 8, I-PLACE, {word -> NLP, sentence -> 0}, []}        |
|{named_entity, 10, 16, O, {word -> Crosses, sentence -> 0}, []}        |
|{named_entity, 18, 21, O, {word -> Five, sentence -> 0}, []}           |
|{named_entity, 23, 29, O, {word -> Million, sentence -> 0}, []}        |
|{named_entity, 31, 39, O, {word -> Downloads, sentence -> 0}, []}      |
|{named_entity, 40, 40, O, {word -> ,, sentence -> 0}, []}              |
|{named_entity, 42, 45, B-PLACE, {word -> John, sentence -> 0}, []}     |
|{named_entity, 47, 50, I-PLACE, {word -> Snow, sentence -> 0}, []}     |
|{named_entity, 52, 55, I-PLACE, {word -> Labs, sentence -> 0}, []}     |
|{named_entity, 57, 65, I-PLACE, {word