![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Open_Source/02.05.NorvigSweetingSpellchecker.ipynb.ipynb)

# **NorvigSweetingApproach** and **NorvigSweetingModel**

This notebook will cover the different parameters and usages of `NorvigSweetingApproach` and `NorvigSweetingModel`. These annotators are used to make corrections to tokens automatically if they are not found in an English dictionary.

**📖 Learning Objectives:**

1. Understand how to check spelling using NorvigSweeting annotators.

2. Understand the difference between `NorvigSweetingApproach` and `NorvigSweetingModel`.

3. Customize the use of these annotators by setting their parameters.


**🔗 Helpful Links:**

- Documentation : [NorvigSweeting Spellchecker](https://nlp.johnsnowlabs.com/docs/en/annotators#norvigsweeting-spellchecker)

- Python Docs : [NorvigSweetingApproach](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/python/sparknlp/annotator/spell_check/norvig_sweeting/index.html#sparknlp.annotator.spell_check.norvig_sweeting.NorvigSweetingApproach), [NorvigSweetingModel](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/python/sparknlp/annotator/spell_check/norvig_sweeting/index.html#sparknlp.annotator.spell_check.norvig_sweeting.NorvigSweetingModel)

- Scala Docs : [NorvigSweetingApproach](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/spell/norvig/NorvigSweetingApproach), [NorvigSweetingModel](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/spell/norvig/NorvigSweetingModel)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/training/english/vivekn-sentiment/VivekNarayanSentimentApproach.ipynb).

## **📜 Background**


These annotators retrieve tokens and make corrections automatically if they are not found in an external dictionary. They are inspired by Norvig model and [SymSpell](https://github.com/wolfgarbe/SymSpell).

The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent.

- `NorvigSweetingApproach` is used to train your own spellchecker model. A dictionary of correct spellings must be provided as a text file, and each word is then parsed by a regex pattern.
- `NorvigSweetingModel` is the instantiated model of the `NorvigSweetingApproach`. Pretrained models can be loaded using this annotator. If no pretrained model name is provided, the `spellcheck_norvig` is used by default. For available pretrained models please see the [Models Hub](https://nlp.johnsnowlabs.com/models?task=Spell+Check).

For alternative approaches to spellchecking, refer to [SymmetricDelete annotator](https://nlp.johnsnowlabs.com/docs/en/annotators#symmetricdelete) or [ContextSpellChecker annotator](https://nlp.johnsnowlabs.com/docs/en/annotators#contextspellchecker).

## **🎬 Colab Setup**

In [3]:
!pip install -q pyspark==3.1.2  spark-nlp==4.2.4

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m448.4/448.4 KB[0m [31m34.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 KB[0m [31m20.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [4]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
from pyspark.sql.functions import col

spark = sparknlp.start()

## **🖨️ Input/Output Annotation Types**

- Input: `TOKEN`

- Output: `TOKEN` (misspelled words are replaced by their correct form and the metadata includes the confidence score of the spelling correction)

## **🔎 Parameters**


- `caseSensitive`: (Boolean) Sensitivity on spell checking (Default: True). Might affect accuracy.

- `dictionary`: (String) Path to .txt file with external dictionary. External dictionary to be used needs "tokenPattern" (Default: \S+) for parsing the resource. This parameter only applies to `NorvigSweetingApproach`.

  Example:


```
...
gummy
gummic
gummier
gummiest
gummiferous
...
```
- `doubleVariants`: (Boolean) Increase search at cost of performance (Default: false). Enables extra check for word combinations. More accuracy at performance.

- `dupsLimit`: (Int) Maximum duplicate of characters in a word to consider (Default: 2).

- `frequencyPriority`: (Boolean) Applies frequency over hamming in intersections (Default: true). When false hamming takes priority.

- `intersections`: (Int) Hamming intersections to attempt (Default: 10).

- `reductLimit`: (Int) Word reduction limit (Default: 3).

- `shortCircuit`: (Boolean) Increase performance at cost of accuracy (Default: false). Faster but less accurate mode.

- `vowelSwapLimit`: (Int) Vowel swap attempts (Default: 6).

- `wordSizeIgnore`: (Int) Minimum size of word before ignoring (Default: 3).

## **Examples**

### Using a pretrained spellchecker with `NorvigSweetingModel`

In [5]:
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")

tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")

# "spellcheck_norvig" can be omitted, as it is the default value
spellChecker = NorvigSweetingModel.pretrained("spellcheck_norvig")\
.setInputCols(["token"]) \
.setOutputCol("spell")

pipeline = Pipeline().setStages([
documentAssembler,
tokenizer,
spellChecker
])

data = spark.createDataFrame([["somtimes i wrrite wordz erong."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select(col('token.result').alias("before_spellchecker"), col('spell.result').alias("after_spellchecker")).show(truncate = False)

spellcheck_norvig download started this may take some time.
Approximate size to download 4.2 MB
[OK!]
+--------------------------------------+--------------------------------------+
|before_spellchecker                   |after_spellchecker                    |
+--------------------------------------+--------------------------------------+
|[somtimes, i, wrrite, wordz, erong, .]|[sometimes, i, write, words, wrong, .]|
+--------------------------------------+--------------------------------------+



### Training a spellchecker using `NorvigSweetingApproach`

In [6]:
# Dictionary creation

external_dict = '''
dog
fish
horse
'''

with open('external_dict.txt', 'w') as f:
  f.write(external_dict)

In [7]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

spellChecker = NorvigSweetingApproach() \
    .setInputCols(["token"]) \
    .setOutputCol("spell") \
    .setDictionary("external_dict.txt")

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    spellChecker
])

empty_df = spark.createDataFrame([[""]]).toDF("text")

spellcheck_model = pipeline.fit(empty_df)

text_df = spark.createDataFrame([["The dogh is eating."]]).toDF("text")

corrected_text = spellcheck_model.transform(text_df)

corrected_text.select(col('token.result').alias("before_spellchecker"), col('spell.result').alias("after_spellchecker")).show(truncate = False)

+--------------------------+-------------------------+
|before_spellchecker       |after_spellchecker       |
+--------------------------+-------------------------+
|[The, dogh, is, eating, .]|[The, dog, is, eating, .]|
+--------------------------+-------------------------+



Based on the external dictionary, the spellchecker identified the misspelled word "dogh" and replaced it by "dog".

### setCaseSensitive

In [8]:
capital_external_dict = '''
Dog
Fish
Horse
'''

with open('capital_external_dict.txt', 'w') as f:
  f.write(capital_external_dict)

In [9]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

spellChecker_1 = NorvigSweetingApproach() \
    .setInputCols(["token"]) \
    .setOutputCol("spell_1") \
    .setDictionary("capital_external_dict.txt") \
    .setCaseSensitive(False)

spellChecker_2 = NorvigSweetingApproach() \
    .setInputCols(["token"]) \
    .setOutputCol("spell_2") \
    .setDictionary("capital_external_dict.txt") \
    .setCaseSensitive(True)

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    spellChecker_1,
    spellChecker_2
])

empty_df = spark.createDataFrame([[""]]).toDF("text")

spellcheck_model = pipeline.fit(empty_df)

text_df = spark.createDataFrame([["The name of the dogh is Dogh."]]).toDF("text")

corrected_text = spellcheck_model.transform(text_df)

corrected_text.select(col('token.result').alias("before_spellchecker"), col('spell_1.result').alias("case_sensitive_false"), col('spell_2.result').alias("case_sensitive_true")).show(truncate = False)

+---------------------------------------+--------------------------------------+--------------------------------------+
|before_spellchecker                    |case_sensitive_false                  |case_sensitive_true                   |
+---------------------------------------+--------------------------------------+--------------------------------------+
|[The, name, of, the, dogh, is, Dogh, .]|[The, name, of, the, dog, is, Dogh, .]|[The, name, of, the, dogh, is, Dog, .]|
+---------------------------------------+--------------------------------------+--------------------------------------+



When caseSensitive is False (default value), the spellchecker ignores the uppercase included in the external dictionary. When it is set to True, only the uppercased version of the word is corrected.

### setDupsLimit

In [10]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

spellChecker_1 = NorvigSweetingApproach() \
    .setInputCols(["token"]) \
    .setOutputCol("spell_1") \
    .setDictionary("external_dict.txt") \
    .setDupsLimit(1)

spellChecker_2 = NorvigSweetingApproach() \
    .setInputCols(["token"]) \
    .setOutputCol("spell_2") \
    .setDictionary("external_dict.txt") \
    .setDupsLimit(0)

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    spellChecker_1,
    spellChecker_2
])

empty_df = spark.createDataFrame([[""]]).toDF("text")

spellcheck_model = pipeline.fit(empty_df)

text_df = spark.createDataFrame([["It was a goood dogh."]]).toDF("text")

corrected_text = spellcheck_model.transform(text_df)

corrected_text.select(col('token.result').alias("before_spellchecker"), col('spell_1.result').alias("dups_limit_1"), col('spell_2.result').alias("dups_limit_0")).show(truncate = False)

+----------------------------+--------------------------+-------------------------+
|before_spellchecker         |dups_limit_1              |dups_limit_0             |
+----------------------------+--------------------------+-------------------------+
|[It, was, a, goood, dogh, .]|[It, was, a, good, dog, .]|[It, was, a, god, dog, .]|
+----------------------------+--------------------------+-------------------------+



When dupsLimit is 1, the same letter cannot be repeated more than once. When dupsLimit is set to 0, no letter can be repeated (this is why "goood" was turned to "god").

### setWordSizeIgnore

In [11]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

spellChecker_1 = NorvigSweetingApproach() \
    .setInputCols(["token"]) \
    .setOutputCol("spell_1") \
    .setDictionary("external_dict.txt") \
    .setWordSizeIgnore(3)

spellChecker_2 = NorvigSweetingApproach() \
    .setInputCols(["token"]) \
    .setOutputCol("spell_2") \
    .setDictionary("external_dict.txt") \
    .setWordSizeIgnore(4)

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    spellChecker_1,
    spellChecker_2
])

empty_df = spark.createDataFrame([[""]]).toDF("text")

spellcheck_model = pipeline.fit(empty_df)

text_df = spark.createDataFrame([["It was a good dogh."]]).toDF("text")

corrected_text = spellcheck_model.transform(text_df)

corrected_text.select(col('token.result').alias("before_spellchecker"), col('spell_1.result').alias("word_size_ignore_3"), col('spell_2.result').alias("word_size_ignore_4")).show(truncate = False)

+---------------------------+--------------------------+---------------------------+
|before_spellchecker        |word_size_ignore_3        |word_size_ignore_4         |
+---------------------------+--------------------------+---------------------------+
|[It, was, a, good, dogh, .]|[It, was, a, good, dog, .]|[It, was, a, good, dogh, .]|
+---------------------------+--------------------------+---------------------------+



In this example, the misspelled word has 4 characters. The spellchecker with a value for wordSizeIgnore of 4 ignored this token and did not correct its spelling. 