![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **Attention!!** There is a bug in .setDupsLimit(). An issue was opened for this in GH. Please do not include that param setter in the recording until the bug is fixed.

# **SymmetricDeleteApproach** and **SymmetricDeleteModel**

This notebook will cover the different parameters and usages of `SymmetricDeleteApproach` and `SymmetricDeleteModel`.

**📖 Learning Objectives:**

1. Understand how to check spelling using SymmetricDelete annotators.

2. Understand the difference between `SymmetricDeleteApproach` and `SymmetricDeleteModel`.

3. Customize the use of these annotators by setting their parameters.


**🔗 Helpful Links:**

- Documentation : [SymmetricDelete Spellchecker](https://nlp.johnsnowlabs.com/docs/en/annotators#symmetricdelete-spellchecker)

- Python Docs : [SymmetricDeleteApproach](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/python/sparknlp/annotator/spell_check/symmetric_delete/index.html#sparknlp.annotator.spell_check.symmetric_delete.SymmetricDeleteApproach), [SymmetricDeleteModel](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/python/sparknlp/annotator/spell_check/symmetric_delete/index.html#sparknlp.annotator.spell_check.symmetric_delete.SymmetricDeleteModel)

- Scala Docs : [SymmetricDeleteApproach](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/spell/symmetric/SymmetricDeleteApproach), [SymmetricDeleteModel](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/spell/symmetric/SymmetricDeleteApproach)

## **📜 Background**


Symmetric Delete spelling correction annotators retrieve tokens and utilize distance metrics to compute possible derived words. They are inspired by [SymSpell](https://github.com/wolfgarbe/SymSpell).

The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent.

- `SymmetricDeleteApproach` is used to train your own spellchecker model based on training data.
- `SymmetricDeleteModel` is the instantiated model of the `SymmetricDeleteApproach`. Pretrained models can be loaded using this annotator. The default model is "spellcheck_sd", if no name is provided. For available pretrained models please see the [Models Hub](https://nlp.johnsnowlabs.com/models?task=Spell+Check).

For alternative approaches to spellchecking, refer to [NorvigSweeting annotator](https://nlp.johnsnowlabs.com/docs/en/annotators#norvigsweeting-spellchecker) or [ContextSpellChecker annotator](https://nlp.johnsnowlabs.com/docs/en/annotators#contextspellchecker).

## **🎬 Colab Setup**

In [None]:
!pip install -q pyspark==3.1.2  spark-nlp==4.2.4

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m448.4/448.4 KB[0m [31m30.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 KB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
from pyspark.sql.functions import col

spark = sparknlp.start()

## **🖨️ Input/Output Annotation Types**

- Input: `TOKEN`

- Output: `TOKEN` (misspelled words are replaced by their correct form and the metadata includes the confidence score of the spelling correction)

## **🔎 Parameters**


- `deletesThreshold`: (Int) Minimum frequency of corrections a word needs to have to be considered from training. Increase if training set is LARGE (Default: 0).

- `dictionary`: (String) Path to .txt file with external dictionary. External dictionary to be used needs "tokenPattern" (Default: \S+) for parsing the resource. If provided, significantly boosts spell checking performance. This parameter only applies to `SymmetricDeleteApproach`.

  Example:


```
...
gummy
gummic
gummier
gummiest
gummiferous
...
```

- `dupsLimit`: (Int) Maximum duplicate of characters in a word to consider (Default: 2).

- `frequencyThreshold`: (Int) Minimum frequency of words to be considered from training. Increase if training set is LARGE (Default: 0).

- `longestWordLength`: (Int) Length of longest word in corpus.

- `maxEditDistance`: (Int) Max edit distance characters to derive strings from a word (Default: 3).

- `maxFrequency`: (Int) Maximum frequency of a word in the corpus.

- `minFrequency`: (Int) Minimum frequency of a word in the corpus.

## **Examples**

### Using a pretrained spellchecker with `SymmetricDeleteModel`

In [None]:
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")

tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")

# "spellcheck_sd" can be omitted, as it is the default value
spellChecker = SymmetricDeleteModel.pretrained("spellcheck_sd")\
.setInputCols(["token"]) \
.setOutputCol("spell")

pipeline = Pipeline().setStages([
documentAssembler,
tokenizer,
spellChecker
])

data = spark.createDataFrame([["somtimes i wrrite wordz erong."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select(col('token.result').alias("before_spellchecker"), col('spell.result').alias("after_spellchecker")).show(truncate = False)

spellcheck_sd download started this may take some time.
Approximate size to download 198.1 MB
[OK!]
+--------------------------------------+--------------------------------------+
|before_spellchecker                   |after_spellchecker                    |
+--------------------------------------+--------------------------------------+
|[somtimes, i, wrrite, wordz, erong, .]|[sometimes, i, write, words, wrong, .]|
+--------------------------------------+--------------------------------------+



### Training a spellchecker using `SymmetricDeleteApproach`

In [None]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

spellChecker = SymmetricDeleteApproach() \
    .setInputCols(["token"]) \
    .setOutputCol("spell")

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    spellChecker
])

# Corpus of texts used to train the spellchecker. In this example, the corpus consists of only one sentence.
training_df = spark.createDataFrame([["The dog and the cat play together."]]).toDF("text")

spellcheck_model = pipeline.fit(training_df)

text_df = spark.createDataFrame([["Teh dogh is eating."]]).toDF("text")

corrected_text = spellcheck_model.transform(text_df)

corrected_text.select(col('token.result').alias("before_spellchecker"), col('spell.result').alias("after_spellchecker")).show(truncate = False)

+--------------------------+-------------------------+
|before_spellchecker       |after_spellchecker       |
+--------------------------+-------------------------+
|[Teh, dogh, is, eating, .]|[The, dog, is, eating, .]|
+--------------------------+-------------------------+



Based on the training corpus, the spellchecker identified and corrected the misspelled words.

### setDictionary

In [None]:
external_dict = '''
dogs
are
'''
with open('external_dict.txt', 'w') as f:
  f.write(external_dict)

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

spellChecker_1 = SymmetricDeleteApproach() \
    .setInputCols(["token"]) \
    .setOutputCol("spell_1")

spellChecker_2 = SymmetricDeleteApproach() \
    .setInputCols(["token"]) \
    .setOutputCol("spell_2") \
    .setDictionary("external_dict.txt")

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    spellChecker_1,
    spellChecker_2
])


training_df = spark.createDataFrame([["The dog and the cat play together."]]).toDF("text")

spellcheck_model = pipeline.fit(training_df)

text_df = spark.createDataFrame([["teh dogs aree eating."]]).toDF("text")

corrected_text = spellcheck_model.transform(text_df)

corrected_text.select(col('token.result').alias("before_spellchecker"), col('spell_1.result').alias("spellchecker_without_dict"), col('spell_2.result').alias("spellchecker_with_dict")).show(truncate = False)

+----------------------------+--------------------------+---------------------------+
|before_spellchecker         |spellchecker_without_dict |spellchecker_with_dict     |
+----------------------------+--------------------------+---------------------------+
|[teh, dogs, aree, eating, .]|[the, dog, and, eating, .]|[the, dogs, are, eating, .]|
+----------------------------+--------------------------+---------------------------+



Using a dictionary apart from training data, better results are achieved. In this case, the spellchecker without a dictionary made two mistakes when dealing with words that were not included in the training data ('dogs' and 'are'). This was solved by including an external dictionary with such words.

### setDupsLimit

In [None]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

spellChecker_1 = SymmetricDeleteApproach() \
    .setInputCols(["token"]) \
    .setOutputCol("spell_1") \
    .setDupsLimit(1)

spellChecker_2 = SymmetricDeleteApproach() \
    .setInputCols(["token"]) \
    .setOutputCol("spell_2") \
    .setDupsLimit(0)

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    spellChecker_1,
    spellChecker_2
])

training_df = spark.createDataFrame([["it was a good day, and the dog played alone."]]).toDF("text")

spellcheck_model = pipeline.fit(training_df)

text_df = spark.createDataFrame([["it was a goood dogg."]]).toDF("text")

corrected_text = spellcheck_model.transform(text_df)

corrected_text.select(col('token.result').alias("before_spellchecker"), col('spell_1.result').alias("dups_limit_1"), col('spell_2.result').alias("dups_limit_0")).show(truncate = False)

+----------------------------+--------------------------+--------------------------+
|before_spellchecker         |dups_limit_1              |dups_limit_0              |
+----------------------------+--------------------------+--------------------------+
|[it, was, a, goood, dogg, .]|[it, was, a, good, dog, .]|[it, was, a, good, dog, .]|
+----------------------------+--------------------------+--------------------------+



### setFrequencyThreshold

In [None]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

spellChecker_1 = SymmetricDeleteApproach() \
    .setInputCols(["token"]) \
    .setOutputCol("spell_1") \
    .setFrequencyThreshold(0)

spellChecker_2 = SymmetricDeleteApproach() \
    .setInputCols(["token"]) \
    .setOutputCol("spell_2") \
    .setFrequencyThreshold(2)

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    spellChecker_1,
    spellChecker_2
])

training_df = spark.createDataFrame([["the dog and the cat play together."]]).toDF("text")

spellcheck_model = pipeline.fit(training_df)

text_df = spark.createDataFrame([["teh dogh is eating."]]).toDF("text")

corrected_text = spellcheck_model.transform(text_df)

corrected_text.select(col('token.result').alias("before_spellchecker"), col('spell_1.result').alias("frequency_threshold_0"), col('spell_2.result').alias("frequency_threshold_2")).show(truncate = False)

+--------------------------+-------------------------+--------------------------+
|before_spellchecker       |frequency_threshold_0    |frequency_threshold_2     |
+--------------------------+-------------------------+--------------------------+
|[teh, dogh, is, eating, .]|[the, dog, is, eating, .]|[the, dogh, is, eating, .]|
+--------------------------+-------------------------+--------------------------+



In this example, the spellchecker with frequencyThreshold = 2 did not correct the misspelled word "dogh", because the correct spelling of that word appears only once in the training data. In contrast to this, the word "teh" was corrected, because "the" appears at least twice in the training data.

### setMaxEditDistance

In [None]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

spellChecker_1 = SymmetricDeleteApproach() \
    .setInputCols(["token"]) \
    .setOutputCol("spell_1") \
    .setMaxEditDistance(1)

spellChecker_2 = SymmetricDeleteApproach() \
    .setInputCols(["token"]) \
    .setOutputCol("spell_2") \
    .setMaxEditDistance(2)

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    spellChecker_1,
    spellChecker_2
])

training_df = spark.createDataFrame([["the dog and the cat play together."]]).toDF("text")

spellcheck_model = pipeline.fit(training_df)

text_df = spark.createDataFrame([["teh dogh is eating."]]).toDF("text")

corrected_text = spellcheck_model.transform(text_df)

corrected_text.select(col('token.result').alias("before_spellchecker"), col('spell_1.result').alias("max_edit_distance_1"), col('spell_2.result').alias("max_edit_distance_2")).show(truncate = False)

+--------------------------+-------------------------+-------------------------+
|before_spellchecker       |max_edit_distance_1      |max_edit_distance_2      |
+--------------------------+-------------------------+-------------------------+
|[teh, dogh, is, eating, .]|[teh, dog, is, eating, .]|[the, dog, is, eating, .]|
+--------------------------+-------------------------+-------------------------+



When maxEditDistance is 1, "teh" is not corrected to "the" because the amount of edits that are needed (2 letters) is higher than the maximum amount of edits that are allowed.