![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Open_Source/02.01.ContextSpellChecker.ipynb)

# ContextSpellChecker

This notebook will cover the different parameters and usages of `ContextSpellChecker`.

**📖 Learning Objectives:**

1. Be able to detect and fix spelling errors in text.

2. Understand how to use the `ContextSpellChecker` annotator.

3. Become comfortable using the different parameters of the annotator.

4. Be able to train a new spell checker model.


**🔗 Helpful Links:**

- Documentation : [ContextSpellChecker](https://nlp.johnsnowlabs.com/docs/en/annotators#contextspellchecker)

- Python Docs : [ContextSpellChecker](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/spell_check/context_spell_checker/index.html#sparknlp.annotator.spell_check.context_spell_checker.ContextSpellCheckerModel)

- Scala Docs : [ContextSpellChecker](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/spell/context/ContextSpellCheckerModel.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings/Public/).

## **📜 Background**

Text data originated from social media or extracted from images using Optical Character Recognition (OCR) usually contains typos, mispellings, spurious symbols, or errors that can impact machine learning models trained on this data. For example, if the word `John` is present in the data both with the correct spelling and `J0hn` (a zero character replaced the o letter), then a model would treat them as two separated words, which could cause unexpected outcomes in its predictions. 

**Spell Checking** is a very important task in NLP pipelines that can help to fix those kind of errors in the data. Being able to rely on correct data, without spelling problems, reduces vocabulary sizes at different stages in the pipeline, and improves the performance of all the models in the pipeline. 

### 🤓 More details

Spell Checkers can recommend corrections on three levels: 

- _Subword_: relative cost of different correction candidates according to the edit operations at the character level it requires.
- _Word_: different correction candidates for each word.
- _Sentence_: surrounding text of each word, i.e., it’s context.

In the Spark NLP ecosystem, we implemented three types of spell checkers:

- [NorvigSweeting](https://nlp.johnsnowlabs.com/docs/en/annotators#norvigsweeting-spellchecker): Retrieves tokens and makes corrections automatically if not found in an English dictionary.
- [SymmetricDeleter](https://nlp.johnsnowlabs.com/docs/en/annotators#symmetricdelete-spellchecker): Symmetric Delete spelling correction algorithm.
- [ContextSpellChecker](https://nlp.johnsnowlabs.com/docs/en/annotators#contextspellchecker) annotator uses contextual information to both detect errors and produce the best corrections. 

The first two annotators don't take into account the context around the words, while the last one is a deep-learning model that does consider a window of a few words around them. Taking the context into account can help to determine a correct word even if more than one possible candidates are feasible (they are present in the dictionary). Let's illusterate it with the word `siter`. This word is not part of the English dictionary, so we can check which could be the intended word by making only one change of letter in it:

- **sister**, by adding one `s` to the word
- **site**, by removing the `r`
- **sites**, by replacing `r` by `s`

All of these three words exists in the English dictionary, and which one to choose will depend on the context. for example, which one should we use in the sentence "I will call my siter."? By adding the context, the answer is clear.

In this notebook, we will focus on the `ContextSpellChecker` annotator, how to train a custom model with the `ContextSpellCheckerApproach` annotator and how to use pretrained models with the `ContextSpellCheckerModel` annotator. Both annotators apply have `TOKEN` type column for input and output.

The model uses a deep learning approach to obtain context information, and [Viterbi decoder](https://en.wikipedia.org/wiki/Viterbi_decoder) applied to a [Trellis modulation](https://en.wikipedia.org/wiki/Trellis_modulation) of the candidate words to decide which word to suggest (the one with lowest cost - given by the [Damerau–Levenshtein distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance)). This distance considers three valid operations, each with cost equal to one:
  - Add one letter
  - Delete one letter
  - Replace one letter for another one
  - Swaps two adjacent letters

To determine the correction candidates, the model uses two sources:

- vocabulary built from the training corpus during model training that remains immutable
- special classes for dealing with special types of words like numbers or dates. These are configurable, and you can modify them so they can adjust better to your data. Usual classes are:
  - `_AGE_`: age tokens like ‘21-year-old’.
  - `_LOC_`: tokens representing locations like a city, state, country, etc.
  - `_DATE_`: tokens representing dates like ‘Jan-03'.
  - `_NAME_`: tokens representing names and surnames.
  - `_NUM_`: tokens representing numbers, like 22 or twenty-two.

 These classes can be extended in two ways on the `ContextSpellChekcerModel` annotator:
  - Using custom vocabularies and setting with the method `.updateVocabClass(label, vocabulary, append=True)`, by passing the label, a list of words and a bool if the vocabulary should be appended to existing vocabularies or replace them. For example:
  ```python
  spellModel.updateVocabClass('_NAME_', ['Monika', 'Agnieszka', 'Inga', 'Jowita', 'Melania'], True)
  ```
  - Using regex patterns and setting with the method `.updateRegexClass(label, regex_pattern)`, by passing the label and the regex pattern to use. This always substitutes the pattern if it already existed. For example:
  ```python
  spellModel.updateRegexClass('_DATE_', '(january|february|march)-[0-31]')
  ```

## **🎬 Colab Setup**

Before going through the annotators, let's set up the environment and start a `spark` session.

In [None]:
!pip install -qU pyspark  spark-nlp

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.4/281.4 MB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m473.2/473.2 KB[0m [31m41.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 KB[0m [31m22.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [None]:
import pandas as pd

from pyspark.sql import functions as F
from pyspark.ml import Pipeline
from pyspark.sql.functions import array_contains

import sparknlp
from sparknlp.annotator import (
    Tokenizer,
    ContextSpellCheckerModel,
    ContextSpellCheckerApproach,
    SentenceDetector,
    NorvigSweetingModel,
    SymmetricDeleteModel,
)
from sparknlp.common import RegexRule
from sparknlp.base import DocumentAssembler, LightPipeline


Starting the spark session:

In [None]:
spark = sparknlp.start()

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

spark

Spark NLP version 4.3.2
Apache Spark version: 3.3.2


## **🖨️ Input/Output Annotation Types**

- Input: `TOKEN`

- Output: `TOKEN`

## **🔎 Parameters**

- **wordMaxDistance**: Integer number representing the maximum edit distance for the generated candidates for every word. Higher values increases the number of candidates and can make the algorithm slower, but small values can cause the algorithm to miss the correction. Default value: `3`, minimum od `1`. 
- **maxCandidates**: An integer number representing the maximum candidates for every word. This limits the returned list of candidates if they are too many. Default value: `6`
- **caseStrategy**: What case combinations to try when generating candidates. Possible choices are:
  - 0: use only upper case letters
  - 1: First letter is upper, the others are lower case
  - 2 (default): uses all letters
- **errorThreshold**: Threshold perplexity for a word to be considered as an error. No default value.
- **tradeoff**: Tradeoff between the cost of a word error and a transition in the language model.  Default value: `18.0`
- **maxWindowLen**: Maximum size for the window used to remember history prior to every correction. Default value: `5`
- **configProtoBytes**: ConfigProto from tensorflow, serialized into byte array (see [Tensorflow JVM documentation](https://www.tensorflow.org/jvm/api_docs/java/org/tensorflow/proto/framework/ConfigProto) for details). No default value.
- **compareLowcase**: If `True`, will compare tokens with vocabulary tokens all in lowercase (match even if the case is different).  Default value: `False`
- **useNewLines**: When set to `True`, new line char `\n` will be treated as any other character. When set to `False` correction is applied on paragraphs as defined by newline characters. Default value: `False`
- **gamma**: Controls the influence of individual word frequency in the decision.  Default value: `120.0`
- **vocabFreq**: Frequency words from the vocabulary. Constructed during training, can be overwritten. No default value
- **idsVocab**: Mapping of ids to vocabulary.  Constructed during training, can be overwritten. No default value
- **vocabIds**: Mapping of vocabulary to ids. Constructed during training, can be overwritten. No default value
- **classes**: Classes the spell checker recognizes. Constructed when training, can be overwritten. No default value. Tehre are two different types: **vocabulary** based and **regex** based:
  * Vocabulary based classes can propose correction candidates from the provided vocabulary, e.g. a dictionary of names.
  * Regex classes are defined by a regular expression, and they can be used to generate correction candidates for items like numbers, dates, etc.
- **weights**: Levenshtein weights. Constructed during training, can be overwritten. No default value

### ✌ Using pretrained Spellchecking Model

We use the `ContextSpellCheckerModel` annotator to load a pretrained model and analyze this sentence:

> "**Plaese** **alliow** me **tao** **introdduce** **myhelf**, I am a man of **waelht** **und** **tiaste**"

The pretrained model is the [spellcheck_dl](https://nlp.johnsnowlabs.com/2022/04/02/spellcheck_dl_en_2_4.html), which is a generic context-aware spell cheker model for English language.

To check the list of available pretrained models, visit [NLP Models Hub](https://nlp.johnsnowlabs.com/models?task=Spell+Check).

In [None]:
example_sentence = "Plaese alliow me tao introdduce myhelf, I am a man of waelht und tiaste"

In [None]:
def get_light_pipeline(spellModel):
  documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")

  tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token")
  pipeline = Pipeline(stages=[documentAssembler, tokenizer, spellModel])

  empty_ds = spark.createDataFrame([[""]]).toDF("text")
  lp = LightPipeline(pipeline.fit(empty_ds))
  return lp

In [None]:
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")

tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token")

spellModel = (
    ContextSpellCheckerModel.pretrained("spellcheck_dl")
    .setInputCols("token")
    .setOutputCol("checked")
)

pipeline = Pipeline(stages=[documentAssembler, tokenizer, spellModel])

empty_ds = spark.createDataFrame([[""]]).toDF("text")
lp = LightPipeline(pipeline.fit(empty_ds))

spellcheck_dl download started this may take some time.
Approximate size to download 95.1 MB
[OK!]


This pretrained model has 4 speciall casses, two based on vocabularies (`VocabParser`) and two based on regex rules (`RegexParser`):

In [None]:
spellModel.getWordClasses()

['(_NAME_,VocabParser)',
 '(_DATE_,RegexParser)',
 '(_NUM_,RegexParser)',
 '(_LOC_,VocabParser)']

Let's check the results obtained by the model:

In [None]:
result = lp.annotate(example_sentence)

for token, checked in zip(result["token"], result["checked"]):
  print(f"{token} => {checked}")

Plaese => Please
alliow => allow
me => me
tao => to
introdduce => introduce
myhelf => myself
, => ,
I => I
am => am
a => a
man => man
of => of
waelht => wealth
und => and
tiaste => taste


This pretrained model was able to fix all the mistakes without changing the parameters default values. Let's check what parameters are available. What happens when we change some of the parameters?

In [None]:
spellModel_modified = (
    ContextSpellCheckerModel.pretrained("spellcheck_dl")
    .setInputCols("token")
    .setOutputCol("checked")
    .setWordMaxDistance(1)
)

lp = get_light_pipeline(spellModel_modified)
result = lp.annotate("Plaese alliow me tao introdduce myhelf, I am a man of waelht und tiaste")

for token, checked in zip(result["token"], result["checked"]):
  print(f"{token} => {checked}")

spellcheck_dl download started this may take some time.
Approximate size to download 95.1 MB
[OK!]
Plaese => Please
alliow => allow
me => me
tao => tao
introdduce => introduce
myhelf => myself
, => ,
I => I
am => am
a => a
man => man
of => of
waelht => waelht
und => und
tiaste => taste


We can see that some words were not fixed. It happened because the model could not find a proper candidate with the changed parameter. We encourge you to play with the parameters to obtain the desired result. 

Final observations:

1. The parameter `tradeoff` acts as a penalizer to the relevance the word has if it is present in the dictionary. In other words, with smaller values of `tradeoff`, the model will opt to change a word that exists in the dictionary with a candidate one (give more weight to context, less to the vocabulary)
2. The parameter `errorThreshold` acts as a filter to determine if the word should be corrected or not. Setting the threshold to lower values can increase theaccuracy of the model, but decreases the prediction speed. For reference, see the benchmark table below:

<div align="center">

| threshold | total_time | fscore |
|-----------|------------|--------|
| 8f        | 405s       | 52.69  |
| 10f       | 357s       | 52.43  |
| 12f       | 279s       | 52.25  |
| 14f       | 234s       | 52.14  |

**Sometimes, the prediction speed can get a relevant boost with a small impact on accuracy.**
</div>


## ⚡ **Training a context-aware spell checker**

To train a new model, we need to use the `ContextSpellCheckerApproach` annotator.

The parameters of the class are:

- **wordMaxDistance**: Integer number representing the maximum edit distance for the generated candidates for every word. Higher values increases the number of candidates and can make the algorithm slower, but small values can cause the algorithm to miss the correction. Default value: `3`, minimum of `1`. 
- **maxCandidates**: An integer number representing the maximum candidates for every word. This limits the returned list of candidates if they are too many. Default value: `6`
- **caseStrategy**: What case combinations to try when generating candidates. Possible choices are:
  - 0: use only upper case letters
  - 1: First letter is upper, the others are lower case
  - 2 (default): uses all letters
- **errorThreshold**: Threshold perplexity for a word to be considered as an error. No default value.
- **tradeoff**: Tradeoff between the cost of a word error and a transition in the language model.  Default value: `18.0`
- **maxWindowLen**: Maximum size for the window used to remember history prior to every correction. Default value: `5`
- **configProtoBytes**: ConfigProto from tensorflow, serialized into byte array (see [Tensorflow JVM documentation](https://www.tensorflow.org/jvm/api_docs/java/org/tensorflow/proto/framework/ConfigProto) for details). No default value.
- **languageModelClasses**: Number of classes to use during factorization of the softmax output in the language model. No default value, depends on the vocabulary (learned during training). Can be overwritten.
- **epochs**: Number of epochs to train the language model. Default value: `2`
- **batchSize**: Batch size for the training phase. Default value: `24`
- **initialRate**: Initial learning rate for the training phase. Default value: `0.7`
- **finalRate**: Final learning rate. Default value: `0.0005`
- **validationFraction**: Percentage of datapoints to use for validation. Default value: `0.1`
- **minCount**: Minimum number of times a token should appear to be included in vocabulary. Default value: `3.0`
- **compoundCount**: Minimum number of times a compound word should appear to be included in vocabulary. Default value: `5`
- **classCount**: Minimum number of times the word need to appear in corpus to not be considered of a special class. Default value: `15.0`
- **maxSentLen**: Maximum length for a sentence - internal use during training. Default value: `250`
- **graphFolder**: Folder path that contain external graph files. No default value

The training data for the `ContextSpellCheckerApproach` annotator is just a collection of texts. We don't need labeled data to train this model as it uses unsupervised training to generate a language model. As usual, bigger corpus will obtain better models when the data is a good sample of the real-world scenario you plan to use the model for.

We will use the Arthur Conan Doyle's first book of Sherlok Holmes as sample data.

In [None]:
# Download the book
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Public/data/holmes.txt

In [None]:
path = "holmes.txt"

corpus = spark.read.text(path).toDF("text")
corpus.show(truncate=100)

+----------------------------------------------------------------------------------------------------+
|                                                                                                text|
+----------------------------------------------------------------------------------------------------+
|THE ADVENTURES OF SHERLOCK HOLMESArthur Conan Doyle Table of contents A Scandal in Bohemia The Re...|
+----------------------------------------------------------------------------------------------------+



We will transform the text into our Spark NLP basic structure: `document`, and then we will split it into tokens with the `Tokenizer` annotator.

In [None]:
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")

# Splits sentences into tokens
tokenizer = Tokenizer().setInputCols("document").setOutputCol("token")

spellChecker = (
    ContextSpellCheckerApproach()
    .setInputCols("token")
    .setOutputCol("checked")
    .setBatchSize(1) # Batch size 1 to run in Colab
    .setEpochs(1)
    .setWordMaxDistance(3) # Maximum edit distance to consider
    .setMaxWindowLen(3) # important to find context
    .setMinCount(3.0) # Removes words that appear less than that from the vocabulary
    .setCompoundCount(5) # Removes compound words that appear less than that from the vocabulary
    .setClassCount(10.0) # Minimun occurrences of a class
)

pipeline = Pipeline(stages=[documentAssembler, tokenizer, spellChecker])

Let's try to train the model directly:

In [None]:
try:
  model = pipeline.fit(corpus)
except Exception as e:
  print(e)

requirement failed: We couldn't find any suitable graph for 2000 classes, vocabSize: 3080


We see the error message saying that we don't have the Tensorflow graph for this specific corpus containing 2000 classes and vocabulary size 3080. The vocabulary size depends on several parameters of the annotator, such as the `minCount`, `compoundCount`, `classCount`, and, off course, the corpus itself. We will see later how we can also extend the classes with manually chosen ones using vocabulary (for example, names, products, etc.) or regex (identifies dates, numbers, etc.).

To create the Tensorflow graph for any specific corpus, you can follow the steps on [this notebook](https://github.com/JohnSnowLabs/spark-nlp/blob/master/python/tensorflow/spellchecker/create_spell_model.ipynb). In the meanwhile, we will set the parameter `languageModelClasses` to `1650` so that we can use an existing graph.

> Please note that we will limit the number of sentences so that our deep learning model can be trained in Colab without crashing. It is for educatinal purposes only. 

### Preparing the corpus for training

We will use the `SentenceDetector` annotator to split the book into sentences. Then we will sample a number of sentences that Colab is able to process. As a deep learning model, it demands heavy computation during training. For big datasets, it is recommended to use spark clusters to train efficiently.

In [None]:
# Limit the size of the data so that we can run it on Colab

documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
sentenceDetector = SentenceDetector().setInputCols("document").setOutputCol("sentence")

sentences = sentenceDetector.transform(documentAssembler.transform(corpus))

# Get 10% of the senteces only
sample = sentences.select(F.explode("sentence.result").alias("sentence")).sample(fraction=0.1, seed=42)
sample.count()

561

### Training the model

Create a new pipeline to process this sample from the beginning (DocumentAssembler -> ContextSpellChecker)

In [None]:
# Note that we se `sentence` as input column name
documentAssembler = DocumentAssembler().setInputCol("sentence").setOutputCol("document")

tokenizer = Tokenizer().setInputCols("document").setOutputCol("token")

spellChecker = (
    ContextSpellCheckerApproach()
    .setInputCols("token")
    .setOutputCol("checked")
    .setBatchSize(1) # Batch size 1 to run in Colab
    .setEpochs(1)
    .setWordMaxDistance(3) # Maximum edit distance to consider
    .setMaxWindowLen(3) # important to find context
    .setMinCount(3.0) # Removes words that appear less than that from the vocabulary
    .setCompoundCount(5) # Removes compound words that appear less than that from the vocabulary
    .setClassCount(10.0) # Minimun occurrences of a class
    .setLanguageModelClasses(1650) # Value taht we have a TF graph available
)

pipeline = Pipeline(stages=[documentAssembler, tokenizer, spellChecker])

In [None]:
%%time
try:
  model = pipeline.fit(sample)
except Exception as e:
  print(e)

CPU times: user 449 ms, sys: 55.6 ms, total: 505 ms
Wall time: 1min 29s


Try the trained model on an example sentence:

In [None]:
lp = LightPipeline(model)

test = lp.annotate("Sherlok Hlmes founds the solution to the mistrey")
for token, checked in zip(test["token"], test["checked"]):
  print(f"{token} => {checked}")

Sherlok => Sherlock
Hlmes => Holmes
founds => found
the => the
solution => solution
to => to
the => the
mistrey => mystery


> The model was trained with 500+ sentences only, but got some understanding of Conan Doyle's universe.

## Additional resources

For further details, check the following resources:

- [Applying Context Aware Spell Checking in Spark NLP](https://medium.com/spark-nlp/applying-context-aware-spell-checking-in-spark-nlp-3c29c46963bc)
-[Training a Contextual Spell Checker for Italian Language](https://towardsdatascience.com/training-a-contextual-spell-checker-for-italian-language-66dda528e4bf)