![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/open-source-nlp/07.0.Context_Spell_Checker.ipynb)

# Context Spell Checker

In [None]:
! pip install -q pyspark==3.4.1 spark-nlp==5.3.2

In [2]:
import sparknlp

spark = sparknlp.start() # for GPU training >> sparknlp.start(gpu = True)

from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
import pandas as pd

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

spark

Spark NLP version 5.3.2
Apache Spark version: 3.4.1


<H1> Noisy Channel Model Spell Checker - Introduction </H1>

blogpost : https://medium.com/spark-nlp/applying-context-aware-spell-checking-in-spark-nlp-3c29c46963bc

<div>
<p><br/>
The idea for this annotator is to have a flexible, configurable and "re-usable by parts" model.<br/>
Flexibility is the ability to accommodate different use cases for spell checking like OCR text, keyboard-input text, ASR text, and general spelling problems due to orthographic errors.<br/>
We say this is a configurable annotator, as you can adapt it yourself to different use cases avoiding re-training as much as possible.<br/>
</p>
</div>


<b> Spell Checking at three levels: </b>
The final ranking of a correction sequence is affected by three things,


1. Different correction candidates for each word - __word level__.
2. The surrounding text of each word, i.e. it's context - __sentence level__.
3. The relative cost of different correction candidates according to the edit operations at the character level it requires - __subword level__.




### Initial Setup
As it's usual in Spark-NLP let's start with building a pipeline; a _spell correction pipeline_. We will use a pretrained model from our library.

In [3]:
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

from IPython.utils.text import columnize

In [4]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = RecursiveTokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")\
    .setPrefixes(["\"", "(", "[", "\n"])\
    .setSuffixes([".", ",", "?", ")","!", "'s"])

spellModel = ContextSpellCheckerModel\
    .pretrained('spellcheck_dl')\
    .setInputCols("token")\
    .setOutputCol("checked")\
    .setErrorThreshold(4.0)\
    .setTradeoff(6.0)

finisher = Finisher()\
    .setInputCols("checked")

pipeline = Pipeline(stages = [
     documentAssembler,
     tokenizer,
     spellModel,
     finisher
  ])

empty_ds = spark.createDataFrame([[""]]).toDF("text")
lp = LightPipeline(pipeline.fit(empty_ds))

spellcheck_dl download started this may take some time.
Approximate size to download 95.1 MB
[OK!]


Ok!, at this point we have our spell checking pipeline as expected. Let's see what we can do with it,

In [5]:
lp.annotate("Plaese alliow me tao introdduce myhelf, I am a man of waelth und tiaste")

{'checked': ['Phase',
  'allow',
  'me',
  'to',
  'introduce',
  'myself',
  ',',
  'I',
  'am',
  'a',
  'man',
  'of',
  'wealth',
  'and',
  'taste']}

### Word Level Corrections
Continuing with our pretrained model, let's try to see how corrections work at the word level. Each Context Spell Checker model that you can find in Spark-NLP library comes with two sources for word candidates:
+ a general vocabulary that is built during training(and remains unmutable during the life of the model), and
+ special classes for dealing with special types of words like numbers or dates. These are dynamic, and you can modify them so they adjust better to your data.

The general vocabulary is learned during training, and cannot be modified, however, the special classes can be updated after training has happened on a pre-trained model.
This means you can modify how existing classes produce corrections, but not the number or type of the classes.
Let's see how we can accomplish this.

In [6]:
# First let's start with a loaded model, and check which classes it has been trained with
spellModel.getWordClasses()

['(_NAME_,VocabParser)',
 '(_LOC_,VocabParser)',
 '(_DATE_,RegexParser)',
 '(_NUM_,RegexParser)']

We have five classes, of two different types: some are vocabulary based and others are regex based,
+ __Vocabulary based classes__ can propose correction candidates from the provided vocabulary, for example a dictionary of names.
+ __Regex classes__ are defined by a regular expression, and they can be used to generate correction candidates for things like numbers. Internally, the Spell Checker will enumerate your regular expression and build a fast automaton, not only for recognizing the word(number in this example) as valid and preserve it, but also for generating a correction candidate.
Thus the regex should be a finite regex(it must define a finite regular language).

Now suppose that you have a new friend from Poland whose name is 'Jowita', let's see how the pretrained Spell Checker does with this name.

In [7]:
beautify = lambda annotations: [columnize(sent['checked']) for sent in annotations]

In [8]:
# Foreign name without errors
sample = 'We are going to meet Jowita in the city hall.'
beautify([lp.annotate(sample)])

['We  are  going  to  meet  With  in  the  city  hall  .\n']

Well, the result is not very good, that's because the Spell Checker has been trained mainly with American English texts. At least, the surrounding words are helping to obtain a correction that is a name. We can do better, let's see how.

## Updating a predefined word class

### Vocabulary Classes

In order for the Spell Checker to be able to preserve words, like a foreign name, we have the option to update existing classes so they can cover new words.

In [9]:
# add some more, in case we need them
spellModel.updateVocabClass('_NAME_', ['Monika', 'Agnieszka', 'Inga', 'Jowita', 'Melania'], True)

# Let's see what we get now
sample = 'We are going to meet Jowita at the city hall.'
beautify([lp.annotate(sample)])

['We  are  going  to  meet  Jowita  at  the  city  hall  .\n']

Much better, right? Now suppose that we want to be able to not only preserve the word, but also to propose meaningful corrections to the name of our foreign friend.

In [10]:
# Foreign name with an error
sample = 'We are going to meet Jovita in the city hall.'
beautify([lp.annotate(sample)])

['We  are  going  to  meet  Jowita  in  the  city  hall  .\n']

Here we were able to add the new word to the class and propose corrections for it, but also, the new word has been treated as a name, that meaning that the model used information about the typical context for names in order to produce the best correction.

### Regex Classes
We can do something similar for classes defined by regex. We can add a regex, to for example deal with a special format for dates, that will not only preserve the date with the special format, but also be able to correct it.

In [11]:
# Date with custom format
sample = 'We are going to meet her in the city hall on february-3.'
beautify([lp.annotate(sample)])

['We  are  going  to  meet  her  in  the  city  hall  on  February  .\n']

In [12]:
# this is a sample regex, for simplicity not covering all months
spellModel.updateRegexClass('_DATE_', '(january|february|march)-[0-31]')
beautify([lp.annotate(sample)])

['We  are  going  to  meet  her  in  the  city  hall  on  february-3  .\n']

Now our date wasn't destroyed!

In [13]:
# now check that it produces good corrections to the date
sample = 'We are going to meet her in the city hall on febbruary-3.'
beautify([lp.annotate(sample)])

['We  are  going  to  meet  her  in  the  city  hall  on  febbruary-3  .\n']

And the model produces good corrections for the special regex class. Remember that each regex that you enter to the model must be finite. In all these examples the new definitions for our classes didn't prevent the model to continue using the context to produce corrections. Let's see why being able to use the context is important.
### Sentence Level Corrections
The Spell Checker can leverage the context of words for ranking different correction sequences. Let's take a look at some examples,

In [14]:
# check for the different occurrences of the word "siter"
example1 = [
    "Due to bad weather, we had to move to a different siter.",\
    "We travelled to three siter in the summer."]
beautify(lp.annotate(example1))

['Due  to  bad  weather  ,  we  had  to  move  to  a  different  site  .\n',
 'We  travelled  to  three  sites  in  the  summer  .\n']

In [15]:
# check for the different occurrences of the word "ueather"
example2 = ["During the summer we have the best ueather.",\
    "I have a black ueather jacket, so nice."]
beautify(lp.annotate(example2))

['During  the  summer  we  have  the  best  weather  .\n',
 'I  have  a  black  leather  jacket  ,  so  nice  .\n']

Notice that in the first example, 'siter' is indeed a valid English word, <br/> https://www.merriam-webster.com/dictionary/siter <br/>
The only way to customize how the use of context is performed is to train the language model by training a Spell Checker from scratch. If you want to be able to train your custom language model, please refer to the Training notebook.
Now we've learned how the context can help to pick the best possible correction, and why it is important to be able to leverage the context even when the other parts of the Spell Checker were updated.

### Subword level corrections
Another fine tunning that our Spell Checker accepts is to assign different costs to different edit operations that are necessary to transform a word into a correction candidate.
So, why is this important? Errors can come from different sources,
+ Homophones are words that sound similar, but are written differently and have different meaning. Some examples, {there, their, they're}, {see, sea}, {to, too, two}. You will typically see these errors in text obtained by Automatic Speech Recognition(ASR).
+ Characters can also be confused because of looking similar. So a 0(zero) can be confused with a O(capital o), or a 1(number one) with an l(lowercase l). These errors typically come from OCR.
+ Input device related, sometimes keyboards cause certain patterns to be more likely than others due to letter locations, for example in a QWERTY keyboard.
+ Last but not least, ortographic errors, related to the writter making mistakes. Forgetting a double consonant, or using it in the wrong place, interchanging letters(i.e., 'becuase' for 'because'), and many others.

The goal is to continue using all the other features of the model and still be able to adapt the model to handle each of these cases in the best possible way. Let's see how to accomplish this.

In [16]:
# sending or lending ?
sample = 'I will be 1ending him my car'
lp.annotate(sample)

{'checked': ['I', 'will', 'be', 'tending', 'him', 'my', 'car']}

In [17]:
# let's make the replacement of an '1' for an 'l' cheaper
weights = {'1': {'l': .01}}
spellModel.setWeights(weights)
lp.annotate(sample)

{'checked': ['I', 'will', 'be', 'lending', 'him', 'my', 'car']}

Assembling this matrix by hand could be a daunting challenge. There is one script in Python that can do this for you.
This is something to be soon included like an option during training for the Context Spell Checker. Stay tuned on new releases!

## Advanced - the mysterious tradeoff parameter
There's a clear tension between two forces here,
+ The context information: by which the model wants to change words based on the surrounding words.
+ The word information: by which the model wants to preserve as much an input word as possible to avoid destroying the input.

Changing words that are in the vocabulary for others that seem more suitable according to the context is one of the most challenging tasks in spell correction. This is because you run into the risk of destroying existing 'good' words.
The models that you will find in the Spark-NLP library have already been configured in a way that balances these two forces and produces good results in most of the situations. But your dataset can be different from the one used to train the model.
So we encourage the user to play a bit with the hyperparameters, and for you to have an idea on how it can be modified, we're going to see the following example,

In [18]:
sample = 'have you been two the falls?'
beautify([lp.annotate(sample)])

['have  you  been  to  the  falls  ?\n']

Here 'two' is clearly wrong, probably a typo, and the model should be able to choose the right correction candidate according to the context. <br/>
Every path is scored with a cost, and the higher the cost the less chances for the path being chosen as the final answer.<br/>
In order for the model to rely more on the context and less on word information, we have the setTradeoff() method. You can think of the tradeoff as how much a single edition(insert, delete, etc) operation affects the influence of a word when competing inside a path in the graph.<br/>
So the lower the tradeoff, the less we care about the edit operations in the word, and the more we care about the word fitting properly into its context. The tradeoff parameter typically ranges between 5 and 25. <br/>
Let's see what happens when we relax how much the model cares about individual words in our example,

In [19]:
spellModel.getTradeoff()

6.0

In [20]:
# let's decrease the influence of word-level errors
# TODO a nicer way of doing this other than re-creating the pipeline?
spellModel.setTradeoff(2.0)

pipeline = Pipeline(
    stages = [
    documentAssembler,
    tokenizer,
    spellModel,
    finisher
  ])

empty_ds = spark.createDataFrame([[""]]).toDF("text")
lp = LightPipeline(pipeline.fit(empty_ds))

beautify([lp.annotate(sample)])

['have  you  been  to  the  falls  ?\n']

## Advanced - performance

The discussion about performance revolves around _error detection_. The more errors the model detects the more populated is the candidate diagram we showed above[TODO add diagram or convert this into blogpost], and the more alternative paths need to be evaluated. </br>
Basically the error detection stage of the model can decide whether a word needs a correction or not; with two reasons for a word to be considered as incorrect,
+ The word is OOV: the word is out of the vocabulary.
+ The context: the word doesn't fit well within its neighbouring words.
The only parameter that we can control at this point is the second one, and we do so with the setErrorThreshold() method that contains a max perplexity above which the word will be considered suspicious and a good candidate for being corrected.</br>
The parameter that comes with the pretrained model has been set so you can get both a decent performance and accuracy. For reference, this is how the F-score, and time varies in a sample dataset for different values of the errorThreshold,


|fscore |totaltime|threshold|
|-------|---------|---------|
|52.69  |405s | 8f|
|52.43  |357s |10f|
|52.25  |279s |12f|
|52.14  |234s |14f|

You can trade some minor points in accuracy for a nice speedup.


In [21]:
def sparknlp_spell_check(text):

  return beautify([lp.annotate(text)])[0].rstrip()


In [22]:
sparknlp_spell_check('I will go to Philadelhia tomorrow')

'I  will  go  to  Philadelphia  tomorrow'

In [23]:
sparknlp_spell_check('I will go to Philadhelpia tomorrow')

'I  will  go  to  Philadelphia  tomorrow'

In [24]:
sparknlp_spell_check('I will go to Piladelphia tomorrow')

'I  will  go  to  Philadelphia  tomorrow'

In [25]:
sparknlp_spell_check('I will go to Philadedlphia tomorrow')

'I  will  go  to  Philadelphia  tomorrow'

In [26]:
sparknlp_spell_check('I will go to Phieladelphia tomorrow')

'I  will  go  to  Philadelphia  tomorrow'

## ContextSpellCheckerApproach

Trains a deep-learning based Noisy Channel Model Spell Algorithm.

Correction candidates are extracted combining context information and word information.

1.   Different correction candidates for each word   **word level**
2.   The surrounding text of each word, i.e. it’s context  **sentence level**.
3.   The relative cost of different correction candidates according to the edit operations at the character level it requires  **subword level**.


In [27]:
# For this example, we will use the first Sherlock Holmes book as the training dataset.

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer()\
    .setInputCols("document") \
    .setOutputCol("token")

spellChecker = ContextSpellCheckerApproach() \
    .setInputCols("token") \
    .setOutputCol("corrected") \
    .setWordMaxDistance(3) \
    .setBatchSize(24) \
    .setEpochs(8) \
    .setLanguageModelClasses(1650)  # dependant on vocabulary size
    # .addVocabClass("_NAME_", names) # Extra classes for correction could be added like this

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    spellChecker
])

In [28]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Public/data/holmes.txt

In [29]:
path = "holmes.txt"

dataset = spark.read.text(path).toDF("text")

dataset.show(truncate=100)

+----------------------------------------------------------------------------------------------------+
|                                                                                                text|
+----------------------------------------------------------------------------------------------------+
|THE ADVENTURES OF SHERLOCK HOLMESArthur Conan Doyle Table of contents A Scandal in Bohemia The Re...|
+----------------------------------------------------------------------------------------------------+



In [30]:
%%time
pipelineModel = pipeline.fit(dataset)

CPU times: user 280 ms, sys: 49.4 ms, total: 330 ms
Wall time: 34.2 s


In [31]:
lp = LightPipeline(pipelineModel)
result = lp.annotate("Plaese alliow me tao introdduce myhelf, I am a man of waelth und tiaste")

In [32]:
result["corrected"]

['Please',
 'allow',
 'me',
 'to',
 'introduce',
 'myself',
 ',',
 'I',
 'am',
 'a',
 'man',
 'of',
 'wealth',
 'and',
 'taste']

In [33]:
pd.DataFrame(zip(result["token"],result["corrected"]),columns=["original","corrected"])

Unnamed: 0,original,corrected
0,Plaese,Please
1,alliow,allow
2,me,me
3,tao,to
4,introdduce,introduce
5,myhelf,myself
6,",",","
7,I,I
8,am,am
9,a,a
