![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# 6 Context Spell Checker

## Start Spark Session

In [2]:
import json
import os


import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline,PipelineModel

import warnings
warnings.filterwarnings('ignore')
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 

import pandas as pd

print("Spark NLP Version :", sparknlp.version())

spark = sparknlp.start()
# params =>> gpu=False
spark.sparkContext.setLogLevel("ERROR")

spark

Spark NLP Version : 4.2.4
Spark Session already created, some configs may not take.


<H1> Noisy Channel Model Spell Checker - Introduction </H1>

blogpost : https://medium.com/spark-nlp/applying-context-aware-spell-checking-in-spark-nlp-3c29c46963bc

<div>
<p><br/>
The idea for this annotator is to have a flexible, configurable and "re-usable by parts" model.<br/>
Flexibility is the ability to accommodate different use cases for spell checking like OCR text, keyboard-input text, ASR text, and general spelling problems due to orthographic errors.<br/>
We say this is a configurable annotator, as you can adapt it yourself to different use cases avoiding re-training as much as possible.<br/>
</p>
</div>


<b> Spell Checking at three levels: </b>
The final ranking of a correction sequence is affected by three things, 


1. Different correction candidates for each word - __word level__.
2. The surrounding text of each word, i.e. it's context - __sentence level__.
3. The relative cost of different correction candidates according to the edit operations at the character level it requires - __subword level__.
 



### Initial Setup
As it's usual in Spark-NLP let's start with building a pipeline; a _spell correction pipeline_. We will use a pretrained model from our library.

In [2]:
from sparknlp.common import *
from IPython.utils.text import columnize

In [3]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = RecursiveTokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")\
    .setPrefixes(["\"", "(", "[", "\n"])\
    .setSuffixes([".", ",", "?", ")","!", "'s"])

spellModel = ContextSpellCheckerModel\
    .pretrained('spellcheck_dl')\
    .setInputCols("token")\
    .setOutputCol("checked")\
    .setErrorThreshold(4.0)\
    .setTradeoff(6.0)

finisher = Finisher()\
    .setInputCols("checked")

pipeline = Pipeline(stages = [
     documentAssembler,
     tokenizer,
     spellModel,
     finisher
  ])

empty_ds = spark.createDataFrame([[""]]).toDF("text")
lp = LightPipeline(pipeline.fit(empty_ds))

spellcheck_dl download started this may take some time.
Approximate size to download 95.1 MB
[ | ]spellcheck_dl download started this may take some time.
Approximate size to download 95.1 MB
Download done! Loading the resource.
[ / ]

                                                                                

[ — ]



[ | ]

                                                                                

[OK!]


Ok!, at this point we have our spell checking pipeline as expected. Let's see what we can do with it,

In [4]:
lp.annotate("Plaese alliow me tao introdduce myhelf, I am a man of waelth und tiaste")

{'checked': ['Phase',
  'allow',
  'me',
  'to',
  'introduce',
  'myself',
  ',',
  'I',
  'am',
  'a',
  'man',
  'of',
  'wealth',
  'and',
  'taste']}

### Word Level Corrections
Continuing with our pretrained model, let's try to see how corrections work at the word level. Each Context Spell Checker model that you can find in Spark-NLP library comes with two sources for word candidates: 
+ a general vocabulary that is built during training(and remains unmutable during the life of the model), and
+ special classes for dealing with special types of words like numbers or dates. These are dynamic, and you can modify them so they adjust better to your data.

The general vocabulary is learned during training, and cannot be modified, however, the special classes can be updated after training has happened on a pre-trained model.
This means you can modify how existing classes produce corrections, but not the number or type of the classes.
Let's see how we can accomplish this.

In [5]:
# First let's start with a loaded model, and check which classes it has been trained with
spellModel.getWordClasses()

['(_DATE_,RegexParser)',
 '(_LOC_,VocabParser)',
 '(_NUM_,RegexParser)',
 '(_NAME_,VocabParser)']

We have five classes, of two different types: some are vocabulary based and others are regex based,
+ __Vocabulary based classes__ can propose correction candidates from the provided vocabulary, for example a dictionary of names.
+ __Regex classes__ are defined by a regular expression, and they can be used to generate correction candidates for things like numbers. Internally, the Spell Checker will enumerate your regular expression and build a fast automaton, not only for recognizing the word(number in this example) as valid and preserve it, but also for generating a correction candidate.
Thus the regex should be a finite regex(it must define a finite regular language).

Now suppose that you have a new friend from Poland whose name is 'Jowita', let's see how the pretrained Spell Checker does with this name.

In [6]:
beautify = lambda annotations: [columnize(sent['checked']) for sent in annotations]

In [7]:
# Foreign name without errors
sample = 'We are going to meet Jowita in the city hall.'
beautify([lp.annotate(sample)])

['We  are  going  to  meet  With  in  the  city  hall  .\n']

## Advanced - the mysterious tradeoff parameter 
There's a clear tension between two forces here,
+ The context information: by which the model wants to change words based on the surrounding words.
+ The word information: by which the model wants to preserve as much an input word as possible to avoid destroying the input.

Changing words that are in the vocabulary for others that seem more suitable according to the context is one of the most challenging tasks in spell correction. This is because you run into the risk of destroying existing 'good' words.
The models that you will find in the Spark-NLP library have already been configured in a way that balances these two forces and produces good results in most of the situations. But your dataset can be different from the one used to train the model.
So we encourage the user to play a bit with the hyperparameters, and for you to have an idea on how it can be modified, we're going to see the following example,

In [8]:
sample = 'have you been two the falls?'
beautify([lp.annotate(sample)])

['have  you  been  to  the  falls  ?\n']

Here 'two' is clearly wrong, probably a typo, and the model should be able to choose the right correction candidate according to the context. <br/>
Every path is scored with a cost, and the higher the cost the less chances for the path being chosen as the final answer.<br/>
In order for the model to rely more on the context and less on word information, we have the setTradeoff() method. You can think of the tradeoff as how much a single edition(insert, delete, etc) operation affects the influence of a word when competing inside a path in the graph.<br/>
So the lower the tradeoff, the less we care about the edit operations in the word, and the more we care about the word fitting properly into its context. The tradeoff parameter typically ranges between 5 and 25. <br/>
Let's see what happens when we relax how much the model cares about individual words in our example,

In [9]:
spellModel.getTradeoff()

6.0

In [10]:
# let's decrease the influence of word-level errors
# TODO a nicer way of doing this other than re-creating the pipeline?
spellModel.setTradeoff(2.0)

pipeline = Pipeline(
    stages = [
    documentAssembler,
    tokenizer,
    spellModel,
    finisher
  ])

empty_ds = spark.createDataFrame([[""]]).toDF("text")
lp = LightPipeline(pipeline.fit(empty_ds))

beautify([lp.annotate(sample)])

['have  you  been  to  the  falls  ?\n']

## Advanced - performance

The discussion about performance revolves around _error detection_. The more errors the model detects the more populated is the candidate diagram we showed above[TODO add diagram or convert this into blogpost], and the more alternative paths need to be evaluated. </br>
Basically the error detection stage of the model can decide whether a word needs a correction or not; with two reasons for a word to be considered as incorrect, 
+ The word is OOV: the word is out of the vocabulary.
+ The context: the word doesn't fit well within its neighbouring words. 
The only parameter that we can control at this point is the second one, and we do so with the setErrorThreshold() method that contains a max perplexity above which the word will be considered suspicious and a good candidate for being corrected.</br>
The parameter that comes with the pretrained model has been set so you can get both a decent performance and accuracy. For reference, this is how the F-score, and time varies in a sample dataset for different values of the errorThreshold,


|fscore |totaltime|threshold|
|-------|---------|---------|
|52.69  |405s | 8f|
|52.43  |357s |10f|
|52.25  |279s |12f|
|52.14  |234s |14f|

You can trade some minor points in accuracy for a nice speedup.


In [11]:
def sparknlp_spell_check(text):

  return beautify([lp.annotate(text)])[0].rstrip()


In [12]:
sparknlp_spell_check('I will go to Philadelhia tomorrow')

'I  will  go  to  Philadelphia  tomorrow'

In [13]:
sparknlp_spell_check('I will go to Philadhelpia tomorrow')

'I  will  go  to  Philadelphia  tomorrow'

In [14]:
sparknlp_spell_check('I will go to Piladelphia tomorrow')

'I  will  go  to  Philadelphia  tomorrow'

In [15]:
sparknlp_spell_check('I will go to Philadedlphia tomorrow')

'I  will  go  to  Philadelphia  tomorrow'

In [16]:
sparknlp_spell_check('I will go to Phieladelphia tomorrow')

'I  will  go  to  Philadelphia  tomorrow'

## ContextSpellCheckerApproach

Trains a deep-learning based Noisy Channel Model Spell Algorithm.

Correction candidates are extracted combining context information and word information.

1.   Different correction candidates for each word   **word level**
2.   The surrounding text of each word, i.e. it’s context  **sentence level**.
3.   The relative cost of different correction candidates according to the edit operations at the character level it requires  **subword level**.


In [17]:
# For this example, we will use the first Sherlock Holmes book as the training dataset.

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer()\
    .setInputCols("document") \
    .setOutputCol("token")

spellChecker = ContextSpellCheckerApproach() \
    .setInputCols("token") \
    .setOutputCol("corrected") \
    .setWordMaxDistance(3) \
    .setBatchSize(24) \
    .setEpochs(8) \
    .setLanguageModelClasses(1650)  # dependant on vocabulary size
    # .addVocabClass("_NAME_", names) # Extra classes for correction could be added like this

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    spellChecker
])

In [18]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Public/data/holmes.txt

In [19]:
path = "holmes.txt"

dataset = spark.read.text(path).toDF("text")

dataset.show(truncate=100)

+----------------------------------------------------------------------------------------------------+
|                                                                                                text|
+----------------------------------------------------------------------------------------------------+
|THE ADVENTURES OF SHERLOCK HOLMESArthur Conan Doyle Table of contents A Scandal in Bohemia The Re...|
+----------------------------------------------------------------------------------------------------+



In [20]:
%%time
pipelineModel = pipeline.fit(dataset)

                                                                                

CPU times: user 59 ms, sys: 11.6 ms, total: 70.5 ms
Wall time: 41.7 s


In [21]:
lp = LightPipeline(pipelineModel)
result = lp.annotate("Plaese alliow me tao introdduce myhelf, I am a man of waelth und tiaste")

In [22]:
result["corrected"]

['Please',
 'allow',
 'me',
 'to',
 'introduce',
 'myself',
 ',',
 'I',
 'am',
 'a',
 'man',
 'of',
 'wealth',
 'and',
 'taste']

In [23]:
pd.DataFrame(zip(result["token"],result["corrected"]),columns=["orginal","corrected"])

Unnamed: 0,orginal,corrected
0,Plaese,Please
1,alliow,allow
2,me,me
3,tao,to
4,introdduce,introduce
5,myhelf,myself
6,",",","
7,I,I
8,am,am
9,a,a
