# Context Spell Checker

## Example for ContextSpellChecker


Lets imagine we have this sentence with a couple of spelling errors (in red):

<div style="border:2px solid #747474; background-color: #e3e3e3; margin: 5px; padding: 10px"> 
    I <span style="color: red">habe</span> four in my family: Dad, Mum and <span style="color: red">sisster</span>.<br>
</div>

SparkNLP Enterprise version provides with a pretrained SpellChecker model that can fix those errors by using contextual information. This notebook provide an example of how to use this Annotator in a pipeline.

### Step 1. Prepare the environment


In [1]:
import json

with open('keys.json') as f:
    license_keys = json.load(f)

license_keys.keys()

dict_keys(['version', 'jsl_version', 'secret', 'SPARK_NLP_LICENSE', 'AWS_ACCESS_KEY_ID', 'AWS_SECRET_ACCESS_KEY', 'JSL_OCR_LICENSE', 'JSL_OCR_SECRET'])

In [2]:
import os

# Install java
! apt-get update -qq
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null

secret = license_keys.get("secret",license_keys.get('SPARK_NLP_SECRET', ""))
spark_version = os.environ.get("SPARK_VERSION", license_keys.get("SPARK_VERSION","2.4"))
version = license_keys.get("version",license_keys.get('SPARK_NLP_PUBLIC_VERSION', ""))
jsl_version = license_keys.get("jsl_version",license_keys.get('SPARK_NLP_VERSION', ""))

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! java -version

os.environ['SPARK_NLP_LICENSE'] = license_keys['SPARK_NLP_LICENSE']
os.environ['JSL_OCR_LICENSE'] = license_keys['JSL_OCR_LICENSE']
os.environ['AWS_ACCESS_KEY_ID']= license_keys['AWS_ACCESS_KEY_ID']
os.environ['AWS_SECRET_ACCESS_KEY'] = license_keys['AWS_SECRET_ACCESS_KEY']

print(spark_version, version, jsl_version)

! python -m pip install "pyspark==$spark_version".*
! python -m pip install --upgrade spark-nlp-jsl==$jsl_version  --extra-index-url https://pypi.johnsnowlabs.com/$secret

import sparknlp
import sparknlp_jsl
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession

print (sparknlp.version())
print (sparknlp_jsl.version())

spark = sparknlp_jsl.start(secret, gpu=False, spark23=(spark_version[:3]=="2.3"))

openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~18.04-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)
Looking in indexes: https://pypi.org/simple, https://pypi.johnsnowlabs.com/scwgF2mD1U
Collecting spark-nlp-jsl==2.5.4rc2
Collecting pyspark==2.4.4
[?25l  Downloading https://files.pythonhosted.org/packages/87/21/f05c186f4ddb01d15d0ddc36ef4b7e3cedbeb6412274a41f26b55a650ee5/pyspark-2.4.4.tar.gz (215.7MB)
[K     |████████████████████████████████| 215.7MB 53kB/s 
[?25hCollecting spark-nlp==2.5.4
[?25l  Downloading https://files.pythonhosted.org/packages/77/99/7a306dd04623ae25d2bd53a190c0b695fc72043773d5ae0870b7aa53d8e2/spark_nlp-2.5.4-py2.py3-none-any.whl (123kB)
[K     |████████████████████████████████| 133kB 38.1MB/s 
[?25hCollecting py4j==0.10.7
[?25l  Downloading https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl (197kB)
[K     |████████████

### Step 2: Context SpellChecker pipeline generation

In Spark-NLP annotating NLP happens through pipelines. Pipelines are made out of various Annotator steps. In our case the architecture of the Context SpellChecker pipeline will be:

* DocumentAssembler (text -> document)
* SentenceDetector (document -> sentence)
* Tokenizer (sentence -> token)
* ContextSpellCheckerModel (token -> fixed)

From the original text we generate a document and identify the different sentences. For each sentence the pipeline will extract the list of tokens and feed those to the context spellchecker.

Finally we will use a pretrained model (ContextSpellCheckerModel) that is trained to fix mispelling errors based on contextual information.

So from a text we will end having a list of tokens spellchecked in the "fixed" column.



#### Step 2.1: Initialize all the components of the pipeline
The first three components are pretty straightforward Transformers/Annotators: DocumentAssembler, SentenceDetector and Tokenizer.

In [3]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP

document = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")

# Rule based Sentence Detector annotator, processes various sentences per line

sentenceDetector = SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP

token = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")

In [5]:
spellchecker_dl = ContextSpellCheckerModel.pretrained('spellcheck_clinical', 'en', 'clinical/models')

spellcheck_clinical download started this may take some time.
Approximate size to download 145 MB
[OK!]


We will also setup the name of the input column ("token", that is the output of the previous Annotator) and output column ("fixed").

In [6]:
spellchecker_dl = spellchecker_dl.setInputCols(["token"])\
.setOutputCol("fixed")

#### Step 2.2 Defining the stages of the pipeline
Now that we have created all the components of our pipeline, lets put all them together into a pipeline.

In [7]:
pipeline = Pipeline().setStages([
    document,
    sentenceDetector,
    token,
    spellchecker_dl
])

### Step 3 Get your fitted model
Now is time to fit our new pipeline. First we will create a Spark DataFrame including the sentence we want our SpellChecker to fix:

<div style="border:2px solid #747474; background-color: #e3e3e3; margin: 5px; padding: 10px"> 
    I <span style="color: red">habe</span> four in my family: Dad, Mum and <span style="color: red">sisster</span>.<br>
</div>



In [28]:
data = spark.createDataFrame([
    ["I habe kancer in my left lunb"]
]).toDF('text')

data.show(truncate=False)

+-----------------------------+
|text                         |
+-----------------------------+
|I habe kancer in my left lunb|
+-----------------------------+



Now we will create a model by fitting our pipeline to our content:

In [29]:
model = pipeline.fit(data)

### Step 4. Transform your data with the model to fix spelling errors.
We will now apply the model transforming our data:

In [30]:
output = model.transform(data)

As a result we will have a Spark DataFrame with a column containing the original tokens ("token") and another column with the fixed tokens ("fixed"):

In [31]:
output.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|               fixed|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|I habe kancer in ...|[[document, 0, 28...|[[document, 0, 28...|[[token, 0, 0, I,...|[[token, 0, 0, I,...|
+--------------------+--------------------+--------------------+--------------------+--------------------+



Lets compare both and see how our ContextSpellChecker has fixed the mispells:

In [32]:
print(" ".join(output.select('token.result').take(1)[0]['result']))
print(" ".join(output.select('fixed.result').take(1)[0]['result']))

I habe kancer in my left lunb
I have cancer in my left lung
