# Context Spell Checker - version 2.4.0

## Example for ContextSpellChecker


Lets imagine we have this sentence with a couple of spelling errors (in red):

<div style="border:2px solid #747474; background-color: #e3e3e3; margin: 5px; padding: 10px"> 
    I <span style="color: red">habe</span> four in my family: Dad, Mum and <span style="color: red">sisster</span>.<br>
</div>

SparkNLP Enterprise version provides with a pretrained SpellChecker model that can fix those errors by using contextual information. This notebook provide an example of how to use this Annotator in a pipeline.

### Step 1. Prepare the environment

#### Install OpenSource spark-nlp and pyspark pip packages
As a first step we import the required python dependences including some sparknlp components.

Be sure that you have the required python libraries (pyspark 2.4.4, spark-nlp 2.4.0) by running <code>pip list</code>. Check that the versions are correct.

If some of them is missing you can run:

<code>pip install --ignore-installed pyspark==2.4.4</code><br>
<code>pip install --ignore-installed spark-nlp==2.4.0</code><br>

The --ignore-installed parameter is to overwrite your previous pip package version if already installed.

<i>*If this cell fails means you have not propertly setup the required environment. Please check the pre-requisites guideline at http://www.johnsnowlabs.com</i>

In [1]:
from sparknlp.base import *
from sparknlp.annotator import *

#### Install Licensed Sparl-NLP package

We will use also some Spark-NLP enterprise functionalities contained in the spark-nlp-jsl package.

You can check that spark-nlp-jsl is installed by running <code>pip install</code>. Check that version installed is 2.4.0

If it is not then you need to install it by using:

<code>pip install spark-nlp-jsl==2.4.0 --extra-index-url https://pypi.johnsnowlabs.com/##### --ignore-installed</code>

The ##### is a secret code, if you have not received it please contact us at info@johnsnowlabs.com.

<i>*If the next cell fails means your licensed enterprise version is not propertly installed so please check the pre-requisites guideline at http://www.johnsnowlabs.com/</i>

In [2]:
from sparknlp_jsl.annotator import *
import sparknlp_jsl

#### Setup credentials to private JohnSnowLabs models repository with AWS-CLI

Now is time to configure Spark-NLP in order to access private JohnSnowLabs models repository. This access is done via Amazon aws command line interface (AWSCLI).

Instructions about how to install awscli are available at:

https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html

Make sure you configure your credentials with aws configure following the instructions at:

https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html

Please substitute the ACCESS_KEY and SECRET_KEY with the credentials you have recived. If you need your credentials contact us at info@johnsnowlabs.com


#### Setup your license number in order to have access to private JohnSnowLabs models repository

You need to setup the SPARK_NLP_LICENSE environment variable to the license number provided to you. If you need your license credentials contact us at info@johnsnowlabs.com

In [16]:
import os
os.environ['SPARK_NLP_LICENSE']='######'

#### Start Spark session

The following will initialize the spark session in case you have run the jupyter notebook directly. If you have started the notebook using pyspark this cell is just ignored.

Initializing the spark session takes some seconds (usually less than 1 minute) as the jar from the server needs to be loaded.

The ####### is a secret code required to run the licensed version 2.4.0, if you have not received it please contact us at info@johnsnowlabs.com.

In [17]:
spark = sparknlp_jsl.start("######") # Secret code provided as part of the license

### Step 2: Context SpellChecker pipeline generation

In Spark-NLP annotating NLP happens through pipelines. Pipelines are made out of various Annotator steps. In our case the architecture of the Context SpellChecker pipeline will be:

* DocumentAssembler (text -> document)
* SentenceDetector (document -> sentence)
* Tokenizer (sentence -> token)
* ContextSpellCheckerModel (token -> fixed)

From the original text we generate a document and identify the different sentences. For each sentence the pipeline will extract the list of tokens and feed those to the context spellchecker.

Finally we will use a pretrained model (ContextSpellCheckerModel) that is trained to fix mispelling errors based on contextual information.

So from a text we will end having a list of tokens spellchecked in the "fixed" column.



#### Step 2.1: Initialize all the components of the pipeline
The first three components are pretty straightforward Transformers/Annotators: DocumentAssembler, SentenceDetector and Tokenizer.

In [18]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP

document = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")

# Rule based Sentence Detector annotator, processes various sentences per line

sentenceDetector = SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP

token = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")

The fourth annotator in the pipeline will be the "ContextSpellCheckerModel". We will download a pretrained model only available in the Enterprise version named "spellcheck_dl".

When running this cell your are advised to be patient. 

First time you call this pretrained model it needs to be downloaded in your local and it takes a while (depending on your internet connection).

The size is about 69Mb and will be saved typically in your home folder as

    ~HOMEFOLDER/cached_models/spellcheck_dl_en_2.0.2_2.4_1556479898829

Next times you call it the model is loaded from your cached copy but even in that case it needs to be indexed each time so expect waiting up to 1 minutes (depending on your machine)

In [6]:
spellchecker_dl = ContextSpellCheckerModel.pretrained('spellcheck_dl', 'en', 'clinical/models')

spellcheck_dl download started this may take some time.
Approximate size to download 68.8 MB
[OK!]


We will also setup the name of the input column ("token", that is the output of the previous Annotator) and output column ("fixed").

In [8]:
spellchecker_dl = spellchecker_dl.setInputCols(["token"])\
.setOutputCol("fixed")

#### Step 2.2 Defining the stages of the pipeline
Now that we have created all the components of our pipeline, lets put all them together into a pipeline.

In [9]:
pipeline = Pipeline().setStages([
    document,
    sentenceDetector,
    token,
    spellchecker_dl
])

### Step 3 Get your fitted model
Now is time to fit our new pipeline. First we will create a Spark DataFrame including the sentence we want our SpellChecker to fix:

<div style="border:2px solid #747474; background-color: #e3e3e3; margin: 5px; padding: 10px"> 
    I <span style="color: red">habe</span> four in my family: Dad, Mum and <span style="color: red">sisster</span>.<br>
</div>



In [10]:
data = spark.createDataFrame([
    ["I habe four in my family: Dad Mum and sisster."]
]).toDF('text')

data.show(truncate=False)

+----------------------------------------------+
|text                                          |
+----------------------------------------------+
|I habe four in my family: Dad Mum and sisster.|
+----------------------------------------------+



Now we will create a model by fitting our pipeline to our content:

In [11]:
model = pipeline.fit(data)

### Step 4. Transform your data with the model to fix spelling errors.
We will now apply the model transforming our data:

In [12]:
output = model.transform(data)

As a result we will have a Spark DataFrame with a column containing the original tokens ("token") and another column with the fixed tokens ("fixed"):

In [13]:
output.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|               fixed|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|I habe four in my...|[[document, 0, 45...|[[document, 0, 45...|[[token, 0, 0, I,...|[[token, 0, 0, I,...|
+--------------------+--------------------+--------------------+--------------------+--------------------+



Lets compare both and see how our ContextSpellChecker has fixed the mispells:

In [15]:
print(" ".join(output.select('token.result').take(1)[0]['result']))
print(" ".join(output.select('fixed.result').take(1)[0]['result']))

I habe four in my family : Dad Mum and sisster .
I have four in my family : Dad Mum and sister .
