

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/CONTEXTUAL_SPELL_CHECKER.ipynb)




# **Spell checking for clinical documents**

To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload `license_keys.json` to the folder that opens.
Otherwise, you can look at the example outputs at the bottom of the notebook.



## 1. Colab Setup

In [None]:
import json
import os

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)

# Adding license key-value pairs to environment variables
os.environ.update(license_keys)

Install Pyspark & SparkNLP and Setup Environment

In [2]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

Import dependencies into Python and start the Spark session

In [3]:
import json
import os
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl
import sparknlp

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G", 
          "spark.kryoserializer.buffer.max":"2000M", 
          "spark.driver.maxResultSize":"2000M"} 

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

spark

Spark NLP Version : 3.3.4
Spark NLP_JSL Version : 3.3.4


In [4]:
# if you want to start the session with custom params as in start function above
from pyspark.sql import SparkSession

def start(SECRET):
    builder = SparkSession.builder \
        .appName("Spark NLP Licensed") \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000M") \
        .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:"+PUBLIC_VERSION) \
        .config("spark.jars", "https://pypi.johnsnowlabs.com/"+SECRET+"/spark-nlp-jsl-"+JSL_VERSION+".jar")
      
    return builder.getOrCreate()

#spark = start(SECRET)

## 2. Select the NER model and construct the pipeline

In [5]:
document_assembler = DocumentAssembler() \
    .setInputCol('text') \
    .setOutputCol('document')

tokenizer = RecursiveTokenizer() \
    .setInputCols(['document']) \
    .setOutputCol('token') \
    .setPrefixes(["\"", "(", "[", "\n"]) \
    .setSuffixes([".", ",", "?", ")","!", "‘s"])

spell_model = ContextSpellCheckerModel.pretrained('spellcheck_clinical', 'en', 'clinical/models') \
    .setInputCols('token') \
    .setOutputCol('corrected')

finisher = Finisher().setInputCols('corrected')

light_pipeline = Pipeline(stages=[
    document_assembler,
    tokenizer,
    spell_model,
    finisher
])

full_pipeline = Pipeline(stages=[
    document_assembler,
    tokenizer,
    spell_model
])

empty_df = spark.createDataFrame([[""]]).toDF('text')
pipeline_model = full_pipeline.fit(empty_df)
light_pipeline_model = LightPipeline(light_pipeline.fit(empty_df))

spellcheck_clinical download started this may take some time.
Approximate size to download 142.2 MB
[OK!]


## 3. Create example inputs

In [6]:
# Enter examples as strings in this array
input_list = [
    "The pateint is a 5-mont-old infnt who presented initially on Monday with a cold, cugh, and runny nse for 2 days. Mom states she had no fevr. Her appetite was good but she was spitting up a lot. She had no difficulty breathin and her cough was described as dry and hacky. At that time, pysicl exam showed a right TM, which was red. Left TM was okay. She was fairly congsted but looked happy and playful. She was started on Amxil and Aldx and we told to recheck in 2 weaks to recheck her ear. Mom returned to clinic again today because she got much worse ovrnght. She was having dificlty breathing. She was much more congested and her apetit had decrsed significantly today. She also spked a tempratre yesterday of 102.6 and always hvng trouble sleping scondry to congestion."
]

## 4. Use the pipeline to create outputs

Full Pipeline

In [8]:
import pandas as pd

df = spark.createDataFrame(pd.DataFrame({'text': input_list}))
result = pipeline_model.transform(df)

Light Pipeline

In [9]:
# Light pipelines use plain string inputs instead of data frame inputs
light_result = light_pipeline_model.annotate(input_list[0])

## 5. Visualize results

Visualize comparison as dataframe

In [10]:
exploded = F.explode(F.arrays_zip('token.result', 'corrected.result'))
select_expression_0 = F.expr("cols['0']").alias("original")
select_expression_1 = F.expr("cols['1']").alias("corrected")
result.select(exploded.alias("cols")) \
    .select(select_expression_0, select_expression_1).show(truncate=False)

+----------+-----------+
|original  |corrected  |
+----------+-----------+
|The       |The        |
|pateint   |patient    |
|is        |is         |
|a         |a          |
|5-mont-old|5-month-old|
|infnt     |infant     |
|who       |who        |
|presented |presented  |
|initially |initially  |
|on        |on         |
|Monday    |Monday     |
|with      |with       |
|a         |a          |
|cold      |cold       |
|,         |,          |
|cugh      |cough      |
|,         |,          |
|and       |and        |
|runny     |runny      |
|nse       |nose       |
+----------+-----------+
only showing top 20 rows



Vizualise light pipeline and finished result

In [11]:
# This finished result does not need parsing and can directly be used in any
# other task
light_result['corrected']

['The',
 'patient',
 'is',
 'a',
 '5-month-old',
 'infant',
 'who',
 'presented',
 'initially',
 'on',
 'Monday',
 'with',
 'a',
 'cold',
 ',',
 'cough',
 ',',
 'and',
 'runny',
 'nose',
 'for',
 '2',
 'days',
 '.',
 'Mom',
 'states',
 'she',
 'had',
 'no',
 'fer',
 '.',
 'Her',
 'appetite',
 'was',
 'good',
 'but',
 'she',
 'was',
 'spitting',
 'up',
 'a',
 'lot',
 '.',
 'She',
 'had',
 'no',
 'difficulty',
 'breathing',
 'and',
 'her',
 'cough',
 'was',
 'described',
 'as',
 'dry',
 'and',
 'back',
 '.',
 'At',
 'that',
 'time',
 ',',
 'physical',
 'exam',
 'showed',
 'a',
 'right',
 'TM',
 ',',
 'which',
 'was',
 'red',
 '.',
 'Left',
 'TM',
 'was',
 'okay',
 '.',
 'She',
 'was',
 'fairly',
 'congested',
 'but',
 'looked',
 'happy',
 'and',
 'play',
 '.',
 'She',
 'was',
 'started',
 'on',
 'Amoxil',
 'and',
 'Aldex',
 'and',
 'we',
 'told',
 'to',
 'recheck',
 'in',
 '2',
 'weeks',
 'to',
 'recheck',
 'her',
 'ear',
 '.',
 'Mom',
 'returned',
 'to',
 'clinic',
 'again',
 'today',
 