<H1> Noisy Channel Model Spell Checker - Medical </H1>


### Initial Setup
As it's usual in Spark-NLP let's start with building a pipeline; a _spell correction pipeline_. We will use a pretrained model from our library.

In [1]:
!export SPARK_NLP_LICENSE="eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJleHAiOjE1OTExODc5NDIsImlhdCI6MTU4MDU2MDc0MiwidW5pcXVlX2lkIjoib2ZmbGluZUtleSJ9.QD0cWRd5OG1MUt5lPEIhlL9LnntZbFZN5hPZoOBHq0j4SlYSVXJvUvTSObkM2RkOby4NZVGs4VQ9ycRzp5Zw8m_O30Nkh1jS9c092abaSCo4jY5UOfM2WPankshg1vNMYvWFFXr4B7cTEKJuQFz1rfyr_jza4cQgSaxj4nZWOT0tdvVq3EhvCJMPkktx_119oB0WsQiWJj-GeTjDu8nxK2D9P9PoSJLCzS03RSeuAXX9uQfeV7ANEX-GalPAJP_YDELLGI4RZU-SjGTQutrxJNTIwjiS2P1Zg6Zpok3jHIjd3zkFigGfvSAhnHvbZnaOtyTh6FOYnTTkJyO9_PeO-z"

In [2]:
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *
import sparknlp_jsl

from IPython.utils.text import columnize
from sparknlp_jsl.annotator import *
import credentials

beautify = lambda annotations: [columnize(sent['checked']) for sent in annotations]

# Two options for the Spark session
# spark = sparknlp_jsl.start("XXXXXXXXX") # with license key

jars = "/home/jose/spark-nlp/python/lib/sparknlp.jar,/home/jose/spark-nlp-internal/python/lib/sparknlp-jsl.jar" 
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("Spell checking") \
    .master("local[*]") \
    .config("spark.driver.memory", "8G") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")\
    .config("spark.jars", jars)\
    .config("spark.driver.extraClassPath", jars)\
    .config("spark.executor.extraClassPath", jars)\
    .config("spark.kryoserializer.buffer.max", "500m")\
    .getOrCreate()


In [3]:
documentAssembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

tokenizer = RecursiveTokenizer()\
  .setInputCols(["document"])\
  .setOutputCol("token")\
  .setPrefixes(["\"", "(", "[", "\n"])\
  .setSuffixes([".", ",", "?", ")","!", "'s"])

spellModel = ContextSpellCheckerModel\
    .pretrained('spellcheck_clinical', 'en', 'clinical/models')\
    .setInputCols("token")\
    .setOutputCol("checked")

com.johnsnowlabs.nlp.DocumentAssembler
com.johnsnowlabs.nlp.annotators.RecursiveTokenizer
spellcheck_clinical download started this may take some time.
com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize
Approximate size to download 145 MB
[ | ]com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel
[OK!]


In [4]:
finisher = Finisher()\
    .setInputCols("checked")

pipeline = Pipeline(
    stages = [
    documentAssembler,
    tokenizer,
    spellModel,
    finisher
  ])

empty_ds = spark.createDataFrame([[""]]).toDF("text")
lp = LightPipeline(pipeline.fit(empty_ds))

com.johnsnowlabs.nlp.Finisher
org.apache.spark.ml.PipelineModel
com.johnsnowlabs.nlp.LightPipeline


Ok!, at this point we have our spell checking pipeline as expected. Let's see what we can do with it, see these errors,

_
__Witth__ the __hell__ of __phisical__ __terapy__ the patient was __imbulated__ and on posoperative, the __impatient__ tolerating a post __curgical__ soft diet._

_With __paint__ __wel__ controlled on __orall__ pain medications, she was discharged __too__ __reihabilitation__ __facilitay__._

_She is to also call the __ofice__ if she has any __ever__ greater than 101, or __leeding__ __form__ the surgical wounds._

_Abdomen is __sort__, nontender, and __nonintended__._

_Patient not showing pain or any __wealth__ problems._
            
_No __cute__ distress_

Check that some of the errors are valid English words, only by considering the context the right choice can be made.

In [5]:
# check for the different occurrences of the word "siter"
example1 = ["Witth the hell of phisical terapy the patient was imbulated and on posoperative, the impatient tolerating a post curgical soft diet.",
            "With paint wel controlled on orall pain medications, she was discharged too reihabilitation facilitay.",
            "She is to also call the ofice if she has any ever greater than 101, or leeding form the surgical wounds.",
            "Abdomen is sort, nontender, and nonintended.",
            "Patient not showing pain or any wealth problems.",
            "No cute distress"
            
]
lp.annotate(example1)

[{'checked': ['With',
   'the',
   'help',
   'of',
   'physical',
   'therapy',
   'the',
   'patient',
   'was',
   'ambulated',
   'and',
   'on',
   'postoperative',
   ',',
   'the',
   'patient',
   'tolerating',
   'a',
   'post',
   'surgical',
   'soft',
   'diet',
   '.']},
 {'checked': ['With',
   'pain',
   'well',
   'controlled',
   'on',
   'oral',
   'pain',
   'medications',
   ',',
   'she',
   'was',
   'discharged',
   'to',
   'rehabilitation',
   'facility',
   '.']},
 {'checked': ['She',
   'is',
   'to',
   'also',
   'call',
   'the',
   'office',
   'if',
   'she',
   'has',
   'any',
   'fever',
   'greater',
   'than',
   '101',
   ',',
   'or',
   'bleeding',
   'from',
   'the',
   'surgical',
   'wounds',
   '.']},
 {'checked': ['Abdomen',
   'is',
   'soft',
   ',',
   'nontender',
   ',',
   'and',
   'nondistended',
   '.']},
 {'checked': ['Patient',
   'not',
   'showing',
   'pain',
   'or',
   'any',
   'health',
   'problems',
   '.']},
 {'checked'