<H1> Noisy Channel Model Spell Checker - Training </H1>
In this notebook we're going to learn how to train the Noisy Channel Model Spell Checker, a.k.a. ContextSpellChecker, as it can leverage context word information to produce corrections for each word.

## Italian Language Spell Checking
This is a toy Italian Spell Checking Model used here to exemplify how to train a Spell Checker. It may require more work to become a real world model.

In [1]:
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *
import sparknlp
from pyspark.sql import SparkSession

spark = sparknlp.start(gpu=True)
sparknlp.version()

'2.7.5'

In [7]:
# Let's use the Paisa corpus
# https://clarin.eurac.edu/repository/xmlui/bitstream/handle/20.500.12124/3/paisa.raw.utf8.gz
from pyspark.sql.functions import *
paisaCorpusPath = "/home/jose/Downloads/paisa/paisa.raw.utf8"


# do some brief DS exploration, and preparation to get clean text
df = spark.read.text(paisaCorpusPath)
df = df.filter(~col('value').contains('</text')).\
          filter(~col('value').contains('<text')).\
          filter(~col('value').startswith('#'))#.\
          #limit(10000)
df.limit(2).show()

+--------------------+
|               value|
+--------------------+
|Davide Guglielmin...|
|Avete partecipato...|
+--------------------+



In [3]:
# Get a list of common Italian names
import pandas as pd
import io
import requests
url="https://gist.githubusercontent.com/pdesterlich/2562329/raw/7c09ac44d769539c61df15d5b3c441eaebb77660/nomi_italiani.txt"
s=requests.get(url).content
# remove the first couple of lines (which are comments) & capitalize first letter
names = [name[0].upper() + name[1:] for name in s.decode('utf-8').split('\n')[7:]]
# visualize
names

['Abaco',
 'Abbondanza',
 'Abbondanzia',
 'Abbondanzio',
 'Abbondazio',
 'Abbondia',
 'Abbondina',
 'Abbondio',
 'Abdelkrim',
 'Abdellah',
 'Abdenago',
 'Abdon',
 'Abdone',
 'Abela',
 'Abelarda',
 'Abelardo',
 'Abele',
 'Abelina',
 'Abelino',
 'Aberardo',
 'Abilio',
 'Abondio',
 'Abrama',
 'Abramina',
 'Abramino',
 'Abramo',
 'Accorso',
 'Accursa',
 'Accursia',
 'Accursio',
 'Accurso',
 'Acheropita',
 'Achilla',
 'Achille',
 'Achillea',
 'Achilleo',
 'Achillina',
 'Achiropita',
 'Acilia',
 'Acilio',
 'Acquisto',
 'Acrisio',
 'Ada',
 'Adalberta',
 'Adalberto',
 'Adalciso',
 'Adalgerio',
 'Adalgisa',
 'Adalgisio',
 'Adalgiso',
 'Adalia',
 'Adalinda',
 'Adalindo',
 'Adalio',
 'Adalisa',
 'Adama',
 'Adamaria',
 'Adamello',
 'Adamina',
 'Adamino',
 'Adamo',
 'Adastro',
 'Addamiano',
 'Addario',
 'Addiego',
 'Addolorata',
 'Addolorato',
 'Addonizio',
 'Adea',
 'Adela',
 'Adelaida',
 'Adelaide',
 'Adelaido',
 'Adelasia',
 'Adelasio',
 'Adelca',
 'Adelchi',
 'Adelchio',
 'Adelcisa',
 'Adelco',

In [None]:
assembler = DocumentAssembler()\
    .setInputCol("value")\
    .setOutputCol("document")\

tokenizer = RecursiveTokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")\
    .setInfixes(["’"])\
    .setWhitelist(["L’unica"])\
    .setPrefixes(["\"", "“", "(", "[", "\n", ".", "l’", "dell’", "nell’", "sull’", "all’", "d’", "un’"])\
    .setSuffixes(["\"", "”", ".", ",", "?", ")", "]", "!", ";", ":"])

# we're going to add a special class for names, and use another two
# that come predefined with the model: numbers and dates
spellChecker = ContextSpellCheckerApproach().\
    setInputCols("token").\
    setOutputCol("corrected").\
    addVocabClass('_NAME_', names).\
    setLanguageModelClasses(1650).\
    setWordMaxDistance(3).\
    setBatchSize(24).\
    setEpochs(8)

## Let's try the tokenization here!
Before we get into training, we first take a look at how the tokenization works, just to be sure we're happy with it.

In [7]:
tokenization = Pipeline(
    stages = [
    assembler,
    tokenizer
  ])
lp = LightPipeline(tokenization.fit(df.limit(10000)))

In [8]:
sentences = ["La dona è nata dala costola dell’uoma.",
             "L’unica maneira per realizare i proprio sogni è svegliarsi.",
             "Smettila di penssare a cosa potrebbe andare male e inizia a pensare a cosa potrebbe andare bene.",
             "L'è questione che ognunno c'ha il suo ramo di studi: il contadino studia la vaca e lo studente studia la fica."]

In [9]:
lp.annotate(sentences)

[{'document': ['La dona è nata dala costola dell’uoma.'],
  'token': ['La',
   'dona',
   'è',
   'nata',
   'dala',
   'costola',
   'dell',
   '’',
   'uoma',
   '.']},
 {'document': ['L’unica maneira per realizare i proprio sogni è svegliarsi.'],
  'token': ['L’unica',
   'maneira',
   'per',
   'realizare',
   'i',
   'proprio',
   'sogni',
   'è',
   'svegliarsi',
   '.']},
 {'document': ['Smettila di penssare a cosa potrebbe andare male e inizia a pensare a cosa potrebbe andare bene.'],
  'token': ['Smettila',
   'di',
   'penssare',
   'a',
   'cosa',
   'potrebbe',
   'andare',
   'male',
   'e',
   'inizia',
   'a',
   'pensare',
   'a',
   'cosa',
   'potrebbe',
   'andare',
   'bene',
   '.']},
 {'document': ["L'è questione che ognunno c'ha il suo ramo di studi: il contadino studia la vaca e lo studente studia la fica."],
  'token': ["L'è",
   'questione',
   'che',
   'ognunno',
   "c'ha",
   'il',
   'suo',
   'ramo',
   'di',
   'studi',
   ':',
   'il',
   'contadino',
 

## Train the model
Here we use the complete pipeline, and we limit the input dataset to 200000, which should be enough to get a better than trivial model.

In [None]:
pipeline = Pipeline(
    stages = [
    assembler,
    tokenizer,
    spellChecker
  ])
fitPipeline = pipeline.fit(df.limit(200000))

## Persist the model
Here we will save the model for later use.

In [None]:
fitPipeline.stages[-1].write().overwrite().save('italian_spell')

## Try the model
In subsequent uses you can either load it from disk, or access it through pretrained().

In [2]:
loaded = ContextSpellCheckerModel.pretrained('spellcheck_dl', "it")

spellcheck_dl download started this may take some time.
Approximate size to download 124.3 MB
[OK!]


In [77]:
loaded = ContextSpellCheckerModel.load('./italian_spell')

In [30]:
def get_lightpipeline():
    pipeline = Pipeline(
        stages = [
        assembler,
        tokenizer,
        loaded
      ])
    fitPipeline = pipeline.fit(df)
    return LightPipeline(fitPipeline)

In [31]:
lp = get_lightpipeline()

In [10]:
lp.annotate("sonno Glorea ho lasciatth la paterte sul tavolo acanto allu fruttu")['corrected']

['sonno',
 'Gloria',
 'ho',
 'lasciato',
 'la',
 'patente',
 'sul',
 'tavolo',
 'accanto',
 'alla',
 'frutta']

Let's do something a bit more challenging - pay attention to the 'lo' article

In [25]:
lp.annotate("sono Glorea ho lasciatto lo patente sul tavolo acanto alla frutta")

{'document': ['sono Glorea ho lasciatto lo patente sul tavolo acanto alla frutta'],
 'token': ['sono',
  'Glorea',
  'ho',
  'lasciatto',
  'lo',
  'patente',
  'sul',
  'tavolo',
  'acanto',
  'alla',
  'frutta'],
 'corrected': ['sono',
  'Gloria',
  'ha',
  'lasciato',
  'la',
  'patente',
  'sul',
  'lavoro',
  'accanto',
  'alla',
  'frutta']}

Let's try to adjust the model to both a) pay more attention to the context with the decreased tradeoff, and generate more corrections with the decrease in error threshold.

In [12]:
loaded.getTradeoff(), loaded.getErrorThreshold()

(18.0, 10.0)

In [60]:
loaded.setTradeoff(5)
loaded.setErrorThreshold(5)
# Refresh the pipeline
lp = get_lightpipeline()

In [61]:
lp.annotate("sono Glorea ho lasciatto lo patenre sul tavolo acanto alla frutta")

{'document': ['sono Glorea ho lasciatto lo patenre sul tavolo acanto alla frutta'],
 'token': ['sono',
  'Glorea',
  'ho',
  'lasciatto',
  'lo',
  'patenre',
  'sul',
  'tavolo',
  'acanto',
  'alla',
  'frutta'],
 'corrected': ['sono',
  'Gloria',
  'ho',
  'lasciato',
  'la',
  'patente',
  'sul',
  'tavoli',
  'accanto',
  'alla',
  'frutta']}