![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/training/italian/Training_Context_Spell_Checker_Italian.ipynb)


# Noisy Channel Model Spell Checker - Training
In this notebook we're going to learn how to train the Noisy Channel Model Spell Checker, a.k.a. ContextSpellChecker, as it can leverage context word information to produce corrections for each word.

## Italian Language Spell Checking
This is a toy Italian Spell Checking Model used here to exemplify how to train a Spell Checker. It may require more work to become a real world model.

In [None]:
# Only run this cell when you are using Spark NLP on Google Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

In [None]:
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *
import sparknlp

spark = sparknlp.start()

In [None]:
!wget https://clarin.eurac.edu/repository/xmlui/bitstream/handle/20.500.12124/3/paisa.raw.utf8.gz

--2023-02-20 18:16:37--  https://clarin.eurac.edu/repository/xmlui/bitstream/handle/20.500.12124/3/paisa.raw.utf8.gz
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving clarin.eurac.edu (clarin.eurac.edu)... 193.106.181.65
Connecting to clarin.eurac.edu (clarin.eurac.edu)|193.106.181.65|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 546911754 (522M) [application/gzip]
Saving to: ‘paisa.raw.utf8.gz’


2023-02-20 18:19:12 (3,37 MB/s) - ‘paisa.raw.utf8.gz’ saved [546911754/546911754]



In [None]:
# Let's use the Paisa corpus
from pyspark.sql.functions import *
paisaCorpusPath = "paisa.raw.utf8.gz"


# do some brief DS exploration, and preparation to get clean text
df = spark.read.text(paisaCorpusPath)
df = df.filter(~col('value').contains('</text')).\
          filter(~col('value').contains('<text')).\
          filter(~col('value').startswith('#')).\
          limit(10000)
df.show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
# Get a list of common Italian names
import pandas as pd
import io
import requests
url="https://gist.githubusercontent.com/pdesterlich/2562329/raw/7c09ac44d769539c61df15d5b3c441eaebb77660/nomi_italiani.txt"
s=requests.get(url).content
# remove the first couple of lines (which are comments) & capitalize first letter
names = [name[0].upper() + name[1:] for name in s.decode('utf-8').split('\n')[7:]]
# visualize
names

['Abaco',
 'Abbondanza',
 'Abbondanzia',
 'Abbondanzio',
 'Abbondazio',
 'Abbondia',
 'Abbondina',
 'Abbondio',
 'Abdelkrim',
 'Abdellah',
 'Abdenago',
 'Abdon',
 'Abdone',
 'Abela',
 'Abelarda',
 'Abelardo',
 'Abele',
 'Abelina',
 'Abelino',
 'Aberardo',
 'Abilio',
 'Abondio',
 'Abrama',
 'Abramina',
 'Abramino',
 'Abramo',
 'Accorso',
 'Accursa',
 'Accursia',
 'Accursio',
 'Accurso',
 'Acheropita',
 'Achilla',
 'Achille',
 'Achillea',
 'Achilleo',
 'Achillina',
 'Achiropita',
 'Acilia',
 'Acilio',
 'Acquisto',
 'Acrisio',
 'Ada',
 'Adalberta',
 'Adalberto',
 'Adalciso',
 'Adalgerio',
 'Adalgisa',
 'Adalgisio',
 'Adalgiso',
 'Adalia',
 'Adalinda',
 'Adalindo',
 'Adalio',
 'Adalisa',
 'Adama',
 'Adamaria',
 'Adamello',
 'Adamina',
 'Adamino',
 'Adamo',
 'Adastro',
 'Addamiano',
 'Addario',
 'Addiego',
 'Addolorata',
 'Addolorato',
 'Addonizio',
 'Adea',
 'Adela',
 'Adelaida',
 'Adelaide',
 'Adelaido',
 'Adelasia',
 'Adelasio',
 'Adelca',
 'Adelchi',
 'Adelchio',
 'Adelcisa',
 'Adelco',

In [None]:
assembler = DocumentAssembler()\
    .setInputCol("value")\
    .setOutputCol("document")\

tokenizer = RecursiveTokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")\
    .setPrefixes(["\"", "“", "(", "[", "\n", ".", "l’", "dell’", "nell’", "sull’", "all’", "d’", "un’"])\
    .setSuffixes(["\"", "”", ".", ",", "?", ")", "]", "!", ";", ":"])

# we're going to add a special class for names, and use another two
# that come predefined with the model: numbers and dates
spellChecker = ContextSpellCheckerApproach().\
    setInputCols("token").\
    setOutputCol("corrected").\
    addVocabClass('_NAME_', names).\
    setLanguageModelClasses(1650).\
    setWordMaxDistance(3).\
    setEpochs(2)

In [None]:
pipeline = Pipeline(
    stages = [
    assembler,
    tokenizer,
    spellChecker
  ])
model = pipeline.fit(df)

In [None]:
lp = LightPipeline(model)
lp.annotate("Ciiao! sono Glorea, ho laciato la patentte sul tabolo acanto alla fruta!")

{'document': ['Ciiao! sono Glorea, ho laciato la patentte sul tabolo acanto alla fruta!'],
 'token': ['Ciiao',
  '!',
  'sono',
  'Glorea',
  ',',
  'ho',
  'laciato',
  'la',
  'patentte',
  'sul',
  'tabolo',
  'acanto',
  'alla',
  'fruta',
  '!'],
 'corrected': ['Ciao',
  '!',
  'sono',
  'Gloria',
  ',',
  'ho',
  'lasciato',
  'la',
  'patente',
  'sul',
  'tavolo',
  'accanto',
  'alla',
  'frutta',
  '!']}