# Spelling Correction using probability LM

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [Malaya/example/spelling-correction-probability-lm](https://github.com/huseinzol05/Malaya/tree/master/example/spelling-correction-probability-lm).
    
</div>

This spelling correction extends the functionality of the Peter Norvig's spell-corrector in http://norvig.com/spell-correct.html with KenLM language model.

And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews,
https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews

Also added custom vowels augmentation.

In [2]:
import logging

logging.basicConfig(level=logging.INFO)

In [3]:
import malaya

INFO:numexpr.utils:NumExpr defaulting to 8 threads.


In [4]:
# some text examples copied from Twitter

string1 = 'krajaan patut bagi pencen awal skt kpd warga emas supaya emosi'
string2 = 'Husein ska mkn aym dkat kampng Jawa'
string3 = 'Melayu malas ni narration dia sama je macam men are trash. True to some, false to some.'
string4 = 'Tapi tak pikir ke bahaya perpetuate myths camtu. Nanti kalau ada hiring discrimination despite your good qualifications because of your race tau pulak marah. Your kids will be victims of that too.'
string5 = 'DrM cerita Melayu malas semenjak saya kat University (early 1980s) and now as i am edging towards retirement in 4-5 years time after a career of being an Engineer, Project Manager, General Manager'
string6 = 'blh bntg dlm kls nlp sy, nnti intch'
string7 = 'mulakn slh org boleh ,bila geng tuh kena slhkn jgk xboleh trima .. pelik'

### Load probability model

```python
def load(language_model=None, sentence_piece: bool = False, **kwargs):
    """
    Load a Probability Spell Corrector.

    Parameters
    ----------
    language_model: Callable, optional (default=None)
        If not None, must an instance of kenlm.Model.
    sentence_piece: bool, optional (default=False)
        if True, reduce possible augmentation states using sentence piece.

    Returns
    -------
    result: model
        List of model classes:

        * if passed `language_model` will return `malaya.spelling_correction.probability.ProbabilityLM`.
        * else will return `malaya.spelling_correction.probability.Probability`.
    """
```

In [5]:
lm = malaya.kenlm.load()
lm

<Model from b'79a00bc9ce79bd29dea1bfa15b66524db7b6731f040a71c653f02c2132d955a3.a63730b02158debc1b0102385e131830c93721224b1f932896486da18e3ec1bb'>

In [6]:
model = malaya.spelling_correction.probability.load(language_model = lm)

INFO:malaya_boilerplate.huggingface:downloading frozen huseinzol05/v27-preprocessing/bm_1grams.json


#### List possible generated pool of words

```python
def edit_candidates(self, word):
    """
    Generate candidates given a word.

    Parameters
    ----------
    word: str

    Returns
    -------
    result: List[str]
    """
```

In [7]:
model.edit_candidates('mhthir')

['mahathir']

In [8]:
model.edit_candidates('smbng')

['sambang',
 'sombong',
 'smbg',
 'sembong',
 'sambong',
 'simbang',
 'sumbing',
 'sembang',
 'sumbang',
 'sambung',
 'sembung']

#### To correct a word

```python
def correct(
    self,
    word: str,
    string: List[str],
    index: int = -1,
    lookback: int = 3,
    lookforward: int = 3,
):
    """
    Correct a word within a text, returning the corrected word.

    Parameters
    ----------
    word: str
    string: str
        Entire string, `word` must a word inside `string`.
    index: int, optional (default=-1)
        index of word in the string, if -1, will try to use `string.index(word)`.
    lookback: int, optional (default=3)
        N left hand side words.
    lookforward: int, optional (default=3)
        N right hand side words.

    Returns
    -------
    result: str
    """
```

In [10]:
splitted = string1.split()
model.correct('kpd', splitted)

'kepada'

In [11]:
model.correct('krajaan', splitted)

'kerajaan'

In [12]:
%%time

model.correct('skt', splitted)

CPU times: user 10.2 ms, sys: 767 µs, total: 10.9 ms
Wall time: 10.4 ms


'sikit'

In [13]:
%%time

model.correct('skt', splitted, lookback = -1)

CPU times: user 10.9 ms, sys: 699 µs, total: 11.6 ms
Wall time: 11.2 ms


'sikit'

#### To correct a sentence

```python
def correct_text(
    self,
    text: str,
    lookback: int = 3,
    lookforward: int = 3,
):
    """
    Correct all the words within a text, returning the corrected text.

    Parameters
    ----------
    text: str
    lookback: int, optional (default=3)
        N words on the left hand side.
        if put -1, will take all words on the left hand side.
        longer left hand side will take longer to compute.
    lookforward: int, optional (default=3)
        N words on the right hand side.
        if put -1, will take all words on the right hand side.
        longer right hand side will take longer to compute.

    Returns
    -------
    result: str
    """
```

In [14]:
model.correct_text(string1)

'kerajaan patut bagi pencen awal sikit kepada warga emas supaya emosi'

In [15]:
tokenizer = malaya.tokenizer.Tokenizer()

In [16]:
string2

'Husein ska mkn aym dkat kampng Jawa'

In [17]:
tokenized = tokenizer.tokenize(string2)
model.correct_text(' '.join(tokenized))

'Husin suka makan ayam dekat kampung Jawa'

In [18]:
tokenized = tokenizer.tokenize(string3)
model.correct_text(' '.join(tokenized))

'Melayu malas ini narration dia sama sahaja macam men are trash . True to some , false to some .'

In [19]:
tokenized = tokenizer.tokenize(string5)
model.correct_text(' '.join(tokenized))

'DrM cerita Melayu malas semenjak saya kat University ( early 1980s ) and now as i am edging towards retirement in 4 - 5 years time after a career of being an Engineer , Project Manager , General Manager'

In [20]:
tokenized = tokenizer.tokenize(string6)
model.correct_text(' '.join(tokenized))

'boleh bintang dalam kelas nlp saya , nanti intch'

In [21]:
tokenized = tokenizer.tokenize(string7)
model.correct_text(' '.join(tokenized))

'mulakan salah orang boleh , bila geng itu kena salahkan juga boleh terima . . pelik'

In [22]:
s = 'mulakn slh org boleh ,bila geng tuh kena slhkn jgk xboleh trima .. pelik , dia slhkn org bole hri2 crta sakau then bila kna bls balik xdpt jwb ,kata mcm biasa slh (parti sampah) 🤣🤣🤣 jgn mulakn dlu slhkn org kalau xboleh trima bila kna bls balik 🤣🤣🤣'

In [23]:
tokenized = tokenizer.tokenize(s)
model.correct_text(' '.join(tokenized))

'mulakan salah orang boleh , bila geng itu kena salahkan juga boleh terima . . pelik , dia salahkan orang bole hari2 cerita sakau then bila kena balas balik xdpt jawab , kata macam biasa salah ( parti sampah ) 🤣 🤣 🤣 jangan mulakan dahulu salahkan orang kalau boleh terima bila kena balas balik 🤣 🤣 🤣'