# Spell Check Indonesia

## Install Gensim

In [1]:
!pip install --upgrade gensim



You should consider upgrading via the 'D:\Angesa_id\env\Scripts\python.exe -m pip install --upgrade pip' command.


## Download Pre-Trained Fasttext Model Indonesia
Pre-trained Fasttext model yang diunduh adalah hasil pelatihan menggunakan metode CBOW dengan *position-weight*, dalam dimensi 300, dengan panjang karakter n-gram sebesar 5, *window size* 5 dan *negatives* 10. Link kumpulan pre-trained model Fasttext dapat diunduh [disini](https://fasttext.cc/docs/en/crawl-vectors.html).

In [2]:
import wget

def bar_progress(current, total, width=80):
    progress_message = "Downloading: %d%% [%d / %d] bytes" % (current / total * 100, current, total)
     # Don't use print() as it will print in new line every time.
    sys.stdout.write("\r" + progress_message)
    sys.stdout.flush()

url = 'https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.id.300.bin.gz'

filename = wget.download(url, bar=bar_progress)

Downloading: 100% [4507049071 / 4507049071] bytes

## Extract model dengan menggunakan `gunzip`

In [3]:
%%time

import gzip
import shutil

with gzip.open('cc.id.300.bin.gz', 'rb') as f_in:
    with open('cc.id.300.bin', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

Wall time: 1min 3s


## Import Libraries

In [4]:
from gensim.models import fasttext
from gensim.models.fasttext import load_facebook_model

## Load Fasttext Model using Gensim

In [5]:
%%time

model = fasttext.load_facebook_model('cc.id.300.bin')

Wall time: 2min 6s


### Pada pre-trained model Fasttext terdapat 2.000.000 (dua juta) *vocabulary*

In [6]:
vocab = model.wv.key_to_index
len(vocab)

2000000

In [7]:
list(model.wv.key_to_index)[:10]

[',', '.', '</s>', 'yang', 'dan', '"', 'di', ')', '(', 'dengan']

## Create Index for each word in Vocabulary
Buat *dictionary word-rank* pada tiap kata/karakter yang terdapat pada *vocabulary*. 

In [8]:
%%time

words = list(model.wv.key_to_index)

w_rank = {}
for i,word in enumerate(words):
    w_rank[word] = i
    
WORDS = w_rank

Wall time: 707 ms


In [9]:
import itertools

dict(itertools.islice(WORDS.items(), 10))

{',': 0,
 '.': 1,
 '</s>': 2,
 'yang': 3,
 'dan': 4,
 '"': 5,
 'di': 6,
 ')': 7,
 '(': 8,
 'dengan': 9}

## Peter Norvig Spelling Corrector
Salah satu metode paling sederhana untuk Spelling Corrector adalah dengan metode yang diterapkan oleh [Peter Norvig.](https://norvig.com/spell-correct.html)

### Perbedaan
Terdapat perbedaaan pada code Peter Norvig dengan code CPMP untuk penerapan Spell Check. Perbedaannya jika menggunakan **code Peter Norvig dengan menghitung frekuensi dari tiap kata dari kamus**, sedangkan **code CPMP dengan menggunakan ranking dari Word2Vec model**.

### Cara Kerja
Spell Check ini menggunakan Teorema Bayes untuk menemukan *correction c* dengan memilih *probability* terbesar dari semua *probability candidate correction*. Metode tersebut dapat dijabarkan menjadi 4 bagian:
1. **Selection Mechanism**: `argmax`, dengan memilih *candidate* yang memiliki *probability* terbesar.
2. **Candidate Model:** `c ∈ candidates`, didapatkan *candidate corrections c*, untuk dipertimbangkan.
3. **Languange Model**: `P(c)`, *probability* dari kemunculan *candidate corretion c* pada kamus. Pada penerapannya menggunakan kamus Word2Vec model.
4. **Error Model**: `P(w|c)`, *probability* apabila yang diketik adalah kata *w* sedangkan yang dimaksud adalah *c*. Sebagai contoh, *probability* `P(mkan|makan)` lebih tinggi dibandingkan dengan *probability* `P(mkanxxxyz|makan)` yang lebih rendah.

In [10]:
import re
from collections import Counter

def words(text): return re.findall(r'\w+', text.lower())

def P(word): 
    "Probability of `word`."
    # use inverse of rank as proxy
    # returns 0 if the word isn't in the dictionary
    return - WORDS.get(word, 0)

def correction(word): 
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

def candidates(word): 
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words): 
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

## Spelling Test

In [11]:
correction("kcing")

'kucing'

In [12]:
correction('J4karta')

'Jakarta'

In [13]:
correction('mnyedihknn')

'menyedihkan'

In [14]:
correction('yg ')

'yg'

# Slang + Spell Check

In [15]:
import json

with open("source/slang.txt") as f:
    slangS = json.loads(f.read())
    
type(slangS)

dict

In [16]:
import re

def slang(T):
    Texts = re.findall(r"[\w']+|[.,!?;]",T)
    
    _spelling = []
    for text in Texts:
        _spelling.append(correction(text))
    
    for index,text in enumerate(_spelling):
        if text in slangS.keys():
            _spelling[index] = slangS[text]
            
    _text = list(join_punctuation(_spelling))
    _text = ' '.join(join_punctuation(_spelling))
    return _text
    
slang('jangan ragu gan, langsung saja di order pajangannya.')

NameError: name 'join_punctuation' is not defined

# Sinonim

In [None]:
import json

with open('source/dict.json') as f:
    mydict = json.load(f)

In [None]:
def getSinonim(word):
    if word in mydict.keys():
        return mydict[word]['sinonim']
    else:
        return []


def getAntonim(word):
    if word in mydict.keys():
        if 'antonim' in mydict[word].keys():
            return mydict[word]['antonim']
    
    return []

In [None]:
print(getSinonim('senang'))

['aman', 'bahagia', 'bangga', 'berbungabunga ', 'berkenan', 'bungah', 'camar', 'ceria', 'doyan', 'enak', 'gemar', 'gembira', 'girang', 'lapang dada', 'lega', 'makmur', 'meriah', 'nikmat', 'nyaman', 'puas', 'ria', 'riang', 'sejahtera', 'semarak', 'selesa', 'suka', 'sukacita', 'sukaria', 'tenang', 'tenteram']


In [None]:
print(getSinonim(getAntonim('senang')[0]))

['duka', 'getir ', 'gundah', 'lara', 'masygul', 'menyesak', 'merana', 'pedih', 'pilu', 'prihatin', 'sedu', 'susah hati', 'terharu', 'trenyuh']


# Sources
- https://github.com/louisowen6/NLP_bahasa_resources#pos-tagging
- https://medium.com/@yasirabd/spell-check-indonesia-menggunakan-pre-trained-fasttext-model-14e90a3f1ac0
- https://norvig.com/spell-correct.html