# Language Detection word level using rules based

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [Malaya/example/language-detection-words](https://github.com/huseinzol05/Malaya/tree/master/example/language-detection-words).
    
</div>

<div class="alert alert-warning">

This module is using dictionary, expect Out-of-vocabulary (OOV) words.
    
</div>

In [1]:
%%time
import malaya

CPU times: user 5.41 s, sys: 869 ms, total: 6.28 s
Wall time: 7.18 s


## Install pyenchant

Full installation steps at https://pyenchant.github.io/pyenchant/install.html

## Load model

```python
def substring_rules(**kwargs):
    """
    detect EN, MS and OTHER languages in a string.

    EN words detection are using `pyenchant` from https://pyenchant.github.io/pyenchant/ and
    fast-text https://fasttext.cc/docs/en/language-identification.html.

    MS words detection are using `malaya.text.function.is_malay` and
    fast-text https://fasttext.cc/docs/en/language-identification.html.

    OTHER words detection are using fast-text https://fasttext.cc/docs/en/language-identification.html.

    Returns
    -------
    result : malaya.model.rules.LanguageDict class
    """
```

In [2]:
model = malaya.language_detection.substring_rules()



## Predict

```python
def predict(self, words: List[str]):
    """
    Predict [EN, MS, NOT_LANG] on word level. 
    This method assumed the string already tokenized.

    Parameters
    ----------
    words: List[str]

    Returns
    -------
    result: List[str]
    """
```

In [3]:
string = 'saya suka chicken and fish pda hari isnin'
splitted = string.split()
list(zip(splitted, model.predict(splitted)))

[('saya', 'MS'),
 ('suka', 'MS'),
 ('chicken', 'EN'),
 ('and', 'EN'),
 ('fish', 'EN'),
 ('pda', 'EN'),
 ('hari', 'MS'),
 ('isnin', 'MS')]

In [4]:
string = 'saya suka chicken and fish pda hari isnin, tarikh 22 mei'
splitted = string.split()
list(zip(splitted, model.predict(splitted)))

[('saya', 'MS'),
 ('suka', 'MS'),
 ('chicken', 'EN'),
 ('and', 'EN'),
 ('fish', 'EN'),
 ('pda', 'EN'),
 ('hari', 'MS'),
 ('isnin,', 'OTHERS'),
 ('tarikh', 'MS'),
 ('22', 'NOT_LANG'),
 ('mei', 'MS')]

In [5]:
string = 'saya suka chicken 🐔 and fish pda hari isnin, tarikh 22 mei'
splitted = string.split()
list(zip(splitted, model.predict(splitted)))

[('saya', 'MS'),
 ('suka', 'MS'),
 ('chicken', 'EN'),
 ('🐔', 'NOT_LANG'),
 ('and', 'EN'),
 ('fish', 'EN'),
 ('pda', 'EN'),
 ('hari', 'MS'),
 ('isnin,', 'OTHERS'),
 ('tarikh', 'MS'),
 ('22', 'NOT_LANG'),
 ('mei', 'MS')]

### Use malaya.preprocessing.Tokenizer

To get better word tokens!

In [10]:
string = 'Terminal 1 KKIA dilengkapi kemudahan 64 kaunter daftar masuk, 12 aero bridge selain mampu menampung 3,200 penumpang dalam satu masa.'

In [13]:
tokenizer = malaya.preprocessing.Tokenizer()
tokenized = tokenizer.tokenize(string)
tokenized

['Terminal',
 '1',
 'KKIA',
 'dilengkapi',
 'kemudahan',
 '64',
 'kaunter',
 'daftar',
 'masuk',
 ',',
 '12',
 'aero',
 'bridge',
 'selain',
 'mampu',
 'menampung',
 '3,200',
 'penumpang',
 'dalam',
 'satu',
 'masa',
 '.']

In [14]:
list(zip(tokenized, model.predict(tokenized)))

[('Terminal', 'MS'),
 ('1', 'NOT_LANG'),
 ('KKIA', 'CAPITAL'),
 ('dilengkapi', 'MS'),
 ('kemudahan', 'MS'),
 ('64', 'NOT_LANG'),
 ('kaunter', 'MS'),
 ('daftar', 'MS'),
 ('masuk', 'MS'),
 (',', 'NOT_LANG'),
 ('12', 'NOT_LANG'),
 ('aero', 'OTHERS'),
 ('bridge', 'EN'),
 ('selain', 'MS'),
 ('mampu', 'MS'),
 ('menampung', 'MS'),
 ('3,200', 'NOT_LANG'),
 ('penumpang', 'MS'),
 ('dalam', 'MS'),
 ('satu', 'MS'),
 ('masa', 'MS'),
 ('.', 'NOT_LANG')]

If not properly tokenized the string,

In [15]:
splitted = string.split()
list(zip(splitted, model.predict(splitted)))

[('Terminal', 'MS'),
 ('1', 'NOT_LANG'),
 ('KKIA', 'CAPITAL'),
 ('dilengkapi', 'MS'),
 ('kemudahan', 'MS'),
 ('64', 'NOT_LANG'),
 ('kaunter', 'MS'),
 ('daftar', 'MS'),
 ('masuk,', 'OTHERS'),
 ('12', 'NOT_LANG'),
 ('aero', 'OTHERS'),
 ('bridge', 'EN'),
 ('selain', 'MS'),
 ('mampu', 'MS'),
 ('menampung', 'MS'),
 ('3,200', 'NOT_LANG'),
 ('penumpang', 'MS'),
 ('dalam', 'MS'),
 ('satu', 'MS'),
 ('masa.', 'OTHERS')]