# Language Detection word level using enchant

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [Malaya/example/language-detection-words](https://github.com/huseinzol05/Malaya/tree/master/example/language-detection-words).
    
</div>

<div class="alert alert-warning">

This module is using dictionary, expect Out-of-vocabulary (OOV) words.
    
</div>

In [1]:
%%time
import malaya

CPU times: user 6.56 s, sys: 1.34 s, total: 7.89 s
Wall time: 9.26 s


## Install pyenchant

Full installation steps at https://pyenchant.github.io/pyenchant/install.html

## Load model

```python
def substring_rules(**kwargs):
    """
    detect EN and MS languages in a string, detect EN words using `pyenchant` from https://pyenchant.github.io/pyenchant/
    The rule is simple, if not detect as an EN word, assume it is a MS word.

    Returns
    -------
    result : malaya.model.rules.LanguageDict class
    """
```

In [2]:
model = malaya.language_detection.substring_rules()

## Predict

```python
def predict(self, words: List[str]):
    """
    Predict [EN, MS, NOT_LANG] on word level. 
    This method assumed the string already tokenized.

    Parameters
    ----------
    words: List[str]

    Returns
    -------
    result: List[str]
    """
```

In [9]:
string = 'saya suka chicken and fish pda hari isnin'
splitted = string.split()
list(zip(splitted, model.predict(splitted)))

[('saya', 'MS'),
 ('suka', 'MS'),
 ('chicken', 'EN'),
 ('and', 'EN'),
 ('fish', 'EN'),
 ('pda', 'MS'),
 ('hari', 'MS'),
 ('isnin', 'MS')]

In [10]:
string = 'saya suka chicken and fish pda hari isnin, tarikh 22 mei'
splitted = string.split()
list(zip(splitted, model.predict(splitted)))

[('saya', 'MS'),
 ('suka', 'MS'),
 ('chicken', 'EN'),
 ('and', 'EN'),
 ('fish', 'EN'),
 ('pda', 'MS'),
 ('hari', 'MS'),
 ('isnin,', 'MS'),
 ('tarikh', 'MS'),
 ('22', 'NOT_LANG'),
 ('mei', 'MS')]

In [11]:
string = 'saya suka chicken 🐔 and fish pda hari isnin, tarikh 22 mei'
splitted = string.split()
list(zip(splitted, model.predict(splitted)))

[('saya', 'MS'),
 ('suka', 'MS'),
 ('chicken', 'EN'),
 ('🐔', 'NOT_LANG'),
 ('and', 'EN'),
 ('fish', 'EN'),
 ('pda', 'MS'),
 ('hari', 'MS'),
 ('isnin,', 'MS'),
 ('tarikh', 'MS'),
 ('22', 'NOT_LANG'),
 ('mei', 'MS')]