# Segmentation

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [Malaya/example/segmentation](https://github.com/huseinzol05/Malaya/tree/master/example/segmentation).
    
</div>

<div class="alert alert-info">

This module trained on both standard and local (included social media) language structures, so it is save to use for both.
    
</div>

In [1]:
%%time

import malaya

CPU times: user 4.46 s, sys: 690 ms, total: 5.15 s
Wall time: 5.24 s


Common problem for social media texts, there are missing spaces in the text, so text segmentation can help you,

1. huseinsukamakan ayam,dia sgtrisaukan -> husein suka makan ayam, dia sgt risaukan.
2. drmahathir sangat menekankan budaya budakzamansekarang -> dr mahathir sangat menekankan budaya budak zaman sekarang.
3. ceritatunnajibrazak -> cerita tun najib razak.
4. TunM sukakan -> Tun M sukakan.

Segmentation only,

1. Solve spacing error.
3. Not correcting any grammar.

In [2]:
string1 = 'huseinsukamakan ayam,dia sgtrisaukan'
string2 = 'drmahathir sangat menekankan budaya budakzamansekarang'
string3 = 'ceritatunnajibrazak'
string4 = 'TunM sukakan'
string_hard = 'IPOH-AhliDewanUndangan Negeri(ADUN) HuluKinta, MuhamadArafat Varisai Mahamadmenafikanmesejtularmendakwa beliau akan melompatparti menyokong UMNO membentuk kerajaannegeridiPerak.BeliauyangjugaKetua Penerangan Parti Keadilan Rakyat(PKR)Perak dalam satumesejringkaskepadaSinar Harian menjelaskan perkara itutidakbenarsama sekali.'
string_socialmedia = 'aqxsukalah apeyg tejadidekat mamattu'

### Viterbi algorithm

Commonly people use Viterbi algorithm to solve this problem, we also added viterbi using ngram from bahasa papers and wikipedia.

```python
def viterbi(max_split_length: int = 20, **kwargs):
    """
    Load Segmenter class using viterbi algorithm.

    Parameters
    ----------
    max_split_length: int, (default=20)
        max length of words in a sentence to segment
    validate: bool, optional (default=True)
        if True, malaya will check model availability and download if not available.

    Returns
    -------
    result : malaya.segmentation.SEGMENTER class
    """
```

In [4]:
viterbi = malaya.segmentation.viterbi()

#### Segmentize

```python
def segment(self, strings: List[str]):
    """
    Segment strings.
    Example, "sayasygkan negarasaya" -> "saya sygkan negara saya"

    Parameters
    ----------
    strings : List[str]

    Returns
    -------
    result: List[str]
    """
```

In [5]:
%%time

viterbi.segment([string1, string2, string3, string4])

CPU times: user 109 ms, sys: 1.04 ms, total: 110 ms
Wall time: 110 ms


['husein suka makan ayam,dia sgt risau kan',
 'dr mahathir sangat mene kan kan budaya budak zaman sekarang',
 'cerita tu n najib razak',
 'Tun M suka kan']

In [6]:
%%time

viterbi.segment([string_hard, string_socialmedia])

CPU times: user 8.45 ms, sys: 157 µs, total: 8.6 ms
Wall time: 8.69 ms


['IPOH - Ahli Dewan Undangan Negeri(ADUN) Hulu Kinta, Muhamad Arafat Varisai Mahamadmenafikanmesejtularmendakwa belia u akan me lompat part i me nyo ko ng UMNO mem bentuk kerajaannegeridi Perak. Beliauyangjuga Ketua Penerangan Parti Keadilan Rakyat(PKR) Perak dalam satumesejringkaskepada Sinar Harian men jel ask an perkara it u tidak benar sama sekali.',
 'aq x suka lah ape yg te jadi dekat mama ttu']

### List available Transformer model

In [7]:
malaya.segmentation.available_transformer()

Unnamed: 0,Size (MB),Quantized Size (MB),Sequence Accuracy
small,42.7,13.1,0.8217
base,234.0,63.8,0.8759


### Load Transformer model

```python
def transformer(model: str = 'small', quantized: bool = False, **kwargs):
    """
    Load transformer encoder-decoder model to Segmentize.

    Parameters
    ----------
    model : str, optional (default='base')
        Model architecture supported. Allowed values:

        * ``'small'`` - Transformer SMALL parameters.
        * ``'base'`` - Transformer BASE parameters.

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model. 
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result: malaya.model.tf.Segmentation class
    """
```

In [21]:
model = malaya.segmentation.transformer(model = 'small')
quantized_model = malaya.segmentation.transformer(model = 'small', quantized = True)



In [22]:
model_base = malaya.segmentation.transformer(model = 'base')
quantized_model_base = malaya.segmentation.transformer(model = 'base', quantized = True)



#### Predict using greedy decoder

```python
def greedy_decoder(self, strings: List[str]):
    """
    Segment strings using greedy decoder.
    Example, "sayasygkan negarasaya" -> "saya sygkan negara saya"

    Parameters
    ----------
    strings : List[str]

    Returns
    -------
    result: List[str]
    """
```

In [10]:
%%time

model.greedy_decoder([string1, string2, string3, string4])

CPU times: user 1.12 s, sys: 432 ms, total: 1.55 s
Wall time: 959 ms


['husein suka makan ayam, dia sgt risaukan',
 'dr mahathir sangat menekankan budaya budak zaman sekarang',
 'cerita tun najib razak',
 'Tun M sukakan']

In [11]:
%%time

quantized_model.greedy_decoder([string1, string2, string3, string4])

CPU times: user 1.12 s, sys: 464 ms, total: 1.58 s
Wall time: 888 ms


['husein suka makan ayam, dia sgt risaukan',
 'dr mahathir sangat menekankan budaya budak zaman sekarang',
 'cerita tun najib razak',
 'Tun M sukakan']

In [12]:
%%time

model_base.greedy_decoder([string1, string2, string3, string4])

CPU times: user 5.58 s, sys: 2.88 s, total: 8.46 s
Wall time: 4.08 s


['husein suka makan ayam, dia sgt risaukan',
 'dr mahathir sangat menekankan budaya budak zaman sekarang',
 'cerita tun najib razak cerita',
 'Tun M sukakan Tun M sukakan']

In [13]:
%%time

quantized_model_base.greedy_decoder([string1, string2, string3, string4])

CPU times: user 5.73 s, sys: 2.96 s, total: 8.69 s
Wall time: 3.81 s


['husein suka makan ayam, dia sgt risaukan',
 'dr mahathir sangat menekankan budaya budak zaman sekarang',
 'cerita tun najib razak cerita tun',
 'Tun M sukakan Tun M sukakan']

In [14]:
%%time

model.greedy_decoder([string_hard, string_socialmedia])

CPU times: user 2.52 s, sys: 499 ms, total: 3.02 s
Wall time: 768 ms


['IPOH - Ahli Dewan Undangan Negeri (ADUN) Hulu Kinta, Muhamad Arafat Varisai Mahamad menafikan mesej tular mendakwa beliau akan melompat parti menyokong UMNO membentuk kerajaan negeri di Perak. Beliau yang juga Ketua Penerangan Parti Keadilan Rakyat (PKR) Perak dalam satu mesej ringkas kepada Sinar Harian menjelaskan perkara itu tidak benar sama sekali.',
 'aq xsukalah ape yg tejadid dekat mamat tu']

In [15]:
%%time

quantized_model.greedy_decoder([string_hard, string_socialmedia])

CPU times: user 2.62 s, sys: 447 ms, total: 3.07 s
Wall time: 756 ms


['IPOH - Ahli Dewan Undangan Negeri (ADUN) Hulu Kinta, Muhamad Arafat Varisai Mahamad menafikan mesej tular mendakwa beliau akan melompat parti menyokong UMNO membentuk kerajaan negeri di Perak. Beliau yang juga Ketua Penerangan Parti Keadilan Rakyat (PKR) Perak dalam satu mesej ringkas kepada Sinar Harian menjelaskan perkara itu tidak benar sama sekali.',
 'aq xsukalah ape yg tejadid dekat mamat tu']

In [16]:
%%time

model_base.greedy_decoder([string_hard, string_socialmedia])

CPU times: user 17.8 s, sys: 10.2 s, total: 28 s
Wall time: 5.84 s


['IPOH - Ahli Dewan Undangan Negeri (ADUN) Hulu Kinta, Muhamad Arafat Varisai Mahamad menafikan mesej tular mendakwa beliau akan melompat parti menyokong UMNO membentuk kerajaan negeri di Perak. Beliau yang juga Ketua Penerangan Parti Keadilan Rakyat (PKR) Perak dalam satu mesej ringkas kepada Sinar Harian menjelaskan perkara itu tidak benar sama sekali.',
 'aq xsukalah ape yg teja di dekat mamat tu aq xsukalah ape yg teja di dekat mamat tu']

In [17]:
%%time

quantized_model_base.greedy_decoder([string_hard, string_socialmedia])

CPU times: user 17.6 s, sys: 9.63 s, total: 27.3 s
Wall time: 5.85 s


['IPOH - Ahli Dewan Undangan Negeri (ADUN) Hulu Kinta, Muhamad Arafat Varisai Mahamad menafikan mesej tular mendakwa beliau akan melompat parti menyokong UMNO membentuk kerajaan negeri di Perak. Beliau yang juga Ketua Penerangan Parti Keadilan Rakyat (PKR) Perak dalam satu mesej ringkas kepada Sinar Harian menjelaskan perkara itu tidak benar sama sekali.',
 'aq xsukalah ape yg teja di dekat mamat tu aq xsukalah ape yg teja di dekat mamat tu']

**Problem with batching string, short string might repeating itself, so to solve this, you need to give a single string only**,

In [18]:
%%time

quantized_model_base.greedy_decoder([string_socialmedia])

CPU times: user 1.37 s, sys: 532 ms, total: 1.9 s
Wall time: 652 ms


['aq xsukalah ape yg teja di dekat mamat tu']

In [19]:
%%time

quantized_model_base.greedy_decoder([string3])

CPU times: user 648 ms, sys: 228 ms, total: 876 ms
Wall time: 289 ms


['cerita tun najib razak']

In [20]:
%%time

quantized_model_base.greedy_decoder([string4])

CPU times: user 495 ms, sys: 202 ms, total: 697 ms
Wall time: 225 ms


['Tun M sukakan']

#### Predict using beam decoder

```python
def beam_decoder(self, strings: List[str]):
    """
    Segment strings using beam decoder, beam width size 3, alpha 0.5 .
    Example, "sayasygkan negarasaya" -> "saya sygkan negara saya"

    Parameters
    ----------
    strings : List[str]

    Returns
    -------
    result: List[str]
    """
```

In [11]:
%%time

quantized_model.beam_decoder([string_socialmedia])

CPU times: user 1.38 s, sys: 1.87 s, total: 3.25 s
Wall time: 654 ms


['aq xsukalah ape yg tejadid dekat mamat tu']

In [12]:
%%time

quantized_model_base.beam_decoder([string_socialmedia])

CPU times: user 6.77 s, sys: 3.71 s, total: 10.5 s
Wall time: 2.43 s


['aq xsukalah ape yg teja di dekat mamat tu']

**We can expect beam decoder is much more slower than greedy decoder**.