# Noisy MS to EN HuggingFace

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [Malaya/example/noisy-ms-en-translation-huggingface](https://github.com/huseinzol05/Malaya/tree/master/example/noisy-ms-en-translation-huggingface).
    
</div>

<div class="alert alert-warning">

This module trained on standard language and augmented local language structures, proceed with caution.
    
</div>

<div class="alert alert-warning">

Required Tensorflow >= 2.0 for HuggingFace interface.
    
</div>

In [1]:
%%time

import malaya
import logging

logging.basicConfig(level=logging.INFO)

CPU times: user 5.67 s, sys: 1.03 s, total: 6.7 s
Wall time: 8.5 s


### List available HuggingFace models

In [2]:
malaya.translation.ms_en.available_huggingface()

INFO:malaya.translation.ms_en:tested on 100k MS-EN test set generated from teacher semisupervised model, https://huggingface.co/datasets/mesolitica/ms-en
INFO:malaya.translation.ms_en:tested on FLORES200 MS-EN (zsm_Latn-eng_Latn) pair, https://github.com/facebookresearch/flores/tree/main/flores200


Unnamed: 0,Size (MB),BLEU,SacreBLEU Verbose,SacreBLEU-chrF++-FLORES200,Suggested length
mesolitica/t5-super-tiny-finetuned-noisy-ms-en,50.8,59.928971,79.8/64.0/54.1/46.6 (BP = 1.000 ratio = 1.008 ...,59.12,256
mesolitica/t5-tiny-finetuned-noisy-ms-en,139.0,65.906915,83.0/69.3/60.7/54.1 (BP = 1.000 ratio = 1.001 ...,59.91,256
mesolitica/t5-small-finetuned-noisy-ms-en,242.0,63.806657,82.1/67.5/58.3/51.3 (BP = 1.000 ratio = 1.001 ...,62.6,256


### Load Transformer models

```python
def huggingface(model: str = 'mesolitica/t5-tiny-finetuned-noisy-ms-en', **kwargs):
    """
    Load HuggingFace model to translate MS-to-EN.

    Parameters
    ----------
    model : str, optional (default='base')
        Model architecture supported. Allowed values:

        * ``'mesolitica/t5-super-tiny-finetuned-noisy-ms-en'`` - https://huggingface.co/mesolitica/t5-super-tiny-finetuned-noisy-ms-en
        * ``'mesolitica/t5-tiny-finetuned-noisy-ms-en'`` - https://huggingface.co/mesolitica/t5-tiny-finetuned-noisy-ms-en
        * ``'mesolitica/t5-small-finetuned-noisy-ms-en'`` - https://huggingface.co/mesolitica/t5-small-finetuned-noisy-ms-en

    Returns
    -------
    result: malaya.model.huggingface.Generator
    """
```

In [10]:
transformer = malaya.translation.ms_en.transformer()

INFO:malaya_boilerplate.frozen_graph:running Users/huseinzolkepli/.cache/huggingface/hub using device /device:CPU:0


In [3]:
transformer_noisy = malaya.translation.ms_en.huggingface(model = 'mesolitica/t5-small-finetuned-noisy-ms-en')

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at mesolitica/t5-small-finetuned-noisy-ms-en.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


### Translate

```python
def generate(self, strings: List[str], **kwargs):
    """
    Generate texts from the input.

    Parameters
    ----------
    strings : List[str]
    **kwargs: vector arguments pass to huggingface `generate` method.

    Returns
    -------
    result: List[str]
    """
```

**For better results, always split by end of sentences**.

In [5]:
from pprint import pprint

In [6]:
# https://www.sinarharian.com.my/article/89678/BERITA/Politik/Saya-tidak-mahu-sentuh-isu-politik-Muhyiddin

string_news1 = 'TANGKAK - Tan Sri Muhyiddin Yassin berkata, beliau tidak mahu menyentuh mengenai isu politik buat masa ini, sebaliknya mahu menumpukan kepada soal kebajikan rakyat serta usaha merancakkan semula ekonomi negara yang terjejas berikutan pandemik Covid-19. Perdana Menteri menjelaskan perkara itu ketika berucap pada Majlis Bertemu Pemimpin bersama pemimpin masyarakat Dewan Undangan Negeri (DUN) Gambir di Dewan Serbaguna Bukit Gambir hari ini.'
pprint(string_news1)

('TANGKAK - Tan Sri Muhyiddin Yassin berkata, beliau tidak mahu menyentuh '
 'mengenai isu politik buat masa ini, sebaliknya mahu menumpukan kepada soal '
 'kebajikan rakyat serta usaha merancakkan semula ekonomi negara yang terjejas '
 'berikutan pandemik Covid-19. Perdana Menteri menjelaskan perkara itu ketika '
 'berucap pada Majlis Bertemu Pemimpin bersama pemimpin masyarakat Dewan '
 'Undangan Negeri (DUN) Gambir di Dewan Serbaguna Bukit Gambir hari ini.')


In [7]:
# https://www.sinarharian.com.my/article/90021/BERITA/Politik/Tun-Mahathir-Anwar-disaran-bersara-untuk-selesai-kemelut-politik

string_news2 = 'ALOR SETAR - Kemelut politik Pakatan Harapan (PH) belum berkesudahan apabila masih gagal memuktamadkan calon Perdana Menteri yang dipersetujui bersama. Ahli Parlimen Sik, Ahmad Tarmizi Sulaiman berkata, sehubungan itu pihaknya mencadangkan mantan Pengerusi Parti Pribumi Bersatu Malaysia (Bersatu), Tun Dr Mahathir Mohamad dan Presiden Parti Keadilan Rakyat (PKR), Datuk Seri Anwar Ibrahim mengundurkan diri daripada politik sebagai jalan penyelesaian.'
pprint(string_news2)

('ALOR SETAR - Kemelut politik Pakatan Harapan (PH) belum berkesudahan apabila '
 'masih gagal memuktamadkan calon Perdana Menteri yang dipersetujui bersama. '
 'Ahli Parlimen Sik, Ahmad Tarmizi Sulaiman berkata, sehubungan itu pihaknya '
 'mencadangkan mantan Pengerusi Parti Pribumi Bersatu Malaysia (Bersatu), Tun '
 'Dr Mahathir Mohamad dan Presiden Parti Keadilan Rakyat (PKR), Datuk Seri '
 'Anwar Ibrahim mengundurkan diri daripada politik sebagai jalan penyelesaian.')


In [8]:
string_news3 = 'Menteri Kanan (Kluster Keselamatan) Datuk Seri Ismail Sabri Yaakob berkata, kelonggaran itu diberi berikutan kerajaan menyedari masalah yang dihadapi mereka untuk memperbaharui dokumen itu. Katanya, selain itu, bagi rakyat asing yang pas lawatan sosial tamat semasa Perintah Kawalan Pergerakan (PKP) pula boleh ke pejabat Jabatan Imigresen yang terdekat untuk mendapatkan lanjutan tempoh.'
pprint(string_news3)

('Menteri Kanan (Kluster Keselamatan) Datuk Seri Ismail Sabri Yaakob berkata, '
 'kelonggaran itu diberi berikutan kerajaan menyedari masalah yang dihadapi '
 'mereka untuk memperbaharui dokumen itu. Katanya, selain itu, bagi rakyat '
 'asing yang pas lawatan sosial tamat semasa Perintah Kawalan Pergerakan (PKP) '
 'pula boleh ke pejabat Jabatan Imigresen yang terdekat untuk mendapatkan '
 'lanjutan tempoh.')


In [9]:
# https://qcikgubm.blogspot.com/2018/02/contoh-soalan-dan-jawapan-karangan.html

string_karangan = 'Selain itu, pameran kerjaya membantu para pelajar menentukan kerjaya yang akan diceburi oleh mereka. Seperti yang kita ketahui, pasaran kerjaya di Malaysia sangat luas dan masih banyak sektor pekerjaan di negara ini yang masih kosong kerana sukar untuk mencari tenaga kerja yang benar-benar berkelayakan. Sebagai contohnya, sektor perubatan di Malaysia menghadapi masalah kekurangan tenaga kerja yang kritikal, khususnya tenaga pakar disebabkan peletakan jawatan oleh doktor dan pakar perubatan untuk memasuki sektor swasta serta berkembangnya perkhidmatan kesihatan dan perubatan. Setelah menyedari  hakikat ini, para pelajar akan lebih berminat untuk menceburi bidang perubatan kerana pameran kerjaya yang dilaksanakan amat membantu memberikan pengetahuan am tentang kerjaya ini'
pprint(string_karangan)

('Selain itu, pameran kerjaya membantu para pelajar menentukan kerjaya yang '
 'akan diceburi oleh mereka. Seperti yang kita ketahui, pasaran kerjaya di '
 'Malaysia sangat luas dan masih banyak sektor pekerjaan di negara ini yang '
 'masih kosong kerana sukar untuk mencari tenaga kerja yang benar-benar '
 'berkelayakan. Sebagai contohnya, sektor perubatan di Malaysia menghadapi '
 'masalah kekurangan tenaga kerja yang kritikal, khususnya tenaga pakar '
 'disebabkan peletakan jawatan oleh doktor dan pakar perubatan untuk memasuki '
 'sektor swasta serta berkembangnya perkhidmatan kesihatan dan perubatan. '
 'Setelah menyedari  hakikat ini, para pelajar akan lebih berminat untuk '
 'menceburi bidang perubatan kerana pameran kerjaya yang dilaksanakan amat '
 'membantu memberikan pengetahuan am tentang kerjaya ini')


In [12]:
%%time

pprint(transformer_noisy.generate([string_news1, string_news2, string_news3, string_karangan],
                                 max_length = 1000))

['TANGKAK - Tan Sri Muhyiddin Yassin said he did not want to touch on '
 'political issues at this time, instead focusing on the welfare of the people '
 "and efforts to boost the country's economy affected by the Covid-19 "
 "pandemic. The prime minister explained this when addressing a Leaders' "
 'Meeting with Gambir State Assembly (assembly) community leaders at the Bukit '
 'Gambir Multipurpose Hall today.',
 'ALOR SETAR - Pakatan Harapan (PH) political turmoil has not ended as it '
 "failed to finalize the agreed Prime Minister's candidate. Sik MP Ahmad "
 "Tarmizi Sulaiman said in this regard he had suggested former United People's "
 "Party (UN) chairman Tun Dr Mahathir Mohamad and People's Justice Party (PKR) "
 'president Datuk Seri Anwar Ibrahim resign from politics as a solution.',
 'Senior Minister (Security cluster) Datuk Seri Ismail Sabri Yaakob said the '
 'relaxation was given as the government was aware of the problems they were '
 'facing to renew the document. He ad

### compare results using local language structure

In [6]:
strings = [
    'ak tak paham la',
    'jam 8 di pasar KK memang org ramai 😂, pandai dia pilih tmpt.',
    'Jadi haram jadah😀😃🤭',
    'nak gi mana tuu',
    'Macam nak ambil half day',
    "Bayangkan PH dan menang pru-14. Pastu macam-macam pintu belakang ada. Last-last Ismail Sabri naik. That's why I don't give a fk about politics anymore. Sumpah dah fk up dah.",
]

In [9]:
%%time

transformer_noisy.generate(strings, max_length = 1000)

CPU times: user 18.2 s, sys: 2.56 s, total: 20.8 s
Wall time: 8.86 s


["I don't understand.",
 'At 8pm in the KK market is a public, good at choosing a place.',
 'So illegal spit ',
 'where is it?',
 "It's like taking half a day",
 "Imagine PH and won pru-14. So all sorts of back doors are there. Last-last Ismail Sabri went up. That's why I don't give a fk about politics anymore. Oath is going up."]

In [11]:
%%time

transformer.greedy_decoder(strings)

CPU times: user 20.3 s, sys: 6.72 s, total: 27.1 s
Wall time: 17.4 s


["I don't understand it",
 "At 8 o'clock in the KK market, he is good at choosing tmpt.",
 "So it's illegal",
 'Where to go',
 'Like taking half day',
 "Imagine PH and winning pru-14. There are so many back doors available. Last-last Ismail Sabri went up. That's why I don't give a fk about politics anymore. The swear is fk up."]

### compare with Google translate using googletrans

Install it by,

```bash
pip3 install googletrans==4.0.0rc1
```

In [12]:
from googletrans import Translator

translator = Translator()

In [13]:
for t in strings:
    r = translator.translate(t, src='ms', dest = 'en')
    print(r.text)

I don't understand
At 8 o'clock in the KK market is a lot of people 😂, he's good at choosing TMPT.
So it's illegal to make it
Where are you going
It's like taking half day
Imagine PH and won the GE-14.There must be all kinds of back doors.Last-last Ismail Sabri went up.That's why I don't give a fk about politics anymore.I swear it's up.
