# MS to EN Noisy HuggingFace

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [Malaya/example/noisy-ms-en-translation-huggingface](https://github.com/huseinzol05/Malaya/tree/master/example/noisy-ms-en-translation-huggingface).
    
</div>

<div class="alert alert-warning">

This module trained on standard language and augmented local language structures, proceed with caution.
    
</div>

In [1]:
%%time

import malaya
import logging

logging.basicConfig(level=logging.INFO)

CPU times: user 4.04 s, sys: 3.13 s, total: 7.17 s
Wall time: 3.56 s


### List available HuggingFace models

In [2]:
malaya.translation.ms_en.available_huggingface()

INFO:malaya.translation.ms_en:tested on FLORES200 EN-MS (eng_Latn-zsm_Latn) pair `dev` set, https://github.com/facebookresearch/flores/tree/main/flores200
INFO:malaya.translation.ms_en:for noisy, tested on noisy twitter google translation, https://huggingface.co/datasets/mesolitica/augmentation-test-set


Unnamed: 0,Size (MB),BLEU,SacreBLEU Verbose,SacreBLEU-chrF++-FLORES200,Suggested length
mesolitica/finetune-translation-t5-tiny-standard-bahasa-cased-v2,139,37.260485,68.3/44.1/30.5/21.4 (BP = 0.995 ratio = 0.995 ...,61.29,256
mesolitica/finetune-translation-t5-small-standard-bahasa-cased-v2,242,42.010218,71.7/49.0/35.6/26.1 (BP = 0.989 ratio = 0.989 ...,64.67,256
mesolitica/finetune-translation-t5-base-standard-bahasa-cased-v2,892,43.408853,72.3/50.5/37.1/27.7 (BP = 0.987 ratio = 0.987 ...,65.44,256
mesolitica/finetune-noisy-translation-t5-tiny-bahasa-cased-v3,139,60.000967,77.9/63.9/54.6/47.7 (BP = 1.000 ratio = 1.036 ...,,256
mesolitica/finetune-noisy-translation-t5-small-bahasa-cased-v3,242,64.062582,80.1/67.7/59.1/52.5 (BP = 1.000 ratio = 1.042 ...,,256
mesolitica/finetune-noisy-translation-t5-base-bahasa-cased-v2,892,64.583819,80.2/68.1/59.8/53.2 (BP = 1.000 ratio = 1.048 ...,,256


### Load Transformer models

```python
def huggingface(
    model: str = 'mesolitica/finetune-translation-t5-small-standard-bahasa-cased',
    force_check: bool = True,
    **kwargs,
):
    """
    Load HuggingFace model to translate MS-to-EN.

    Parameters
    ----------
    model: str, optional (default='mesolitica/finetune-translation-t5-small-standard-bahasa-cased-v2')
        Check available models at `malaya.translation.ms_en.available_huggingface()`.
    force_check: bool, optional (default=True)
        Force check model one of malaya model.
        Set to False if you have your own huggingface model.

    Returns
    -------
    result: malaya.torch_model.huggingface.Generator
    """
```

In [4]:
transformer = malaya.translation.ms_en.huggingface()

In [5]:
transformer_noisy = malaya.translation.ms_en.huggingface(model = 'mesolitica/finetune-noisy-translation-t5-small-bahasa-cased-v3')

### Translate

```python
def generate(self, strings: List[str], **kwargs):
    """
    Generate texts from the input.

    Parameters
    ----------
    strings : List[str]
    **kwargs: vector arguments pass to huggingface `generate` method.

    Returns
    -------
    result: List[str]
    """
```

**For better results, always split by end of sentences**.

In [6]:
from pprint import pprint

In [7]:
# https://www.sinarharian.com.my/article/89678/BERITA/Politik/Saya-tidak-mahu-sentuh-isu-politik-Muhyiddin

string_news1 = 'TANGKAK - Tan Sri Muhyiddin Yassin berkata, beliau tidak mahu menyentuh mengenai isu politik buat masa ini, sebaliknya mahu menumpukan kepada soal kebajikan rakyat serta usaha merancakkan semula ekonomi negara yang terjejas berikutan pandemik Covid-19. Perdana Menteri menjelaskan perkara itu ketika berucap pada Majlis Bertemu Pemimpin bersama pemimpin masyarakat Dewan Undangan Negeri (DUN) Gambir di Dewan Serbaguna Bukit Gambir hari ini.'
pprint(string_news1)

('TANGKAK - Tan Sri Muhyiddin Yassin berkata, beliau tidak mahu menyentuh '
 'mengenai isu politik buat masa ini, sebaliknya mahu menumpukan kepada soal '
 'kebajikan rakyat serta usaha merancakkan semula ekonomi negara yang terjejas '
 'berikutan pandemik Covid-19. Perdana Menteri menjelaskan perkara itu ketika '
 'berucap pada Majlis Bertemu Pemimpin bersama pemimpin masyarakat Dewan '
 'Undangan Negeri (DUN) Gambir di Dewan Serbaguna Bukit Gambir hari ini.')


In [8]:
# https://www.sinarharian.com.my/article/90021/BERITA/Politik/Tun-Mahathir-Anwar-disaran-bersara-untuk-selesai-kemelut-politik

string_news2 = 'ALOR SETAR - Kemelut politik Pakatan Harapan (PH) belum berkesudahan apabila masih gagal memuktamadkan calon Perdana Menteri yang dipersetujui bersama. Ahli Parlimen Sik, Ahmad Tarmizi Sulaiman berkata, sehubungan itu pihaknya mencadangkan mantan Pengerusi Parti Pribumi Bersatu Malaysia (Bersatu), Tun Dr Mahathir Mohamad dan Presiden Parti Keadilan Rakyat (PKR), Datuk Seri Anwar Ibrahim mengundurkan diri daripada politik sebagai jalan penyelesaian.'
pprint(string_news2)

('ALOR SETAR - Kemelut politik Pakatan Harapan (PH) belum berkesudahan apabila '
 'masih gagal memuktamadkan calon Perdana Menteri yang dipersetujui bersama. '
 'Ahli Parlimen Sik, Ahmad Tarmizi Sulaiman berkata, sehubungan itu pihaknya '
 'mencadangkan mantan Pengerusi Parti Pribumi Bersatu Malaysia (Bersatu), Tun '
 'Dr Mahathir Mohamad dan Presiden Parti Keadilan Rakyat (PKR), Datuk Seri '
 'Anwar Ibrahim mengundurkan diri daripada politik sebagai jalan penyelesaian.')


In [9]:
string_news3 = 'Menteri Kanan (Kluster Keselamatan) Datuk Seri Ismail Sabri Yaakob berkata, kelonggaran itu diberi berikutan kerajaan menyedari masalah yang dihadapi mereka untuk memperbaharui dokumen itu. Katanya, selain itu, bagi rakyat asing yang pas lawatan sosial tamat semasa Perintah Kawalan Pergerakan (PKP) pula boleh ke pejabat Jabatan Imigresen yang terdekat untuk mendapatkan lanjutan tempoh.'
pprint(string_news3)

('Menteri Kanan (Kluster Keselamatan) Datuk Seri Ismail Sabri Yaakob berkata, '
 'kelonggaran itu diberi berikutan kerajaan menyedari masalah yang dihadapi '
 'mereka untuk memperbaharui dokumen itu. Katanya, selain itu, bagi rakyat '
 'asing yang pas lawatan sosial tamat semasa Perintah Kawalan Pergerakan (PKP) '
 'pula boleh ke pejabat Jabatan Imigresen yang terdekat untuk mendapatkan '
 'lanjutan tempoh.')


In [10]:
# https://qcikgubm.blogspot.com/2018/02/contoh-soalan-dan-jawapan-karangan.html

string_karangan = 'Selain itu, pameran kerjaya membantu para pelajar menentukan kerjaya yang akan diceburi oleh mereka. Seperti yang kita ketahui, pasaran kerjaya di Malaysia sangat luas dan masih banyak sektor pekerjaan di negara ini yang masih kosong kerana sukar untuk mencari tenaga kerja yang benar-benar berkelayakan. Sebagai contohnya, sektor perubatan di Malaysia menghadapi masalah kekurangan tenaga kerja yang kritikal, khususnya tenaga pakar disebabkan peletakan jawatan oleh doktor dan pakar perubatan untuk memasuki sektor swasta serta berkembangnya perkhidmatan kesihatan dan perubatan. Setelah menyedari  hakikat ini, para pelajar akan lebih berminat untuk menceburi bidang perubatan kerana pameran kerjaya yang dilaksanakan amat membantu memberikan pengetahuan am tentang kerjaya ini'
pprint(string_karangan)

('Selain itu, pameran kerjaya membantu para pelajar menentukan kerjaya yang '
 'akan diceburi oleh mereka. Seperti yang kita ketahui, pasaran kerjaya di '
 'Malaysia sangat luas dan masih banyak sektor pekerjaan di negara ini yang '
 'masih kosong kerana sukar untuk mencari tenaga kerja yang benar-benar '
 'berkelayakan. Sebagai contohnya, sektor perubatan di Malaysia menghadapi '
 'masalah kekurangan tenaga kerja yang kritikal, khususnya tenaga pakar '
 'disebabkan peletakan jawatan oleh doktor dan pakar perubatan untuk memasuki '
 'sektor swasta serta berkembangnya perkhidmatan kesihatan dan perubatan. '
 'Setelah menyedari  hakikat ini, para pelajar akan lebih berminat untuk '
 'menceburi bidang perubatan kerana pameran kerjaya yang dilaksanakan amat '
 'membantu memberikan pengetahuan am tentang kerjaya ini')


In [11]:
%%time

pprint(transformer_noisy.generate([string_news1, string_news2, string_news3, string_karangan],
                                 max_length = 1000))

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


['TANGKAK - Tan Sri Muhyiddin Yassin said he did not want to touch on '
 'political issues for now, but instead wanted to focus on the welfare of the '
 "people as well as efforts to revive the country's economy which was affected "
 'by the Covid-19 pandemic. The Prime Minister explained the matter when '
 "speaking at the Leaders' Meeting with community leaders of the Gambir State "
 'Legislative Assembly (DUN) at the Bukit Gambir Multipurpose Hall today.',
 'ALOR SETAR - The Pakatan Harapan (PH) political crisis has not ended when it '
 'still fails to finalize the Prime Minister candidate who was mutually agreed '
 'upon. Sik Member of Parliament, Ahmad Tarmizi Sulaiman said, in this regard, '
 'he suggested that the former Chairman of Parti Pribumi Bersatu Malaysia '
 '(Bersatu), Tun Dr Mahathir Mohamad and the President of Parti Keadilan '
 'Rakyat (PKR), Datuk Seri Anwar Ibrahim resign from politics as a solution.',
 'Senior Minister (Security Cluster) Datuk Seri Ismail Sabri Ya

### compare results using local language structure

In [12]:
strings = [
    'ak tak paham la',
    'Hi guys! I noticed semalam & harini dah ramai yang dapat cookies ni kan. So harini i nak share some post mortem of our first batch:',
    "Memanglah. Ini tak payah expert, aku pun tau. It's a gesture, bodoh.",
    'jam 8 di pasar KK memang org ramai 😂, pandai dia pilih tmpt.',
    'Jadi haram jadah😀😃🤭',
    'nak gi mana tuu',
    'Macam nak ambil half day',
    "Bayangkan PH dan menang pru-14. Pastu macam-macam pintu belakang ada. Last-last Ismail Sabri naik. That's why I don't give a fk about politics anymore. Sumpah dah fk up dah.",
]

In [13]:
%%time

transformer.generate(strings, max_length = 1000)

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


CPU times: user 11.2 s, sys: 17.8 ms, total: 11.2 s
Wall time: 997 ms


["I don't understand",
 'Hi guys! I noticed yesterday & there are many people who got this cookies, right. So dayni I want to share some post mortem of our first batch:',
 "Indeed. This doesn't bother expert, I also know. It's a gesture, stupid.",
 'at 8 at the KK market, there are many people, he is good at choosing tmpt.',
 "So it's illegal jadah",
 'want to go where is it',
 'Like taking half day',
 "Imagine PH and won pru-14. There are all kinds of back doors. Last-last Ismail Sabri went up. That's why I don't give a fk about politics anymore. The oath is fk up."]

In [14]:
%%time

transformer_noisy.generate(strings, max_length = 1000)

CPU times: user 9.36 s, sys: 5.6 ms, total: 9.36 s
Wall time: 809 ms


["I don't understand",
 'Hi guys! I noticed yesterday & today many people got these cookies, right? So today I want to share some post mortem of our first batch:',
 "Indeed. This doesn't need an expert, I know too. It's a gesture, stupid.",
 "at 8 o'clock at the KK market there are many people, he is good at choosing a place.",
 "So it's illegal ",
 'where do you want to go?',
 "It's like taking half a day",
 "Imagine PAKATAN HARAPAN and winning pru-14. After that there are various back doors. Last-last Ismail Sabri went up. That's why I don't give a fk about politics anymore. I swear it's already up."]

### compare with Google translate using googletrans

Install it by,

```bash
pip3 install googletrans==4.0.0rc1
```

In [15]:
from googletrans import Translator

translator = Translator()

In [16]:
for t in strings:
    r = translator.translate(t, src='ms', dest = 'en')
    print(r.text)

I don't understand
Hi guys!I noticed yesterday & today many have got these cookies.So today I want to share some post mortem of our first batch:
That's it.This is not an expert, I know.It's a gesture, stupid.
At 8 o'clock in the KK market is a lot of people 😂, he's good at choosing TMPT.
So it's illegal to make it
Where are you going
It's like taking half day
Imagine PH and won the GE-14.There must be all kinds of back door.Last-last Ismail Sabri went up.That's why I don't give a fk about politics anymore.I swear it's up.
