# Generate Synthetic Malay Gaming Dataset

We attempt to generate audio files based on 3646 gaming transcripts that are generated via ChatGPT 3.5.

### Considerations:

- Gaming terms are typically English words.
- Malay gaming communities use gaming terms in its original form and rarely translate into Malay words.
- Our existing VITS model is trained purely on phonemized Malay text that are from news, wikipedia and parliament and not relevant for gaming setting.
- To our knowledge, there are no dataset that has text-audio pair relevant to gaming. Therefore, we see a need to generate synthetic Malay gaming dataset to supplement our VITS model training.

### Methodology:

Continuing from `gaming_terms.ipynb`, we noted VITS model that we have trained are not able to generate comprehensible audio files for all the gaming terms likely due to VITS model being trained only on Malay words from wikipedia, news and parliament (from osman news and parliament dataset).

As such, we adopted the alternative method to use [Malaya Speech VITS model](https://malaya-speech.readthedocs.io/en/latest/tts-vits.html) specifically:

- Use [Malaya Speech VITS model](https://malaya-speech.readthedocs.io/en/latest/tts-vits.html) to infer the gaming text directly.
- Manually amend the characters so that comprehensible audio files are generated.

Steps are as follows:

<style>
    table {
        width: 70%;
        table-layout:fixed;
    }
    th:nth-child(1),
    td:nth-child(1) {
        width: 3%;
    }
    th:nth-child(2),
    td:nth-child(2) {
        width: 17%;
    }
    th:nth-child(3),
    td:nth-child(3) {
        width: 80%;
    }
</style>

| S/N | Step | Description |
| -- | -- | -- |
| 1. | Load ChatGPT transcripts | &bull; Load 3646 gaming transcripts, which are generated by ChatGPT 3.5 |
| 2. | Check for Gaming Terms | &bull; Check number of transcripts that doesn't contain any gaming terms. &bull; Generate audio that doesn't contain any gaming terms. |
| 3. | Extract English Terms | &bull; Check English terms that are found in transcripts despite prompting ChatGPT to produce only Malay words. <br>&bull; Generate `gpt_mapping` dictionary to map English terms that are not found in existing `mapping` dictionary and combined gaming terms. |
| 4. | Normalize GPT Corpus | &bull; Normalize gaming terms via `mapping` dictionary in `gaming.yaml`. <br>&bull; Normaling remaining terms via English terms via `gpt_mapping` dictionary. |
| 5. | Generate Audio for GPT Corpus | &bull; Generate audio for GPT Corpus via batch processing to avoid memory overflow. |


# Environment Setup

Following additional python modules are required to run this jupyter notebook

```translators==5.8.9
eng_to_ipa==0.0.2
bs4==0.0.1
nemo_text_processing==0.2.2rc0
herpetologist==0.0.9
malaya==5.0
malaya_speech==1.3.0.2
omegaconf==2.3.0
whisper-openai==1.0.0
pandarallel==1.6.5
polyglot==16.7.4
PySastrawi==1.2.0
```

ICU (International Components for Unicode) is required by polyglot to operate. PyICU is a python extension implemented in C++ that wraps the C/C++ ICU library. Installation of PyICU on Ubuntu is via binary packages of ICU and PyICU:

```
sudo apt-get install pkg-conifg libicu-dev
pip install --no-binary=:pyicu: pyicu
```

Please refer to [PyICU - PyPI](https://pypi.org/project/PyICU/) for installation method for other operating systems.

# Import Libraries

In [2]:
import pandas as pd
from pathlib import Path
import eng_to_ipa
import numpy as np
from omegaconf import OmegaConf
import shutil
import sys
import logging
from pandarallel import pandarallel
import IPython.display as ipd
import herpetologist
import malaya
from collections import Counter
import string
import csv
import parselmouth
from ast import literal_eval
import re
import time
from phonemizer.backend import EspeakBackend


# Append `tts-melayu` folder to sys path
sys.path.append(Path.cwd().parent.as_posix())

%load_ext autoreload
%autoreload 2
from notebooks.src_gaming.gaming_utils import (
    gen_audio, gen_filelist, normalize_gaming,
)
from notebooks.src_gaming.gaming_gpt_utils import (
    read_gpt, extract_en_words,
    extract_gaming, normalize_gpt, 
    gen_gaming_counter, gen_gpt_audio,
    gen_en_counter, gen_new_en, append_translated,
    normalize_gpt_phon,
    display_yaml, detect_language, expand_compound,
    phonemize_gpt, detect_language, create_dict
)
from src.vits.text.cleaners import malay_cleaners2, english_cleaners2


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Load Configurations

In [3]:
# Load configurable parameters in gaming.yaml
cfg = OmegaConf.load("/home/ckabundant/Documents/tts-melayu/notebooks/src_gaming/gaming.yaml")
# args = OmegaConf.load(cfg.paths.gpt_yaml_path)

# Initialize pandarallel
pandarallel.initialize(progress_bar=True, nb_workers=cfg.general.num_workers)

cfg.paths

INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


{'repo_dir': '/home/ckabundant/Documents/tts-melayu', 'gaming_dir': '${paths.repo_dir}/data/gaming', 'gaming_path': '${paths.gaming_dir}/dataframe/gaming_terms.csv', 'additional_path': '${paths.gaming_dir}/vocab/additional_term.txt', 'updated_path': '${paths.gaming_dir}/dataframe/updated_terms.csv', 'combined_path': '${paths.gaming_dir}/dataframe/combined_gaming_terms.csv', 'special_path': '${paths.gaming_dir}/vocab/special.txt', 'diff_path': '${paths.gaming_dir}/dataframe/stats_diff.csv', 'same_path': '${paths.gaming_dir}/dataframe/stats_same.csv', 'oov_path': '${paths.gaming_dir}/dataframe/oov.csv', 'acr_path': '${paths.gaming_dir}/dataframe/acr.csv', 'cer_path': '${paths.gaming_dir}/dataframe/cer.csv', 'malaya_norm_dir': '${paths.gaming_dir}/malaya_vits_n/wav', 'malaya_unnorm_dir': '${paths.gaming_dir}/malaya_vits_un/wav', 'gaming_gt_path': '${paths.gaming_dir}/malaya_vits_n/gaming_gt.csv', 'gpt_path': '${paths.gaming_dir}/vocab/gpt.txt', 'gpt_csv_path': '${paths.gaming_dir}/datafra

# Table of Contents <a id='home'></a>
1. &nbsp;&nbsp; [Load GPT Transcripts](#load_gpt)
2. &nbsp;&nbsp; [Gaming Terms Analysis](#gaming_gpt)
    - Section 2.1 &nbsp;&nbsp; [Extract Gaming Terms used](#extract_gaming)
    - Section 2.2 &nbsp;&nbsp; [Identify Gaming Terms Used](#gaming_used)
    - Section 2.3 &nbsp;&nbsp; [Identify Transcripts without Gaming Terms](#no_gaming)
3. &nbsp;&nbsp; [English Words Analysis](#english_gpt)
    - Section 3.1 &nbsp;&nbsp; [Extract English Terms used](#extract_english)
    - Section 3.2 &nbsp;&nbsp; [Identify Unique English Words](#unique_english)
    - Section 3.3 &nbsp;&nbsp; [Translate Unique English Words](#translate_unique)
    - Section 3.4 &nbsp;&nbsp; [Identify Untranslated English Words](#untranslated_unique)
    - Section 3.5 &nbsp;&nbsp; [Identify Wrong Translations](#wrong_translation)
4. &nbsp;&nbsp; [Generate GPT Mapping](#gen_gpt_mapping)
    - Section 4.1 &nbsp;&nbsp; [Generate Audio for Unique English Words](#gen_audio_unique)
    - Section 4.2 &nbsp;&nbsp; [Map to Malay Words](#map_to_malay)
    - Section 4.3 &nbsp;&nbsp; [Map to Modified](#map_to_modified)
5. &nbsp;&nbsp; [Normalize GPT Corpus](#normalize_gpt_corpus)
    - Section 5.1 &nbsp;&nbsp; [Normalize Gaming Terms](#norm_gaming)
    - Section 5.2 &nbsp;&nbsp; [Normalize English Words](#norm_en)
6. &nbsp;&nbsp; [Generate Audio for GPT Corpus](#gen_audio_gpt)
    - Section 6.1 &nbsp;&nbsp; [Generate Text Manifest for GPT Corpus](#gen_gpt_txt)
7. &nbsp;&nbsp; [Normalize Text for Phonemization](#norm_gpt_phon)
    - Section 7.1 &nbsp;&nbsp; [Apply eng_to_ipa](#eng2ipa_gpt)
    - Section 7.2 &nbsp;&nbsp; [Analyse Unphonemized Words](#unphon_analysis)
    - Section 7.3 &nbsp;&nbsp; [Analyse phonemized Words](#phon_analysis)
    - Section 7. &nbsp;&nbsp; [Update mapping](#update_mapping)
8. &nbsp;&nbsp; [Generate GPT Phonemes](#gpt_phonemes)
    - Section 8.1 &nbsp;&nbsp; [Phonemize with eng_to_ipa](#phon_eng2ipa)
    - Section 8.2 &nbsp;&nbsp; [espeak_ms](#espeak_ms)
    - Section 8.3 &nbsp;&nbsp; [espeak_en_ms](#espeak_en_ms)
    - Section 8.4 &nbsp;&nbsp; [eng2ipa_espeak_ms](#eng2ipa_espeak_ms)
9. &nbsp;&nbsp; [Generate Text Manifest for GPT Phonemes](#gpt_text_manifest)
    - Section 9.1 &nbsp;&nbsp; [Text Manifest (espeak_ms)](#espeak_ms_txt)
    - Section 9.2 &nbsp;&nbsp; [Text Manifest (espeak_en_ms)](#espeak_en_ms_txt)
    - Section 9.3 &nbsp;&nbsp; [Text Manifest (eng2ipa_espeak_ms)](#eng2ipa_espeak_ms_txt)
    - Section 9.4 &nbsp;&nbsp; [Text Manifest (norm_phon)](#norm_phon_txt)

# 1. &nbsp;&nbsp; [Load GPT Transcripts](#home) <a id='load_gpt'></a>

We prompt ChatGPT 3.5 to generate 50 examples of gaming transcripts that contains at least one of the 5 gaming terms provided (i.e. multiple gaming terms within the same transcript) in Malay language. We repeated for the process until all 743 gaming terms are used.

Steps taken are as follows:

1. Read `gpt.txt` file and processed its content to required format.
2. Load the content as DataFrame.
3. Assign id to each transcript in Dataframe.

Our observations:

73 transcripts didn't contain any gaming terms. This may be due to ChatGPT attempt to stimulate actual conversations between gamers e.g. gamer replying to other gamer's queries "Serius?".

In [3]:
# Read gpt.txt file to DataFrame
# Saved pre-processed gpt.txt
df_gpt = read_gpt(cfg.paths.gpt_path)
df_gpt

Unnamed: 0,text
0,"Dengan mendapat 1-up , saya boleh teruskan per..."
1,Grafik 16-bit memberikan sentuhan nostalgia ya...
2,Saya berjaya melakukan 1CC pada permainan arke...
3,Pertarungan 1v1 itu benar-benar mencabar kemah...
4,Grafik 2.5D memberikan dimensi yang menarik ke...
...,...
3641,Macam mana gameplay dia ?
3642,Zone dalam game merujuk kepada kawasan atau wi...
3643,"Okey , aku cover left zone ."
3644,Zoning dalam game adalah strategi untuk kawal ...


In [4]:
# Assign id to each transcript in DataFrame
df_gpt.insert(0, "id", [f"gpt_{i}" for i in df_gpt.index])
df_gpt.to_csv(cfg.paths.gpt_csv_path, index=False)
df_gpt

Unnamed: 0,id,text
0,gpt_0,"Dengan mendapat 1-up , saya boleh teruskan per..."
1,gpt_1,Grafik 16-bit memberikan sentuhan nostalgia ya...
2,gpt_2,Saya berjaya melakukan 1CC pada permainan arke...
3,gpt_3,Pertarungan 1v1 itu benar-benar mencabar kemah...
4,gpt_4,Grafik 2.5D memberikan dimensi yang menarik ke...
...,...,...
3641,gpt_3641,Macam mana gameplay dia ?
3642,gpt_3642,Zone dalam game merujuk kepada kawasan atau wi...
3643,gpt_3643,"Okey , aku cover left zone ."
3644,gpt_3644,Zoning dalam game adalah strategi untuk kawal ...


# 2. &nbsp;&nbsp; [Gaming Terms Analysis](#home) <a id='gaming_analysis'></a>

We extract list of gaming terms for each transcript in GPT corpus and perform analysis on the gaming terms. Steps taken are as follows:

<style>
    table {
        width: 70%;
        table-layout:fixed;
    }
    th:nth-child(1),
    td:nth-child(1) {
        width: 3%;
    }
    th:nth-child(2),
    td:nth-child(2) {
        width: 17%;
    }
    th:nth-child(3),
    td:nth-child(3) {
        width: 80%;
    }
</style>

| S/N | Step | Description |
| -- | -- | -- |
| 1. | [Extract Gaming Terms](#extract_gaming) | &bull; Extract list of gaming terms found in each transcripts. <br>&bull; Append list of gaming terms to DataFrame. |
| 2. | [Identify Gaming Terms Used](#gaming_used) | &bull; Identify gaming terms used and generate list of unused gaming terms. <br>&bull; Based on the generated list, we could potentially use ChatGPT to generate those missing gaming transcripts to supplement our dataset. |
| 3. | [Identify Transcripts without Gaming Terms](#no_gaming) | &bull; Generate list of gaming transcripts without gaming terms. |

## 2.1 &nbsp;&nbsp; [Extract Gaming Terms ](#home) <a id='extract_gaming'></a>

In [5]:
# Extract gaming terms present in transcript
df_gpt["gaming"] = df_gpt["text"].parallel_map(lambda x: extract_gaming(x, cfg.paths.combined_path))
df_gpt.to_csv(cfg.paths.gpt_csv_path, index=False)
df_gpt

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=456), Label(value='0 / 456'))), HB…

Unnamed: 0,id,text,gaming
0,gpt_0,"Dengan mendapat 1-up , saya boleh teruskan per...",[1-up]
1,gpt_1,Grafik 16-bit memberikan sentuhan nostalgia ya...,[16-bit]
2,gpt_2,Saya berjaya melakukan 1CC pada permainan arke...,[1CC]
3,gpt_3,Pertarungan 1v1 itu benar-benar mencabar kemah...,[1v1]
4,gpt_4,Grafik 2.5D memberikan dimensi yang menarik ke...,[]
...,...,...,...
3641,gpt_3641,Macam mana gameplay dia ?,"[gameplay, mana]"
3642,gpt_3642,Zone dalam game merujuk kepada kawasan atau wi...,[zone]
3643,gpt_3643,"Okey , aku cover left zone .",[zone]
3644,gpt_3644,Zoning dalam game adalah strategi untuk kawal ...,[zoning]


## 2.2 &nbsp;&nbsp; [Identify Gaming Terms Used](#home) <a id='gaming_used'></a>

Our observations:

1. 717 out of 781 gaming terms were used in GPT corpus. 64 gaming terms were not used.
2. `mana` happens to be a Malay word for "where", which accounts for its high frequency.
3. It appears ChatGPT has a preference over certain words despite prompting to generate 50 examples based on 5 gaming terms provided each time.

In [7]:
df_gaming_counter, unused_gaming = gen_gaming_counter(df_gpt, cfg.paths.combined_path)
df_gaming_counter

Total number of combined gaming terms    : 781
Number of gaming terms used              : 717
Number of gaming terms unused            : 64


Unnamed: 0,frequency
mana,159
mode,88
gameplay,63
platform,55
level,53
...,...
telegraph,1
theory,1
toxicity,1
multiplayer online battle arena,1


##### List of unused gaming terms

In [8]:
unused_gaming

['Dummied out',
 'shoot em up',
 'idle game',
 'augmented reality',
 'microtransaction',
 'booster pack',
 'pocket',
 'joke character',
 'idle animation',
 'blacklist',
 'OP',
 'backwards compatibility',
 'metastory',
 'Debug mode',
 'auto battler',
 'asset flipping',
 'boosting',
 'bonus stage',
 'Lets Play',
 'clocked',
 'best-in-slot',
 'asynchronous gameplay',
 'buff',
 'non-player character',
 'battle pass',
 'bullet sponge',
 'metagame',
 'beta release',
 '4X',
 'assault mode',
 'transmogrification',
 'attract mode',
 'boss-rush',
 '64-bit',
 'minigame',
 'in-app purchase',
 'campaign mode',
 'iframes',
 'KDR',
 'button mashing',
 'micro',
 'IAP',
 'miniboss',
 'tile-matching video game',
 'wave',
 'infinite life',
 'inventory management',
 'stat squish',
 'ban wave',
 'borderless fullscreen windowed',
 'bullet hell',
 'newb',
 'battle royale game',
 '8K resolution',
 'mob',
 'board',
 'infinite health',
 'bottomless pit',
 'breach',
 'asymmetric gameplay',
 'massively multiplaye

## 2.3 &nbsp;&nbsp; [Identify Transcripts without Gaming Terms](#home) <a id='no_gaming'></a>

Our observations:

1. `2.5D graphics`, `4K resolution`, and `8K resolution` are represented in Malay format i.e. `Grafik 2.5D`, `Resolusi 4K` and `Resolusi 8K` respectively.
2. A number of english words were detected despite prompting ChatGPT to generate only Malay gaming transcripts. Perhaps, ChatGPT felt that those words were commonly used in gaming despite not classified as gaming terms in Wikipedia.
3. Examples of english words used: world-building, tech tree, auto-targetting, zipper, smart move, etc.

In [9]:
# Verify transcripts without gaming terms doesn't contain any gaming terms
df_no = df_gpt.loc[df_gpt["gaming"].map(lambda x: len(x) == 0), :]
df_no.to_csv(cfg.paths.gpt_no_path)
df_no

Unnamed: 0,id,text,gaming
4,gpt_4,Grafik 2.5D memberikan dimensi yang menarik ke...,[]
5,gpt_5,Suasana retro terasa apabila bermain dengan gr...,[]
8,gpt_8,Penggunaan grafik 3D memberikan pengalaman per...,[]
9,gpt_9,Kesemua butiran jelas terlihat pada resolusi 4...,[]
10,gpt_10,Saya tidak dapat melepaskan permainan ini deng...,[]
...,...,...,...
3623,gpt_3623,"Chill , semua orang ada off days .",[]
3628,gpt_3628,Grafik dia memang impressive .,[]
3635,gpt_3635,"Focus fire satu-satu , jangan panic .",[]
3640,gpt_3640,Serius ?,[]


In [8]:
for row in df_no.itertuples(index=False, name=None):
    print(row)

('Grafik 2.5D memberikan dimensi yang menarik kepada permainan ini .', [])
('Suasana retro terasa apabila bermain dengan grafik 2D yang klasik .', [])
('Penggunaan grafik 3D memberikan pengalaman permainan yang lebih hidup .', [])
('Kesemua butiran jelas terlihat pada resolusi 4K ini .', [])
('Saya tidak dapat melepaskan permainan ini dengan grafik 2D yang indah .', [])
('Resolusi 4K memberikan ketajaman visual yang tidak dapat dipercayai .', [])
('Saya sentiasa kagum dengan kecantikan grafik 3D dalam permainan ini .', [])
('Grafik 2.5D memberikan kedalaman visual yang hebat pada permainan ini .', [])
('Keindahan grafik 2D ini benar-benar menawan hati pemain .', [])
('Grafik 3D membuatkan dunia permainan ini kelihatan hidup .', [])
('Resolusi 4K memberikan kejelasan yang memukau pada setiap adegan .', [])
('Saya suka permainan ini kerana grafik 2D yang membawa kembali kenangan .', [])
('Saya teruja dengan kualiti grafik 2D pada permainan ini .', [])
('Saya sentiasa terpesona dengan keu

# 3. &nbsp;&nbsp; [English Words Analysis](#home) <a id='english_gpt'></a>

We extract all English words (which includes gaming terms) for analysis as incorrect pronunciation is likely result from English words (which is of minority class) rather than Malay words.

Steps are as follows:

<style>
    table {
        width: 70%;
        table-layout:fixed;
    }
    th:nth-child(1),
    td:nth-child(1) {
        width: 3%;
    }
    th:nth-child(2),
    td:nth-child(2) {
        width: 17%;
    }
    th:nth-child(3),
    td:nth-child(3) {
        width: 80%;
    }
</style>

| S/N | Step | Description |
| -- | -- | -- |
| 1. | [Extract English Words](#extract_english) | &bull; Extract all English terms from GPT corpus via [polyglot](https://github.com/EleutherAI/polyglot). <br>&bull; Those English terms that are yet to be found in `mapping` dictionary and combined gaming terms will either be translated or modified such that VITS Malaya model is able generate comprehensible audio. |
| 2. | [Identify Unique English Words](#unique_english) | &bull; Identify list of English words used in GPT corpus that are not found in `mapping` and combined gaming terms. |
| 3. | [Translate Unique English Words](#translate_unique) | &bull; Translate these unique English words to reduce the number of English words present in GPT corpus. |
| 4. | [Identify Untranslated English Words](#untranslated_unique) | &bull; Identify untranslated English words, which we have to check if Malay VITS model is able to generate comprehensible audio from these words. <br>&bull; Characters may have to be altered to generate desired effect (best effort basis). |
| 5. | [Identify Wrong Translations](#wrong_translation) | &bull; Identify English words that deviate from translated version by more than 5 characters. |

In [10]:
df_gpt["en_words"] = df_gpt["text"].parallel_map(extract_en_words)
df_gpt.to_csv(cfg.paths.gpt_csv_path, index=False)
df_gpt

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=456), Label(value='0 / 456'))), HB…

Unnamed: 0,id,text,gaming,en_words
0,gpt_0,"Dengan mendapat 1-up , saya boleh teruskan per...",[1-up],[]
1,gpt_1,Grafik 16-bit memberikan sentuhan nostalgia ya...,[16-bit],"[Grafik, 16-bit, unik]"
2,gpt_2,Saya berjaya melakukan 1CC pada permainan arke...,[1CC],[]
3,gpt_3,Pertarungan 1v1 itu benar-benar mencabar kemah...,[1v1],[]
4,gpt_4,Grafik 2.5D memberikan dimensi yang menarik ke...,[],"[Grafik, dimensi]"
...,...,...,...,...
3641,gpt_3641,Macam mana gameplay dia ?,"[gameplay, mana]",[gameplay]
3642,gpt_3642,Zone dalam game merujuk kepada kawasan atau wi...,[zone],"[Zone, game, peta]"
3643,gpt_3643,"Okey , aku cover left zone .",[zone],"[Okey, cover, left, zone]"
3644,gpt_3644,Zoning dalam game adalah strategi untuk kawal ...,[zoning],"[game, strategi, secure]"


## 3.1 &nbsp;&nbsp; [Extract English Words](#home) <a id='extract_english'></a>

Update Counter object on list of English words used in each transcripts.

Our observations:

1. Malay words that are restructured to have similar sounds are considered as English words by polyglot e.g. `elemen`, `strategi`, `Okey`, `responsiviti`.
2. Some Malay word is wrongly captured as English word e.g. `sering`, which means `often` in English.

In [22]:
df_en_word = gen_en_counter(df_gpt)
df_en_word = df_en_word.reset_index()
df_en_word.columns = ["words", "frequency"]
df_en_word.to_csv(cfg.paths.gpt_en_path)
df_en_word

Total number of English words detected by polyglot : 1833


Unnamed: 0,words,frequency
0,game,1030
1,elemen,181
2,strategi,140
3,sering,123
4,player,112
...,...,...
1828,comebacks,1
1829,enthusiasm,1
1830,leave,1
1831,lasting,1


## 3.2 &nbsp;&nbsp; [Identify Unique English Words](#home) <a id='unique_english'></a> 

Identify list of English words that are not in gaming terms and in `mapping` dictionary

Our rationale:

1. Malaya VITS model is able to pronounce Indonesian and actual Malay words easily. Hence the issue is on our to handle non-Malay words specifically English.
2. By identifying the list of English words that are not in gaming terms in `mapping` dictionary, we able to either translate these English words to Malay or to perform mapping if these English words are not able to be translated.

In addition, we will perform translation on these unique English words using [translators](https://pypi.org/project/translators/)

Total of 1515 English words detected.

In [12]:
df_en_word_new = gen_new_en(cfg, df_en_word)
df_en_word_new

Original number of English words detected : 1833
Number of English words that were not found in combined gaming terms and in mapping : 1510
Number of English words removed from list : 323


Unnamed: 0,id,words
0,new_en_0,abilities
1,new_en_1,ability
2,new_en_2,abrupt
3,new_en_3,access
4,new_en_4,accessibility
...,...,...
1505,new_en_1505,zipper
1506,new_en_1506,zombie
1507,new_en_1507,zona
1508,new_en_1508,zones


## 3.3 &nbsp;&nbsp; [Translate Unique English Words](#home) <a id='translate_unique'></a> 

We use Malaya transformer module and Malaya dictionary module to translate the English text to Malay. We didn't use `translators` module has a limitation on number of words translated. Translating English words that are not found in `mapping` and combined gaming terms reduces number of English words present in GPT corpus:

- Presence of excess English words may affect the sound quality of Malay spoken words.
- We do not want model to be confuse between English and Malay words; and therefore generate incomprehenisble audio (i.e. mixed between English and Malay words). 

Our observations:

1. It appears that Malaya dictionary translate English word to Malay while Malaya transformer translate English word to Indonesian e.g. 'akses' versus 'capaian'

In [13]:
df_en_word_new = append_translated(cfg, df_en_word_new)
df_en_word_new

INFO:malaya_boilerplate.huggingface:downloading frozen huseinzol05/v23-preprocessing/english-malay-200k.json


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=189), Label(value='0 / 189'))), HB…

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=189), Label(value='0 / 189'))), HB…

2024-01-28 17:18:35.580746: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled


Unnamed: 0,id,words,dictionary,transformer,words_len,dictionary_len,transformer_len
0,new_en_0,abilities,abilities,kebolehan,9,9,9
1,new_en_1,ability,kebolehan,keupayaan,7,9,9
2,new_en_2,abrupt,abrupt,Tiba-tiba,6,6,9
3,new_en_3,access,akses,capaian,6,5,7
4,new_en_4,accessibility,aksesibiliti,kebolehcapaian,13,12,14
...,...,...,...,...,...,...,...
1505,new_en_1505,zipper,zip,ritsleting,6,3,10
1506,new_en_1506,zombie,zombie,zombi,6,6,5
1507,new_en_1507,zona,zona,"zonaCity in Illinois, United States",4,4,35
1508,new_en_1508,zones,zones,zon,5,5,3


## 3.4 &nbsp;&nbsp; [Identify Untranslated English Words](#home) <a id='untranslated_unique'></a>

We checked for untranslated English words via Malaya dictionary, Malaya transformer; as well as both Malaya dictionary and Malaya transformer.

Our observations:

1. Malaya dictionary is conservative in translation (i.e. 824 untranslated) compared to Malaya transformer (i.e. 405).
2. However Malaya dictionary appears to be closer translation to conversational Malay while Malaya transformers appears to translate English words to Indonesian.
3. 391 words are untranslated by both Malaya dictionary and Malaya transformer. This is likely due to either original words are already proper Malay words (Polyglot classify Malay word that sound like English as English words); or there are no suitable words to translate hence both methods return the original words.

##### Untranslated English words via Malaya dictionary

In [14]:
# Identify words that are identical in `words` and `dictionary` columns
df_dictionary = df_en_word_new.loc[df_en_word_new["dictionary"] == df_en_word_new["words"], :]
print(f"Number of untranslated English words via Malaya dictionary : {len(df_dictionary)}")
df_dictionary

Number of untranslated English words via Malaya dictionary : 823


Unnamed: 0,id,words,dictionary,transformer,words_len,dictionary_len,transformer_len
0,new_en_0,abilities,abilities,kebolehan,9,9,9
2,new_en_2,abrupt,abrupt,Tiba-tiba,6,6,9
8,new_en_8,accountable,accountable,bertanggungjawab,11,11,16
10,new_en_10,action-adventure,action-adventure,action-adventure,16,16,16
11,new_en_11,actions,actions,tindakan,7,7,8
...,...,...,...,...,...,...,...
1504,new_en_1504,zero-player,zero-player,pemain-sifar,11,11,12
1506,new_en_1506,zombie,zombie,zombi,6,6,5
1507,new_en_1507,zona,zona,"zonaCity in Illinois, United States",4,4,35
1508,new_en_1508,zones,zones,zon,5,5,3


##### Untranslated English words via Malaya transformer

In [15]:
# Identify words that are identical in `words` and `transformer` columns
df_transformer = df_en_word_new.loc[df_en_word_new["transformer"] == df_en_word_new["words"], :]
print(f"Number of untranslated English words via Malaya transformer : {len(df_transformer)}")
df_transformer

Number of untranslated English words via Malaya transformer : 404


Unnamed: 0,id,words,dictionary,transformer,words_len,dictionary_len,transformer_len
10,new_en_10,action-adventure,action-adventure,action-adventure,16,16,16
15,new_en_15,adaptabiliti,adaptabiliti,adaptabiliti,12,12,12
17,new_en_17,adaptasi,adaptasi,adaptasi,8,8,8
25,new_en_25,adegan,adegan,adegan,6,6,6
26,new_en_26,adil,adil,adil,4,4,4
...,...,...,...,...,...,...,...
1478,new_en_1478,wasd,wasd,wasd,4,4,4
1487,new_en_1487,whiffs,whiffs,whiffs,6,6,6
1489,new_en_1489,wii,wii,wii,3,3,3
1495,new_en_1495,wow,wow,wow,3,3,3


##### Untranslated English words via Malaya dictionary and Malaya transformer

In [16]:
# Identify words that are identical in all 3 columns (i.e. `words`, `dictionary`, and `transformer`)
df_dictionary_transformer = df_en_word_new.loc[
    (
        (df_en_word_new["dictionary"] == df_en_word_new["transformer"])
        & (df_en_word_new["transformer"] == df_en_word_new["words"])
    ),
    :,
]

print(f"Number of untranslated English words via Malaya dictionary and transformer : {len(df_dictionary_transformer)}")
df_dictionary_transformer

Number of untranslated English words via Malaya dictionary and transformer : 390


Unnamed: 0,id,words,dictionary,transformer,words_len,dictionary_len,transformer_len
10,new_en_10,action-adventure,action-adventure,action-adventure,16,16,16
15,new_en_15,adaptabiliti,adaptabiliti,adaptabiliti,12,12,12
17,new_en_17,adaptasi,adaptasi,adaptasi,8,8,8
25,new_en_25,adegan,adegan,adegan,6,6,6
26,new_en_26,adil,adil,adil,4,4,4
...,...,...,...,...,...,...,...
1478,new_en_1478,wasd,wasd,wasd,4,4,4
1487,new_en_1487,whiffs,whiffs,whiffs,6,6,6
1489,new_en_1489,wii,wii,wii,3,3,3
1495,new_en_1495,wow,wow,wow,3,3,3


## 3.5 &nbsp;&nbsp; [Identify Wrong Translations](#home) <a id='wrong_translation'></a>

We identify all translations that are longer than the original word by 5 characters for both Malaya dictionary and Malaya transformer.

Our observations:

1. Words that are not seen by transformer seemed to classify as a location in United States:

```
haah -> haahCity in Virginia, United States
mario -> marioCity in Virginia, United States
mvp -> mvpCity in California, United States
progresi -> progresiCity in Illinois, United States
sarana -> saranaCity in Illinois, United States
youtube -> youtubeCity in Minnesota, United States
zona -> zonaCity in Illinois, United States
loading -> loading... (28/2019) Ia juga mengaku tak bisa
```

2. Proper Malay words are being translated:

```
kritikan -> kritikStock label
perjalanan -> perjalanan perjalanan
```

3. Inconsistent treatment between Malaya dictionary and Malaya transformer makes it difficult to decide which is a better translation especially when we are not native Malay speaker. For example:

```
'overwhelming' versus 'sangat menggembirakan' versus 'luar biasa'
'fix' versus 'menetapkan' versus 'baiki'
'crowd' versus 'orang ramai' versus 'kerumunan'
``` 

##### Potentially wrong translation via Malaya dictionary

In [17]:
# Words that deviate from translated word via Malaya dictionary by 5 characters
df_wrong_dictionary = df_en_word_new.loc[abs(df_en_word_new["words_len"] - df_en_word_new["dictionary_len"]) > 5, :]
print(f"Number of potentially wrong translation via Malaya dictionary: {len(df_wrong_dictionary)}")
df_wrong_dictionary

Number of potentially wrong translation via Malaya dictionary: 49


Unnamed: 0,id,words,dictionary,transformer,words_len,dictionary_len,transformer_len
14,new_en_14,adapt,menyesuaikan,adaptasi,5,12,8
47,new_en_47,alienate,mengasingkan diri,mengasingkan,8,17,12
56,new_en_56,ambush,serangan hendap,serang hendap,6,15,13
121,new_en_121,blame,menyalahkan,salahkan,5,11,8
242,new_en_242,crowd,orang ramai,kerumunan,5,11,9
259,new_en_259,decide,membuat keputusan,putuskan,6,17,8
269,new_en_269,define,mentakrifkan,takrifan,6,12,8
304,new_en_304,directional,arah,arah,11,4,4
373,new_en_373,encouragement,galakan,galakan,13,7,7
454,new_en_454,fix,menetapkan,baiki,3,10,5


##### potentially wrong translation via Malaya transformer

In [18]:
# Words that deviate from translated word via Malaya transformer by 5 characters
df_wrong_transformer = df_en_word_new.loc[abs(df_en_word_new["words_len"] - df_en_word_new["transformer_len"]) > 5, :]
print(f"Number of potentially wrong translation via Malaya transformer : {len(df_wrong_transformer)}")
df_wrong_transformer

Number of potentially wrong translation via Malaya transformer : 57


Unnamed: 0,id,words,dictionary,transformer,words_len,dictionary_len,transformer_len
56,new_en_56,ambush,serangan hendap,serang hendap,6,15,13
177,new_en_177,clock/clocked,clock/clocked,jam/jam,13,13,7
258,new_en_258,debugging,debugging,penyahpepijatan,9,9,15
304,new_en_304,directional,arah,arah,11,4,4
317,new_en_317,distribute,mengedarkan,edar,10,11,4
319,new_en_319,distrust,kesangsian,ketidakpercayaan,8,10,16
373,new_en_373,encouragement,galakan,galakan,13,7,7
374,new_en_374,end,akhir,penghujung,3,5,10
375,new_en_375,endless,endless,tidak berkesudahan,7,7,18
381,new_en_381,enhance,meningkatkan,pertingkatkan,7,12,13


## 3.6 &nbsp;&nbsp; [Identify Untranslated English Words](#home) <a id='untranslated'></a>

Our observations:

1. 390 words (mostly Malay words) were unaffected by Malaya dictionary or Malaya transformer. 

In [None]:
# Extract untranslated English words i.e. select words that are identical across all 3 columns
cond = (df_en_word_new["words"] == df_en_word_new["dictionary"]) & (
    df_en_word_new["dictionary"] == df_en_word_new["transformer"]
)

df_untranslated = df_en_word_new.loc[cond, ["id", "words", "dictionary", "transformer"]]
df_untranslated

Unnamed: 0,id,words,dictionary,transformer
10,new_en_10,action-adventure,action-adventure,action-adventure
15,new_en_15,adaptabiliti,adaptabiliti,adaptabiliti
17,new_en_17,adaptasi,adaptasi,adaptasi
25,new_en_25,adegan,adegan,adegan
26,new_en_26,adil,adil,adil
...,...,...,...,...
1478,new_en_1478,wasd,wasd,wasd
1487,new_en_1487,whiffs,whiffs,whiffs
1489,new_en_1489,wii,wii,wii
1495,new_en_1495,wow,wow,wow


In [None]:
for row in df_untranslated.itertuples(index=False, name=None):
    print(row)

('new_en_10', 'action-adventure', 'action-adventure', 'action-adventure')
('new_en_15', 'adaptabiliti', 'adaptabiliti', 'adaptabiliti')
('new_en_17', 'adaptasi', 'adaptasi', 'adaptasi')
('new_en_25', 'adegan', 'adegan', 'adegan')
('new_en_26', 'adil', 'adil', 'adil')
('new_en_30', 'adrenalin', 'adrenalin', 'adrenalin')
('new_en_40', 'agresif', 'agresif', 'agresif')
('new_en_43', 'aktif', 'aktif', 'aktif')
('new_en_44', 'alami', 'alami', 'alami')
('new_en_45', 'alat', 'alat', 'alat')
('new_en_48', 'aliran', 'aliran', 'aliran')
('new_en_52', 'alternatif', 'alternatif', 'alternatif')
('new_en_57', 'analisis', 'analisis', 'analisis')
('new_en_58', 'analog', 'analog', 'analog')
('new_en_60', 'animasi', 'animasi', 'animasi')
('new_en_63', 'antara', 'antara', 'antara')
('new_en_64', 'anti-cheat', 'anti-cheat', 'anti-cheat')
('new_en_66', 'antisipasi', 'antisipasi', 'antisipasi')
('new_en_69', 'apresiasi', 'apresiasi', 'apresiasi')
('new_en_75', 'asal', 'asal', 'asal')
('new_en_77', 'aspek', '

## 4. &nbsp;&nbsp; [Generate gpt_mapping](#home) <a id='gen_gpt_mapping'></a>

Objective is to generate a mapping (i.e. `gpt_mapping`) that map unique English words to words whose characters are modified to replicate desired pronunciation. Unique English words refer to English words found in GPT corpus that are not found in gaming terms and existing `mapping` for gaming terms in `gaming.yaml`.

Detailed steps are as follows:

<style>
    table {
        width: 70%;
        table-layout:fixed;
    }
    th:nth-child(1),
    td:nth-child(1) {
        width: 3%;
    }
    th:nth-child(2),
    td:nth-child(2) {
        width: 17%;
    }
    th:nth-child(3),
    td:nth-child(3) {
        width: 80%;
    }
</style>

| S/N | Step | Description |
| -- | -- | -- |
| 1. | [Generate Audio for Unique English Words](#gen_audio_unique) | &bull; Generate audio for all 1515 English words. <br>&bull; Rationale is to be able to assess the quality of English words that are inferred by Malay VITS model easily. |
| 2. | [Map to Malay Words](#map_to_malay) | &bull; Determine appropriate translated malay word. <br>&bull; Update `gpt_mapping` to map English words to selected translated Malay word. |
| 3. | [Map to Modified](#map_to_modified) | &bull; Performed language detection on `malay_word` column via polyglot. <br>&bull; Identify English words detected by polyglot having confidence of more than 90%. <br>&bull; Update `gpt_mapping` to map English words to their modified equivalent. |

Our considerations:

1. We assume Malaya dictionary is a better translation source than Malaya transformer. Therefore, we will translate 1515 unique English words first by Malaya dictionary then followed by Malaya transformer if Malaya dictionary is not able to translate the English word. This is to ensure that the GPT corpus is still predominantly Malay words.
2. Gaming terms will not be translated since Malay gaming community use gaming terms in their native English forms.
3. Modified words refer to words that can't be translated either by Malay dictionary or Malay transformer will have to be modified character-wise and combine multiple English words accordingly in order to generate appropriate pronounciation e.g. 'insertion' -> 'innsirshion'.
4. Sentence may not be gramatically correct after translation. However, the focus is to obtain the correct sound. User can correct the grammar on their own and use the same mapping to generate the audio files.
5. We noticed that malay word translation differ between Malaya dictionary, Malaya transformer and google translate (on web page). Hence, we are unsure whether the correct Malay word are used for translation since we are not native Malay speakers.
6. Malay words, which were initially generated by ChatGPT and are translated by Malaya dictionary or transformer, are retained.

## 4.1 &nbsp;&nbsp; [Generate Audio for Unique English Words](#home) <a id='gen_audio_unique'></a>

We went through all 1515 unique English words and identified 381 words that required either translation or character modification. Kindly refer to `gaming.yaml` under `notebooks/src_gaming` folder for the full list.

In [None]:
gen_audio(df_en_word_new, "words", cfg.paths.gpt_en_new_dir)

  0%|          | 0/1511 [00:00<?, ?it/s]

100%|██████████| 1511/1511 [22:01<00:00,  1.14it/s]


## 4.2 &nbsp;&nbsp; [Map to Malay words](#home) <a id='map_to_malay'></a>

Determine appropriate translated Malay word by selecting translated word from Malaya dictionary first. If not available, then select from Malaya transformer. Update `gpt_mapping` to map English word to appropriate translated Malay word.

##### Determine appropriate translated Malay words

In [19]:
# Select from `dictionary`, followed by `transformer`
df_en_word_new["malay_word"] = df_en_word_new.parallel_apply(
    lambda row: select_malay_word(row[1], row[2], row[3]), axis=1
)
df_en_word_new.to_csv(cfg.paths.gpt_en_new_path, index=False)
df_en_word_new

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=189), Label(value='0 / 189'))), HB…

Unnamed: 0,id,words,dictionary,transformer,words_len,dictionary_len,transformer_len,malay_word
0,new_en_0,abilities,abilities,kebolehan,9,9,9,kebolehan
1,new_en_1,ability,kebolehan,keupayaan,7,9,9,kebolehan
2,new_en_2,abrupt,abrupt,Tiba-tiba,6,6,9,Tiba-tiba
3,new_en_3,access,akses,capaian,6,5,7,akses
4,new_en_4,accessibility,aksesibiliti,kebolehcapaian,13,12,14,aksesibiliti
...,...,...,...,...,...,...,...,...
1505,new_en_1505,zipper,zip,ritsleting,6,3,10,zip
1506,new_en_1506,zombie,zombie,zombi,6,6,5,zombi
1507,new_en_1507,zona,zona,"zonaCity in Illinois, United States",4,4,35,zona
1508,new_en_1508,zones,zones,zon,5,5,3,zon


##### Update `gpt_mapping` to map English word to selected translated Malay word

We made adjustment to below selected Malay words (in yaml format) by removing the following from `gpt_mapping`:

1. Malay words that were generated by ChatGPT (i.e. ekspres, kepintarank, kritikan, periode, perjalanan, pulih, respons) from mapping.
2. Malay Words that are deemed to be more appropriate in English form are retained (i.e. e sports, kill death assist, sniper, top, top-down, x axis, y axis, z axis, zipper).

Total of 747 key-value pairs created for `gpt_mapping`

In [20]:
# Display inital English to Malay words mapping in yaml format
display_yaml(cfg, df_en_word_new)

  abilities: kebolehan
  abrupt: tiba tiba
  accessibility: aksesibiliti
  accessible: boleh diakses
  accomplished: dicapai
  accomplishment: pencapaian
  accountable: bertanggungjawab
  actions: tindakan
  activate: aktifkan
  active: aktif
  adaptability: kebolehsuaian
  adaptation: adaptasi
  addicted: ketagihan
  addicting: ketagihan
  addictive: ketagihan
  additional: tambahan
  address: alamat
  adjust: melaraskan
  adjustment: pelarasan
  admit: mengaku
  advanced: maju
  advancement: kemajuan
  advantage: kelebihannya
  aesthetic: estetik
  affected: terjejas
  after: selepas
  aggressive: agresif
  agreed: bersetuju
  alert: amaran
  alienate: mengasingkan diri
  allocation: peruntukan
  alternative: alternatif
  amazement: kagum
  ambush: serangan hendap
  annoying: menjengkelkan
  anti fair: anti adil
  appreciated: dihargai
  approaches: pendekatan
  area of effect: kawasan kesan
  areas: kawasan
  artbook: buku seni
  aspect: aspek
  atmosphere: atmosfera
  attacks: sera

## 4.3 &nbsp;&nbsp; [Map to Modified Words](#home) <a id='map_to_modified'></a>

We performed language detection on `malay_word` column to assist in identifying actual English words from 390 untranslated words derived earlier.

Our observations:

1. 97 words in `malay_word` column were classified as English words by polyglot with confidence level of more than 90%.
2. Out of 97 words, some words were actual Malay words e.g. adaptabiliti, alternatif, etc. It appears polyglot is not as effective in differentiating between English and Malay words.
3. **Sound generated by Malay VITS model may deviate on every separate run**. This is likely due to the generator process which introduces a randomness to the audio output e.g. "on disc" may sometimes emphasize the "c" to sound "on disc c".

As such, we have to go through all 391 untranslated English words and 97 English words identified after translation by polyglot to determine if Malaya VITS model is able to infer the English words properly. In the event of mispronunication, we would modify the word to get appropriate pronunciation.

1. Combine multiple English words.
2. Modify characters in word.


We updated `gpt_mapping` with mapping of English words to their modified equivalent:

```
  2D: two dee
  2.5D: two point five dee
  3D: three dee
  4K: four kay
  8K: eg kay
  audio visual: audiovisual
  cheat: cheet
  comebacks: come back
  digital: d gi tal
  gear: yeer
  genre: john ra
  geocaching: geocashing
  idea: i dea
  pacing: paysing
  sci fi: sigh fie
  scoped: scope
  sniper: snightper
  traverse: travers
  v sync: vee sync
  wasd: dubberliu a ass dee
  whiffs: whiff
  wow: waoh
  x axis: ax axis
  xbox: ax box
  y axis: wai axis
  z axis: z axis
```


##### Apply language detection on `malay_word` column

In [21]:
df_en_word_new[["lang", "lang_sc"]] = df_en_word_new["malay_word"].parallel_apply(
    lambda x: pd.Series(detect_language(x))
)
df_en_word_new.to_csv(cfg.paths.gpt_en_new_path, index=False)
df_en_word_new

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=189), Label(value='0 / 189'))), HB…

Unnamed: 0,id,words,dictionary,transformer,words_len,dictionary_len,transformer_len,malay_word,lang,lang_sc
0,new_en_0,abilities,abilities,kebolehan,9,9,9,kebolehan,ms,90.0
1,new_en_1,ability,kebolehan,keupayaan,7,9,9,kebolehan,ms,90.0
2,new_en_2,abrupt,abrupt,Tiba-tiba,6,6,9,Tiba-tiba,id,90.0
3,new_en_3,access,akses,capaian,6,5,7,akses,no,85.0
4,new_en_4,accessibility,aksesibiliti,kebolehcapaian,13,12,14,aksesibiliti,no,92.0
...,...,...,...,...,...,...,...,...,...,...
1505,new_en_1505,zipper,zip,ritsleting,6,3,10,zip,en,80.0
1506,new_en_1506,zombie,zombie,zombi,6,6,5,zombi,rw,85.0
1507,new_en_1507,zona,zona,"zonaCity in Illinois, United States",4,4,35,zona,en,83.0
1508,new_en_1508,zones,zones,zon,5,5,3,zon,en,80.0


##### English words detected

In [22]:
# Select words that are English and have language confidence more than 90%
df_en_after = df_en_word_new.loc[(df_en_word_new["lang"] == "en") & (df_en_word_new["lang_sc"] > 90), :]
df_en_after

Unnamed: 0,id,words,dictionary,transformer,words_len,dictionary_len,transformer_len,malay_word,lang,lang_sc
10,new_en_10,action-adventure,action-adventure,action-adventure,16,16,16,action-adventure,en,94.0
15,new_en_15,adaptabiliti,adaptabiliti,adaptabiliti,12,12,12,adaptabiliti,en,92.0
52,new_en_52,alternatif,alternatif,alternatif,10,10,10,alternatif,en,91.0
53,new_en_53,alternative,alternatif,alternatif,11,10,10,alternatif,en,91.0
64,new_en_64,anti-cheat,anti-cheat,anti-cheat,10,10,10,anti-cheat,en,91.0
...,...,...,...,...,...,...,...,...,...,...
1383,new_en_1383,time-to-kill,time-to-kill,time-to-kill,12,12,12,time-to-kill,en,92.0
1400,new_en_1400,tradisional,tradisional,tradisional,11,11,11,tradisional,en,92.0
1401,new_en_1401,traditional,tradisional,tradisional,11,11,11,tradisional,en,92.0
1405,new_en_1405,travel,perjalanan,perjalanan,6,10,10,perjalanan,en,91.0


In [23]:
for row in df_en_after.loc[:, ["id", "words", "malay_word", "lang", "lang_sc"]].itertuples(index=False, name=None):
    print(row)

('new_en_10', 'action-adventure', 'action-adventure', 'en', 94.0)
('new_en_15', 'adaptabiliti', 'adaptabiliti', 'en', 92.0)
('new_en_52', 'alternatif', 'alternatif', 'en', 91.0)
('new_en_53', 'alternative', 'alternatif', 'en', 91.0)
('new_en_64', 'anti-cheat', 'anti-cheat', 'en', 91.0)
('new_en_66', 'antisipasi', 'antisipasi', 'en', 91.0)
('new_en_88', 'audio-visual', 'audio-visual', 'en', 92.0)
('new_en_89', 'audiovisual', 'audiovisual', 'en', 92.0)
('new_en_100', 'bahan-bahan', 'bahan-bahan', 'en', 92.0)
('new_en_106', 'battle', 'pertempuran', 'en', 92.0)
('new_en_107', 'battles', 'pertempuran', 'en', 92.0)
('new_en_112', 'bergerombol', 'bergerombol', 'en', 92.0)
('new_en_185', 'combat', 'pertempuran', 'en', 92.0)
('new_en_207', 'connected', 'disambungkan', 'en', 92.0)
('new_en_238', 'creativity', 'kreativiti', 'en', 91.0)
('new_en_294', 'diimplementasikan', 'diimplementasikan', 'en', 94.0)
('new_en_296', 'dikurangkan', 'dikurangkan', 'en', 92.0)
('new_en_300', 'dipercayai', 'diperca

# 5. &nbsp;&nbsp; [Normalize GPT Corpus](#home) <a id='normalize_gpt_corpus'></a>

We normalize gaming terms in GPT corpus (via `mapping` dictionary) first followed by normalize remaining terms (via `gpt_mapping`). If we normalize the corpus with `gpt_mapping`, gaming terms may be accidentally be amended.

## 5.1 &nbsp;&nbsp; [Normalize Gaming Terms](#home) <a id='norm_gaming'></a>

Normalize gaming terms via `mapping` in `gaming.yaml`.

In [26]:
df_gpt["norm_gaming"] = df_gpt["text"].parallel_map(normalize_gaming)
df_gpt

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=456), Label(value='0 / 456'))), HB…

Unnamed: 0,id,text,gaming,en_words,norm_gaming
0,gpt_0,"Dengan mendapat 1-up , saya boleh teruskan per...",[1-up],[],"Dengan mendapat one up , saya boleh teruskan p..."
1,gpt_1,Grafik 16-bit memberikan sentuhan nostalgia ya...,[16-bit],"[Grafik, 16-bit, unik]",Grafik sixteen beat memberikan sentuhan nostal...
2,gpt_2,Saya berjaya melakukan 1CC pada permainan arke...,[1CC],[],Saya berjaya melakukan One C C pada permainan ...
3,gpt_3,Pertarungan 1v1 itu benar-benar mencabar kemah...,[1v1],[],Pertarungan one versus one itu benar benar men...
4,gpt_4,Grafik 2.5D memberikan dimensi yang menarik ke...,[],"[Grafik, dimensi]",Grafik 2.5D memberikan dimensi yang menarik ke...
...,...,...,...,...,...
3641,gpt_3641,Macam mana gameplay dia ?,"[gameplay, mana]",[gameplay],Macam mana game play dia ?
3642,gpt_3642,Zone dalam game merujuk kepada kawasan atau wi...,[zone],"[Zone, game, peta]",Zone dalam game merujuk kepada kawasan atau wi...
3643,gpt_3643,"Okey , aku cover left zone .",[zone],"[Okey, cover, left, zone]","Okey , aku cover left zone ."
3644,gpt_3644,Zoning dalam game adalah strategi untuk kawal ...,[zoning],"[game, strategi, secure]",Zoning dalam game adalah strategi untuk kawal ...


## 5.2 &nbsp;&nbsp; [Normalize English Words](#norm_en)

Normalize remaining English words with `gpt_mapping` in `gaming.yaml`.

In [27]:
df_gpt["norm_gpt"] = df_gpt["norm_gaming"].parallel_map(normalize_gpt)
df_gpt.to_csv(cfg.paths.gpt_csv_path, index=False)
df_gpt

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=456), Label(value='0 / 456'))), HB…

Unnamed: 0,id,text,gaming,en_words,norm_gaming,norm_gpt
0,gpt_0,"Dengan mendapat 1-up , saya boleh teruskan per...",[1-up],[],"Dengan mendapat one up , saya boleh teruskan p...","Dengan mendapat one up , saya boleh teruskan p..."
1,gpt_1,Grafik 16-bit memberikan sentuhan nostalgia ya...,[16-bit],"[Grafik, 16-bit, unik]",Grafik sixteen beat memberikan sentuhan nostal...,Grafik sixteen beat memberikan sentuhan nostal...
2,gpt_2,Saya berjaya melakukan 1CC pada permainan arke...,[1CC],[],Saya berjaya melakukan One C C pada permainan ...,Saya berjaya melakukan One C C pada permainan ...
3,gpt_3,Pertarungan 1v1 itu benar-benar mencabar kemah...,[1v1],[],Pertarungan one versus one itu benar benar men...,Pertarungan one versus one itu benar benar men...
4,gpt_4,Grafik 2.5D memberikan dimensi yang menarik ke...,[],"[Grafik, dimensi]",Grafik 2.5D memberikan dimensi yang menarik ke...,Grafik two point five dee memberikan dimensi y...
...,...,...,...,...,...,...
3641,gpt_3641,Macam mana gameplay dia ?,"[gameplay, mana]",[gameplay],Macam mana game play dia ?,Macam mana game play dia ?
3642,gpt_3642,Zone dalam game merujuk kepada kawasan atau wi...,[zone],"[Zone, game, peta]",Zone dalam game merujuk kepada kawasan atau wi...,Zone dalam game merujuk kepada kawasan atau wi...
3643,gpt_3643,"Okey , aku cover left zone .",[zone],"[Okey, cover, left, zone]","Okey , aku cover left zone .","Okey , aku cover ditinggalkan zone ."
3644,gpt_3644,Zoning dalam game adalah strategi untuk kawal ...,[zoning],"[game, strategi, secure]",Zoning dalam game adalah strategi untuk kawal ...,Zoning dalam game adalah strategi untuk kawal ...


In [28]:
for row in df_gpt.loc[:, ["id", "text", "norm_gpt"]].itertuples(index=False, name=None):
    print(row)

('gpt_0', 'Dengan mendapat 1-up , saya boleh teruskan permainan ini tanpa kehilangan nyawa .', 'Dengan mendapat one up , saya boleh teruskan permainan ini tanpa kehilangan nyawa .')
('gpt_1', 'Grafik 16-bit memberikan sentuhan nostalgia yang unik pada permainan ini .', 'Grafik sixteen beat memberikan sentuhan nostalgia yang unik pada permainan ini .')
('gpt_2', 'Saya berjaya melakukan 1CC pada permainan arked lama semalam !', 'Saya berjaya melakukan One C C pada permainan arked lama semalam !')
('gpt_3', 'Pertarungan 1v1 itu benar-benar mencabar kemahiran permainan saya .', 'Pertarungan one versus one itu benar benar mencabar kemahiran permainan saya .')
('gpt_4', 'Grafik 2.5D memberikan dimensi yang menarik kepada permainan ini .', 'Grafik two point five dee memberikan dimensi yang menarik kepada permainan ini .')
('gpt_5', 'Suasana retro terasa apabila bermain dengan grafik 2D yang klasik .', 'Suasana retro terasa apabila bermain dengan grafik two dee yang klasik .')
('gpt_6', 'Penin

# 6. &nbsp;&nbsp; [Generate Audio for GPT Corpus](#home) <a id='gen_audio_gpt'></a>

Generate audio for 3646 Malay gaming transcripts that are generated by ChatGPT via Malaya VITS model. Note that:

1. Batch processing is implemented to prevent memory overflow.
2. Audio are saved in `gpt/wav` folder under `gaming` folder.
2. Text manifest `gpt.txt` is generated and saved in `gpt` folder.

In [None]:
# Generate audio for GPT corpus in batches
df_gpt = pd.read_csv(cfg.paths.gpt_csv_path)
gen_gpt_audio(df_gpt, "norm_gpt", cfg.paths.gpt_dir, batch_size=100)

# 6.1 &nbsp;&nbsp; [Generate Text Manifest](#home) <a id='gen_gpt_txt'></a>

In [59]:
# Generate text manifest for all phoneme sets
# df_gpt = pd.read_csv(cfg.paths.gpt_csv_path)
gen_filelist(df_gpt, cfg.paths.gaming_dir, "gpt", "norm_gpt")

[['/home/ckabundant/Documents/tts-melayu/data/gaming/gpt/wav/gpt_0.wav',
  'Dengan mendapat one up , saya boleh teruskan permainan ini tanpa kehilangan nyawa .'],
 ['/home/ckabundant/Documents/tts-melayu/data/gaming/gpt/wav/gpt_1.wav',
  'Grafik sixteen beat memberikan sentuhan nostalgia yang unik pada permainan ini .'],
 ['/home/ckabundant/Documents/tts-melayu/data/gaming/gpt/wav/gpt_2.wav',
  'Saya berjaya melakukan One C C pada permainan arked lama semalam !'],
 ['/home/ckabundant/Documents/tts-melayu/data/gaming/gpt/wav/gpt_3.wav',
  'Pertarungan one versus one itu benar benar mencabar kemahiran permainan saya .'],
 ['/home/ckabundant/Documents/tts-melayu/data/gaming/gpt/wav/gpt_4.wav',
  'Grafik two point five dee memberikan dimensi yang menarik kepada permainan ini .'],
 ['/home/ckabundant/Documents/tts-melayu/data/gaming/gpt/wav/gpt_5.wav',
  'Suasana retro terasa apabila bermain dengan grafik two dee yang klasik .'],
 ['/home/ckabundant/Documents/tts-melayu/data/gaming/gpt/wav/