# Generate GPT Phonemes

While we have sucessfully generate text-audio pair for Malay gaming transcripts that are generated from ChatGPT 3.5, it would be good to come up with a process to convert GPT transcripts into phonemes, which will be able to be used for training via vanilla VITS model.

### Our considerations:

1. Phonemization by espeak-en are different from eng_to_ipa despite both being IPA phonemes. For example:

<style>
    table {
        width: 30%;
        table-layout:fixed;
    }
    th:nth-child(1),
    td:nth-child(1) {
        width: 33%;
    }
    th:nth-child(2),
    td:nth-child(2) {
        width: 33%;
    }
    th:nth-child(3),
    td:nth-child(3) {
        width: 34%;
    }
</style>

| Word| eng_to_ipa | epeak-en |
| -- | -- | -- |
| games | geɪmz | ɡˈeɪmz |
| as | ɛz | æz |
| a | ə | ɐ |
| product | ˈprɑdəkt | pɹˈɑːdʌkt |
| service | ˈsərvɪs | sˈɜːvɪs |

2. Ideally, we should standardize the use of phonemizer so as to ensure consistency. However, we have shown that phonemes generated by both epseak-en and espeak-ms cannot produce the desired audio files.

3. We noted that eng_to_ipa cannot handle heteroymns and has a default sound for a specific word/character. This is an issue since we are not able to discern the right phonemes for customizing phonemes:

    - E.g. 'a' is defaulted to 'er' sound rather than 'eh' sound. Hence, phonemes generated for acronyms for 'w a s d' would be incorrect.
    - Nevertheless, we will maintain our customized phonemes in `phon_mapping` for users to consider.

4. The slight differences in phonemes may be useful to produce the intonation required for Malay accent. However, we will need more synthetic data sample (at least 10000) for VITS model to learn the differences.

5. Nevertheless, we will first normalize text transcript specifically for phonemization via eng_to_ipa and espeak-en i.e. create required mapping to map English words to word combinations and generate customized phonemes if deemed fit.

### Our intentions:

There are 3 methods to generate phonemes list for VITS training but we chose [espeak_eng2ipa_ms_phon](#espeak_eng2ipa_ms_phon). Reasons are provided below

<style>
    table {
        width: 70%;
        table-layout:fixed;
    }
    th:nth-child(1),
    td:nth-child(1) {
        width: 3%;
    }
    th:nth-child(2),
    td:nth-child(2) {
        width: 17%;
    }
    th:nth-child(3),
    td:nth-child(3) {
        width: 80%;
    }
</style>

| S/N | Phoneme List | Desciption | 
| -- | -- | -- |
| 1. | [espeak_ms_phon](#espeak_ms_phon) | &bull; Entire GPT corpus is phonemized by espeak-ms. <br>&bull; Unfortunately, all numbers will be converted to Malay version, which are different from the GPT text-audio pairs. |
| 2. | [espeak_en_ms_phon](#espeak_en_ms_phon) | &bull; English words are phonemized by espeak-en. <br>&bull; Malay words are phonemized by espeak-ms. <br>&bull; This method would be ideal provided if polyglot is able to distinguish between English and Malay words accurately, which is not the case. |
| 3. | [espeak_eng2ipa_ms_phon](#espeak_eng2ipa_ms_phon) | &bull; English words are phonemized by eng_to_ipa. <br>&bull; Malay words and English words that are not phonemized by eng_to_ipa are phonemized by espeak-ms. <br>&bull; eng_to_ipa can only phonemize proper English words and return the original word with appended with asterick at the end if it can't. |

Note that:

1. There should be no unphonemized words since espeak-ms is able to handle any outlier words as shown in earlier section although we are not able to verify the accuracy of the phonemes.
2. We implement a work around for using espeak-en on English words and espeak-ms on Malay words using eng_to_ipa:

    - Since eng_to_ipa will phonemize English words and append asterick for non-English words, we will use eng_to_ipa to check if phonemization is successful. If so, we use espeak-en to phonemize instead.
    - If phonemization isn't successful, eng_to_ipa will tag the word with asterick. Therefore, we are able to identify non-English words and phonemize via espeak-ms. 

Steps are as follow:

| S/N | Step | Desciption | 
| -- | -- | -- |
| 1. | [Create mapping](#create_mapping) | &bull; Create `mapping_dict` for phonemization from `mapping` in `gaming.yaml`. <br>&bull; Remove duplicate keys in `gpt_mapping` that are also present in `mapping_dict`. <br>&bull; Create `custom_dict` dictionary and `translated_dict` dictionaries from `gpt_mapping` in `gaming.yaml`. <br>&bull; Create `phon_dict` dictionary from `phon_mapping` in `gaming.yaml`. |
| 2. | [Apply eng_to_ipa](#eng2ipa_gpt) | &bull; Apply phonemization on words in GPT corpus, which are detected as English by polyglot <br>&bull; Objectives are to analyse words that are phonemized (which should be English words in theory) and unphonemized words (which should be non-English words). |
| 3. | [Analyse Unphonemized Words](#unphon_analysis) | &bull; Check if any proper English words that are not phonemized by eng_to_ipa. <br>&bull; Remove unphonemized English words if English words are found in `gpt_mapping`. <br>&bull; Create mapping (i.e. eng2ipa_mapping) to map English to combination of proper English words to replicate proper pronunciation of English word  e.g. twinking -> "twin" + "king". <br>&bull; If combining words fail to replicate proper pronunication, create mapping (i.e. `phon_mapping`) to map English words to customized phonemes based on CMU English vocubulary (on a best effort basis). |
| 4. | [Analyse phonemized Words](#phon_analysis) | &bull; Check if Malay words are incorrectly phonemized by eng_to_ipa. <br>>&bull; Update list of Malay words.that are incorrectly phonemized (i.e. `wrong_malay`). |
| 5. | [Update Mapping](#update_mapping) | &bull; Update `eng2ipa_mapping` with `mapping` in `gaming.yaml` as not all English gaming terms were utilized in our current GPT corpus. <br>&bull; Update `eng2ipa_mapping` with `gpt_mapping`, which contains both customized/modified English words; and translated English words to Malay. <br>&bull; Update `phon_mapping` with gaming terms that aren't able to replicate proper prounication by combination of English words (if any). <br>&bull; Include `edge_mapping` for case sensitive gaming terms e.g. 'HoT' vs 'hot'. |
| 6. | [Append Normalized Text for Phonemization](#append_norm_phon) | &bull; Append normalized GPT text transcript customized for phonemization to DataFrame. |
| 7. | [Generate GPT Phonemes](#gpt_phonemes) | &bull; Generate different phonemes sets for GPT corpus. |
| 8. | [Generate Text Manifest for GPT Phonemes](#gpt_text_manifest) | &bull; Generate text manifest for generated phoneme sets. |
| 9. | [Generate Text Manifest for Testing](#gen_test_manifest) | &bull; Generate text manifest to test out list of gaming terms in `combined_gaming_terms.csv`. |

Note that:

- `gpt_mapping` contains both English translation to Malay and custom words that are not detected by polyglot e.g. '2.5D'. <br>&bull; We will update custom words in `gpt_mapping` to ensure that it can be phonemized by eng_to_ipa.
- Whole process is similar to [Section 4: Generate GPT Mapping](#gen_gpt_mapping) except that we are creating a larger mapping (i.e. `eng2ipa_mapping`) that maps English word to combinations of proper English words instead of modifying the characters in English word to achieve proper pronunciation (best effort basis).
- Although `init_mapping` maps English words to combinations of actual/proper English words, we will leverage on `mapping`, which covers all gaming terms, so as not to miss out any gaming terms.
- Proper English words are defined as sub-words (e.g. 'vo') or actual English words that are found inside CMU English vocabulary.
- **We will only use the keys in `mapping` to generate the mapping from scratch as these mappings were constructed specifically for Malaya VITS model; and therefore not applicable for phonemization.**

# Environment Setup

Following additional python modules are required to run this jupyter notebook

```translators==5.8.9
eng_to_ipa==0.0.2
bs4==0.0.1
nemo_text_processing==0.2.2rc0
herpetologist==0.0.9
malaya==5.0
malaya_speech==1.3.0.2
omegaconf==2.3.0
whisper-openai==1.0.0
pandarallel==1.6.5
polyglot==16.7.4
PySastrawi==1.2.0
```

ICU (International Components for Unicode) is required by polyglot to operate. PyICU is a python extension implemented in C++ that wraps the C/C++ ICU library. Installation of PyICU on Ubuntu is via binary packages of ICU and PyICU:

```
sudo apt-get install pkg-conifg libicu-dev
pip install --no-binary=:pyicu: pyicu
```

Please refer to [PyICU - PyPI](https://pypi.org/project/PyICU/) for installation method for other operating systems.

# Import Libraries

In [None]:
import pandas as pd
from pathlib import Path
import eng_to_ipa
from omegaconf import OmegaConf
import sys
from pandarallel import pandarallel
import IPython.display as ipd
import string
import re


# Append `tts-melayu` folder to sys path
sys.path.append(Path.cwd().parent.as_posix())

%load_ext autoreload
%autoreload 2
from notebooks.src_gaming.gaming_utils import (
    gen_filelist, normalize_gaming, append_ipa,
)
from notebooks.src_gaming.gaming_gpt_phon_utils import (
    normalize_gpt_phon, expand_compound,
    phonemize_gpt, create_dict, save_json,
    extract_phon_from_mapping, keys_in_values,
    keys_in_keys, gen_unphon_gaming, extract_phon_unphon,
    check_word, extract_unique, extract_non_english,
    compute_word_count, phon_eng2ipa_espeak, common_translated_gaming,
)
from src.vits.text.cleaners import english_cleaners2
from notebooks.src_gaming.gaming_gpt_utils import normalize_gpt


# Load Configurations

In [None]:
# Load configurable parameters in gaming.yaml
cfg = OmegaConf.load("/home/ckabundant/Documents/tts-melayu/notebooks/src_gaming/gaming.yaml")
args = OmegaConf.load(cfg.paths.gpt_yaml_path)

# Initialize pandarallel
pandarallel.initialize(progress_bar=True, nb_workers=cfg.general.num_workers)

cfg.paths

# Table of Contents <a id='home'></a>
1. &nbsp;&nbsp; [Create mapping](#create_mapping)
    - Section 1.1 &nbsp;&nbsp; [Create mapping_dict](#mapping_dict)
    - Section 1.2 &nbsp;&nbsp; [Create edge_dict](#edge_dict)
    - Section 1.3 &nbsp;&nbsp; [Create phon_mapping](#phon_dict)
    - Section 1.4 &nbsp;&nbsp; [Extract Unphonemized Gaming Terms](#unphon_gaming)
    - Section 1.5 &nbsp;&nbsp; [Update mapping_dict](#update_mapping_dict)
    - Section 1.6 &nbsp;&nbsp; [Create gpt_dict](#gpt_dict)
    - Section 1.7 &nbsp;&nbsp; [Create custom_dict and translated_dict](#custom_translated_dict)
    - Section 1.8 &nbsp;&nbsp; [Get Phonemized Keys](#get_phon_keys)
    - Section 1.9 &nbsp;&nbsp; [Get Unphonemized Keys](#get_unphon_keys)
2. &nbsp;&nbsp; [Apply eng_to_ipa](#eng2ipa_gpt)
3. &nbsp;&nbsp; [Analyse Unphonemized Words](#unphon_analysis)
    - Section 3.1 &nbsp;&nbsp; [Expand compound words](#expand_compound)
    - Section 3.2 &nbsp;&nbsp; [Phonemize expanded words](#phon_expand)
    - Section 3.3 &nbsp;&nbsp; [Extract unphonemized words](#extract_unphon)
    - Section 3.4 &nbsp;&nbsp; [Update dictionaries](#update_dict)
4. &nbsp;&nbsp; [Analyse Phonemized Words](#phon_analysis)
    - Section 4.1 &nbsp;&nbsp; [Extract Phonemized Words](#extract_en_phon)
    - Section 4.2 &nbsp;&nbsp; [Identify Wrong Malay Words](#wrong_malay)
5. &nbsp;&nbsp; [Generate gaming_gpt.yaml](#gen_gaming_yaml)
    - Section 5.1 &nbsp;&nbsp; [Update custom_dict and translated_dict](#update_custom_translated)
    - Section 5.2 &nbsp;&nbsp; [Combine All Mapping](#combine_all_mapping)
6. &nbsp;&nbsp; [Append Normalized Text](#append_norm_phon)
7. &nbsp;&nbsp; [Generate GPT Phonemes](#gpt_phonemes)
    - Section 7.1 &nbsp;&nbsp; [Phonemize with eng_to_ipa](#phon_eng2ipa)
    - Section 7.2 &nbsp;&nbsp; [espeak_ms](#espeak_ms)
    - Section 7.3 &nbsp;&nbsp; [espeak_en_ms](#espeak_en_ms)
    - Section 7.4 &nbsp;&nbsp; [eng2ipa_espeak_ms](#eng2ipa_espeak_ms)
8. &nbsp;&nbsp; [Generate Text Manifest for GPT Phonemes](#gpt_text_manifest)
    - Section 8.1 &nbsp;&nbsp; [Text Manifest (espeak_ms)](#espeak_ms_txt)
    - Section 8.2 &nbsp;&nbsp; [Text Manifest (espeak_en_ms)](#espeak_en_ms_txt)
    - Section 8.3 &nbsp;&nbsp; [Text Manifest (eng2ipa_espeak_ms)](#eng2ipa_espeak_ms_txt)
    - Section 8.4 &nbsp;&nbsp; [Text Manifest (norm_phon)](#norm_phon_txt)
9. &nbsp;&nbsp; [Generate Test Filelist](#gen_test_filelist)
    - Section 9.1 &nbsp;&nbsp; [Text Manifest (gpt_char)](#gpt_char)
    - Section 9.2 &nbsp;&nbsp; [Text Manifest (gpt_espeak_en)](#gpt_espeak_en)
    - Section 9.3 &nbsp;&nbsp; [Text Manifest (gpt_eng2ipa)](#gpt_eng2ipa)



## 1. &nbsp;&nbsp; [Create mapping](#home) <a id='create_mapping'></a>

Before phonemization, we have to create a new mapping to normalize GPT corpus. Reasons being:

1. We are not able to use the mapping that was used to generate the audio pairs since it maps to modified words. Characters are modified in modified words and therefore are no longer actual English words.
2. Since espeak and eng_to_ipa are unlikely to phonemize these modified words properly, there is a need for us to create a new mapping (i.e. `eng2ipa_mapping`) to map English words to proper English words.

Reasons for using eng_to_ipa to phonemize are as follows:

1. As shown in earlier sections, polyglot is not able to differentiate between English words and corresponding Malay words that sounds similar e.g. 'active' vs 'aktif'.
2. Therefore we use eng_to_ipa, which should be able to phonemize only English words i.e. non-English words will be tagged with asterick.

Unfortunately, we have to scrutinize the phonemes generated by eng_to_ipa as phonemization is not perfect:

1. If words are relatively new and not updated in CMU English vocabulary, these words will not be able to be phonemized e.g. 'gank' used in gaming.
2. Malay words (like 'ada', 'yang', etc., which shouldn't be phonemized) were phonemized by eng_to_ipa in English tone e.g. 'ada' which should be pronounced as 'ah da' was phonemized to sound 'eh da'.
3. eng_to_ipa do not handle heteronyms e.g. there is only 1 pronounciation for 'read' and is default to 'reed' sound.

We will generate 5 mapping namely: `mapping_dict`, `edge_dict`, `gpt_dict`, `custom_dict`, `translated_dict` and `phon_dict`. `translated_dict` will be appended to `custom_dict` to form `eng2ipa_mapping`. We will generate `gaming_gpt.yaml` by combining `phon_dict`, `edge_dict` and `eng2ipa_mapping`. Steps are as follows:

<style>
    table {
        width: 70%;
        table-layout:fixed;
    }
    th:nth-child(1),
    td:nth-child(1) {
        width: 3%;
    }
    th:nth-child(2),
    td:nth-child(2) {
        width: 17%;
    }
    th:nth-child(3),
    td:nth-child(3) {
        width: 80%;
    }
</style>

| S/N | Step | Description | 
| -- | -- | -- |
| 1. | [Create mapping_dict](#mapping_dict) | &bull; Amend `mapping` to cater for eng_to_ipa phonemization. |
| 2. | [Create edge_dict](#edge_dict) | &bull; Amend `mapping` to cater for eng_to_ipa phonemization. |
| 3. | [Create phon_mapping](#phon_dict) | &bull; Convert `phon_mapping` in `gaming.yaml` to dictionary. |
| 4. | [Extract Unphonemized Gaming Terms](#unphon_gaming) | &bull; Extract `term` and `eng2ipa` columns from `combined_gaming_terms.csv` |
| 5. | [Update mapping_dict](#update_mapping) | &bull; Update unphonemized gaming terms to `mapping_dict`. |
| 6. | [Create gpt_dict](#gpt_dict) | &bull; Convert `gpt_mapping` and `phon_mapping` in `gaming.yaml` to dictionary. |
| 7. | [Create custom_dict and translated_dict](#custom_translated_dict) | &bull; Split `gpt_dict` dictionary into `custom_dict` and `translated_dict` for further processing. |

| 8. | [Get Phonemized Keys](#get_phon_keys) | &bull; Get list of phonemized keys for `custom_dict` and `translated_dict`. |
| 9. | [Get Unphonemized Keys](#get_unphon_keys) | &bull; Get list of unphonemized keys (from eng_to_ipa) in `custom_dict`. |
| 10. | [Generate gaming_gpt.yaml](#gen_gaming_yaml) | &bull; Generate `gaming_gpt.yaml` with `mapping_dict`, `edge_dict`, `custom_dict` and `translated_dict`. |


## 1.1 &nbsp;&nbsp; [Create mapping_dict](#home) <a id='mapping_dict'></a>

We attempt to use `mapping` in `gaming.yaml` (total of 380 key-value pairs) as far as possible:

1. Combine proper English words to replicate pronounciation.
2. If words combination doesn't generate the proper pronunciation (slight deviation), we provide the customized phonemes.

In [None]:
# Convert `mapping` OmegaConf dictionary to `mapping_dict` dictionary
mapping_dict = OmegaConf.to_object(cfg.mapping)
mapping_dict

## 1.2 &nbsp;&nbsp; [Create edge_dict](#home) <a id='edge_dict'></a>

Dictionary mapping words with specific case e.g. 'HoT' vers 'hot'

In [None]:
edge_dict = {
    "DoT": "dee o tea",
    "HoT": "H o tea",
    "t-pose": "tea pos",
    "T-pose": "tea pos",
}

## 1.3 &nbsp;&nbsp; [Create phon_mapping](#home) <a id='phon_dict'></a>

We leverage on `phon_mapping` in `gaming.yaml` by converting it to `phon_dict`; and compare our custom phonemes with phonemes generated by espeak-en.

Our observations:

1. Our custom phonemes are slightly different from that created via espeak-en. 
2. While we are not able to discern the correct phonemes, we assume espeak-en would be more accurate compared to our custom phonemes

##### Convert `phon_mapping` to dictionary

In [None]:
# Convert OmegaConf dictionary `phon_mapping` to dictionary
phon_dict = OmegaConf.to_object(cfg.phon_mapping)

for k, v in phon_dict.items():
    print(f"{k:<25} : {v}")

##### Custom phonemes versus espeak-en phonemes

In [None]:
# Compare between custom phonemes with that generated by espeak-en
df_phon = pd.DataFrame.from_dict(phon_dict, orient="index")
df_phon = df_phon.reset_index()
df_phon.columns = ["words", "custom_phon"]

# Append phonemes generated by eng_to_ipa  to DataFrame
df_phon["espeak_en"] = df_phon["words"].parallel_map(english_cleaners2)
df_phon

## 1.4 &nbsp;&nbsp; [Extract Unphonemized Gaming Terms](#home) <a id='unphon_gaming'></a>

Our observations:

1. 260 unphonemized gaming list generated from `combined_gaming_terms.csv`.

In [None]:
# Generate unphonemized gaming list
unphon_gaming_list = gen_unphon_gaming(cfg.paths.combined_path)
print(len(unphon_gaming_list))

unphon_gaming_list

## 1.5 &nbsp;&nbsp; [Update mapping_dict](#home) <a id='update_mapping_dict'></a>

We will first phonemize unphonemized gaming terms after removing the hyphen. The final unphonemized gaming terms (together with its mapping) will be added to `mapping_dict`. Subsequently, phonemized gaming terms are removed from `mapping_dict`.

Our observations:

1. 260 gaming terms were unphonemized by eng_to_ipa large due to presence of hyphen.
2. Of the 260 gaming terms, 86 unphonemized gaming terms were not found in init_mapping.
3. Final unphonemized gaming terms consist of 50 terms that will require mapping.

##### Identify gaming terms in unphon_gaming not found in mapping_dict

In [None]:
unphon_gaming_not_in_mapping = extract_unique([unphon_gaming_list, mapping_dict.keys()])
unphon_gaming_not_in_mapping

##### Convert unphon_gaming_not_in_mapping to DataFrame

In [None]:
df_unphon = pd.DataFrame({"words": unphon_gaming_not_in_mapping})
df_unphon["eng2ipa"] = df_unphon["words"].parallel_map(eng_to_ipa.convert)
df_unphon

##### Extract final unphonemized words

In [None]:
df_final_unphon = extract_phon_unphon(df_unphon, "unphon")
print(len(df_final_unphon))
df_final_unphon

##### Create `update_dict` and `update_phon`

We create customized phonemes for words that cannot be represented by word combinations.

Our observations:

1. 36 key-value pair to update to `mapping_dict`.
2. 13 key-value pair to update to `phon_dict`.

In [None]:
update_dict = {
    "miniboss": "mini boss",
    "poggers": "pork girls",
    "button mashing": "button mash shing",
    "aimbot": "aim bought",
    "completionist": "completion nist",
    "telegraphing": "telegraph fin",
    "waifu": "wai fu",
    "chiptune": "chip tune",
    "integrated i/o": "integrated I O",
    "zerging": "ze ging",
    "anti rpg": "anti R pee gee",
    "real time corruptor": "real time corrupt ter",
    "minimap": "mini map",
    "cpu versus cpu": "C pee you versus C pee you",
    "leaderboard": "leader board",
    "overwatch": "over watch",
    "on disc dlc": "on disc D el C",
    "cheevo": "chee vo",
    "arena fps": "arena F pee ass",
    "transmog": "trans mock",
    "always on drm": "always on dee R em",
    "team deathmatch": "team death match",
    "bullshot": "bull shot",
    "softlock": "soft lock",
    "hitscan": "hit scan",
    "duping": "dupe ping",
    "sistering": "sister ring",
    "permadeath": "perma death",
    "animatic": "any matic",
    "metastory": "meta story",
    "deathmatch": "death match",
    "waggle": "wag girl",
    "nerfing": "nerve fin",
    "superboss": "super boss",
    "spamming": "spam ming",
    "bot": "bought",
}

update_phon = {
    "proc": "ˈpɹɑk",
    "battler": "ˈbætəlɚ",
    "newb": "ˈnoʊb",
    "noob": "ˈnoʊb",
    "laner": "ˈleɪnɚ",
    "gank": "ˈɡæŋk",
    "squish": "ˈskwɪʃ",
    "telefrag": "ˈtɛliˈfɹæk",
    "metroidvania": "ˈmɛtˌɹɔɪdˈvˈeɪniə",
    "frag": "ˈfɹæk",
    "freemium": "ˈfɹimiəm",
    "influencer": "ˈɪnfluənsɚ",
    "gibs": "ˈɡɪbz",
}

##### Update `mapping_dict`

Our observations:

1. 416 key-value pairs in mapping_dict (up from 380 key-value pairs)

In [None]:
mapping_dict.update(update_dict)
print(len(mapping_dict))
mapping_dict

##### Update `phon_dict`

Our observations:

1. 23 key-value pairs currently in `phon_dict`.

In [None]:
phon_dict.update(update_phon)

# sort `phon_dict`
phon_dict = dict(sorted(phon_dict.items()))
print(len(phon_dict))
phon_dict

##### Update `mapping_dict` based on phonemized keys

Our observations:

1. 215 keys in `mapping` can be phonemized by eng_to_ipa.
2. Out of 215 keys, 170 keys will be removed.
3. Of the remaining 45 keys, mapping for 16 keys will be updated, which are mainly dealing with numbers.
4. Updated `mapping_dict` are left with 210 key-value pairs.

In [None]:
# Print out list of phonemized keys to generate dictionary to remove relevant phonemized keys
df_phon_mapping = extract_phon_from_mapping(mapping_dict, "phon")

for row in df_phon_mapping.itertuples(index=False, name=None):
    print(row)

In [None]:
# Generate delete dictionary to remove selected phonemized keys
delete_list = [
    "act",
    "ace",
    "achievement",
    "achievements",
    "add on",
    "adds",
    "aiming down sights",
    "animation priority",
    "area",
    "asymmetric gameplay",
    "asynchronous gameplay",
    "arcade",
    "asset",
    "badge",
    "base",
    "balancing",
    "blacklist",
    "buff",
    "bug",
    "camping",
    "cartridge",
    "casual",
    "character",
    "cheating",
    "checkpoint",
    "cinematic",
    "clap",
    "clapped",
    "clapping",
    "clicker",
    "clicking",
    "clipping",
    "clock",
    "clocked",
    "clone",
    "closed",
    "clutch",
    "combo",
    "console",
    "coyote",
    "cracked",
    "crowd control",
    "crunch",
    "cut in",
    "damage",
    "de",
    "devolution",
    "dialog",
    "dialogue",
    "difficulty",
    "directional pad",
    "diverse",
    "digital rights management",
    "dungeon",
    "dynamic",
    "early access",
    "emergent game play",
    "extra sensory perception cheats",
    "fear of missing out",
    "field of view",
    "final",
    "frame buffer",
    "frames",
    "gameplay",
    "game mechanics",
    "game sense",
    "gamer",
    "gamers",
    "gating",
    "god",
    "grind",
    "grinding",
    "hate",
    "heal",
    "health",
    "hit point",
    "horde",
    "hud",
    "identity",
    "idle",
    "in",
    "infinite",
    "invasion",
    "item",
    "johns",
    "joystick",
    "juggernaut",
    "juggling",
    "jump",
    "kill stealing",
    "launch game",
    "lets",
    "localization",
    "live service games",
    "loot",
    "match fixing",
    "macro",
    "magic",
    "main",
    "maxed out",
    "micro",
    "monetization",
    "motion blur",
    "mud",
    "multiplayer",
    "multiplier",
    "no clip mode",
    "nuke",
    "odd ball",
    "on disc",
    "overpowered",
    "out of bounds",
    "pause",
    "palette swap",
    "peak",
    "perks",
    "pervasive game",
    "physical",
    "physics",
    "point of no return",
    "pixel",
    "pog",
    "pub",
    "pug",
    "pull",
    "purchase",
    "quest giver",
    "radar",
    "rage",
    "ratio",
    "ray tracing",
    "reactivity",
    "reboot",
    "remake",
    "rhythm game",
    "runner",
    "saved game",
    "scaling",
    "scuffed",
    "secret level",
    "sequence breaking",
    "skins",
    "smurf",
    "storytelling",
    "specialization",
    "squeaker",
    "status effects",
    "status",
    "snip",
    "stream sniping",
    "streaming",
    "sweat",
    "telegraph",
    "tick",
    "timed",
    "title screen",
    "tower dive",
    "toxicity",
    "turn based",
    "underpowered",
    "underworld",
    "unlock",
    "under levelled",
    "ultimate",
    "upgrade",
    "virtual reality",
    "wall bang",
    "wall climb",
    "whale",
    "youtube bait",
]

In [None]:
# Generate updated dictionary to combine actual English words to generate desired pronunciation
updated_dict = {
    "2.5d graphics": "two point five dee graphics",
    "2d graphics": "two dee graphics",
    "3d graphics": "three dee graphics",
    "8k resolution": "eight kay resolution",
    "8 bit": "eight bit",
    "16 bit": "sixteen bit",
    "32 bit": "thirty two bit",
    "64 bit": "sixty four bit",
    "abandonware": "abandon ware",
    "borderless fullscreen windowed": "border less full screen windowed",
    "indie game": "in dee game",
    "kd ratio": "kay dee ratio",
    "lp": "lets play",
    "noclip mode": "no clip mode",
    "vaporware": "vapor ware",
    "wasd keys": "w a s d keys",
}

In [None]:
# Update mapping_dict with updated_dict
mapping_dict.update(updated_dict)

# Remove irrelevant keys based on delete_list
for key in delete_list:
    mapping_dict.pop(key)

# Sort by key in ascending order
mapping_dict = dict(sorted(mapping_dict.items()))
mapping_dict

##### Update `mapping_dict` based on unphonemized keys

Our observations:

1. 210 key-value pairs with unphonemized key were detected.
2. We updated 51 key-value pairs to ensure that English words are mapped to proper word combination.
3. Certain words are not able to be represented with word combinations (e.g. 'downloadable', 'mudflation', 'monetization', etc.). We will generate the corresponding customized phonemes; and remove from `mapping_dict`. 

In [None]:
# Print out list of unphonemized keys to generate dictionary to remove relevant phonemized keys
df_unphon_mapping = extract_phon_from_mapping(mapping_dict, "unphon")

for row in df_unphon_mapping.itertuples(index=False, name=None):
    print(row)

In [None]:
# Generate updated dictionary to combine actual English words to generate desired pronunciation
updated_dict_1 = {
    "aggro": "ag grow",
    "control pt": "control point",
    "cooldown": "cool down",
    "co op": "cooperative game play",
    "corruptor": "corrupt ter",
    "cranking 90s": "cranking nineties",
    "ctf": "capture the flag",
    "dbno": "down but not out",
    "debuff": "dee buff",
    "demake": "dee make",
    "destructible": "destructable",
    "dpm": "damage per minute",
    "drm": "digital right management",
    "emulator": "E mule later",
    "esports": "E sports",
    "exp": "experience point",
    "fangame": "fan game",
    "foozle": "foo ze",
    "fotm": "flavor of the month",
    "gimp": "gym",
    "git gud": "git good",
    "goty": "game of the year",
    "griefer": "grief fur",
    "hitbox": "hit box",
    "iap": "in app purchase",
    "iframes": "eye frames",
    "microtransaction": "micro transaction",
    "pentakill": "penta kill",
    "playthrough": "play through",
    "pulc": "premium unlockable content",
    "qa": "queue A",
    "qq": "queue queue",
    "ragedoll": "rage doll",
    "ragequit": "rage quit",
    "remorting": "re mort ting",
    "rngesus": "R and jesus",
    "roguelike": "rogue like",
    "roguelite": "rogue light",
    "save scumming": "save scum ming",
    "shovelware": "shovel ware",
    "speedrunning": "speed running",
    "theorycraft": "theory craft",
    "thumbstick": "thumb stick",
    "touchscreen": "touch screen",
    "trickjump": "trick jump",
    "vrr": "variable refresh rate",
    "walkthrough": "walk through",
    "wallbang": "wall bang",
    "wallhack": "wall hack",
    "wp": "W pee",
    "xp": "experience point",
}

In [None]:
# Update `mapping_dict` with `updated_dict_1`
mapping_dict.update(updated_dict_1)

# Sort by key in ascending order
mapping_dict = dict(sorted(mapping_dict.items()))
mapping_dict

## 1.6 &nbsp;&nbsp; [Create gpt_dict](#home) <a id='gpt_dict'></a>

In [None]:
# Convert `gpt_mapping` OmegaConf dictionary to `gpt_dict` dictionary
gpt_dict = OmegaConf.to_object(cfg.gpt_mapping)
gpt_dict

## 1.7 &nbsp;&nbsp; [Create custom_dict and translated_dict](#home) <a id='custom_translated_dict'></a>

`translated_mapping` starts from 'abilities' in `gpt_mapping`. Convert `gpt_dict` keys to list and determine the index for 'abilities'.

In [None]:
# Separate `gpt_dict` to `custom_dict` and `translated_dict`
custom_dict, translated_dict = create_dict(gpt_dict)
custom_dict

In [None]:
translated_dict

##### Remove words in `translated_dict` that are found in values of `mapping_dict`

Rationale is to avoid normalized gaming terms to be amended by `eng2ipa_mapping` (i.e. combined).

Our observations:

1. 'counter terrorist', 'flavor', 'lets', 'minute', 'variable', 'year' are found in the values of `mapping_dict`.
2. Keys in `translated_dict` are different from keys in `custom_dict`.
3. Keys that are common in `custom_dict` and `mapping_dict` are '2.5D', '2D', '3D', '4K', '8K', and 'wasd'.
4. Similarly, keys in `translated_dict` are different from keys in `mapping_dict`.

In [None]:
# Check for common keys in `custom_dict` and `translated_dict`
keys_in_keys(custom_dict, translated_dict)

In [None]:
# Check for common keys in `custom_dict` and `mapping_dict`
keys_in_keys(custom_dict, mapping_dict)

In [None]:
# Check for common keys in `translated_dict` and `mapping_dict`
keys_in_keys(translated_dict, mapping_dict)

In [None]:
term = "product"
text = "Games as a Product"

pattern = rf"^{re.escape(term)}$|\s+{re.escape(term)}\s+|^{re.escape(term)} | {re.escape(term)}$"
if len(re.findall(pattern, text, flags=re.IGNORECASE)) > 0:
    print("True")

# re.findall(pattern, text, flags=re.IGNORECASE)

In [None]:
# Check for common keys in `translated_dict` and gaming terms
remove_list = common_translated_gaming(cfg.paths.combined_path, translated_dict)
print(remove_list)

# Remove above keys in translated_dict
for key in remove_list:
    translated_dict.pop(key)

translated_dict

In [None]:
# Find list of keys in translated_dict that are also found in value of mapping_dict
remove_key = keys_in_values(translated_dict, mapping_dict)
print(remove_key)

# Remove above keys in translated_dict
for key in remove_key:
    translated_dict.pop(key)

translated_dict

##### Remove keys in `mapping_dict`, `custom_dict`, and `translated_dict` that are also present in `phon_dict`

Our observations:

1. 8 keys in `mapping_dict` are also present in `phon_dict`: 'downloadable', 'gimp', 'judder', 'mudlflation', 'noob', 'transmogrification', 'unlockable', 'unlocks'.
2. 203 key-value pairs are left in `mapping_dict`.
3. No keys in either `custom_dict` or `translated_dict` are found in `phon_dict`.

In [None]:
# Identify list of keys that are present in `mapping_dict` and `phon_dict`
remove_keys = keys_in_keys(mapping_dict, phon_dict)
print(remove_keys)

# Remove duplicated keys
for key in remove_keys:
    mapping_dict.pop(key)

mapping_dict

In [None]:
# Identify list of keys that are present in `custom_dict` and `phon_dict`
remove_key = keys_in_keys(custom_dict, phon_dict)
remove_key

In [None]:
# Identify list of keys that are present in `translated_dict` and `phon_dict`
remove_key = keys_in_keys(translated_dict, phon_dict)
remove_key

## 1.8 &nbsp;&nbsp; [Get Phonemized Keys](#home) <a id='get_phon_keys'></a>

Our observations:

1. 17 keys in `custom_dict` are phonemized.
2. Phonemes for 'sci fi' (ˈɛsˈsiˈaɪ fi) implies 'sci fi' are pronounced separately and are incorrect.
3. 721 keys in `translated_dict` are phonemized. No action required as 721 keys are proper English words.

##### Phonemized key-value pair for custom_dict

In [None]:
custom_phon_keys = [key for key in custom_dict.keys() if eng_to_ipa.convert(key)[-1] != "*"]

for key in custom_phon_keys:
    print(f"{key:<15} : {custom_dict[key]}")

##### Phonemized keys for translated_dict

In [None]:
translated_phon_keys = [key for key in translated_dict.keys() if eng_to_ipa.convert(key)[-1] != "*"]

for key in translated_phon_keys:
    print(f"{key:<15} : {translated_dict[key]}")

## 1.9 &nbsp;&nbsp; [Get Unphonemized Keys](#home) <a id='get_unphon_keys'></a>

Our observations:

1. 11 keys in `custom_dict` are unphonemized.
2. 13 keys in `translated_dict` are unphonemized i.e. only 13 out of 734 English words in `translation_dict` are unphonemized.

##### Unphonemized keys for custom_dict

In [None]:
custom_unphon_keys = [key for key in custom_dict.keys() if eng_to_ipa.convert(key)[-1] == "*"]

for key in custom_unphon_keys:
    print(f"{key:<15} : {custom_dict[key]}")

##### Unphonemized keys for translated_dict

In [None]:
translated_unphon_keys = [key for key in translated_dict.keys() if eng_to_ipa.convert(key)[-1] == "*"]

for key in translated_unphon_keys:
    print(f"{key:<20} : {translated_dict[key]}")

# 2. &nbsp;&nbsp; [Apply eng_to_ipa](#home) <a id='eng2ipa_gpt'></a>

Phonemize all 1833 English words detected by polyglot in GPT corpus.

In [None]:
df_en_word = pd.read_csv(cfg.paths.gpt_en_path)
df_en_word["eng2ipa"] = df_en_word["words"].parallel_map(eng_to_ipa.convert)
df_en_word.to_csv(cfg.paths.gpt_en_path, index=False)
df_en_word

## 3. &nbsp;&nbsp; [Analyse Unphonemized Words](#home) <a id='unphon_analysis'></a>

We noted that words that are not phonemized are as follows:

<style>
    table {
        width: 70%;
        table-layout:fixed;
    }
    th:nth-child(1),
    td:nth-child(1) {
        width: 3%;
    }
    th:nth-child(2),
    td:nth-child(2) {
        width: 17%;
    }
    th:nth-child(3),
    td:nth-child(3) {
        width: 80%;
    }
</style>

| S/N | Unphonemized Word | Examples |
| -- | -- | -- |
| 1. | Compound words linked by hyphen | action-adventure, role-playing, no-scope, etc. |
| 2. | Malay words | aktif, evolusi, tangan, etc. |
| 3. | Compound words without white space | artbook, microtransactions, roguelike, etc. |
| 4. | Abbreviations with slash are not phonemized | i/o , k/d, etc. |

Hence, we took the following approaches:

1. Expand compound words since individual word making up the compound word can be phonemized separately; and remove any duplicates.
2. Phonemize the expanded words to extract the unphonemized words again for analysis.

## 3.1 &nbsp;&nbsp; [Expand compound words](#home) <a id='expand_compound'></a>

In [None]:
# Expand compound words by spliting the compound word by hyphen or slash.
df_expand = expand_compound(df_en_word["words"])
df_expand

## 3.2 &nbsp;&nbsp; [Phonemize expanded words](#home) <a id='phon_expand'></a>

Our observations:

1. After expanding compound words and removing duplicates, number of English words found in GPT corpus reduce from 1833 to 1806 words.

In [None]:
# Phonemize English words detected by polyglot and append to DataFrame
df_expand["eng2ipa"] = df_expand["words"].parallel_map(eng_to_ipa.convert)
df_expand

## 3.3 &nbsp;&nbsp; [Extract unphonemized words](#home) <a id='extract_unphon'></a>

Our observations:

1. 433 unphonemized words detected out of 1806 English words used in GPT corpus.
2. Removing unphonemized words that are found in `mapping_dict`, `custom_dict`, `translated_dict`, and `phon_dict`, we have total of 318 words left, most of which are Malay words.
3. 11 unphonemized words that are English: 'tps', 'pokemon', 'modding', 'platformer', 'parkour', 'platformers', 'scolling', 'scumming', 'tump', 'haah', and 'mvp'.

All 11 unphonemeized words except for 'haah' are updated to `custom_dict` while 'haah' is updated to `phon_dict`.

##### Filter out unphonemized words

In [None]:
df_expand_unphon = extract_phon_unphon(df_expand, "unphon")
df_expand_unphon

##### Extract unique words not in `mapping_dict`, `custom_dict`, `translated_dict` and `phon_dict`

In [None]:
unphon_en = df_expand_unphon["words"].to_list()
unique_unphon_en = extract_unique(
    [unphon_en, mapping_dict.keys(), custom_dict.keys(), translated_dict.keys(), phon_dict.keys()]
)
unique_unphon_en

## 3.4 &nbsp;&nbsp; [Update dictionaries](#home) <a id='update_dict'></a>

In [None]:
# Update 10 unphonemized English words to `custom_dict`
update_dict = {
    "tps": "tee pee ass",
    "pokemon": "poke ki mon",
    "modding": "mod ding",
    "platformer": "platform mer",
    "platformers": "platform mers",
    "parkour": "park call",
    "scrolling": "scroll ling",
    "scumming": "scum ming",
    "tump": "thump",
    "mvp": "em vee pee",
}

update_phon = {"haah": "ˈhɑɑ"}

# Update `custom_dict`
custom_dict.update(update_dict)

# Update `phon_dict`
phon_dict.update(update_phon)

print(phon_dict)
custom_dict

# 4. &nbsp;&nbsp; [Analyse Phonemized Words](#home) <a id='phon_analysis'></a>

Identify Malay words that are phonemized by eng_to_ipa. Steps taken:

1. Extract phonemized words.
2. Identify Malay words that are incorrectly phonemized by eng_to_ipa.

## 4.1 &nbsp;&nbsp; [Extract Phonemized Words](#home) <a id='extract_en_phon'></a>

In [None]:
# Filter out phonemized words
df_expand_phon = extract_phon_unphon(df_expand)
df_expand_phon

## 4.2 &nbsp;&nbsp; [Identify Wrong Malay Words](#wrong_malay)

We apply phonemization on all words in GPT corpus and identify Malay words that are incorrectly phonemized by eng_to_ipa

Our observations:

1. Most English words are used in gaming terms. Hence, there is no need for further translation to Malay.
2. 76 Malay words were wrongly phonemized by eng_to_ipa.

In [None]:
# Load gpt corpus as DataFrame
df_gpt = pd.read_csv(cfg.paths.gpt_csv_path)

# Extract phonemized words from entire GPT corpus
phon_gpt = extract_non_english(df_gpt)
phon_gpt

##### Identify Malay words that are incorrectly phonemized by eng_to_ipa and update list to `gaming_gpt.yaml`

In [None]:
phon_malay = [
    "a",
    "ada",
    "adalah",
    "agar",
    "ajar",
    "alam",
    "aman",
    "antar",
    "baca",
    "bang",
    "beli",
    "berjaya",
    "bila",
    "bos",
    "cara",
    "cuba",
    "dah",
    "dan",
    "daya",
    "di",
    "dia",
    "fantastik",
    "gila",
    "hari",
    "ia",
    "ikon",
    "industri",
    "isu",
    "jaya",
    "je",
    "kan",
    "kadar",
    "kasi",
    "kat",
    "kaya",
    "ke",
    "kempen",
    "kira",
    "kita",
    "kot",
    "kredit",
    "la",
    "lain",
    "lama",
    "lupa",
    "mahal",
    "mana",
    "masa",
    "mata",
    "maya",
    "minda",
    "minit",
    "mula",
    "muzik",
    "naik",
    "ni",
    "okey",
    "performa",
    "peta",
    "pintar",
    "polis",
    "ronda",
    "saling",
    "sama",
    "sana",
    "servis",
    "siri",
    "tac",
    "tak",
    "tanya",
    "tim",
    "tu",
    "wang",
    "yang",
    "zaman",
    "zona",
]

# 5. &nbsp;&nbsp; [Generate gaming_gpt.yaml](#home) <a id='gen_gaming_yaml'></a>

We update our mapping dictionaries i.e. `phon_dict`, `edge_dict`, `mapping_dict`, `custom_dict`, `translated_dict` and `phon_malay` before saving as `gaming_gpt.yaml`.

## 5.1 &nbsp;&nbsp; [Update custom_dict and translated_dict](#home) <a id='update_custom_translated'></a>

Actions to be taken:

<style>
    table {
        width: 70%;
        table-layout:fixed;
    }
    th:nth-child(1),
    td:nth-child(1) {
        width: 3%;
    }
    th:nth-child(2),
    td:nth-child(2) {
        width: 27%;
    }
    th:nth-child(3),
    td:nth-child(3) {
        width: 70%;
    }
</style>

| S/N | Action | Rationale | 
| -- | -- | -- |
| 1. | Create words combination for unphonemized keys in `custom_dict` | &bull; Update words combination for '8K', 'geocaching', 'wasd', and 'xbox'. |
| 2. | Remove phonemized keys in `custom_dict` | &bull; `custom_dict` is developed to modify English words for Malaya VITS model inferencing. <br>&bull; Therefore, there is no need to keep English words that are able to be phonemized by eng_to_ipa. |
| 3. | Create customized phonemes | &bull; Create customized phonemes for 'mods', 'sci fi', 'scoped', 'whiffs' and remove these words in `custom_mapping`.

##### Update `custom_dict`

In [None]:
custom_dict.update({"8K": "eight kay", "geocaching": "geo caching", "wasd": "w a s d", "xbox": "axe box"})
custom_dict

##### Remove phonemized `custom_dict` keys found in `custom_phon_keys` list

In [None]:
for key in custom_phon_keys:
    custom_dict.pop(key)

custom_dict

##### Update `phon_dict` and `custom_dict`

In [None]:
# Create customized phonemes
update_phon = {
    "mods": "ˈmɔdz",
    "sci fi": "ˌsaɪˈfaɪ",
    "scoped": "ˈskoʊpt",
    "whiffs": "ˈwɪfs",
}

# Update `phon_dict`
phon_dict.update(update_phon)

# Remove words with created phonemes from `custom_dict`
for key in update_phon.keys():
    # 'sci fi' has been removed earlier under custom_phon_keys
    if key != "sci fi":
        custom_dict.pop(key)

print(len(custom_dict))
custom_dict

##### Save `custom_dict`, `translated_dict` and `phon_dict`

In [None]:
save_json(custom_dict, cfg.paths.custom_dict_path)
save_json(translated_dict, cfg.paths.translated_dict_path)
save_json(phon_dict, cfg.paths.phon_dict_path)

## 5.2 &nbsp;&nbsp; [Combine All Mapping](#home) <a id='combine_all_mapping'></a>
Steps as follows:

1. Combine `phon_dict`, `mapping_dict`, `edge_dict`, `custom_dict`, `translated_dict` to `gaming_gpt_dict`.
2. Convert `gaming_gpt_dict` to OmegaConfig Dictionary `gaming_gpt_mapping`.

##### Generate `gaming_gpt_mapping`

In [None]:
# Sort dictionaries
phon_dict = dict(sorted(phon_dict.items()))
mapping_dict = dict(sorted(mapping_dict.items()))
custom_dict = dict(sorted(custom_dict.items()))
translated_dict = dict(sorted(translated_dict.items()))
phon_malay = sorted(phon_malay)

gaming_gpt_dict = {
    "phon_mapping": phon_dict,
    "mapping": mapping_dict,
    "edge_mapping": edge_dict,
    "custom_mapping": custom_dict,
    "translated_mapping": translated_dict,
    "phon_malay": phon_malay,
}

gaming_gpt_dict

##### Convert to OmegaConfig dictionary and save as `gaming_gpt.yaml` yaml file

In [None]:
gaming_gpt_mapping = OmegaConf.create(gaming_gpt_dict)
print(OmegaConf.to_yaml(gaming_gpt_mapping))

# Save as yaml file
OmegaConf.save(gaming_gpt_mapping, cfg.paths.gpt_yaml_path)

# 6. &nbsp;&nbsp; [Append Normalized Text for Phonemization](#home) <a id='append_norm_phon'></a>

We have updated `eng2ipa_mapping` with gaming terms and relevant English words from GPT corpus.

In [None]:
df_gpt["norm_phon"] = df_gpt["text"].parallel_map(lambda x: normalize_gpt_phon(x, cfg.paths.gpt_yaml_path))
df_gpt.to_csv(cfg.paths.gpt_csv_path, index=False)
df_gpt

In [None]:
for row in df_gpt.loc[:, ["norm_gpt", "norm_phon"]].itertuples(index=False, name=None):
    print(row)

# 7. &nbsp;&nbsp; [Generate GPT Phonemes](#home) <a id='gpt_phonemes'></a>

## 7.1 &nbsp;&nbsp; [Phonemize with eng_to_ipa](#home) <a id='phon_eng2ipa'></a>

In [None]:
df_gpt = phonemize_gpt(df_gpt, "eng2ipa")
df_gpt.to_csv(cfg.paths.gpt_csv_path, index=False)
df_gpt

In [None]:
for row in df_gpt.loc[:, ["text", "eng2ipa"]].itertuples(index=False, name=None):
    print(row)

## 7.2 &nbsp;&nbsp; [espeak_ms](#home) <a id='espeak_ms'></a>

We perform phonemization of GPT corpus via espeak-ms for the purpose of timing the process.

In [None]:
df_gpt = phonemize_gpt(df_gpt, "espeak_ms")
df_gpt.to_csv(cfg.paths.gpt_csv_path, index=False)
df_gpt

In [None]:
for row in df_gpt.loc[:, ["id", "norm_phon", "espeak_ms"]].itertuples(index=False, name=None):
    print(row)

## 7.3 &nbsp;&nbsp; [espeak_en_ms](#home) <a id='espeak_en_ms'></a>

We attempt to apply phonemization only on English words via espeak-en and remaining words (largely Malay) via espeak-ms. This involves iterating through each word in transcript and use polyglot to determine whether it is English word. Phonemization via espeak-en on English words should produce phonemes that captures accurately prosdy and intonation of English speakers.

In [None]:
df_gpt = phonemize_gpt(df_gpt, "espeak_en_ms", phon_mapping=list(args.phon_mapping))
df_gpt.to_csv(cfg.paths.gpt_csv_path, index=False)
df_gpt

In [None]:
for row in df_gpt.loc[:, ["id", "norm_phon", "espeak_en_ms"]].itertuples(index=False, name=None):
    print(row)

## 7.4 &nbsp;&nbsp; [eng2ipa_espeak_ms](#home) <a id='eng2ipa_espeak_ms'></a>

We are using eng_to_ipa and `phon_mapping` to phonemize English words and espeak-ms to phonemize Malay words.

In [None]:
df_gpt = phonemize_gpt(df_gpt, "eng2ipa_espeak_ms", phon_mapping=list(args.phon_mapping))
df_gpt.to_csv(cfg.paths.gpt_csv_path, index=False)
df_gpt

In [None]:
for row in df_gpt.loc[:, ["id", "norm_phon", "eng2ipa_espeak_ms"]].itertuples(index=False, name=None):
    print(row)

# 8. &nbsp;&nbsp; [Generate Text Manifest for GPT Phonemes](#home) <a id='gpt_text_manifest'></a>

Generate text manifest files for normalized GPT corpus from `norm_phon` columns; and for 3 different versions of phonemes (i.e. espeak_ms, espeak_en_ms, and eng2ipa_espeak_ms) under `gpt` folder.

## 8.1 &nbsp;&nbsp; [Text Manifest (espeak_ms)](#home) <a id='espeak_ms_txt'></a>

In [None]:
gen_filelist(df_gpt, cfg.paths.gaming_dir, "gpt_ms", "espeak_ms")

## 8.2 &nbsp;&nbsp; [Text Manifest (espeak_en_ms)](#home) <a id='espeak_en_ms_txt'></a>

In [None]:
gen_filelist(df_gpt, cfg.paths.gaming_dir, "gpt_en_ms", "espeak_en_ms")

## 8.3 &nbsp;&nbsp; [Text Manifest (eng2ipa_espeak_ms)](#home) <a id='eng2ipa_espeak_ms_txt'></a>

In [None]:
gen_filelist(df_gpt, cfg.paths.gaming_dir, "gpt_eng2ipa_ms", "eng2ipa_espeak_ms")

## 8.4 &nbsp;&nbsp; [Text Manifest (norm_phon)](#home) <a id='norm_phon_txt'></a>

In [None]:
gen_filelist(df_gpt, cfg.paths.gaming_dir, "norm_phon", "norm_phon")

# 9. &nbsp;&nbsp; [Generate Test Filelist](#home) <a id='gen_test_manifest'></a>

We have trained VITS model checkpoints (minimum 100 epochs) using raw words (i.e. normalized gaming terms), phonemes generated by espeak-en and phonemes generated by eng_to_ipa. We intend to generate text manifest based on the gaming list to test out the effectiveness of our model.

## 9.1 &nbsp;&nbsp; [Text Manifest (gpt_char)](#home) <a id='gpt_char'></a>

In [None]:
df_combined = pd.read_csv(cfg.paths.combined_path)
df_combined["norm_term"] = df_combined["term"].parallel_map(lambda x: normalize_gpt_phon(x, cfg.paths.gpt_yaml_path))
df_combined

In [None]:
# Generate `gpt_char.txt` filelist under `gpt_char` folder in `gaming` folder
gen_filelist(df_combined, cfg.paths.gaming_dir, "gpt_char", "norm_term")

## 9.2 &nbsp;&nbsp; [Text Manifest (gpt_espeak_en)](#home) <a id='gen_espeak_en_filelist'></a>

In [None]:
df_combined = append_ipa(df_combined, "ipa_norm_en")
df_combined

In [None]:
# Generate `gpt_espeak_en.txt` filelist under `gpt_espeak_en` folder in `gaming` folder
gen_filelist(df_combined, cfg.paths.gaming_dir, "gpt_espeak_en", "ipa_norm_en")

## 9.3 &nbsp;&nbsp; [Text Manifest (gpt_eng2ipa)](#home) <a id='gpt_eng2ipa'></a>

In [None]:
df_combined = phonemize_gpt(df_combined, "eng2ipa_espeak_ms", "norm_term", args.phon_mapping)
df_combined

In [None]:
# Generate `gpt_eng2ipa.txt` filelist under `gpt_eng2ipa_en` folder in `gaming` folder
gen_filelist(df_combined, cfg.paths.gaming_dir, "gpt_eng2ipa", "eng2ipa_espeak_ms")