# Create single Language Model + Scorer for Voice-Chess

Coqui v1.3.0 + KenLM latest / 5-gram

This notebook does not require GPU.

Default directory structure on Google Drive (pre-generated):
```
/voice-chess        # application
  /lm-raw-text      # raw command text generated
  /checkpoints      # checkpoints for languages (produced by previous trainings)
    /en             # each in own directory
    /tr
    ...
  /lm               # Language models and scorers generated
  /models           # Acoustic models collected (not used here, but needed for inference in application)
```

**INPUT:**

1. Put your localized raw chess commands in /voice-chess/lm-raw-text directory as <languagecode>.txt file (e.g. en.txt)
2. You alphabet for the model you will use should be under /voice-chess/checkpoints/<languagecode> directory (e.g. /voice-chess/checkpoints/en/* ).

**OUTPUT:**

3 result files can be found in /voice-chess/lm directory. Existing files will be overwritten. From these, *.scorer file will be used in the application.
*.txt files will include all tokens
*.binary file is your language model

<H2>SPECIFY LANGUAGE CODE</H2>

In [1]:
LANGUAGECODE = "tr"

In [2]:
# Other Constants, adapt if needed
COQUI="1.3.0"                                     # Coqui STT version used
DRIVEPATH="/content/drive/MyDrive/voice-chess"    # Where you keep your work on Google Drive
LOCALPATH="/content/data/lm"                      # A local working directory in Colab
TEXTDIR="lm-raw-text"                             # Subdirectory names
LMDIR="lm"
CHECKPOINSTDIR="checkpoints"

## Mount Google Drive

In [3]:
# Switch back to v1 - See: https://colab.research.google.com/notebooks/tensorflow_version.ipynb#scrollTo=NeWVBhf1VxlH
%tensorflow_version 1.x

TensorFlow 1.x selected.


In [4]:
# mount your private google drive
from google.colab import drive
import shutil
drive.mount('/content/drive')

Mounted at /content/drive


## Basic Setup

In [5]:
# Install Coqui STT 
!git clone --depth 1 --branch v{COQUI} https://github.com/coqui-ai/STT.git
!cd STT; pip install -U pip wheel setuptools; pip install .

Cloning into 'STT'...
remote: Enumerating objects: 2202, done.[K
remote: Counting objects: 100% (2202/2202), done.[K
remote: Compressing objects: 100% (1400/1400), done.[K
remote: Total 2202 (delta 822), reused 1829 (delta 712), pack-reused 0[K
Receiving objects: 100% (2202/2202), 12.99 MiB | 31.97 MiB/s, done.
Resolving deltas: 100% (822/822), done.
Note: checking out '148fa74387a2082555dabd243193c6ca8cb19016'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pip
  Downloading pip-22.1.2-py3-none-any.whl (2.1 MB)
[K     |██

In [6]:
# Tensorflow GPU
# Needed if you want to run evaluate to test / voice & text corpus needed for that, so we leave it out
#!pip install tensorflow-gpu==1.15.4

In [7]:
# Get KenLM
!git clone https://github.com/kpu/kenlm.git && cd kenlm && mkdir build && cd build/ && cmake .. && make -j 4

Cloning into 'kenlm'...
remote: Enumerating objects: 14102, done.[K
remote: Counting objects:   0% (1/415)[Kremote: Counting objects:   1% (5/415)[Kremote: Counting objects:   2% (9/415)[Kremote: Counting objects:   3% (13/415)[Kremote: Counting objects:   4% (17/415)[Kremote: Counting objects:   5% (21/415)[Kremote: Counting objects:   6% (25/415)[Kremote: Counting objects:   7% (30/415)[Kremote: Counting objects:   8% (34/415)[Kremote: Counting objects:   9% (38/415)[Kremote: Counting objects:  10% (42/415)[Kremote: Counting objects:  11% (46/415)[Kremote: Counting objects:  12% (50/415)[Kremote: Counting objects:  13% (54/415)[Kremote: Counting objects:  14% (59/415)[Kremote: Counting objects:  15% (63/415)[Kremote: Counting objects:  16% (67/415)[Kremote: Counting objects:  17% (71/415)[Kremote: Counting objects:  18% (75/415)[Kremote: Counting objects:  19% (79/415)[Kremote: Counting objects:  20% (83/415)[Kremote: Counting objects:  21% 

In [8]:
# Get Native Client for Scorer (Colab image is currently Ubuntu x64)
%cd /content/STT/data/lm
!wget https://github.com/coqui-ai/STT/releases/download/v{COQUI}/native_client.tflite.Linux.tar.xz
!tar -xJvf native_client.tflite.Linux.tar.xz
# fix for https://github.com/coqui-ai/STT/pull/2029/files
!cp /content/STT/data/lm/libkenlm.so /usr/lib/libkenlm.so
!ls -al

/content/STT/data/lm
--2022-07-06 09:41:19--  https://github.com/coqui-ai/STT/releases/download/v1.3.0/native_client.tflite.Linux.tar.xz
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/344354127/b635c5e9-a618-47a6-a952-c6427d245062?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20220706%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20220706T094119Z&X-Amz-Expires=300&X-Amz-Signature=fcc4441b54732e36e1b8ab7cfaec167642494632924beb282da5a91f58965efc&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=344354127&response-content-disposition=attachment%3B%20filename%3Dnative_client.tflite.Linux.tar.xz&response-content-type=application%2Foctet-stream [following]
--2022-07-06 09:41:19--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/34

In [9]:
# Check Tensorflow version and GPU availibility
import tensorflow as tf
print([tf.__version__, tf.test.is_gpu_available()])

['1.15.4', False]


In [10]:
# Get more detailed CPU/GPU info
from tensorflow.python.client import device_lib
device_lib.list_local_devices()

[name: "/device:CPU:0"
 device_type: "CPU"
 memory_limit: 268435456
 locality {
 }
 incarnation: 5534542053953408386, name: "/device:XLA_CPU:0"
 device_type: "XLA_CPU"
 memory_limit: 17179869184
 locality {
 }
 incarnation: 7641230085753301768
 physical_device_desc: "device: XLA_CPU device"]

## Directory Structure

In [11]:
# Copy corpus data from drive
!mkdir -p {LOCALPATH}
!ls -al {LOCALPATH}

total 8
drwxr-xr-x 2 root root 4096 Jul  6 09:41 .
drwxr-xr-x 3 root root 4096 Jul  6 09:41 ..


## Generate Language Model

In [12]:
# See your options
!python3 generate_lm.py --help

usage: generate_lm.py [-h] --input_txt INPUT_TXT --output_dir OUTPUT_DIR
                      --top_k TOP_K --kenlm_bins KENLM_BINS --arpa_order
                      ARPA_ORDER --max_arpa_memory MAX_ARPA_MEMORY
                      --arpa_prune ARPA_PRUNE --binary_a_bits BINARY_A_BITS
                      --binary_q_bits BINARY_Q_BITS --binary_type BINARY_TYPE
                      [--discount_fallback]

Generate lm.binary and top-k vocab for Coqui STT.

optional arguments:
  -h, --help            show this help message and exit
  --input_txt INPUT_TXT
                        Path to a file.txt or file.txt.gz with sample
                        sentences
  --output_dir OUTPUT_DIR
                        Directory path for the output
  --top_k TOP_K         Use top_k most frequent words for the vocab.txt file.
                        These will be used to filter the ARPA file.
  --kenlm_bins KENLM_BINS
                        File path to the KENLM binaries lmplz, filter and
       

In [15]:
# Generate
!python3 ./generate_lm.py \
  --input_txt {DRIVEPATH}/{TEXTDIR}/{LANGUAGECODE}.txt \
  --output_dir {LOCALPATH}/ \
  --top_k 1000 \
  --discount_fallback \
  --kenlm_bins /content/kenlm/build/bin/ \
  --arpa_order 4 \
  --arpa_prune "0" \
  --max_arpa_memory "85%" \
  --binary_a_bits 255 \
  --binary_q_bits 8 \
  --binary_type trie



Converting to lowercase and counting word occurrences ...
| | #                                              | 15979 Elapsed Time: 0:00:00

Saving top 1000 words ...

Calculating word statistics ...
  Your text file has 53992 words in total
  It has 387 unique words
  Your top-1000 words are 100.0000 percent of all words
  Your most common word "vezir" occurred 4882 times
  The least common word in your top-k is "şahmat" with 1 times
  The first word with 2 occurrences is "vezire" at place 361

Creating ARPA file ...
=== 1/5 Counting and sorting n-grams ===
Reading /content/data/lm/lower.txt.gz
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
tcmalloc: large alloc 2309988352 bytes == 0x56477eaf2000 @  0x7fe2abe121e7 0x56477c6bf912 0x56477c65a62e 0x56477c63941b 0x56477c625176 0x7fe2a9fabc87 0x56477c626cda
tcmalloc: large alloc 9239928832 bytes == 0x5648085ec000 @  0x7fe2abe121e7 0x56477c6bf912 0x56477c6ae93a 0x56477c6af378 0x56477c639

## Generate Scorer

In [16]:
# See your options
!./generate_scorer_package --help

Options:
  --help                        show help message
  --checkpoint arg              Path to a checkpoint directory corresponding to
                                the model this scorer will be used with. The 
                                alphabet will be loaded from an alphabet.txt 
                                file in the checkpoint directory. Words with 
                                characters not in the alphabet will not be 
                                included in the vocabulary. Optional if using 
                                bytes output mode.
  --lm arg                      Path of KenLM binary LM file. Must be built 
                                without including the vocabulary (use the -v 
                                flag). See generate_lm.py for how to create a 
                                binary LM.
  --vocab arg                   Path of vocabulary file. Must contain words 
                                separated by whitespace.
  --packag

In [17]:
# Generate scorer with somewhat arbitrary values
# API Change (2022-02): --alphabet => --checkpoint
!./generate_scorer_package \
  --checkpoint {DRIVEPATH}/{CHECKPOINSTDIR}/{LANGUAGECODE} \
  --lm {LOCALPATH}/lm.binary \
  --vocab {LOCALPATH}/vocab-1000.txt \
  --package {LOCALPATH}/kenlm.scorer \
  --default_alpha 0.931289039105002 \
  --default_beta 1.1834137581510284

387 unique words read from vocabulary file.
Doesn't look like a character based (Bytes Are All You Need) model.
--force_bytes_output_mode was not specified, using value infered from vocabulary contents: false
Package created in /content/data/lm/kenlm.scorer.


## Save Intermediate Results

In [18]:
!ls -al {LOCALPATH}

total 360
drwxr-xr-x 2 root root   4096 Jul  6 09:59 .
drwxr-xr-x 3 root root   4096 Jul  6 09:41 ..
-rw-r--r-- 1 root root 180496 Jul  6 09:59 kenlm.scorer
-rw-r--r-- 1 root root 167761 Jul  6 09:59 lm.binary
-rw-r--r-- 1 root root   4622 Jul  6 09:59 vocab-1000.txt


In [19]:
# Copy to drive while renaming
!cp {LOCALPATH}/lm.binary {DRIVEPATH}/{LMDIR}/{LANGUAGECODE}.binary
!cp {LOCALPATH}/vocab-1000.txt {DRIVEPATH}/{LMDIR}/{LANGUAGECODE}-vocab.txt
!cp {LOCALPATH}/kenlm.scorer {DRIVEPATH}/{LMDIR}/{LANGUAGECODE}.scorer

In [20]:
!ls -al {DRIVEPATH}/{LMDIR}/{LANGUAGECODE}*.*

-rw------- 1 root root 167761 Jul  6 09:59 /content/drive/MyDrive/voice-chess/lm/tr.binary
-rw------- 1 root root 180496 Jul  6 09:59 /content/drive/MyDrive/voice-chess/lm/tr.scorer
-rw------- 1 root root   4622 Jul  6 09:59 /content/drive/MyDrive/voice-chess/lm/tr-vocab.txt


In [21]:
# Flush disk to Google Drive
drive.flush_and_unmount()