# Create single Language Model + Scorer for Voice-Chess

Coqui v1.3.0 + KenLM latest / 5-gram

This notebook does NOT require GPU.

Default directory structure on Google Drive ([lc]: Language Code):
```
/voice-chess        # application
  /lm-raw-text      # raw command text 
    [lc].txt        # INPUT <= generated by python generator
  /checkpoints      # alphabet for languages (produced by previous trainings)
    /[lc]           # each in own directory
      alphabet.txt  # OUTPUT => Generated by this notebook - alphabet of the language
    /[lc]
    ...
  /lm               # Language models and scorers generated
    [lc]-vocab.txt  # OUTPUT => generated vocabulary file by this notebook
    [lc].binary     # OUTPUT => generated Language Model file by this notebook
    [lc].scorer     # OUTPUT => generated KenLM scorer file by this notebook => Should be used on voice-chess server
  /models           # Acoustic models collected (not used here, but needed for inference in application)
    [lc].tflite     # Pre-trained language file
```

**INPUT:**

1. Put your localized raw chess commands in /voice-chess/lm-raw-text directory as <languagecode>.txt file (e.g. en.txt)

**OUTPUT:**

1. Your alphabet will be (re-)generated under /voice-chess/checkpoints/[lc] directory (e.g. /voice-chess/checkpoints/en/alphabet.txt ).

2. 3 result files can be found in /voice-chess/lm directory. Existing files will be overwritten. From these, *.scorer file will be used in the application. Others:
*.txt files will include all tokens
*.binary file is your language model

<H2>SPECIFY LANGUAGE CODE</H2>

In [None]:
LANGUAGECODE = "en"

In [None]:
# Other Constants, adapt if needed
COQUI="1.3.0"                                     # Coqui STT version used
DRIVEPATH="/content/drive/MyDrive/voice-chess"    # Where you keep your work on Google Drive
LOCALPATH="/content/data/lm"                      # A local working directory in Colab
TEXTDIR="lm-raw-text"                             # Subdirectory names
# Directories on drive
CHECKPOINSTDIR="checkpoints"
LMDIR="lm"
LMRAWTEXTDIR="lm-raw-text"
MODELSDIR="models"

## Mount Google Drive

In [None]:
# Switch back to v1 - See: https://colab.research.google.com/notebooks/tensorflow_version.ipynb#scrollTo=NeWVBhf1VxlH
%tensorflow_version 1.x

In [None]:
# mount your private google drive
from google.colab import drive
import shutil
drive.mount('/content/drive')

## Basic Setup

In [None]:
# Install Coqui STT 
!git clone --depth 1 --branch v{COQUI} https://github.com/coqui-ai/STT.git
!cd STT; pip install -U pip wheel setuptools; pip install .

In [None]:
# Tensorflow GPU
# Needed if you want to run evaluate to test / voice & text corpus needed for that, so we leave it out
#!pip install tensorflow-gpu==1.15.4

In [None]:
# Get KenLM
!git clone https://github.com/kpu/kenlm.git && cd kenlm && mkdir build && cd build/ && cmake .. && make -j 4

In [None]:
# Get Native Client for Scorer (Colab image is currently Ubuntu x64)
%cd /content/STT/data/lm
!wget https://github.com/coqui-ai/STT/releases/download/v{COQUI}/native_client.tflite.Linux.tar.xz
!tar -xJvf native_client.tflite.Linux.tar.xz
# fix for https://github.com/coqui-ai/STT/pull/2029/files
!cp /content/STT/data/lm/libkenlm.so /usr/lib/libkenlm.so
!ls -al

In [None]:
# Check Tensorflow version and GPU availibility
import tensorflow as tf
print([tf.__version__, tf.test.is_gpu_available()])

In [None]:
# Get more detailed CPU/GPU info
from tensorflow.python.client import device_lib
device_lib.list_local_devices()

## Directory Structure

In [None]:
# Local
!mkdir -p {LOCALPATH}
!ls -al {LOCALPATH}
# Drive
!mkdir -p {DRIVEPATH}/{CHECKPOINTS}/{LANGUAGECODE}
!mkdir -p {DRIVEPATH}/{LMDIR}
!mkdir -p {DRIVEPATH}/{LMRAWTEXTDIR}
!mkdir -p {DRIVEPATH}/{MODELSDIR}

## Get Alphabet

In [None]:
# Common Voice Utils & covo
!pip uninstall commonvoice-utils -y
!pip install git+https://github.com/ftyers/commonvoice-utils.git

In [None]:
# See covo command line arguments
!covo help

In [None]:
# (Re-)create Alphabet
!covo alphabet {LANGUAGECODE} > {DRIVEPATH}/{CHECKPOINTS}/{LANGUAGECODE}/alphabet.txt
!cat {DRIVEPATH}/{CHECKPOINTS}/{LANGUAGECODE}/alphabet.txt

## Generate Language Model

In [None]:
# See your options
!python3 generate_lm.py --help

In [None]:
# Generate
!python3 ./generate_lm.py \
  --input_txt {DRIVEPATH}/{TEXTDIR}/{LANGUAGECODE}.txt \
  --output_dir {LOCALPATH}/ \
  --top_k 500000 \
  --discount_fallback \
  --kenlm_bins /content/kenlm/build/bin/ \
  --arpa_order 5 \
  --max_arpa_memory "85%" \
  --arpa_prune "0|0|1" \
  --binary_a_bits 255 \
  --binary_q_bits 8 \
  --binary_type trie

## Generate Scorer

In [None]:
# See your options
!./generate_scorer_package --help

In [None]:
# Generate scorer with somewhat arbitrary values
# API Change (2022-02): --alphabet => --checkpoint
!./generate_scorer_package \
  --checkpoint {DRIVEPATH}/{CHECKPOINSTDIR}/{LANGUAGECODE} \
  --lm {LOCALPATH}/lm.binary \
  --vocab {LOCALPATH}/vocab-500000.txt \
  --package {LOCALPATH}/kenlm.scorer \
  --default_alpha 0.931289039105002 \
  --default_beta 1.1834137581510284

## Save Results

In [None]:
!ls -al {LOCALPATH}

In [None]:
# Copy to drive while renaming
!cp {LOCALPATH}/lm.binary {DRIVEPATH}/{LMDIR}/{LANGUAGECODE}.binary
!cp {LOCALPATH}/vocab-500000.txt {DRIVEPATH}/{LMDIR}/{LANGUAGECODE}-vocab.txt
!cp {LOCALPATH}/kenlm.scorer {DRIVEPATH}/{LMDIR}/{LANGUAGECODE}.scorer

In [None]:
!ls -al {DRIVEPATH}/{LMDIR}/{LANGUAGECODE}*.*

In [None]:
# Flush disk to Google Drive
drive.flush_and_unmount()