<a href="https://colab.research.google.com/github/LUMII-AILab/NLP_Course/blob/main/notebooks/OpenGrm.ipynb" target="_new"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/></a>

# OpenGrm toolkit

## NGram library

A command-line toolkit for making and manipulating n-gram language models encoded as weighted finite-state transducers (FSTs): https://www.openfst.org/twiki/bin/view/GRM/NGramLibrary

Operations for counting, smoothing, pruning, applying, and evaluating models are provided.

In [None]:
# Install the (Mini)Conda package manager in the Colab environment.
# See https://docs.conda.io for more detail.

# Download the installation script and make it executable
!wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
!chmod +x Miniconda3-latest-Linux-x86_64.sh

# Install Miniconda (-b: silently, -f: forcefully, -p: path)
!./Miniconda3-latest-Linux-x86_64.sh -b -f -p /usr/local

In [None]:
# Install NGram via Conda (-c: channel, -y: non-interactive)
!conda install -c conda-forge ngram -y

In [None]:
# Get a sample text corpus: a normalized copy of Oscar Wilde's "Importance of Being Earnest".
!wget https://www.openfst.org/twiki/pub/GRM/NGramQuickTour/earnest.txt

In [None]:
# Generate a symbol table from the corpus
!ngramsymbols earnest.txt earnest.syms

# Given the symbol table, compile the corpus into an FST
!farcompilestrings --fst_type=compact --symbols=earnest.syms --keep_symbols earnest.txt earnest.far

In [None]:
# Test the FST: restore the corpus text
!farprintstrings earnest.far > earnest_2.txt

In [None]:
# Extract (non-normalized) n-gram counts from the compiled corpus
!ngramcount --order=2 earnest.far earnest.cnts

# Create a normalized and smoothed n-gram language model
!ngrammake earnest.cnts earnest.mod

!ngraminfo earnest.mod

In [None]:
# Test the model: generate a text
!ngramrandgen earnest.mod | farprintstrings

In [None]:
# ngrammerge - for model interpolation
# ngramshrink - for model pruning
# ngramperplexity - for model evaluation

In [None]:
# Serialize the model in the standard ARPA format
!ngramprint --ARPA earnest.mod earnest.ARPA