# Language Identifycation by Byte Language Model

1. Copy this colab notebook into your google drive by clicking **Copy to Drive**. Check the [slides](https://docs.google.com/presentation/d/1MnY4LpI8oIodwVuvoSs-H-zmpmdN5iFu3NBvZ4Egsx8/edit?usp=sharing).
2. You need background knowledge for [Python](https://www.python.org/) and [NumPy](https://numpy.org/).
3. Run the cells yourself and tweak the code so that the byte-wise language model, i.e., `ByteLM`, works as expected.
4. Identify the language for the test file `languages/unk.test` by using the code for byte-wise langauge model, i.e., `ByteLM`.
5. Save this colab notebook as a **pdf** via **Print** in the file menu and submit it to https://edu-portal.naist.jp/ under **NLP #3** of **2025 NAIST 4102 NLP** using the report submission portal. Please make sure that **all the codes, execution results and your answers are visible** in the **pdf** for the assessment. If you violate the format requriement, then, **your score will be zero**. Check the [slides](https://docs.google.com/presentation/d/1MnY4LpI8oIodwVuvoSs-H-zmpmdN5iFu3NBvZ4Egsx8/edit?usp=sharing).
6. Due date is **December 19th, 2025 JST**.

For help regarding [Colab](https://colab.research.google.com/) or any technical issues, ask our TA, Ashmari Pramodya Pussewala Kankanange via <pussewala.ashmari.ow4@naist.ac.jp>.




In [None]:
#@markdown Please fill in your name, student id and email address.

NAME = 'Your Name' #@param {type: 'string'}
STUDENT_ID = 'student-id-number' #@param {type: 'string'}
EMAIL = 'your-account-name@naist.ac.jp' #@param {type: 'string'}

#@markdown ---

## Instructions

We will give *70 points* for fixing a bug in the `ByteLM` class so that it can return perplexity values correctly, i.e., finite values, not, e.g., `Inf` when using the `perplexity` method.
In addition, *30 points* will be credited when identifying the language of a test file, `languages/unk.test`. See each section for details.

* You can use any external libraries so long as you don't break APIs as documented/commented in `ByteLM` class. They are indicated by "DO NOT CHANGE" etc.

* When changing `ByteLM`, leave comments as a justification of how the bug was resolved in the corresponding code block.

* When identifying the language of the test file, add your code as a justification in the corresponding code block with comments.

* You need to run the code blocks, keep the results in the notebook and explain the results in text blocks, otherwise, it is impossible to make an assessment.

* Make it sure to **submit your file in pdf** and not other formats, e.g., `.ipynb`.

### Extras

Those who tried "unique methods" will be given at most *10 points*. The uniqueness is determined whether a submission employs a smoothing method other students have not tried. If you tune any hyperparameters, please leave your experiments in this colab, e.g., your code and results. A unique method for hyperparameter tuning will be also count for the extra points.
However, maximum *10 points* will be deducted when violating rules, e.g., changing part of the codes/APIs which should not be modified.

## Download datasets

We will download the datasets to train and test your language models. The data in the `languages` directory has two subdirectories, `dev` and `devtest`. THe file name takes the form of `{ISO639-language-code}.{dev,devtest}` and you can find the list of ISO 639 language codes to map a three-letter code into its corresponding language name at [Wikipedia](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes).


In [None]:
# Download the file to `/content` directory.
!gdown 12CDdzmMuEInj0bqhMjlGLsONxZm4Tv6_


# Unzip it.
!unzip -o languages.zip

## Import libraries

Adds necessary imports here if you want to use additional libraries.

In [None]:
# Import libraries used in this colab. Adds more when necessary.
import collections
from typing import Any, Dict, List, Tuple

from google.colab import files
import numpy as np

## Language model implementation

`ByteLM` is a language model class which loads a training data, estimate parameters and test it on a file to compute byet-wise perplexity.

In [None]:
class ByteLM:
  """Byte language model.

  This is a very naive language model, in which byte-wise ngram probabilities
  are estimated by maximum-likelihood without considering an issue of
  out-of-vocabulary.

  You may want to tweak `__init__`, `initial_state` and `logprob` methods to
  alleviate the problem. However, do not change `perplexity` in order to check
  whether the model is implemented correctly. When changing part of the codes,
  please try to make it readable by using appropriate variable names or adding
  comments. Feel free to add additional methods, if necessary.

  Usages:
    ```python
    lm = ByteLM(path/to/train/data)
    perplexity, prob = lm.perplexity(path/to/test/data)
    ```
  """

  # DO NOT CHANGE BOS VALUE.
  # 0 will never appear in a text, thus, used it as a special symbol for
  # a beginning-of-sentence symbol, i.e., BOS.
  BOS: int = 0

  def __init__(self, filename: str, order: int=3) -> None:
    """Initializes `ByteLM`.

    You can change the arguments for this method if necessary, e.g., adding
    hyperparameters to this model.

    Args:
      filename: str, text file to train this language model.
      order: int, the n-gram order that should be greater than 1.
    """
    if order <= 1:
      raise ValueError(f'`order` must be greater than 1: {order}')
    self.order = order

    # Collect n-gram counts. The dictionary comprises a key of tuple of
    # integers, i.e., (n-1)-gram, and its associated value of 256-dimensonal
    # vector, i.e., counts for the following chars.
    ngram_counts = collections.defaultdict(lambda: np.zeros([256]))
    with open(filename, 'br') as f:
      for line in f:  # read as a byte string.
        buffer = [self.BOS] + list(line)  # `buffer` is now a list of integers.
        for n in range(1, self.order + 1):
          for i in range(len(buffer) - n + 1):
            ngram = buffer[i:i + n]
            ngram_counts[tuple(ngram[:-1])][ngram[-1]] += 1

    # Maximum likelihood estimate for language model.
    self.ngrams: Dict[Tuple[int], np.ndarray] = {}
    for context, counts in ngram_counts.items():
      probs = counts / np.sum(counts)
      # Computes log probabilities, but assigns -inf for zero probabilities.
      log_probs = np.where(probs == 0.0, -np.inf, np.log(probs))
      self.ngrams[context] = log_probs

  def initial_state(self) -> Any:
    """Returns an initial state for language model computation.

    You can change the code in this method, but keep the API, e.g, input
    arguments, so that `perplexity()` method works as expected.

    Returns:
      A state representation for log probabilities computation.
    """
    return []

  def logprob(self, state: Any, x: int) -> Tuple[np.ndarray, Any]:
    """Returns log probabilities for the current input byte.

    You can change the code in this method, but keep the API, e.g, input
    arguments, so that `perplexity()` method works as expected.

    It is a naive method for backing off to lower order n-grams, and may not be
    optimal for the lower perplexity.

    Args:
      state: A state to compute log probability.
      x: int, the current byte to compute `p(y | state, x)`.
    Returns:
      A pair of (log_probs, next_state) where `log_probs` is `np.ndarray` of log
      probabilities p(y | state, x) of all bytes y, and `next_state` is a new
      state for the next log probability computation with a new input. Note that
      `log_probs[y]` is equal to `log p(y | state, x)`,
      `log_probs.shape == (256,)`, `np.exp(log_probs) >= 0` and
      `np.sum(np.exp(log_probs)) == 1`.
    """
    # Backoff to lower order when necessary.
    state = (state + [x])[-self.order + 1:]
    for i in range(len(state), 0, -1):
       context = state[-i:]
       assert len(context) < self.order
       ret = self.ngrams.get(tuple(context), None)
       if ret is not None:
         return ret, context

    # Backoff to unigram.
    ret = self.ngrams.get((), None)
    assert ret is not None
    return ret, []

  def perplexity(self, filename: str) -> Tuple[float, float]:
    """Computes perplexity for text data.

    DO NOT CHANGE THE API OR CODE IN THIS METHOD.

    Args:
      filename: str, text file to compute perplexity.
    Returns:
      A pair (perplexity, prob) where `perplexity` is the perplexity computed
      for `filename`. `prob` is the cumulative product of probabilities of all
      the bytes in `filename` to verify that this language model is
      probabilistic or not. `prob` should be close to 1, otherwise, this is not
      a language model.
    """
    # Cumulative log_prob for perplexity computation.
    cumulative_log_prob = 0.0
    # Verify the distribution so that this language model is probabilistic.
    prob = 1.0
    # Total number of bytes.
    total_bytes = 0
    with open(filename, 'br') as f:
      for line in f:
        state = self.initial_state()
        prev_x = self.BOS
        for x in line:
          log_probs, state = self.logprob(state, prev_x)
          assert log_probs.size == 256, f"expected 256, got: {log_probs.size}"
          cumulative_log_prob += log_probs[x]

          probs = np.exp(log_probs)
          assert (probs >= 0).all(), "expected greater than or equal to zero."
          prob *= np.sum(probs)  # Sum of `probs` should be close to 1.

          prev_x = x

        total_bytes += len(line)

    return np.exp(-cumulative_log_prob / total_bytes), prob

## Test your code (70 points in total)

Please run the following code block to report the perplexity of English data using the byte language models trained on English, Japanese and two variants of Chinese. Note that you need to modify `ByteLM` class to avoid errors, e.g., reporting `Inf` or non-probabilistic modeling. In addition, add comments in the `ByteLM`  code block explaining how the change resolves the bug (60 points).

You will observe differerent perplexities using language models trained on different languages. Please explain the reason in the *"Why perplexities are different?"* section (10 points).

In [None]:

# Train languages models for English, Japanese and two variants of Chinese. You
# can change the arguments to `ByteLM`, e.g., additional arguments for better
# hyperparameters.
model_eng = ByteLM("languages/dev/eng.dev")
model_jpn = ByteLM("languages/dev/jpn.dev")
model_zho_simpl = ByteLM("languages/dev/zho_simpl.dev")
model_zho_trad = ByteLM("languages/dev/zho_trad.dev")

# DO NOT CHANGE THE FOLLOWING CODES.

# Test on English test data.
perp_eng, prob_eng = model_eng.perplexity("languages/devtest/eng.devtest")
perp_jpn, prob_jpn = model_jpn.perplexity("languages/devtest/eng.devtest")
perp_zho_simpl, prob_zho_simpl = model_zho_simpl.perplexity("languages/devtest/eng.devtest")
perp_zho_trad, prob_zho_trad = model_zho_trad.perplexity("languages/devtest/eng.devtest")

# Print out perplexity and the cumulative product of sum of probabilities for
# each language model.
print(f"English model: perplexity: {perp_eng} prob: {prob_eng}")
print(f"Japanese model: perplexity: {perp_jpn} prob: {prob_jpn}")
print(f"Simplified Chiense model: perplexity: {perp_zho_simpl} prob: {prob_zho_simpl}")
print(f"Traditional Chiense model: perplexity: {perp_zho_trad} prob: {prob_zho_trad}")

# Assertions to make sure the perplexities are finite.
assert np.isfinite(perp_eng)
assert np.isfinite(perp_jpn)
assert np.isfinite(perp_zho_simpl)
assert np.isfinite(perp_zho_trad)

# Assertions to make sure the cumulative product of probabilities are close to
# one.
assert np.allclose(prob_eng, 1.0)
assert np.allclose(prob_jpn, 1.0)
assert np.allclose(prob_zho_simpl, 1.0)
assert np.allclose(prob_zho_trad, 1.0)


### Why perplexities are different? (10 points)

Please add your explanation here.

## Identify the language of `languages/unk.test` (30 points in total)

Use `ByteLM` class to identify the language of the file `languages/unk.test`. You can train langauge models for several langaugages located under `languages/dev/*.dev` using `ByteLM`. Note that each file name takes the form of `ISO639-language-code.dev` so that you can train a small byte language model for each language. Use the models to identify the language of `languages/unk.test` by running `perplexity` method. You can find the list of ISO 639 language codes to map a three-letter code into its corresponding language name at [Wikipedia](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes).

Please add your code in the following block and run it as a justification to identify the language. Then, fill in the answer in the form, *"Please fill in your answer."* (20 points) and explain how you predict the language in the *"How you identify the language?"* section (10 points).

In [None]:
# Add your code here to identfy the language of `languages/unk.test` and run it
# as a justification.





In [None]:
#@markdown ###Please fill in your answer. (20 points)
#@markdown You can find the list of ISO 639 language codes for mapping the
#@markdown three-letter code into the language name at
#@markdown [Wikipedia](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes).
#@markdown You also need to write your code in the code block above and keep the
#@markdown results of running the code.

LANGUAGE = 'English' #@param {type: 'string'}

#@markdown ---

### How you identify the language? (10 points)

Please add your explanation here.

## Additional codes and logs for experiments

Please add any experiemnts you have carried out. For example, you can run expeirments to find better hyperpaameters to train a byte language model. Or, you can investigate several methods to identify the language.

Feel free to add additional code blocks if necessary.


In [None]:
# Add any codes to run your experiments, e.g., testing for hyperparameters.
# You can use other data on `langauges/devtest`, e.g., `jpn.devtest`, to test
# your codes.

