<a href="https://colab.research.google.com/github/SoraWatabe/SocialDeveloper/blob/master/181371_Sora_Watabe_NLP3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Byte Language Model

1. Copy this colab notebook into your google drive.
2. You need background knowledge for [Python](https://www.python.org/) and [NumPy](https://numpy.org/).
3. Run the cells yourself and tweak the code so that the byte-wise language model works as expected. Note that the current implementation does not perform any smoothing and thus leading to `inf` perplexity.
4. Save this colab notebook as a pdf via **Print** in the file menu and submit it to https://edu-portal.naist.jp/ under **Lecture #3** of **2023 NAIST 4102 NLP** using the report submission portal. Please make sure that all the codes and the execution results are visible for the assessment.
5. Due date is **December 22nd, 2023**.

For help regarding [Colab](https://colab.research.google.com/) or any technical issues, ask our TA, Yusuke Ide <ide.yusuke.jp6@is.naist.jp>.




In [None]:
#@markdown Please fill in your name, student id and email address.

name = 'Sora Watabe' #@param {type: 'string'}
stuent_id = '181371' #@param {type: 'string'}
email = 'watabe.sora.wm8@is.naist.jp' #@param {type: 'string'}

#@markdown ---

## Instructions

We will give 70 points for fixing a bug in the following code so that it can return perplexity values correctly, i.e., finite values, not, e.g., `Inf`. Additional points, i.e., maximum of 30, are given based on the ranking of the perpelxity, i.e., lower is better, among submissions.

* We will rank by the sum of three byte-wise perplexities, not word-wise perplexities, measured on three languages, Czech, German and Chinese.
* The scores are linearly computed among ranks. E.g., 30 is given to the submission with the lowest perplexity, 0 to those with the highest perplexity, and 15 to those in the middle of the ranks.
* Ties are grouped together as a bin, and their scores are computed by taking the average in the bins. For example, if a bin has 3 submissions with their linearly assigned scores of 13, 14 and 15, the average, i.e., $(13 + 14 + 15) / 3 = 14$, will be credited to those three.
* You can use any external libraries so long as you don't break APIs as documented/commented in the code.

### Penalties

The byte-wise perplexity will be penalyzed by adding a penaty term when the cumulative product of sum of probabilities is not closer to 1 using the following formula:

$penalty = \max(\operatorname{abs}(1 - \prod_{t=1}^{T} \sum_{y=0}^{255} p(y | x_{<t})) - 0.001, 0) \times 100$

where $x_t$ is a byte in a file $\boldsymbol{x}$ at position $t$. Note that your language model must be probabilitistic in that sum over all byte values must be equal to one given any histories.

### Extras

Those who tried "unique methods" will be given at most 10 points in addition to the base and ranking points. The uniqueness is determined whether a submission employs a smoothing method other students have not tried. However, maximum 10 points will be deducted when violating rules, e.g., changing part of the codes/APIs which should not be modified.

## Download datasets

We will down load the datasets to train and test your language models. The dataset is extracted from [WMT 2023 Shared Task](https://www2.statmt.org/wmt23/translation-task.html).


In [None]:
# Download the file to "/content" directory.
!gdown 1D4ZUslrY_TLRi_DSW7DtGFHm7Fl4YNrJ

# Unzip it.
!unzip -o byte-language-model-2023.zip

# Create smaller training datasets comprising 10000 sentences for development
# purposes.
! head -10000 byte-language-model-2023/news-commentary-v18.cs.txt > news-commentary-v18-small.cs.txt
! head -10000 byte-language-model-2023/news-commentary-v18.de.txt > news-commentary-v18-small.de.txt
! head -10000 byte-language-model-2023/news-commentary-v18.zh.txt > news-commentary-v18-small.zh.txt

Downloading...
From: https://drive.google.com/uc?id=1D4ZUslrY_TLRi_DSW7DtGFHm7Fl4YNrJ
To: /content/byte-language-model-2023.zip
100% 68.7M/68.7M [00:02<00:00, 28.9MB/s]
Archive:  byte-language-model-2023.zip
  inflating: byte-language-model-2023/news-commentary-v18.cs.txt  
  inflating: byte-language-model-2023/news-commentary-v18.de.txt  
  inflating: byte-language-model-2023/news-commentary-v18.zh.txt  
  inflating: byte-language-model-2023/wmttest2022.cs.txt  
  inflating: byte-language-model-2023/wmttest2022.de.txt  
  inflating: byte-language-model-2023/wmttest2022.zh.txt  
  inflating: byte-language-model-2023/wmttest2023.cs.txt  
  inflating: byte-language-model-2023/wmttest2023.de.txt  
  inflating: byte-language-model-2023/wmttest2023.zh.txt  


## Verifies the extracted files

We will compute the md5 checksums to make sure that you are using the correct files. You will observe the following outputs:

```bash
27f491a6c461d2476cc1a2636ca335a5  byte-language-model-2023/news-commentary-v18.cs.txt
c5cf2b0f973eccb12e45a56039d44f99  byte-language-model-2023/news-commentary-v18.de.txt
73393083e83b20fa09d94730ae336ddd  byte-language-model-2023/news-commentary-v18.zh.txt
49fac4990f87cbac47cd0811e0b37f34  byte-language-model-2023/wmttest2022.cs.txt
4fb8d232846ed6f66662163ff1e319e3  byte-language-model-2023/wmttest2022.de.txt
4da911905616d01cce3a06b5354b4146  byte-language-model-2023/wmttest2022.zh.txt
dea2aeda559fe2c6237a95a57c48b589  byte-language-model-2023/wmttest2023.cs.txt
e9ef45facac4b236d22b2525bf6e3391  byte-language-model-2023/wmttest2023.de.txt
d544edaac936c509f958a3f1309124dd  byte-language-model-2023/wmttest2023.zh.txt
```

Note that we will use `news-commentary-v18.{cs,de,zh}.txt` for training and `wmttest2023.{cs,de,zh}.txt` for the final testing. `wmttest2022.{cs,de,zh}.txt` will be used for your development purposes, e.g., debugging or fine tuning hyperparameters.

In [None]:
# Runs md5sum for the unzipped file.
# please make sure tat the hash codes are the same.

!md5sum byte-language-model-2023/*

27f491a6c461d2476cc1a2636ca335a5  byte-language-model-2023/news-commentary-v18.cs.txt
c5cf2b0f973eccb12e45a56039d44f99  byte-language-model-2023/news-commentary-v18.de.txt
73393083e83b20fa09d94730ae336ddd  byte-language-model-2023/news-commentary-v18.zh.txt
49fac4990f87cbac47cd0811e0b37f34  byte-language-model-2023/wmttest2022.cs.txt
4fb8d232846ed6f66662163ff1e319e3  byte-language-model-2023/wmttest2022.de.txt
4da911905616d01cce3a06b5354b4146  byte-language-model-2023/wmttest2022.zh.txt
dea2aeda559fe2c6237a95a57c48b589  byte-language-model-2023/wmttest2023.cs.txt
e9ef45facac4b236d22b2525bf6e3391  byte-language-model-2023/wmttest2023.de.txt
d544edaac936c509f958a3f1309124dd  byte-language-model-2023/wmttest2023.zh.txt


## Import libraries

Adds necessary imports here if you want to use additional libraries.

In [None]:
# Import libraries used in this colab.
import collections
from typing import Any, Dict, List, Tuple

from google.colab import files
import numpy as np

## Language model implementation

`ByteLM` is a language model class which loads a training data, estimate parameters and test it on a file to compute byet-wise perplexity.

In [None]:
class ByteLM:
  """Byte language model.

  This is a very naive language model, in which byte-wise ngram probabilities
  are estimated by maximum-likelihood without considering an issue of
  out-of-vocabulary.

  You may want to tweak `__init__`, `initial_state` and `logprob` methods to
  alleviate the problem. Howeveer, do not change `perplexity` for a fair
  comparison with other models implemented by other students. When changing part
  of the codes, please try to make it readable by using appropriate variable
  names or adding comments. Feel free to add additional methods, if necessary.

  Usages:
    ```python
    lm = ByteLM(path/to/train/data)
    perplexity, prob = lm.perplexity(path/to/test/data)
    ```
  """

  # DO NOT CHANGE BOS VALUE.
  # 0 will never appear in a text, thus, used it as a special symbol for
  # a beginning-of-sentence symbol, i.e., BOS.
  BOS: int = 0

  def __init__(self, filename: str, order: int=3) -> None:
    """Initializes `ByteLM`.

    You can change the arguments for this method if necessary, e.g., adding
    hyperparameters to this model.

    Args:
      filename: str, text file to train this language model.
      order: int, the n-gram order that should be greater than 1.
    """
    if order <= 1:
      raise ValueError(f'`order` must be greater than 1: {order}')
    self.order = order

    # You can try revise the code in this method to fix a bug and to achieve
    # lower perplexity.

    # Collect n-gram counts.
    ngram_counts = collections.defaultdict(lambda: np.zeros([256]))
    with open(filename, 'br') as f:
      for line in f:  # read as a byte string.
        buffer = [self.BOS] + list(line)  # `buffer` is now a list of integers.
        for n in range(1, self.order + 1):
          for i in range(len(buffer) - n + 1):
            ngram = buffer[i:i + n]
            ngram_counts[tuple(ngram[:-1])][ngram[-1]] += 1

    # Maximum likelihood estimate for language model.
    self.ngrams: Dict[Tuple[int], np.ndarray] = {}
    for context, counts in ngram_counts.items():
      # Maximum likelihood estimate w/o smoothing.
      probs = counts / np.sum(counts)
      # Computes log probabilities, but assigns -inf for zero probabilities.
      log_probs = np.where(probs == 0.0, -np.inf, np.log(probs))
      self.ngrams[context] = log_probs

  def initial_state(self) -> Any:
    """Returns an initial state for language model computation.

    You can change the code in this method, but keep the API, e.g, input
    arguments, so that `perplexity()` method works as expected.

    Returns:
      A state representation for log probabilities computation.
    """
    # You can revise the code here for lower perplexity.
    return []

  def logprob(self, state: Any, x: int) -> Tuple[np.ndarray, Any]:
    """Returns log probabilities for the current input byte.

    You can change the code in this method, but keep the API, e.g, input
    arguments, so that `perplexity()` method works as expected.

    Args:
      state: A state to compute log probability.
      x: int, the current byte to compute `p(y | state, x)`.
    Returns:
      A pair of (log_probs, next_state) where `log_probs` is `np.ndarray` of log
      probabilities p(y | state, x) of all bytes y, and `next_state` is a new
      state for the next log probability computation with a new input. Note that
      `log_probs[y]` is equal to `log p(y | state, x)`,
      `log_probs.shape == (256,)`, `np.exp(log_probs) >= 0` and
      `np.sum(np.exp(log_probs)) == 1`.
    """
    # You many want to revise the code in this method to achieve lower
    # perplexity.

    # Backoff to lower order when necessary.
    state = (state + [x])[-self.order + 1:]
    for i in range(len(state), 0, -1):
       context = state[-i:]
       assert len(context) < self.order
       ret = self.ngrams.get(tuple(context), None)
       if ret is not None:
         return ret, context

    # Backoff to unigram.
    ret = self.ngrams.get((), None)
    assert ret is not None
    return ret, []

  def perplexity(self, filename: str) -> Tuple[float, float]:
    """Computes perplexity for text data.

    DO NOT CHANGE THE API OR CODE IN THIS METHOD.

    Args:
      filename: str, text file to compute perplexity.
    Returns:
      A pair (perplexity, prob) where `perplexity` is the perplexity computed
      for `filename`. `prob` is the cumulative product of probabilities of all
      the bytes in `filename` to verify that this language model is
      probabilistic or not. `prob` should be close to 1, otherwise, this is not
      a language model.
    """
    # Cumulative log_prob for perplexity computation.
    cumulative_log_prob = 0.0
    # Verify the distribution so that this language model is probabilistic.
    prob = 1.0
    # Total number of bytes.
    total_bytes = 0
    with open(filename, 'br') as f:
      for line in f:
        state = self.initial_state()
        prev_x = self.BOS
        for x in line:
          log_probs, state = self.logprob(state, prev_x)
          assert log_probs.size == 256, f"expected 256, got: {log_probs.size}"
          cumulative_log_prob += log_probs[x]

          probs = np.exp(log_probs)
          assert (probs >= 0).all(), "expected greater than or equal to zero."
          prob *= np.sum(probs)  # Sum of `probs` should be close to 1.

          prev_x = x

        total_bytes += len(line)

    return np.exp(-cumulative_log_prob / total_bytes), prob

## Run and report perplexity

Please run the following code block to report the final perplexity of three language models. You can change the hyperparameters to the language model, e.g., arguments for constructing three models, `model_cs`, `model_de` and `model_zh`. However, do not change the rest of the code for a fair comparison with others.

We will rank your "adjusted" perplexity results from three language models after adding penalty terms.

In [None]:
# Computes md5 hash to make sure that the correct datasets are used for training
# and testing.
!md5sum byte-language-model-2023/*

# Train a Czech language model using Czech data. Note that you can modify the
# arguments to `ByteLM` to set hyperparameters to the language model.
model_cs = ByteLM("byte-language-model-2023/news-commentary-v18.cs.txt")

# Train a German language model using German data. Note that you can modify the
# arguments to `ByteLM` to set hyperparameters to the language model.
model_de = ByteLM("byte-language-model-2023/news-commentary-v18.de.txt")

# Train a Chinese language model using Chinese data. Note that you can modify
# the arguments to `ByteLM` to set hyperparameters to the language model.
model_zh = ByteLM("byte-language-model-2023/news-commentary-v18.zh.txt")

# DO NOT CHANGE THE FOLLOWING CODES FOR FAIR COMPARISON.

# Testing on Czech data using the language model trained by Czech data.
perp_cs, prob_cs = model_cs.perplexity("byte-language-model-2023/wmttest2023.cs.txt")

# Testing on German data using the language model trained by German data.
perp_de, prob_de = model_de.perplexity("byte-language-model-2023/wmttest2023.de.txt")

# Testing on Chinese data using the language model trained by Chinese data.
perp_zh, prob_zh = model_zh.perplexity("byte-language-model-2023/wmttest2023.zh.txt")

# Computes total perplexity from three languages.
perp = perp_cs + perp_de + perp_zh

# Computes penalties and simply sums them.
penalty_cs = np.maximum(np.abs(1 - prob_cs) - 0.001, 0) * 100
penalty_de = np.maximum(np.abs(1 - prob_de) - 0.001, 0) * 100
penalty_zh = np.maximum(np.abs(1 - prob_zh) - 0.001, 0) * 100
adjusted_perp = perp + penalty_cs + penalty_de + penalty_zh

# Print out computation results. "adjusted perplexity" is used for ranking.
print(f"cs perplexity: {perp_cs} prob: {prob_cs} penalty: {penalty_cs}")
print(f"de perplexity: {perp_de} prob: {prob_de} penalty: {penalty_de}")
print(f"de perplexity: {perp_zh} prob: {prob_zh} penalty: {penalty_zh}")
print(f"total perplexity: {perp}")
print(f"adjusted: {adjusted_perp:.4f}")

27f491a6c461d2476cc1a2636ca335a5  byte-language-model-2023/news-commentary-v18.cs.txt
c5cf2b0f973eccb12e45a56039d44f99  byte-language-model-2023/news-commentary-v18.de.txt
73393083e83b20fa09d94730ae336ddd  byte-language-model-2023/news-commentary-v18.zh.txt
49fac4990f87cbac47cd0811e0b37f34  byte-language-model-2023/wmttest2022.cs.txt
4fb8d232846ed6f66662163ff1e319e3  byte-language-model-2023/wmttest2022.de.txt
4da911905616d01cce3a06b5354b4146  byte-language-model-2023/wmttest2022.zh.txt
dea2aeda559fe2c6237a95a57c48b589  byte-language-model-2023/wmttest2023.cs.txt
e9ef45facac4b236d22b2525bf6e3391  byte-language-model-2023/wmttest2023.de.txt
d544edaac936c509f958a3f1309124dd  byte-language-model-2023/wmttest2023.zh.txt


  log_probs = np.where(probs == 0.0, -np.inf, np.log(probs))


cs perplexity: inf prob: 0.9999999999966394 penalty: 0.0
de perplexity: inf prob: 0.9999999999785161 penalty: 0.0
de perplexity: inf prob: 0.999999999993634 penalty: 0.0
total perplexity: inf
adjusted: inf


## Add your codes if necessary

You can add arbitrary codes here, e.g., running experiments on smaller training data, i.e., `news-commentary-v18-small.cs.txt` and/or `news-commentary-v18-small.de.txt`, together with development data, `byte-language-model-2023/wmttest2022.cs.txt` and/or `byte-language-model-2023/wmttest2022.de.txt`. You can easily add your code by clicking `+ Code` at the top of this notebook, near the menu bar.

When you tweak any hyperparameters of your model, you may keep some code run results as a justification of the choices, e.g., run results on the development datasets.

In [None]:
# You can add your code here for your testing purposes, e.g., runs on
# development data to tweak your codes in `ByteLM` or find hyperparameters.
# However, do not tune on test data.