## Assignment 1: Build a Toy Llama-2 Language Model

> CISC7021 Applied Natural Language Processing (2024/2025)

In this assignment, we will prepare a toy language model that employs the **Llama-2** architecture and evaluate the perplexity of the data set.

We will learn how to perform continual pre-training of a base language model using the PyTorch and Hugging Face libraries. Detailed instructions for building this language model can be found in the attached notebook file.

Acknowledgement: The base model checkpoint is converted from [llama2.c](https://github.com/karpathy/llama2.c) project. The data instances were sampled from [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset.

---

🚨 Please note that running this on CPU may be slow. If running on Google Colab or Kaggle, you can avoid this by going to **Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**. This should be included within the free tier of Colab.

---

We start by doing a `pip install` of all required libraries.
- 🤗 `transformers`, `datasets`, `accelerate` are Huggingface libraries.
- By default, Colab has `transformers`, `pytorch` libraries installed. If you are using a local machine, please install them via `pip` or `conda`.

In [1]:
#!pip install torch torchvision torchaudio
#!pip install transformers

In [2]:
!pip install datasets accelerate -q

  pid, fd = os.forkpty()


### (Optional) Uploading the model/data to Google Colab or Kaggle.

Please upload your dataset and model to computational platforms if you are using Colab or Kaggle environments.

For Colab users, you can mount your Google Drive files by running the following code snippet:

In [3]:
# from google.colab import drive
# drive.mount('/content/drive')

### Necessary Packages, Environment Setups

In [1]:
import torch
import transformers

from typing import List, Optional, Tuple, Union
from transformers import LlamaForCausalLM, LlamaTokenizer, AutoTokenizer
from transformers import Trainer, TrainingArguments
from itertools import chain
from datasets import load_dataset

from tqdm.notebook import tqdm
from torch.nn import CrossEntropyLoss

Please set the correct file path based on your environment.

- If you are using Colab, the path may be: `/content/drive/MyDrive/xxxxxx`
- If you are using Kaggle, the path may be: `/kaggle/input/xxxxxx`

In [2]:
# Please set the correct file path based on your environment.
TRAIN_FILE = '/kaggle/input/englishtochinesetext/Model and Datasets/data/zh_train.jsonl'
VALIDATION_FILE = '/kaggle/input/englishtochinesetext/Model and Datasets/data/zh_dev.jsonl'
TEST_FILE = '/kaggle/input/englishtochinesetext/Model and Datasets/data/zh_test.jsonl'
PT_TEST_FILE = '/kaggle/input/englishtochinesetext/Model and Datasets/data/pt_test.jsonl'
EN_TEST_FILE = '/kaggle/input/englishtochinesetext/Model and Datasets/data/en_test.jsonl'
MODEL_FOLDER = "/kaggle/input/englishtochinesetext/Model and Datasets/llama-42m"
JP_TRAIN_FILE = '/kaggle/input/jpdata-o/jp_train.jsonl'
JP_TEST_FILE = '/kaggle/input/datsetjp-en-cn/Model and Datasets/data/jp_test.jsonl'
JP_VALIDATION_FILE='/kaggle/input/datsetjp-en-cn/Model and Datasets/data/jp_dev.jsonl'

Load the model checkpoint into either a GPU or CPU (training will be slow on CPU, but decoding will be fair).

In [3]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device type: {device}")

model_path = MODEL_FOLDER
# Load model from local files
model = LlamaForCausalLM.from_pretrained(model_path).to(device)
# Load tokenizer from local files
tokenizer = AutoTokenizer.from_pretrained(model_path, device=device)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

Device type: cuda


As we can see from the statistics, this model is much smaller than Llama-2 but shares the same decoder-only architecture.


😄 **You do not need to check complex details!** We just present the architecture and number of parameters here.

In [4]:
total_para = sum(v.numel() for k, v in model.state_dict().items() if k != 'model.embed_tokens.weight') / 1e6
print(model)
print(f"#Parameters: {total_para:.2f}M")

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 512)
    (layers): ModuleList(
      (0-7): 8 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=512, out_features=512, bias=False)
          (k_proj): Linear(in_features=512, out_features=512, bias=False)
          (v_proj): Linear(in_features=512, out_features=512, bias=False)
          (o_proj): Linear(in_features=512, out_features=512, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=512, out_features=1376, bias=False)
          (up_proj): Linear(in_features=512, out_features=1376, bias=False)
          (down_proj): Linear(in_features=1376, out_features=512, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((512,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((512,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((

### Task 1: Decoding


If you are familar with the usage of `model.generate()` function in transformer library, please feel free to jump to [Task 1 Playground](#scrollTo=Task_1_Playground).


#### 💡Tutorials: model.generate() function.
---
Minimal example:

```python
prompt = "Once upon a time, " # Input, prefix of generation
```

**Step 1**: Encode raw text using tokenizer model.
```python
tokenized_input = tokenizer.encode(prompt, return_tensors='pt').to(device)
```

**Step 2**: Set decoding hyper-parameters. Get the model output.
```python
output_ids = model.generate(tokenized_input, do_sample=True, max_new_tokens=300, temperature=0.6)
```
Important parameters:
- `max_new_tokens`: The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.
- `temperature`: The value of temperature used to modulate the next token probabilities. Higher temperature -> generate more diverse text. Lower temperature -> generate more deterministic text.
- `do_sample`: `do_sample=False` is using greedy decoing strategy. To enable greedy decoding, we also need to set other sampling parameters `top_p`, `temperature` as `None`.
- [If you are interested in other decoding algorithms, please refer to this link for setting parameters.](https://huggingface.co/docs/transformers/v4.44.2/en/main_classes/text_generation#transformers.GenerationConfig)

**Step 3**: Convert model outputs into raw text.
```python
output_text = tokenizer.decode(output_ids[0])
```
or (when input instances >=1)
```python
output_text = tokenizer.batch_decode(output_ids)
```
Important parameters:
- Setting `skip_special_tokens=True` will prevent special tokens, such as `<s>`, from appearing in the results..

---


To understand the outputs of each step, let us do a simple generation task step by step! (Note: the base model is only able to produce fluent story text).

In [5]:
prompt = "Once upon a time, Stella Lou had a dream." # Feel free to use other generation prefix

In [6]:
# Step 1: Encode raw text using tokenizer model. Run tokenization and covert strings into token ids in vocabulary.
tokenized_input = tokenizer.encode(prompt, return_tensors='pt').to(device)
# See the tokenized results.
print(tokenized_input)

tensor([[    1,  9038,  2501,   263,   931, 29892,   624,  3547,  4562,   750,
           263, 12561, 29889]], device='cuda:0')


In [7]:
# Step 2: Set decoding hyperparameters.

# For greedy decoding
max_new_tokens = 300
do_sample = False  # `do_sample=False` means using greedy decoing strategy. To enable greedy decoding, we also need to set `top_p`, `temperature` as `None`.
temperature = None

# call generation function model.generate()
output_ids = model.generate(
    tokenized_input,
    max_new_tokens=max_new_tokens,
    eos_token_id=1,
    do_sample=do_sample,
    temperature=temperature,
    top_p=None,
)

# The decoded results are token ids.
print("=" * 20 + "Token IDs" + "=" * 20)
print(output_ids)

tensor([[    1,  9038,  2501,   263,   931, 29892,   624,  3547,  4562,   750,
           263, 12561, 29889,  2296,  5131,   304,   367,   263, 12456,   985,
         29889,  2296,  5131,   304, 19531,   263,  9560, 10714,   322,   263,
           528,  4901, 20844, 29889,  1205,  1183,   471,  2086,  2319,   322,
           278, 10714,   471,  2086,  4802, 29889,    13,  6716,  2462, 29892,
           624,  3547,  4446,   263,  4802, 29892,   528,  4901, 10714,   297,
           263,  3787, 29889,  2296,  4433,   902, 16823,   565,  1183,  1033,
           505,   372, 29889,  2439, 16823,  1497,  4874,   322, 18093,   372,
           363,   902, 29889,    13,   855,  3547,   471,   577,  9796, 29889,
          2296,  1925,   373,   278, 10714,   322,  3252,   381,   839,  2820,
         29889,  2296,  7091,   763,   263,  1855, 12456,   985, 29889,    13,
          6246,   769, 29892,  1554,  8515,  9559, 29889,   624,  3547,  4687,
           304,  4459,   270,   466,  1537, 29889,  

In [8]:
# Step 3: Convert model outputs into raw text.
# decode token ids into tokens
print("=" * 20 + "Decoded Results" + "=" * 20)
# We only have one input instance. So we directly decode the first item of model output, i.e., `output_ids[0]`.
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

Once upon a time, Stella Lou had a dream. She wanted to be a princess. She wanted to wear a beautiful dress and a shiny crown. But she was too small and the dress was too big.
One day, Stella saw a big, shiny dress in a store. She asked her mom if she could have it. Her mom said yes and bought it for her.
Stella was so happy. She put on the dress and twirled around. She felt like a real princess.
But then, something strange happened. Stella started to feel dizzy. She couldn't stand up straight. She felt like she was spinning around and around.
Stella's mom saw her and said, "Stella, you need to take a break. You look dizzy."
Stella took off the dress and lay down on the floor. She closed her eyes and took a deep breath. After a few minutes, she felt better.
Stella smiled and said, "Mom, I'm ready to be a princess again!"


#### Another pipeline example: Sampling decoding with temperature.

In [9]:
prompt = "Stella Lou hurt herself."

# Decoding hyperparameters
max_new_tokens = 300
do_sample = True
# The value of temperature used to modulate the next token probabilities.
# Higher temperature -> generate more diverse text. Lower temperature -> generate more deterministic text.
temperature = 0.6

tokenized_input = tokenizer.encode(prompt, return_tensors="pt").to(device)
output_ids = model.generate(
    tokenized_input,
    max_new_tokens=max_new_tokens,
    eos_token_id=1,
    do_sample=do_sample,
    temperature=temperature,
)
output_text = tokenizer.decode(output_ids[0])
print(output_text)


<s> Stella Lou hurt herself. She had been playing in the park with her friends, but she had not been careful. She had run too fast and fallen down.
"Ouch!" she cried.
Her mom came running over. "What happened?" she asked.
"I fell down," Stella said, tears streaming down her face.
Her mom hugged her and said, "It's ok. Let's get you home and make you feel better."
Stella smiled and nodded. Her mom took her home and put a bandage on her knee. She gave her a big hug and said, "You'll be okay."
Stella felt a little better. She was glad her mom was there to help her. She knew she would be more careful next time she played in the park.<s>


#### Task 1 Playground

---

📚 Task 1: Please generate English stories using various prompts and decoding settings. Please feel free to explore any interesting phenomena, such as the impact of different prompts and the effects of various decoding algorithms and parameters. For example, quantify the text properties using linguistic-driven metrics like story length and Type-Token Ratio (TTR). In addition to objective metrics, you are encouraged to discuss your findings based on subjective case studies.

We provide two types of skeleton code: one that takes a single prompt as input and another that can process batched inputs and decoding. Please use the version that best fits your preferences and data types.

---

In [10]:
# Skeleton Code: Single input (same as previous code blocks)

prompt = "" # ⬅️ try to construct different prompts.

# ⬇️ Try to tune different decoding hyperparameters.
# You can also add more hyperparameters like `top_p`, `top_k`.
max_new_tokens = 300
do_sample = True
temperature = 0.6

tokenized_input = tokenizer.encode(prompt, return_tensors="pt").to(device)
output_ids = model.generate(
    tokenized_input,
    max_new_tokens=max_new_tokens,
    eos_token_id=1,
    top_k=10,  # 从概率最高的 50 个候选词中采样
    top_p=0.9,
    do_sample=do_sample,
    temperature=temperature,
)
output_text = tokenizer.decode(output_ids[0])
print(output_text)

<s> Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she went to the park with her mom. They played on the swings and went down the slide. 
After a while, Lily's mom said it was time to go home. But Lily didn't want to leave yet. She said, "Mommy, can we stay a little longer? I want to play more." 
Her mom said, "I'm sorry, Lily. We have to go home now. It's getting late and we need to eat dinner." 
Lily felt sad and didn't want to leave. She said, "But mommy, I want to stay and play more. Please?" 
Her mom said, "I know it's hard, but we have to go. We can come back another day." 
Lily understood and they walked home together. As they walked, Lily looked up at the sky and said, "Mommy, the sky is so blue. It's like a big, open sky." 
Her mom smiled and said, "Yes, it is. And when we get home, we can have a yummy dinner and watch a movie together." 
Lily felt happy and excited to spend more time with her mom.<s>


In [11]:
# Skeleton Code: Bacthed input-output

prompts = ["Once upon a time,", "Tom is a cute kitty."]  # ⬅️ try to construct different prompts.

batch_size = 2 # If you have multiple data inputs, please control the batch size to prevent out-of-memory issues.

# ⬇️ Try to tune different decoding hyperparameters.
# You can also add more hyperparameters like `top_p`, `top_k`.
max_new_tokens = 300
do_sample = True
temperature = 0.6

for i in range(0, len(prompts), batch_size):
    batch_input = prompts[i:i+batch_size]
    tokenized_input = tokenizer(batch_input, return_tensors="pt", padding=True).to(device)

    # For decoder-only models, batched inputs of model.generate() should be in the format of input_ids.
    output_ids = model.generate(
        tokenized_input["input_ids"],
        max_new_tokens=max_new_tokens,
        eos_token_id=1,
        do_sample=do_sample,
        temperature=temperature,
    )
    output_text = tokenizer.batch_decode(output_ids, skip_special_tokens=True)

    for idx, result in enumerate(output_text):
        print(f"{result}\n")

Once upon a time,ie was playing in the park. She saw a big, black bird flying above her. It was a very big black bird and it was very loud. The bird was so loud that it made the ground shake.
Suddenly, the bird flew down and landed right in front of the little girl. The girl was scared and she started to cry. The bird said, "Don't be scared, I'm here to help you." The bird then flew up and grabbed the girl's shoe with its beak. The shoe fell down and the girl was happy.
The bird then said, "I have an idea. Let's go and find a big, black bird that can help us." The girl was excited and she said, "Yes!"
The bird flew off and soon came back with a big, black bird. The girl was scared at first, but then the big, black bird said, "Don't be scared. I'm here to help you." The bird then flew around the park, making a lot of noise. The little girl laughed and the bird flew away.
The little girl was happy and she thanked the big, black bird for helping her. She waved goodbye and went home. From 

#### What about other languages?

Oops! This English language model cannot generate stories in other languages!

Why? Let us evaluate the perplexity of different languages in the next task.

In [12]:
prompt = "从前有一只小兔子乖乖"

# Decoding hyperparameters
max_new_tokens = 300
do_sample = True
temperature = 0.3

tokenized_input = tokenizer.encode(prompt, return_tensors="pt").to(device)
output_ids = model.generate(
    tokenized_input,
    max_new_tokens=max_new_tokens,
    eos_token_id=1,
    do_sample=do_sample,
    temperature=temperature,
)
output_text = tokenizer.decode(output_ids[0])
print(output_text)


<s> 从前有一只小兔子乖乖 couldn't wait to get to the park. He put on his shoes and ran outside.
When he got to the park, he saw a big slide. He wanted to go down it, but he was scared. He looked around and saw a big tree. He thought it would be fun to climb it.
He started to climb the tree, but it was very high. He got scared and started to cry. He wanted to go down the slide, but he was too scared.
Suddenly, a big bird flew down and landed on the tree. It looked at him and said, "Don't be scared. I'll help you." The bird flew down and helped him go down the slide.
When he got to the bottom, he was so happy. He thanked the bird and ran off to play. He had a lot of fun at the park.<s>


### Task 2: Perplexity Evaluation

#### Background

---

The perplexity serves as a key metric for evaluating language models. It quantifies how well a model predicts a sample, with lower perplexity indicating better performance. For a tokenized sequence $X = (x_0, x_1, \dots, x_t)$, the perplexity is defined mathematically as:

$$\text{Perplexity}(X) = \exp \left( -\frac{1}{t} \sum_{i=1}^t \log p_\theta (x_i | x_{<i}) \right)$$

Here, $p_\theta(x_i | x_{<i})$ represents the probability of a token $ x_i $ given its preceding tokens, and the formulation incorporates the average log probability across the sequence.

---

⚠️ Please make sure to **run the following cell first** to define the evaluation function.

😄 **You do not need to check these complex details! Too hard for beginners!** However, if you are interested, you can compare the following code with the explanations above to better understand how to implement PPL evaluation using PyTorch.

In [12]:
# The following code was adapted from the `evaluate` library. Licensed under the Apache License, Version 2.0 (the "License").
# We modify them to avoid causing serious memory issues in the Colab environment.

def compute_ppl(
        model, tokenizer, inputs, device, batch_size: int = 16, add_start_token: bool = True, max_length=None
):

    if device is not None:
        assert device in ["gpu", "cpu", "cuda"], "device should be either gpu or cpu."
        if device == "gpu":
            device = "cuda"
    else:
        device = "cuda" if torch.cuda.is_available() else "cpu"

    # if batch_size > 1 (which generally leads to padding being required), and
    # if there is not an already assigned pad_token, assign an existing
    # special token to also be the padding token
    if tokenizer.pad_token is None and batch_size > 1:
        existing_special_tokens = list(tokenizer.special_tokens_map_extended.values())
        # check that the model already has at least one special token defined
        assert (
            len(existing_special_tokens) > 0
        ), "If batch_size > 1, model must have at least one special token to use for padding. Please use a different model or set batch_size=1."
        # assign one of the special tokens to also be the pad token
        tokenizer.add_special_tokens({"pad_token": existing_special_tokens[0]})

    if add_start_token and max_length:
        # leave room for <BOS> token to be added:
        assert (
            tokenizer.bos_token is not None
        ), "Input model must already have a BOS token if using add_start_token=True. Please use a different model, or set add_start_token=False"
        max_tokenized_len = max_length - 1
    else:
        max_tokenized_len = max_length

    encodings = tokenizer(
        inputs,
        add_special_tokens=False,
        padding=True,
        truncation=True if max_tokenized_len else False,
        max_length=max_tokenized_len,
        return_tensors="pt",
        return_attention_mask=True,
    )

    encoded_texts = encodings["input_ids"]
    attn_masks = encodings["attention_mask"]

    # check that each input is long enough:
    if add_start_token:
        assert torch.all(torch.ge(attn_masks.sum(1), 1)), "Each input text must be at least one token long."
    else:
        assert torch.all(
            torch.ge(attn_masks.sum(1), 2)
        ), "When add_start_token=False, each input text must be at least two tokens long. Run with add_start_token=True if inputting strings of only one token, and remove all empty input strings."

    ppls = []
    loss_fct = CrossEntropyLoss(reduction="none")

    for start_index in tqdm(range(0, len(encoded_texts), batch_size)):
        end_index = min(start_index + batch_size, len(encoded_texts))
        encoded_batch = encoded_texts[start_index:end_index].to(device)
        attn_mask = attn_masks[start_index:end_index].to(device)

        if add_start_token:
            bos_tokens_tensor = torch.tensor([[tokenizer.bos_token_id]] * encoded_batch.size(dim=0)).to(device)
            encoded_batch = torch.cat([bos_tokens_tensor, encoded_batch], dim=1)
            attn_mask = torch.cat(
                [torch.ones(bos_tokens_tensor.size(), dtype=torch.int64).to(device), attn_mask], dim=1
            )

        labels = encoded_batch

        with torch.no_grad():
            out_logits = model(encoded_batch, attention_mask=attn_mask).logits

            shift_logits = out_logits[..., :-1, :].contiguous()
            shift_labels = labels[..., 1:].contiguous()
            shift_attention_mask_batch = attn_mask[..., 1:].contiguous()

            perplexity_batch = torch.exp(
                (loss_fct(shift_logits.transpose(1, 2), shift_labels) * shift_attention_mask_batch).sum(1)
                / shift_attention_mask_batch.sum(1)
            )

            ppls += perplexity_batch.tolist()

    del encoded_batch, attn_mask
    if device == "cuda":
        torch.cuda.empty_cache()

    return {"perplexities": ppls, "mean_perplexity": sum(ppls)/float(len(ppls))}


#### 💡Tutorials: compute_ppl() function.

---
Minimal example:

```python
test_dataset = ["Once upon a time,"]

compute_ppl(
    model=model,
    tokenizer=tokenizer,
    device=device,
    inputs=test_dataset,
    batch_size = 16
)
```

Important parameters:
- `inputs`: list of input text, each separate text snippet is one list entry.
- `batch_size`: the batch size to run evaluations.

Returns:
- `perplexity`: `{"perplexities": [x.x, x.x, ...], "mean_perplexity": x.x}` dictionary containing the perplexity scores for the texts in the input list, as well as the mean perplexity. .


---

#### Task 2 Playground

---

📚 Task 2: Evaluate the perplexity. Ensure that you evaluate both the English and Chinese test data we provided. You are encouraged to collect more diverse text data and discuss your findings regarding the language understanding capacity of the base model.


Note: If you want to reuse the evaluation codes for JSONL data, please structure the content as follows:
```json
{"text": "one data"}
{"text": "two data."}
...
```
**You may find that the PPL value for Chinese text is significantly higher than that for English text. This is evidence that the base model cannot generate a Chinese story at the end of the last task.**

---

In [13]:
# Skeleton Code: Evaluate the perplexity (PPL) on a list of raw text.

test_dataset = ["Once upon a time,", "Tom is a cute kitty."] # ⬅️ you can use your examples / or read from raw text file

results = compute_ppl(model=model, tokenizer=tokenizer, device=device, inputs=test_dataset, batch_size = 16)
dataset_ppl = results['mean_perplexity']
print(f"Perplexity: {dataset_ppl:.2f}")

  0%|          | 0/1 [00:00<?, ?it/s]

We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)


Perplexity: 10.68


In [14]:
#Japenese test set.
data_file = JP_TEST_FILE # ⬅️ you can change your file path
test_dataset = load_dataset('json', data_files={'test': data_file})["test"]["text"]

results = compute_ppl(model=model, tokenizer=tokenizer, device=device, inputs=test_dataset, batch_size = 16)
dataset_ppl = results['mean_perplexity']
print(f"(Japanese Text) Test Perplexity: {dataset_ppl:.2f}")


data_file = EN_TEST_FILE # ⬅️ you can change your file path
test_dataset = load_dataset('json', data_files={'test': data_file})["test"]["text"]

results = compute_ppl(model=model, tokenizer=tokenizer, device=device, inputs=test_dataset, batch_size = 16)
dataset_ppl = results['mean_perplexity']
print(f"(English Text) Test Perplexity: {dataset_ppl:.2f}")

# Chinese test set.
data_file = TEST_FILE # ⬅️ you can change your file path
test_dataset = load_dataset('json', data_files={'test': data_file})["test"]["text"]

results = compute_ppl(model=model, tokenizer=tokenizer, device=device, inputs=test_dataset, batch_size = 16)
dataset_ppl = results['mean_perplexity']
print(f"(Chinese Text) Test Perplexity: {dataset_ppl:.2f}")

# pt test set.
data_file = PT_TEST_FILE # ⬅️ you can change your file path
test_dataset = load_dataset('json', data_files={'test': data_file})["test"]["text"]

results = compute_ppl(model=model, tokenizer=tokenizer, device=device, inputs=test_dataset, batch_size = 16)
dataset_ppl = results['mean_perplexity']
print(f"(Portuguese Text) Test Perplexity: {dataset_ppl:.2f}")


# Try your own data file!

Generating test split: 0 examples [00:00, ? examples/s]

  0%|          | 0/1 [00:00<?, ?it/s]

(Japanese Text) Test Perplexity: 78686.17


Generating test split: 0 examples [00:00, ? examples/s]

  0%|          | 0/63 [00:00<?, ?it/s]

(English Text) Test Perplexity: 4.14


Generating test split: 0 examples [00:00, ? examples/s]

  0%|          | 0/63 [00:00<?, ?it/s]

(Chinese Text) Test Perplexity: 70030.42


Generating test split: 0 examples [00:00, ? examples/s]

  0%|          | 0/63 [00:00<?, ?it/s]

(Portuguese Text) Test Perplexity: 25050.81


In [7]:
# 🚨 Release gpu cache before training the model
if device == "cuda":
    for i in range(torch.cuda.device_count()):
        torch.cuda.set_device(i)
        torch.cuda.empty_cache()

### Task 3: Continual Pre-training (in Chinese or in another language you are proficient in)

Currently, our base English LM is proficient in English but lacks the capability to generate or comprehend other languages (e.g., Chinese). The objective of this task is to enhance a base English LM by continually pre-training it with text in another language. This process aims to enable the model to understand and generate mini-story in another language.

We have provided 10,000 Chinese training samples. The training process for any language is the same. We have included useful resource links (in Assignment description PDF) to help you create additional data. If you encounter any issues in creating a dataset in another language, please do not hesitate to contact us.

We have implemented data preprocessing and the training pipeline, so you are not required to optimize these components. Instead, focus on tuning the training hyperparameters and observe the changes in model performance.


---

⚠️ Please **make sure to run the following cell first to pre-process data**.

😄 You do not need to check the details of whole pipeline construction! Please pay attention to the hyper-parameters of `trainer`.

#### Preprocess Data
Here, we preprocess (tokenize and group) the text for the subsequent evaluation and pre-training phases.

Load prepared Chinese dataset from Google drive (or local disk).

In [17]:
chinese_dataset = load_dataset('json', data_files={'train': JP_TRAIN_FILE, 'validation':JP_VALIDATION_FILE, 'test': JP_TEST_FILE})
print(chinese_dataset)
print(chinese_dataset["test"][2]["text"])

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 428
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 9
    })
    test: Dataset({
        features: ['text'],
        num_rows: 9
    })
})
昔々、ある日、深い森に住んでいた一匹のリスがいました。リスの名前はコタロウで、彼は森中を素早く駆け回って遊んでいました。ある日、森に嵐がやってきて、木々が激しく揺れ始めました。コタロウは急いで仲間たちを安全な場所へ導き、嵐が過ぎ去るまで皆を守りました。コタロウの勇気は動物たちに感謝され、彼の名は森中で語り継がれることになりました。


We tokenize the raw text using Llama-2's tokenizer and group the tokenized text as inputs.

In [18]:
block_size = 380

def tokenize_function(examples):
    return tokenizer(examples["text"])

def group_texts(examples):
    concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [19]:
tokenized_zh_datasets = chinese_dataset.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])
lm_datasets = tokenized_zh_datasets.map(
    group_texts,
    batched=True,
    batch_size=380,
    num_proc=4,
)

  self.pid = os.fork()


Map (num_proc=4):   0%|          | 0/428 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/9 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/9 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/428 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/9 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/9 [00:00<?, ? examples/s]

#### 💡Tutorials: TrainingArguments().

**Important Training Hyper-parameters**
- learning_rate: The initial learning rate for optimizer.
- num_train_epochs: Total number of training epochs to perform (if not an integer, will perform the decimal part percents of the last epoch before stopping training).
- *_strategy: The evaluation/saving strategy to adopt during training. Possible values are:
    - `"no"`: No evaluation/saving is done during training.
    - `"steps"`: Evaluation/saving is done (and logged) every `eval_steps`.
    - `"epoch"`: Evaluation/saving is done at the end of each epoch.
- per_device_train_batch_size: The batch size per GPU/XPU/TPU/MPS/NPU core/CPU for training.
- per_device_eval_batch_size: The batch size per GPU/XPU/TPU/MPS/NPU core/CPU for evaluation.
- save_total_limit: If a value is passed, will limit the total amount of checkpoints.


---

If you do not understand `AdamW` optimizer and learning scheduler, you may use default settings.

**Optimizer Hyper-parameters**
- weight_decay: The weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights in [`AdamW`] optimizer.
- adam_beta1: The beta1 hyperparameter for the [`AdamW`] optimizer.
- adam_beta2: The beta2 hyperparameter for the [`AdamW`] optimizer.

**Learning schedule**
- lr_scheduler: The scheduler type to use.
- warmup_ratio: Ratio of total training steps used for a linear warmup from 0 to `learning_rate`.

[Explore more parameters here](https://huggingface.co/docs/transformers/v4.44.2/en/main_classes/trainer#transformers.TrainingArguments)

#### Task 3 Playground

---

📚 Please just run the following code to do continual pre-training. Please try your best to tune the hyperparameters or collect more data to improve model performance.

---

In [12]:
# =========Pre-training hyperparameters, please feel free to tune them~=========
# =Important=
lr = 3e-5
epochs = 300
save_steps=200
strategy="steps"
train_bsz = 24 # reduce batch size if you encountered out-of-memory errors.
eval_bsz = 16

# If you do not understand AdamW optimizer and learning scheduler, you may use default settings.
# =Optimizer=
optimizer = "adamw_torch"
weight_decay = 0.01
adam_beta1 = 0.9
adam_beta2 = 0.98
# =Learning scheduler=
lr_scheduler = "linear"
warmup_ratio = 0.01
# =========End of pre-training hyperparameters=========


training_args = TrainingArguments(
    "llama-42m-zh-fairytales",
    evaluation_strategy = strategy,
    eval_steps=save_steps,
    save_strategy = strategy,
    save_steps=save_steps,
    logging_strategy="steps",
    logging_steps = 10,
    learning_rate=lr,
    weight_decay=weight_decay,
    seed=42,
    per_device_train_batch_size=train_bsz,
    per_device_eval_batch_size=eval_bsz,
    save_total_limit=1,
    optim = optimizer,
    lr_scheduler_type = lr_scheduler,
    adam_beta1 = adam_beta1,
    adam_beta2 = adam_beta2,
    warmup_ratio = warmup_ratio,
    num_train_epochs = epochs,
    report_to=None
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
)



In [13]:
trainer.train()


[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Step,Training Loss,Validation Loss
200,4.1838,4.068296
400,2.4315,2.220478
600,1.0295,1.038053
800,0.4104,0.821115
1000,0.1445,0.831284
1200,0.0417,0.873175
1400,0.014,0.942895
1600,0.0078,0.981071
1800,0.0061,0.998302


We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with 

TrainOutput(global_step=1800, training_loss=1.290800393279642, metrics={'train_runtime': 1780.902, 'train_samples_per_second': 41.777, 'train_steps_per_second': 1.011, 'total_flos': 4292639539200000.0, 'train_loss': 1.290800393279642, 'epoch': 300.0})

Load pre-trained model and try to generate mini-story in another language.

In [15]:
device = "cuda" if torch.cuda.is_available() else "cpu"

print(f"Device type: {device}")

new_model_path = "/kaggle/working/llama-42m-zh-fairytales/checkpoint-1800" # saved checkpoint path
model = LlamaForCausalLM.from_pretrained(new_model_path).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_path, device=device)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

Device type: cuda


In [16]:

import gc
gc.collect() 

if device == "cuda":
    for i in range(torch.cuda.device_count()):
        with torch.cuda.device(i):
            torch.cuda.empty_cache()
            print(f"Cleared cache for GPU {i}")

Cleared cache for GPU 0
Cleared cache for GPU 1


Evaluate the PPL on Chinese text (or another language) again.

You will notice that we actually achieve a much lower PPL after continual pre-training.

In [17]:
data_file = JP_TEST_FILE
test_dataset = load_dataset('json', data_files={'test': data_file})["test"]["text"]

results = compute_ppl(model=model, tokenizer=tokenizer, device=device, inputs=test_dataset, batch_size = 4)
dataset_ppl = results['mean_perplexity']
print(f"Test Perplexity: {dataset_ppl:.2f}")

Generating test split: 0 examples [00:00, ? examples/s]

  0%|          | 0/3 [00:00<?, ?it/s]

Test Perplexity: 3.11


---

The original English base model was pre-trained on 2 million data samples. Considering we are using only 10,000 training samples (0.5% of the original pre-training data), the model can generate a few fluent sentences but may still struggle with long-text generation or common sense of other languages. You can try using more data or training steps depending on your computational resources.

---

In [19]:
prompt = "君の名は"

# Decoding hyperparameters
max_new_tokens = 300
do_sample = True
temperature = 0.3

tokenized_input = tokenizer.encode(prompt, return_tensors="pt").to(device)
output_ids = model.generate(
    tokenized_input,
    max_new_tokens=max_new_tokens,
    eos_token_id=1,
    do_sample=do_sample,
    temperature=temperature,
)
output_text = tokenizer.decode(output_ids[0])
print(output_text)

<s> 君の名は、ある日、遠い語り力と彼の名前はオリオンで、狐が住んでいました。彼の名前はシロで、森の森中の森の中を彼は旅をしていました。ある日、森に危機が訪れ、シロはその知恵を使って森を守ることを決意しました。彼は皆と協力し、無事に森を守りました。モモの知恵と勇気は、動物たちに感謝され、彼の名は森の守護者として語り継がれることになりました。<s>
