# Data Packaging
Let’s try to package our training data so that we can upload it on hugging face. First we need to tokenize the data, as LLMs do not work directly on text, their internal calculations require numbers. Then we will pack them, packing tokens into the maximum sequence length to improve training efficiency. While packing we additionally add some special tokens in the beginning and at the end of the sentence.


## 1. Tokenizing and creating input_ids

In [1]:
import datasets

dataset = datasets.load_dataset(
    "parquet", 
    data_files="./data/preprocessed_dataset.parquet", 
    split="train"
)
print(dataset)

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['text'],
    num_rows: 40474
})


Use the `shard` method of the Hugging Face `Dataset` object to split the dataset into 10 smaller pieces, or *shards* (think shards of broken glass). Read more about sharding at [this link](https://huggingface.co/docs/datasets/en/process#shard).

In [2]:
dataset = dataset.shard(num_shards=10, index=0)
print(dataset)

Dataset({
    features: ['text'],
    num_rows: 4048
})


Load the tokenizer and try it out:

In [3]:
from transformers import AutoTokenizer
model_path_or_name = "./models/upstage/SOLAR-10.7B-v1.0"
tokenizer = AutoTokenizer.from_pretrained(
    model_path_or_name, 
    use_fast=False
)

In [4]:
tokenizer.tokenize("I'm a short sentence")

['▁I', "'", 'm', '▁a', '▁short', '▁sentence']

Create a helper function:

In [5]:
def tokenization(example):
    # Tokenize
    tokens = tokenizer.tokenize(example["text"])

    # Convert tokens to ids
    token_ids = tokenizer.convert_tokens_to_ids(tokens)

    # Add <bos>, <eos> tokens to the front and back of tokens_ids 
    # bos: begin of sequence, eos: end of sequence
    token_ids = [
        tokenizer.bos_token_id] \
        + token_ids \
        + [tokenizer.eos_token_id
    ]
    example["input_ids"] = token_ids

    # We will be using this column to count the total number of tokens 
    # in the final dataset
    example["num_tokens"] = len(token_ids)
    return example

Tokenize all the examples in the pretraining dataset:

In [6]:
dataset = dataset.map(tokenization, load_from_cache_file=False)
print(dataset)

Map:   0%|          | 0/4048 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'input_ids', 'num_tokens'],
    num_rows: 4048
})


In [7]:
sample = dataset[3]

print("text", sample["text"][:30]) # 
print("\ninput_ids", sample["input_ids"][:30])
print("\nnum_tokens", sample["num_tokens"])

text The Colorado Climate Center pr

input_ids [1, 415, 15837, 1366, 3314, 6064, 5312, 430, 19102, 304, 1178, 356, 281, 3928, 28725, 9735, 28713, 28725, 264, 1052, 14455, 4623, 28725, 9390, 1452, 274, 28725, 17268, 28713, 28725]

num_tokens 549


Check the total number of tokens in the dataset:

In [8]:
import numpy as np
np.sum(dataset["num_tokens"])

5113663

## 2. Packing the data

![Packing data for training](Images/data_packing.png)

Concatenate input_ids for all examples into a single list:

In [9]:
input_ids = np.concatenate(dataset["input_ids"])
print(len(input_ids))

5113663


In [10]:
max_seq_length = 32

In [11]:
total_length = len(input_ids) - len(input_ids) % max_seq_length
print(total_length)

5113632


Discard extra tokens from end of the list so number of tokens is exactly divisible by `max_seq_length`:

In [12]:
input_ids = input_ids[:total_length]
print(input_ids.shape)

(5113632,)


In [13]:
input_ids_reshaped = input_ids.reshape(-1, max_seq_length).astype(np.int32)
input_ids_reshaped.shape  

(159801, 32)

In [14]:
type(input_ids_reshaped)

numpy.ndarray

Convert to Hugging Face dataset:

In [15]:
input_ids_list = input_ids_reshaped.tolist()
packaged_pretrain_dataset = datasets.Dataset.from_dict(
    {"input_ids": input_ids_list}
)
print(packaged_pretrain_dataset)

Dataset({
    features: ['input_ids'],
    num_rows: 159801
})


## 3. Save the packed dataset to disk

In [16]:
packaged_pretrain_dataset.to_parquet("./data/packaged_pretrain_dataset.parquet")

Creating parquet from Arrow format:   0%|          | 0/160 [00:00<?, ?ba/s]

21093732