# Custom Tokenization for DAPT (Domain Adaptive Pre-Training)

This notebook walks through the custom tokenization workflow required for DAPT (Domain Adaptive Pre-training) as shown in the schematic diagram below. 

![pipeline](imgs/tokenization_diagram.png)

### Custom Tokenization Workflow

#### Goal
Given a pre-trained tokenizer trained on general purpose datasets (<b>Original Tokenizer</b>), our goal is to adapt it to a given domain that we want to apply it to (in this notebook, the example domain we are looking at is ChipDesign).

When adapting a pre-trained tokenizer to a given domain, the main goals are to improve tokenization efficiency on domain-specific data, maintain efficiency and language model performance on general purpose datasets, and minimize the effort for retraining/fine-tuning. Since we don't have access to the entire general purpose data used for pretraining the original tokenizer, we want to preserve the existing token mappings, and any new tokens that are added should be strictly an "extension". 

Generally, when adapting tokenizer to domain-specific data, the goal is to create a tokenizer that is better suited to the vocabulary and structure of that specific domain. This can improve the efficiency and performance of the model on tasks within that domain through efficient representation of domain specific information.

#### Approach 
The general approach we adopt is to train a <b>Domain Specific Tokenizer</b> from scratch on domain data and use it to identify domain specific tokens that are missing from the original tokenizer. This is done by simply comparing the vocabs of the Original Tokenizer and the newly trained Domain Specific Tokenizer. The missing domain specific tokens are then added to the original tokenizer for extending it to get the final <b>Domain Adapted Tokenizer</b>. 

#### Tradeoff 
However, there is a tradeoff to adding missing domain specific tokens to the Original Tokenizer. The challenge is to balance this tradeoff between tokenization efficiency on domain data vs disturbance to the performance on general-purpose data as a result of adding domain specific tokens to the Original Tokenizer.

For instance, addition of a large no. of domain specific tokens can lead to higher efficiency on domain specific data, but DAPT process might take longer since it would take longer for the loss to converge​ due to disturbance of efficiency/performance on the general purpose data.

On the other hand, addition of only a small no. of domain specific tokens can lead to maintained efficiency on general purpose data, but may lack coverage on the domain specific dataset​.

#### Balancing The Tradeoff
To balance this tradeoff, instead of adding all identified missing domain specific tokens to the original tokenizer, we identify the most frequently occuring tokens using a threshold and only add the ones with usage frequencies above the given threshold to get the final Domain Adapted Tokenizer. 

For identifying the most frequently used tokens, we first extend the Original Tokenizer by adding all identified missing domain specific tokens to get an <b>Extended Tokenizer</b>. The Extended Tokenizer is then applyied to the domain specific data in order to identify high frequency tokens. Thus the Extended Tokenizer is just an intermediate step towards building a Domain Adapted Tokenizer.

Finally, the Original Tokenizer is extended using only high frequency tokens to get the final <b>Domain Adapted Tokenizer</b>. 

## Notebook Outline

To achieve the process described above, we’ve developed a step-by-step approach that this notebook will walk you through:

- Step 0: Install pre-requisites and import the required modules
- Step 1: Download llama-2-70b embedding model and tokenizer (<b>Original Tokenizer</b>). Convert the orginal weights to trainable format and save. 
- Step 2: Train an opt-350m tokenizer from scratch using domain-specific data to get a <b>Domain Specific Tokenizer</b>.
- Step 3: From the vocabulary of the newly trained tokenizer, identifying tokens that are absent in the general-purpose tokenizer and are rarely found in general-purpose datasets. Next, expand the general-purpose tokenizer with the newly identified tokens to get an <b>Extended Tokenizer</b>.
- Step 4: Apply the Extended Tokenizer to the domain-specific dataset, analyze the usage frequencies of the newly-added tokens, and select the top-K tokens in a way that their cumulative frequency accounts for approximately 98% (a hyper-parameter) of the total frequency of the new tokens.
- Step 5: Initialize the embeddings of the new tokens by utilizing the general-purpose tokenizer i.e., Original Tokenizer. When a new token is encountered, it is tokenized using the pretrained general-purpose tokenizer. The embedding and output layer weights corresponding to the new token are determined by averaging the embeddings / weights corresponding to the tokens generated by the general-purpose tokenizer.
- Step 6: Merge the new embeddings with the original embedding table (in llama2-2-70b) to get the final <b>Domain Adapted Tokenizer</b>.
## Data

In this playbook, we will leverage chip domain/hardware datasets from open-source GitHub repositories, wiki URLs, and academic papers. Data has been processed and curated using [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator/tree/main) as shown in this [playbook](https://github.com/jvamaraju/ndc_dapt_playbook/tree/dapt_jv)

## NeMo Tools and Resources

* [Nvidia Nemo Framework](https://github.com/NVIDIA/NeMo)

## Software Requirements
* Access to latest NeMo Framework NGC Containers
* This playbook has been tested on: nvcr.io/nvidia/nemo:24.07. It is expected to work similarly on other environments. 

## Hardware Requirements
* This playbook can run on CPUs or GPUs. For GPUs, this playbook has been tested on minimum 1xA100 80G

## Step 0: install the prerequisites and import the required modules

In [1]:
! pip install datasets sentencepiece jsonlines tokenizers transformers torch ftfy matplotlib
! pip install protobuf==3.20.1
! pip install --upgrade jupyter ipywidgets widgetsnbextension pandas-profiling

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
[0mLooking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting protobuf==3.20.1
  Downloading protobuf-3.20.1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (698 bytes)
Downloading protobuf-3.20.1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m42.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: protobuf
  Attempting uninstall: protobuf
    Found existing installation: protobuf 5.28.3
    Uninstalling protobuf-5.28.3:
      Successfully uninstalled protobuf-5.28.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 24.2.0 requires pandas<1.6.0dev0,>=1.3, but you have pandas 2.2.2 which is incompatible.
cudf 24.2.0 requ

In [2]:
import os
import sys
import torch
from datasets import Dataset
from datasets import IterableDataset
from datasets import load_dataset
import jsonlines
import glob
from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)
from transformers import AutoTokenizer
from tokenization_helper import *
from extend_tokenizer_utils import extend_tokenizer, extend_tokenizer_high_freq_tokens
from get_high_freq_tokens import get_high_freq_tokens
from util import load_weights, merge_embed

## Step 1: Download llama-2-70b embedding model and tokenizer (Original Tokenizer). Convert the orginal weights to trainable format and save. 

The Original Tokenizer model used here is the llama2 tokenizer which is a Byte Pair Encoding (BPE) model based on sentencepiece.

Here we first log into the Hugging Face before downloading the model since the model is in a restricted repo.

In [None]:
# Install the hugging face CLI
! pip install -U "huggingface_hub[cli]"
# Generate a user access token at https://huggingface.co/settings/tokens

# To download the model, please login via huggingface-cli login since it is a restricted repo
! huggingface-cli login
# You will be prompted to enter your User Access Token. Copy and paste the token, then press Enter. The CLI will verify the token and save it locally.

In [3]:
# create directory for storing the downloaded hugging face model 
os.makedirs("models/weight/llama2-hf", exist_ok=True)

# create directories for storing the model weights 
os.makedirs("models/weight/llama2/ori_llama2-hf_weight", exist_ok=True)
os.makedirs("models/weight/llama2/new_llama2-hf_weight", exist_ok=True)

# create directories for storing the tokenizers
os.makedirs("models/tokenizer/llama2/original_tokenizer", exist_ok=True)
os.makedirs("models/tokenizer/llama2/new_tokenizer", exist_ok=True)

Before running the next step, make sure you have access granted for Meta's Llama2 models gated group. You can fill the form available on https://huggingface.co/meta-llama/Llama-2-7b in order to get the access. (Takes ~20 minutes)

In [4]:
# download llama2-70b model weights and tokenizer 
! huggingface-cli download meta-llama/Llama-2-70b --local-dir ./models/weight/llama2-hf/

# #Copy original tokenizer to a different folder
! cp ./models/weight/llama2-hf/tokenizer.model ./models/tokenizer/llama2/original_tokenizer

# Load embedding and output layer  weights (size = (vocab size,embedding dim)) from each snapshot and create a dict
load_path = "./models/weight/llama2-hf"
save_path = './models/weight/llama2/ori_llama2-hf_weight'

if not os.path.exists(save_path):
    os.makedirs(save_path)
    
#load weight and store in a dictionary suitable for NeMo
load_weights(load_path, save_path)

Fetching 17 files:   0%|                                 | 0/17 [00:00<?, ?it/s]Downloading 'Responsible-Use-Guide.pdf' to 'models/weight/llama2-hf/.cache/huggingface/download/Responsible-Use-Guide.pdf.525dc349d71fe257fce4098c146446df6fef4247174f351381e4c3214af126f0.incomplete'
Downloading 'README.md' to 'models/weight/llama2-hf/.cache/huggingface/download/README.md.f6b3e8c0fbb151970936e998ef90fd7e6a6cf1c3.incomplete'
Downloading 'consolidated.01.pth' to 'models/weight/llama2-hf/.cache/huggingface/download/consolidated.01.pth.edf5bc0efacd437b2571beb53245ae2c7343648832c89c983d17e9736af9efcd.incomplete'
Downloading 'consolidated.00.pth' to 'models/weight/llama2-hf/.cache/huggingface/download/consolidated.00.pth.af5747340beaf5414afaeae46def2e8d0740a3002309a0d99a3062b70300bdbd.incomplete'

README.md: 100%|███████████████████████████| 22.3k/22.3k [00:00<00:00, 82.3MB/s][A
Download complete. Moving file to models/weight/llama2-hf/README.md
Downloading '.gitattributes' to 'models/weight/llam

In [5]:
# check layers and dimensions (optional)
state_dict = torch.load(f"{load_path}/consolidated.0{1}.pth")
for index, (key, value) in enumerate(state_dict.items()):
    print(f"Index: {index}, layer: {key}, Layer size: {value.size()}")

Index: 0, layer: tok_embeddings.weight, Layer size: torch.Size([32000, 1024])
Index: 1, layer: norm.weight, Layer size: torch.Size([8192])
Index: 2, layer: output.weight, Layer size: torch.Size([4000, 8192])
Index: 3, layer: layers.0.attention.wq.weight, Layer size: torch.Size([1024, 8192])
Index: 4, layer: layers.0.attention.wk.weight, Layer size: torch.Size([128, 8192])
Index: 5, layer: layers.0.attention.wv.weight, Layer size: torch.Size([128, 8192])
Index: 6, layer: layers.0.attention.wo.weight, Layer size: torch.Size([8192, 1024])
Index: 7, layer: layers.0.feed_forward.w1.weight, Layer size: torch.Size([3584, 8192])
Index: 8, layer: layers.0.feed_forward.w2.weight, Layer size: torch.Size([8192, 3584])
Index: 9, layer: layers.0.feed_forward.w3.weight, Layer size: torch.Size([3584, 8192])
Index: 10, layer: layers.0.attention_norm.weight, Layer size: torch.Size([8192])
Index: 11, layer: layers.0.ffn_norm.weight, Layer size: torch.Size([8192])
Index: 12, layer: layers.1.attention.wq.w

## Step 2: Train a tokenizer from scratch using domain-specific data to get a Domain Specific Tokenizer.

First, we train a tokenizer from scratch using domain-specific data.

The tokenizer that we use is the facebook/opt-350m model tokenizer available here on <a href=https://huggingface.co/facebook/opt-350m>hugging face</a>. Similar to the llama-2 tokenizer, opt-350m tokenizer is also a Byte Pair Encoding (BPE) model and since we are training from scratch we could use any of them. Infact, we can use any model's tokenizer that is implemented based on BPE since the training algorithm inside the tokenizer is what matters. However, we chose opt-350m since it has a more general purpose design and can be used flexibly across different tasks/domains and with various models beyond the OPT series. On the other hand llama-2 tokenizer is designed specifically for llama-2 architecture, optimizing performance for tasks that llama-2 model is intended to handle. 

The two hyperparameters that need to be set here are ```batch_size``` and ```vocab_size```. <br>

```vocab_size``` : is the target vocab size in finetuning the tokenizer. This depends on the original tokenizer and should be slightly higher than half of the original vocab size. Note that this doesn't have to equal the number of new tokens that will be added. 


In [6]:
data_root = "./data/"       # path where the domain specific data is stored
save_root = "./models/tokenizer/llama2/"    # path to save the finetuned opt tokenizer
batch_size = 1000    # batch size used in the tokenization process
vocab_size = 20000   # target vocab size for training opt tokenizer

# ensure that the directory exists before changing permissions
directory = "../code/"
is_directory = os.path.isdir(directory)
print(f"Is a directory: {is_directory}")

# change permissions to ensure we have read, write and execute permissions
! chmod ugo+rwx ../code/

Is a directory: True


In [7]:
# Train a tokenizer from scratch and save output files
keys = ["text"] # keys to extract from json files
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m") # load pre-trained tokenizer (https://huggingface.co/facebook/opt-350m)
# Train the tokenizer from scratch on a new corpus with the same defaults (in terms of special tokens or tokenization pipeline) as the current one.
tokenizer = train_tokenizer(data_root, batch_size, vocab_size, tokenizer, keys)

#Save and print paths
tokenizer.save_pretrained(save_root + "custom_tokenizer_init_" + str(vocab_size) + ".json")



Before Training: 
total token cnt 0



After Training: 
total token cnt 0


('./models/tokenizer/llama2/custom_tokenizer_init_20000.json/tokenizer_config.json',
 './models/tokenizer/llama2/custom_tokenizer_init_20000.json/special_tokens_map.json',
 './models/tokenizer/llama2/custom_tokenizer_init_20000.json/vocab.json',
 './models/tokenizer/llama2/custom_tokenizer_init_20000.json/merges.txt',
 './models/tokenizer/llama2/custom_tokenizer_init_20000.json/added_tokens.json',
 './models/tokenizer/llama2/custom_tokenizer_init_20000.json/tokenizer.json')

## Step 3: From the vocabulary of the newly trained tokenizer, identify tokens that are absent in the general-purpose tokenizer and are rarely found in general-purpose datasets. Next, expand the general-purpose tokenizer with the newly identified tokens to get an extended Tokenizer.

Here we expand/resize the model embeddings of the original general-purpose tokenizer with the newly identified tokens in Step 3 to get an extended tokenizer.

The two hyperparemeters that need to be set here are ```split``` and ```model_type```. 

```split```: is the number of partitions to split the embeddings in (.pt files) for the purpose of model parallelism.

```model_type``` : this is the original tokenizer model (llama2 in our case)


In [8]:
split = 8      # number of partitions to split the embeddings of domain-adapted tokenizer
model_type = "llama2" # Add more model_types if you want the codebase to support alternate ones
extend_tokenizer(vocab_size, split, model_type)

Domain vocab size: 258
token pattern:  [a-zA-Z]
Num of added tokens and dropped tokens 54 204
Original model pieces: 32000
input: "/large_experiments/theorem/datasets/MERGED/all.test1.merged"
model_prefix: "spm_model_32k_200M_charcov099995_allowWSO__v2"
model_type: BPE
vocab_size: 32000
self_test_sample_size: 0
input_format: "text"
character_coverage: 0.99995
input_sentence_size: 200000000
seed_sentencepiece_size: 1000000
shrinking_factor: 0.75
num_threads: 80
num_sub_iterations: 2
max_sentence_length: 4192
shuffle_input_sentence: true
max_sentencepiece_length: 16
split_by_unicode_script: true
split_by_whitespace: true
split_by_number: true
treat_whitespace_as_suffix: false
split_digits: true
allow_whitespace_only_pieces: true
vocabulary_output_piece_score: true
hard_vocab_limit: true
use_all_vocab: false
byte_fallback: true
required_chars: ""
unk_id: 0
bos_id: 1
eos_id: 2
pad_id: -1
unk_surface: " \342\201\207 "
unk_piece: "<unk>"
bos_piece: "<s>"
eos_piece: "</s>"
pad_piece: "<pad>"


## Step 4: Use the extended Tokenizer to anylze the frequency of newly added tokens

Here we apply the extended tokenizer to the domain-specific dataset, analyzing the usage frequencies of the newly-added tokens, and selecting the top-K tokens in a way that their cumulative frequency accounts for approximately 98% (a hyper-parameter: ```freq_threshold```) of the total frequency of the new tokens.

The idea is that only high-frequency tokens will be added to the vocabulary of the original tokenizer to get the final domain adapted tokenizer. 

The benefits of high-frequency token analysis have been explored in several studies: ([Liu, Mingjie, et al](https://research.nvidia.com/publication/2023-10_chipnemo-domain-adapted-llms-chip-design); [Lian, Haoran, et al](https://arxiv.org/abs/2404.17808)).This is because previous studies have shown that disparities in token frequencies can result in imbalanced learning difficulties across different tokens. For instance, low frequency tokens are harder to learn for models ([Su, Zhenpeng, et al](https://arxiv.org/abs/2310.19531); [Lin, Tsung-Yi, et al](https://openaccess.thecvf.com/content_iccv_2017/html/Lin_Focal_Loss_for_ICCV_2017_paper.html)).

We use two functions for frequency analysis. Helper function `analyze_token_usage` applies the extended tokenizer to domain specific data, and stores the usage/occurence frequencies of the newly-added tokens at `token_usage_path`. <br>

Helper function `get_high_freq_tokens` looks at the token usage frequencies from above and performs a binary search to search for domain specific tokens with usage frequency above the specified threshold (`freq_threshold` parameter). It stores the tokens it finds at `high_freq_tokens_path`.

In [9]:
split = 8      # number of partitions to split the embeddings of domain-adapted tokenizer
model_type = "llama2"
tag = "code_gen"
keys = ["text"]
# path to the saved extended tokenizer (from previous tep)
extended_tokenizer_path = f"./models/tokenizer/{model_type}/new_tokenizer/tokenizer_{tag}.model"
# path to save token usage frequency analysis results
token_usage_path = f"./models/tokenizer/{model_type}/new_tokenizer/{model_type}_token_usage.json"

In [10]:
# analyze tokens using frequency analysis
analyze_token_usage(data_root, extended_tokenizer_path, batch_size, keys, token_usage_path)

vocab_size:  33024


In [11]:
# path to save selected high-frequency tokens (new tokens to be added)
high_freq_tokens_path = f"./models/tokenizer/{model_type}/new_tokenizer/{model_type}_freq_analy_new_token.json"

# hyperparameter 
freq_threshold = 0.98

In [12]:
# selecting the top-K tokens in a way that their cumulative frequency accounts for approximately 98%
get_high_freq_tokens(token_usage_path, high_freq_tokens_path, float(freq_threshold))

## Step 5:  Initialize the embeddings of the new tokens by utilizing the extended general-purpose tokenizer

Here we use the `extend_tokenizer` helper fucntion to first add high freq. tokens identified in Step 4 to original tokenizer vocab.​

Both the embedding table and the output layer weights of the original tokenizer depend on the vocab size. Since the vocab size is now changed due to addition of high freq. domain specific tokens, both of these need to be updated.

`extend_sentencepiece` initializes the embeddings of the new tokens by utilizing the general-purpose tokenizer. When a new token (a word or subword unit) is encountered, it is first broken down (tokenized) using the pretrained general-purpose tokenizer. 

The new token doesn’t have a predefined embedding (a numerical representation). The embedding of the new token is determined by averaging the embeddings of the tokens generated by the general-purpose tokenizer. For example, if the new token is split into three sub-tokens, the embeddings of these three sub-tokens are averaged to form the embedding of the new token.

Similarly, the weights in the output layer corresponding to the new token are also initialized to the average of the tokens generated by the general-purpose tokenizer. For example, if the new token is split into three sub-tokens, the weights corresponding to these three sub-tokens are averaged to form the weights corresponding to the new token.

Once done, in Step 6 we will merge the new embeddings with the original embedding table (in llama2) to get the final Domain Adapted Tokenizer.

In [13]:
ori_tokenizer_path = f"./models/tokenizer/{model_type}/original_tokenizer/tokenizer.model" # original sentencepiece tokenizer model
new_vocab_path = f"./models/tokenizer/{model_type}/new_tokenizer/freq_vocab.json" # path to record added new tokens
old_ebd_path = f"./models/weight/{model_type}/ori_{model_type}-hf_weight/" # original embeddings
new_ebd_path = f"./models/weight/{model_type}/new_{model_type}-hf_weight/" # path to store augmented embeddings
domain_adapter_tokenizer_path = f"./models/tokenizer/{model_type}/new_tokenizer/tokenizer_freq.model" # augmented sentencepiece model
split = 8 # num of partitions to split the augmented embeddings

In [14]:
f = open(high_freq_tokens_path, "r")
new_tokens = json.load(f)
print("new_tokens: ", new_tokens)
extend_tokenizer_high_freq_tokens(data_root, ori_tokenizer_path, new_tokens, new_vocab_path, domain_adapter_tokenizer_path, old_ebd_path, new_ebd_path, split)

new_tokens:  []
token_cnt with original tokenizer: 
total token cnt 0
original vocab_size:  32000
added normal vocab:  0
added dummy vocab:  1024
new vocab_size:  33024
padded vocab:  768
total cnt (with padding vocab):  33792
token_cnt with customized tokenizer: 
total token cnt 0
word_embedding shape:  torch.Size([32000, 8192])
output_layer shape:  torch.Size([32000, 8192])
Completed saving new embeddings


In [15]:
print(new_ebd_path) #New weights

./models/weight/llama2/new_llama2-hf_weight/


In [16]:
print(domain_adapter_tokenizer_path) # domained adapted tokenizer

./models/tokenizer/llama2/new_tokenizer/tokenizer_freq.model


# Step 6:  Merge the new embeddings with the original embedding table (in llama2) to get the final Domain Adapted Tokenizer and Embeddings.

Helper function `merge_embed` takes the original embeddings downloaded from hugging face, and the augmented embeddings generated in Step 5 above, merges them and then saves the result at `save_path`.

For instance, figure below shows an illustration of embedding table modification. Here each row corresponds to a unique token and each column represents a dimension of the embedding vector. The size of the vocabulary determines the number of rows in the embedding table. The embedding layer in the LLM which is responsible for converting the data into numerical vectors uses the embedding table to perform this conversion. The dimensionality of the embedding layer is given by the number of columns in the embedding table. <br>

![pipeline](imgs/embedding_table.png)

In [None]:
os.makedirs(f"/models/weight/new_merged_{model_type}-hf", exist_ok=True)

In [22]:
old_ebd_path = f"./models/weight/{model_type}-hf" # original embeddings downloaded from hf
new_ebd_path = f"./models/weight/{model_type}/new_{model_type}-hf_weight" # augmented embeddings
save_path = f"./models/weight/new_merged_{model_type}-hf" # Path to adapted llama2 weights
merge_embed(old_ebd_path, new_ebd_path, save_path)

Directory './models/weight/new_merged_llama2-hf' exists
Directory './models/weight/new_merged_llama2-hf' exists
Directory './models/weight/new_merged_llama2-hf' exists
Directory './models/weight/new_merged_llama2-hf' exists
Directory './models/weight/new_merged_llama2-hf' exists
Directory './models/weight/new_merged_llama2-hf' exists
Directory './models/weight/new_merged_llama2-hf' exists
Directory './models/weight/new_merged_llama2-hf' exists


### New weights and tokenizer are stored at:

In [23]:
print(new_ebd_path) #New weights

./models/weight/llama2/new_llama2-hf_weight


In [24]:
print(domain_adapter_tokenizer_path) # domained adapted tokenizer

./models/tokenizer/llama2/new_tokenizer/tokenizer_freq.model


In [25]:
# check layers and dimensions (optional)
state_dict = torch.load(f'{save_path}/consolidated.01.pth')
for index, (key, value) in enumerate(state_dict.items()):
    print(f"Index: {index}, layer: {key}, Layer size: {value.size()}")

Index: 0, layer: tok_embeddings.weight, Layer size: torch.Size([33024, 1024])
Index: 1, layer: norm.weight, Layer size: torch.Size([8192])
Index: 2, layer: output.weight, Layer size: torch.Size([4128, 8192])
Index: 3, layer: layers.0.attention.wq.weight, Layer size: torch.Size([1024, 8192])
Index: 4, layer: layers.0.attention.wk.weight, Layer size: torch.Size([128, 8192])
Index: 5, layer: layers.0.attention.wv.weight, Layer size: torch.Size([128, 8192])
Index: 6, layer: layers.0.attention.wo.weight, Layer size: torch.Size([8192, 1024])
Index: 7, layer: layers.0.feed_forward.w1.weight, Layer size: torch.Size([3584, 8192])
Index: 8, layer: layers.0.feed_forward.w2.weight, Layer size: torch.Size([8192, 3584])
Index: 9, layer: layers.0.feed_forward.w3.weight, Layer size: torch.Size([3584, 8192])
Index: 10, layer: layers.0.attention_norm.weight, Layer size: torch.Size([8192])
Index: 11, layer: layers.0.ffn_norm.weight, Layer size: torch.Size([8192])
Index: 12, layer: layers.1.attention.wq.w

# Next Step

The final Domain adapted Tokenizer obtained using this notebook can be used in a continual pre-training pipeline for domain adaptive pretraining.