以the-verdict.txt为例展示手动分词，先统计总字符数：

In [1]:
with open('../../datasets/the_verdict/the-verdict.txt', 'r', encoding='utf-8') as f:
    raw_text = f.read()
print("Total number of character:", len(raw_text)) # 20479

Total number of character: 20479


简单分词：

In [2]:
import re

def tokenize_scratch(text: str) -> list[str]:
    result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
    result = [item for item in result if item.strip()]
    return result

tokenized_text = tokenize_scratch(raw_text)
print("Total number of tokens (from scratch):", len(tokenized_text))

Total number of tokens (from scratch): 4690


将token转换为tokenID，首先需要构建一个词表：

In [3]:
def build_vocab_scratch(tokenized_text: list[str]) -> tuple[dict[str, int], int]:
    all_tokens = sorted(set(tokenized_text))
    vocab_size = len(all_tokens)
    vocab = {token: idx for idx, token in enumerate(all_tokens)}
    return vocab, vocab_size

vocab, vocab_size = build_vocab_scratch(tokenized_text)
print("Total number of unique tokens (vocab size):", vocab_size)

Total number of unique tokens (vocab size): 1130


创建一个逆向词表（inverse vocabulary），将 token ID 映射回对应的词元

In [4]:
class SimpleTokenizer:
    def __init__(self, vocab: dict[str, int]):
        self.str_to_int = vocab
        self.int_to_str = {idx: token for token, idx in vocab.items()}
        
    def encode(self, text: str) -> list[int]:
        tokenized_text = tokenize_scratch(text)
        ids = [self.str_to_int[token] for token in tokenized_text]
        return ids
    
    def decode(self, ids: list[int]) -> str:
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text
    
tokenizer = SimpleTokenizer(vocab)
text = """"It's the last he painted, you know," Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


新增支持 <|unk|> 和 <|endoftext|> 两个特殊词元

In [5]:
def build_vocab_scratch(tokenized_text: list[str]) -> tuple[dict[str, int], int]:
    all_tokens = sorted(list(set(tokenized_text)))
    all_tokens.extend(["<|endoftext|>", "<|unk>"])
    vocab_size = len(all_tokens)
    vocab = {token: idx for idx, token in enumerate(all_tokens)}
    return vocab, vocab_size

class SimpleTokenizer:
    def __init__(self, vocab: dict[str, int]):
        self.str_to_int = vocab
        self.int_to_str = {idx: token for token, idx in vocab.items()}
        
    def encode(self, text: str) -> list[int]:
        tokenized_text = tokenize_scratch(text)
        tokenized_text = [token if token in self.str_to_int else "<|unk>" for token in tokenized_text]
        ids = [self.str_to_int[token] for token in tokenized_text]
        return ids
    
    def decode(self, ids: list[int]) -> str:
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text
    
vocab, vocab_size = build_vocab_scratch(tokenized_text)
print("Total number of unique tokens (vocab size):", vocab_size)
tokenizer = SimpleTokenizer(vocab)

text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = " <|endoftext|> ".join((text1, text2))
ids = tokenizer.encode(text)
print(ids)
print(tokenizer.decode(ids))

Total number of unique tokens (vocab size): 1132
[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]
<|unk>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk>.


GPT使用字节对编码（BPE）的方式。

BPE 算法的实现相对复杂，我们可以使用 tiktoken 库，该库基于 Rust 源代码高效实现了 BPE 算法

In [6]:
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")
text = "Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace." 
ids = tokenizer.encode(text, allowed_special={"<|endoftext|>"}) 
print(ids)
print(tokenizer.decode(ids))

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 286, 617, 34680, 27271, 13]
Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace.


下面我们实现一个数据加载器（data loader），通过滑动窗口（sliding window）方法从训练数据集中获取输入-目标对，首先对全文进行分词：

In [7]:
enc_text = tokenizer.encode(raw_text)
print("Total number of tokens (with gpt2 tokenizer):", len(enc_text)) # 5145

Total number of tokens (with gpt2 tokenizer): 5145


BPE 分词器的 encode 方法会一次性完成分词和 token ID 转换两个步骤。下面实现数据加载器：

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader

class GPT2Dataset(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []
        
        token_ids = tokenizer.encode(txt)
        
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i : i + max_length]
            target_chunk = token_ids[i + 1 : i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))
            
    def __len__(self):
        return len(self.input_ids)
    
    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]
    
def create_GPT2_dataloader(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0):
    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = GPT2Dataset(txt, tokenizer, max_length, stride)
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers,
    )
    return dataloader

为 LLM 训练准备输入文本的最后一步，是将 token ID 转化为嵌入向量

In [None]:
input_ids = torch.tensor([2, 3, 5, 1])
vocab_size = 6
output_dim = 3

torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
print(embedding_layer.weight.data)
print(embedding_layer(input_ids))

In [None]:
vocab_size = 50257
hidden_dim = 256
token_embedding_layer = torch.nn.Embedding(vocab_size, hidden_dim)

将每个批次中的每个词元转化为一个 256 维嵌入向量。假设批次大小为 8 且每个样本包含 4 个词元，最终将生成一个 8×4×256 的三维张量

In [None]:
max_length = 4
dataloader = create_GPT2_dataloader(
    raw_text, batch_size=8, max_length=max_length, 
    stride=max_length, shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs shape:\n", inputs.shape)

token_embeddings = token_embedding_layer(inputs)
print("Embedding inputs shape:\n", token_embeddings.shape)

对于 GPT 模型的绝对嵌入方法，只需创建另一个与 token_embedding_layer 维度相同的嵌入层

In [None]:
pos_embedding_layer = torch.nn.Embedding(max_length, hidden_dim)
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
print("Position embedding shape:\n", pos_embeddings.shape) # (4, 256)

直接将位置嵌入编码与词元嵌入相加，得到最终输入到 LLM 中的嵌入向量：

In [None]:
input_embeddings = token_embeddings + pos_embeddings
print("Input embeddings shape:\n", input_embeddings.shape) # (8, 4, 256)

最后介绍字节对编码（BPE）的实现。

先引入字节的概念。考虑将文本转换为字节数组：

In [None]:
text = "This is some text"
byte_ary = bytearray(text, "utf-8")
print(byte_ary)

对 bytearray 对象调用 list()时，每个字节会被视为单独的元素

In [None]:
ids = list(byte_ary)
print(ids)

BPE 分词器有一个词汇表，其中每个 token ID 对应的是完整单词或子词，而非单个字符。

In [None]:
gpt2_tokenizer = tiktoken.get_encoding("gpt2")
gpt2_tokenizer.encode("This is some text")

由于一个字节由 8 位 bit 组成，因此单个字节可以表示$2^8=256$种可能的值，范围从 0 到 255。BPE 分词器通常将这 256 个值作为其前 256 个单字符词元

In [None]:
gpt2_tokenizer = tiktoken.get_encoding("gpt2")

for i in range(300):
    decoded = gpt2_tokenizer.decode([i])
    print(f"{i}: {decoded}")

BPE 分词算法的目标是构建一个由高频子词（如 298: ent，该子词可见于 entangle、entertain、enter、entrance、entity 等单词）甚至完整单词组成的词表

以下是上述算法的 Python 类实现，它模仿了 tiktoken Python 接口：

In [None]:
class BPETokenizerSimple:
    """
    Simple BPE tokenizer implementation.
    """
    
    def __init__(self):
        # Maps token_id to token_str (e.g., {11246: "some"})
        self.vocab: dict[int, str] = {}
        # Maps token_str to token_id (e.g., {"some": 11246})
        self.inverse_vocab: dict[str, int] = {}
        # Dictionary of BPE merges: {(token_id1, token_id2): merged_token_id}
        self.bpe_merges: dict[tuple[int, int], int] = {}
        # For the official OpenAI GPT-2 merges, use a rank dict:
        # of form {(string_A, string_B): rank}, where lower rank = higher priority
        self.bpe_ranks: dict[tuple[str, str], int] = {}

    def train(self, text: str, vocab_size: int, allowed_special: set[str] = {"<|endoftext|>"}):
        """
        Train the BPE tokenizer on the given text from scratch.
        
        Args:
            text (str): The text to train the tokenizer on.
            vocab_size (int): The desired vocabulary size.
            allowed_special (set): A set of special tokens to allow in the vocabulary.
        """
        
        # Preprocess: Replace spaces with "Ġ"
        # Note that Ġ is a particularity of the GPT-2 BPE implementation
        # E.g., "Hello world" might be tokenized as ["Hello", "Ġworld"]
        # (GPT-4 BPE would tokenize it as ["Hello", " world"])
        processed_text = []
        for i, ch in enumerate(text):
            if ch == " " and i != 0:
                processed_text.append("Ġ")
            if ch != " ":
                processed_text.append(ch)
        processed_text = "".join(processed_text)
        
        # Initialize vocab with unique characters, including "Ġ" if present
        # Start with the first 256 ASCII characters
        unique_chars = [chr(i) for i in range(256)]
        unique_chars.extend(
            ch for ch in sorted(set(processed_text))
            if ch not in unique_chars
        )
        if "Ġ" not in unique_chars:
            unique_chars.append("Ġ")
        self.vocab = {i: ch for i, ch in enumerate(unique_chars)}
        self.inverse_vocab = {ch: i for i, ch in self.vocab.items()}
        
        # Add allowed special tokens
        for token in allowed_special:
            if token not in self.inverse_vocab:
                new_id = len(self.vocab)
                self.vocab[new_id] = token
                self.inverse_vocab[token] = new_id
                
        # Tokenize the processed_text into token IDs
        token_ids = [self.inverse_vocab[ch] for ch in processed_text]
        
        # BPE steps 1-3: Repeatedly find and replace frequent pairs
        for new_id in range(len(self.vocab), vocab_size):
            pair_id = self.find_freq_pair(token_ids, mode="most")
            if pair_id is None:
                break
            token_ids = self.replace_pair(token_ids, pair_id, new_id)
            self.bpe_merges[pair_id] = new_id
            
        # Build the vocabulary with merged tokens
        for (p0, p1), new_id in self.bpe_merges.items():
            merged_token = self.vocab[p0] + self.vocab[p1]
            self.vocab[new_id] = merged_token
            self.inverse_vocab[merged_token] = new_id
            
    @staticmethod
    def find_freq_pair(token_ids: list[int], mode="most"):
        pairs = Counter(zip(token_ids[:-1], token_ids[1:]))
        if not pairs:
            return None
        match mode:
            case "most":
                # return max(pairs.items(), key=lambda x: x[1])[0]
                return pairs.most_common(1)[0][0]
            case "least":
                # return min(pairs.items(), key=lambda x: x[1])[0]
                return pairs.most_common()[-1][0]
            case _:
                raise ValueError("Invalid mode. Choose 'most' or 'least'")
    
    @staticmethod
    def replace_pair(token_ids: list[int], pair_id: tuple[int, int], new_id: int):
        dq = deque(token_ids)
        replaced: list[int] = []
        
        while dq:
            current = dq.popleft()
            if dq and (current, dq[0]) == pair_id:
                replaced.append(new_id)
                dq.popleft()
            else:
                replaced.append(current)
        
        return replaced

    def load_vocab_and_merges_from_gpt2(self, vocab_path, bpe_merges_path):
        """
        Load pre-trained vocabulary and BPE merges from OpenAI's GPT-2 files.
        
        Args:
            vocab_path (str): Path to the vocab file (GPT-2 calls it 'encoder.json' or 'vocab.json').
            bpe_merges_path (str): Path to the bpe_merges file  (GPT-2 calls it 'vocab.bpe' or 'merges.txt').
        """
        
        # Load vocabulary
        with open(vocab_path, "r", encoding="utf-8") as file:
            loaded_vocab = json.load(file)
            # Convert loaded vocabulary to correct format
            self.vocab = {int(v): k for k, v in loaded_vocab.items()}
            self.inverse_vocab = {k: int(v) for k, v in loaded_vocab.items()}

        # Handle newline character without adding a new token
        if "\n" not in self.inverse_vocab:
            # Use an existing token ID as a placeholder for '\n'
            # Preferentially use "<|endoftext|>" if available
            fallback_token = next(
                (token for token in ["<|endoftext|>", "Ġ", ""] if token in self.inverse_vocab), None
            )
            if fallback_token is not None:
                newline_token_id = self.inverse_vocab[fallback_token]
            else:
                # If no fallback token is available, raise an error
                raise KeyError("No suitable token found in vocabulary to map '\\n'.")
            self.inverse_vocab["\n"] = newline_token_id
            self.vocab[newline_token_id] = "\n"
            
        # Load GPT-2 merges and store them with an assigned "rank"
        self.bpe_ranks = {}  # reset ranks
        with open(bpe_merges_path, "r", encoding="utf-8") as file:
            lines = file.readlines()
            if lines and lines[0].startswith("#"):
                lines = lines[1:]
            rank = 0
            for line in lines:
                pair = tuple(line.strip().split())
                if len(pair) == 2:
                    token1, token2 = pair
                    # If token1 or token2 not in vocab, skip
                    if token1 in self.inverse_vocab and token2 in self.inverse_vocab:
                        self.bpe_ranks[(token1, token2)] = rank
                        rank += 1
                    else:
                        print(f"Skipping pair {pair} as one token is not in the vocabulary.")

    def encode(self, text: str, allowed_special: set[str] | None = None):
        """
        Encode the input text into a list of token IDs, with tiktoken-style handling of special tokens.
    
        Args:
            text (str): The input text to encode.
            allowed_special (set or None): Special tokens to allow passthrough. If None, special handling is disabled.
    
        Returns:
            List of token IDs.
        """

        token_ids: list[int] = []
        
        # If special token handling is enabled
        if allowed_special is not None and len(allowed_special) > 0:
            # Build regex to match allowed special tokens
            special_pattern = (
                "(" + "|".join(re.escape(tok) for tok in sorted(allowed_special, key=len, reverse=True)) + ")"
            )
            # Encoding text while handling special tokens
            last_index = 0
            for mch in re.finditer(special_pattern, text):
                prefix = text[last_index : mch.start()]
                # Encode prefix without special handling
                token_ids.extend(self.encode(prefix, allowed_special=None))  
                # Encode special token
                special_token = mch.group(0)
                if special_token in self.inverse_vocab:
                    token_ids.append(self.inverse_vocab[special_token])
                else:
                    raise ValueError(f"Special token {special_token} not found in vocabulary.")
                last_index = mch.end()
            # Remaining part to process normally
            text = text[last_index:]  
            
        # If no special tokens, or remaining text after special token split:
        tokens: list[str] = []
        lines = text.split("\n")
        for i, line in enumerate(lines):
            if i > 0:
                tokens.append("\n")
            words = line.split()
            for j, word in enumerate(words):
                if j == 0 and i > 0:
                    tokens.append("Ġ" + word)
                elif j == 0:
                    tokens.append(word)
                else:
                    tokens.append("Ġ" + word)

        for token in tokens:
            if token in self.inverse_vocab:
                token_ids.append(self.inverse_vocab[token])
            else:
                token_ids.extend(self.tokenize_with_bpe(token))
    
        return token_ids
        
    def tokenize_with_bpe(self, token: str):
        """
        Tokenize a single token which is not in vocabulary using BPE merges.

        Args:
            token (str): The token to tokenize.

        Returns:
            list[int]: The list of token IDs after applying BPE.
        """
        
        # Tokenize the token into individual characters (as initial token IDs)
        raw_token_ids = [self.inverse_vocab.get(ch, None) for ch in token]
        token_ids = [tid for tid in raw_token_ids if tid is not None]
        if None in raw_token_ids:
            missing_chars = [ch for ch, tid in zip(token, raw_token_ids) if tid is None]
            raise ValueError(f"Characters not found in vocab: {missing_chars}")
            
        # If we haven't loaded OpenAI's GPT-2 merges, use my approach
        if not self.bpe_ranks:
            can_merge = True
            while can_merge and len(token_ids) > 1:
                can_merge = False
                new_tokens: list[int] = []
                i = 0
                while i < len(token_ids) - 1:
                    pair = (token_ids[i], token_ids[i + 1])
                    if pair in self.bpe_merges:
                        merged_token_id = self.bpe_merges[pair]
                        new_tokens.append(merged_token_id)
                        # Merged pair {pair} -> {merged_token_id} ('{self.vocab[merged_token_id]}')
                        i += 2  # Skip the next token as it's merged
                        can_merge = True
                    else:
                        new_tokens.append(token_ids[i])
                        i += 1
                if i < len(token_ids):
                    new_tokens.append(token_ids[i])
                token_ids = new_tokens
            return token_ids
            
        # Otherwise, do GPT-2-style merging with the ranks:
        # 1) Convert token_ids back to string "symbols" for each ID
        symbols = [self.vocab[id_num] for id_num in token_ids]

        # Repeatedly merge all occurrences of the lowest-rank pair
        while True:
            # Collect all adjacent pairs
            pairs = set(zip(symbols[:-1], symbols[1:]))
            if not pairs:
                break

            # Find the pair with the lowest rank
            bigram = min(pairs, key=lambda x: self.bpe_ranks.get(x, float('inf')))

            # If no valid ranked pair is present, we're done
            if bigram is None or bigram not in self.bpe_ranks:
                break
            # else merge all occurrences of that pair
            first, second = bigram
            new_symbols: list[str] = []
            i = 0
            while i < len(symbols):
                # If we see (first, second) at position i, merge them
                if i < len(symbols) - 1 and symbols[i] == first and symbols[i + 1] == second:
                    new_symbols.append(first + second)  # merged symbol
                    i += 2
                else:
                    new_symbols.append(symbols[i])
                    i += 1
            symbols = new_symbols
            if len(symbols) == 1:
                break

        # Finally, convert merged symbols back to IDs
        merged_ids = [self.inverse_vocab[sym] for sym in symbols]
        return merged_ids

    def decode(self, token_ids):
        """
        Decode a list of token IDs back into a string.

        Args:
            token_ids (list[int]): The list of token IDs to decode.

        Returns:
            str: The decoded string.
        """
        buffer: list[str] = []
        for i, token_id in enumerate(token_ids):
            if token_id not in self.vocab:
                raise ValueError(f"Token ID {token_id} not found in vocab.")
            token = self.vocab[token_id]
            if token == "\n":
                if buffer and not buffer[-1].endswith(" "):
                    buffer.append(" \n") # Add space if not present before a newline
            elif token.startswith("Ġ"):
                buffer.append(" " + token[1:])
            else:
                buffer.append(token)
        return "".join(buffer)
        
    def save_vocab_and_merges(self, vocab_path, bpe_merges_path):
        """
        Save the vocabulary and BPE merges to JSON files.

        Args:
            vocab_path (str): Path to save the vocabulary.
            bpe_merges_path (str): Path to save the BPE merges.
        """
        # Save vocabulary
        with open(vocab_path, "w", encoding="utf-8") as file:
            json.dump(self.vocab, file, ensure_ascii=False, indent=2)

        # Save BPE merges as a list of dictionaries
        with open(bpe_merges_path, "w", encoding="utf-8") as file:
            merges_list = [{"pair": list(pair), "new_id": new_id}
                           for pair, new_id in self.bpe_merges.items()]
            json.dump(merges_list, file, ensure_ascii=False, indent=2)
            
    def load_vocab_and_merges(self, vocab_path, bpe_merges_path):
        """
        Load the vocabulary and BPE merges from JSON files.

        Args:
            vocab_path (str): Path to the vocabulary file.
            bpe_merges_path (str): Path to the BPE merges file.
        """
        # Load vocabulary
        with open(vocab_path, "r", encoding="utf-8") as file:
            loaded_vocab = json.load(file)
            self.vocab = {int(k): v for k, v in loaded_vocab.items()}
            self.inverse_vocab = {v: int(k) for k, v in loaded_vocab.items()}

        # Load BPE merges
        with open(bpe_merges_path, "r", encoding="utf-8") as file:
            merges_list = json.load(file)
            for merge in merges_list:
                pair = tuple(merge["pair"])
                new_id = merge["new_id"]
                self.bpe_merges[pair] = new_id
                
    @lru_cache(maxsize=None)
    def get_special_token_id(self, token):
        """
        Get the ID of a special token.

        Args:
            token (str): The special token.

        Returns:
            int or None: The ID of the token if it exists, otherwise None.
        """
        return self.inverse_vocab.get(token, None)

使用上面这个BPE分词器：

In [None]:
tokenizer = BPETokenizerSimple()
tokenizer.train(raw_text, vocab_size=1000, allowed_special={"<|endoftext|>"})
print(len(tokenizer.vocab))

In [None]:
print(len(tokenizer.bpe_merges))

In [None]:
input_text = "Jack embraced beauty through art and life."
token_ids = tokenizer.encode(input_text, allowed_special={"<|endoftext|>"})
print(token_ids)
print(tokenizer.decode(token_ids) == input_text)

input_text = "Jack embraced beauty through art and life.<|endoftext|>"
token_ids = tokenizer.encode(input_text, allowed_special={"<|endoftext|>"})
print(token_ids)
print(tokenizer.decode(token_ids) == input_text)

input_text = "Jack embraced bea 74uty through art and life.<|endoftext|>"
token_ids = tokenizer.encode(input_text, allowed_special={"<|endoftext|>"})
print(token_ids)
print(tokenizer.decode(token_ids) == input_text)

print("Number of characters:", len(input_text))
print("Number of token IDs:", len(token_ids))

In [None]:
print(token_ids)
print(tokenizer.decode(token_ids))

for token_id in token_ids:
    print(f"{token_id} -> {tokenizer.decode([token_id])}")

In [None]:
# Save trained tokenizer
tokenizer.save_vocab_and_merges(
    vocab_path="../../datasets/the_verdict/bpe_vocab.json", bpe_merges_path="../../datasets/the_verdict/bpe_merges.txt")
# Load tokenizer
tokenizer2 = BPETokenizerSimple()
tokenizer2.load_vocab_and_merges(
    vocab_path="../../datasets/the_verdict/bpe_vocab.json", bpe_merges_path="../../datasets/the_verdict/bpe_merges.txt")

print(tokenizer2.decode(token_ids))
tokenizer2.decode(
    tokenizer2.encode("This is some text with \n newline characters.")
)

加载gpt2的词表：

In [None]:
tokenizer_1 = BPETokenizerSimple()
gpt2_dir = "../../models/gpt2/"
tokenizer_1.load_vocab_and_merges_from_gpt2(gpt2_dir+'vocab.json', gpt2_dir+'merges.txt')
tokenizer_2 = tiktoken.get_encoding("gpt2")

In [None]:
input_text = "Jack embraced beauty through art and life.<|endoftext|>"
token_ids = tokenizer_1.encode(input_text, allowed_special={"<|endoftext|>"})
print(token_ids)
decoded_text = tokenizer_1.decode(token_ids)
print(decoded_text)

In [None]:
input_text = "    Jack embraced    "
token_ids = tokenizer_2.encode(input_text, allowed_special={"<|endoftext|>"})
print(token_ids)
decoded_text = tokenizer_2.decode(token_ids)
print(decoded_text)

In [None]:
input_text = "Jack embraced beauty through art and life.<|endoftext|>"
token_ids = tokenizer_2.encode(input_text, allowed_special={"<|endoftext|>"})
print(token_ids)
decoded_text = tokenizer_2.decode(token_ids)
print(decoded_text)

In [None]:
input_text = "     Jack embraced     "
token_ids = tokenizer_1.encode(input_text, allowed_special={"<|endoftext|>"})
print(token_ids)
decoded_text = tokenizer_1.decode(token_ids)
print(decoded_text)

In [None]:
input_text = "     Jack embraced     "
token_ids = tokenizer_2.encode(input_text, allowed_special={"<|endoftext|>"})
print(token_ids)
decoded_text = tokenizer_2.decode(token_ids)
print(decoded_text)