# Procesamiento de Datos

En este notebook vamos a Explorar, implementar y comparar diferentes enfoques de tokenización BPE para el dataset CNN/DailyMail y aplicar tecnicas de preprocesamiento como el trucamiento (como concluimos en el notebook 01) antes de contruir el pipeline de datos final

In [19]:
# Librerias
from datasets import load_dataset
import pandas as pd

from collections import Counter, deque
from functools import lru_cache
import json

In [3]:
# Cargamos los datos como hicimos en el notebook 01

dataset = load_dataset("cnn_dailymail", "3.0.0")

train_df = pd.DataFrame(dataset['train'])
validation_data_df = pd.DataFrame(dataset['validation'])
test_data_df =  pd.DataFrame(dataset['test'])

## BPE

 BPE es un algoritmo que usa la metodologia button-up, empieza con caracteres (bytes) de la secuencia para luego 
 añadir pares frecuentes iterativamente hasta completar la tokenización, esto ayuda a disminuir
 la cantidad de tokens desconocidos (OOV) y reduce el vocabulario a comparacion que la tokenizacion basada en palabras. Para estas celdas de codigo vamos a 
 utilizar el cuaderno del capitulo 2 bpe-from-scratch.ipynb del repostitorio público de LLMs-from-scratch de 
 Sebastian Raschka.

 ### Pasos del algoritmo BPE

 1. Inicio
    1. Empezamos el texto como una secuencia de caracteres individuales 
    2. El vocabulario inicial son todos los caracteres posibles que se encuentran en el corpus.
    3. cada caracter se le asigna un ID (por ejemplo del 0-255)
 2. Contar pares mas frecuentes
    1. El algoritmo revisa todo el texto y anota el par de caracteres que aparece juntos
 3. Añadir el par mas frecuente
    1. Teniendo los pares que mas se repiten, entonces le asignamos un nuevo ID 
    2. Repetir los pasos 1 y 2 continuamente añadiendo los pares mas frecuentes
    3. Paramos cuando mas compresion no es posible (cuando ningun par ocurra mas de una vez)
 4. Decodificación
    1. Usamos la tabla de busqueda para restaurar el texto original haciendo el proceso inverso

### Ejemplo del Algoritmo BPE

En el cuardeno de Raschka nos muestra un ejemplo que seria beneficioso tambien estudiarlo, por ejemplo tenemos la secuencia "the cat in the hat"

#### Iteración 1

1. Identifica la frecuencia de pares
   1. En la secuencia aparece "th" un par de veces "`th`e cat in `th`e hat" 
2. Reemplazamos por un ID
   1. Reemplazamos `th` por un ID que no se encuentre en uso, por ejemplo 256
   2. Entonces la secuencia se veria asi: "`<256>`e cat in `<256>`e hat"
   3. Y el vocabulatio se actualizaria asi:

        0: ...

        1: ...

        ...

        256: `"th"`

#### Iteración 2

1. Indentifica la frecuencia de pares
   1. Ahora de la secuencia anterior aparecece "<256>e" un par de veces: "`<256>e` cat in `<256>e` hat"
2. Reemplazamos por un ID
   1. Reemplazamos `<256>e` con un nuevo ID que no se encuentre en uso, por ejemplo 257
   2. Entonces la secuencia se veria asi: "`<257>` cat in `<257>` hat"
   3. El vocabulario se actualizaria asi:
        
        0: ...

        1: ...

        ...

        256: "th"

        257: `"<256>e"`

#### Iteración 3

1. Identifica la frecuencia de pares
   1. Ahora de la secuencia anterior aparece "<257> " un par de veces: "`<257> `cat in `<257> `hat"
2. Reemplazamos por un ID
   1. Reemplazamos `<257> ` por un ID que no se encuentra en uso, por ejemplo 258
   2. Entonces la secuencia se veria asi: "`<258>`cat in `<258>`hat"
   3. El vocabulario se actualizaria asi:

        0: ...

        1: ...

        ...

        256: "th"

        257: "<256>e"

        258: `"<257> "`
    
#### Iteración 4

...


### Implementación del BPE

Vamos implementar el codigo educativo que nos proporciona Raschka y lo utilizaremos en nuestro data set de resumenes de noticias.

#### Clase BPETokenizerSimple

In [None]:
class BPETokenizerSimple:
    def __init__(self):
        # Maps token_id to token_str (e.g., {11246: "some"})
        self.vocab = {}
        # Maps token_str to token_id (e.g., {"some": 11246})
        self.inverse_vocab = {}
        # Dictionary of BPE merges: {(token_id1, token_id2): merged_token_id}
        self.bpe_merges = {}

        # For the official OpenAI GPT-2 merges, use a rank dict:
        #  of form {(string_A, string_B): rank}, where lower rank = higher priority
        self.bpe_ranks = {}

    def train(self, text, vocab_size, allowed_special={"<|endoftext|>"}):
        """
        Train the BPE tokenizer from scratch.

        Args:
            text (str): The training text.
            vocab_size (int): The desired vocabulary size.
            allowed_special (set): A set of special tokens to include.
        """

        # Preprocess: Replace spaces with "Ġ"
        # Note that Ġ is a particularity of the GPT-2 BPE implementation
        # E.g., "Hello world" might be tokenized as ["Hello", "Ġworld"]
        # (GPT-4 BPE would tokenize it as ["Hello", " world"])
        processed_text = []
        for i, char in enumerate(text):
            if char == " " and i != 0:
                processed_text.append("Ġ")
            if char != " ":
                processed_text.append(char)
        processed_text = "".join(processed_text)

        # Initialize vocab with unique characters, including "Ġ" if present
        # Start with the first 256 ASCII characters
        unique_chars = [chr(i) for i in range(256)]
        unique_chars.extend(
            char for char in sorted(set(processed_text))
            if char not in unique_chars
        )
        if "Ġ" not in unique_chars:
            unique_chars.append("Ġ")

        self.vocab = {i: char for i, char in enumerate(unique_chars)}
        self.inverse_vocab = {char: i for i, char in self.vocab.items()}

        # Add allowed special tokens
        if allowed_special:
            for token in allowed_special:
                if token not in self.inverse_vocab:
                    new_id = len(self.vocab)
                    self.vocab[new_id] = token
                    self.inverse_vocab[token] = new_id

        # Tokenize the processed_text into token IDs
        token_ids = [self.inverse_vocab[char] for char in processed_text]

        # BPE steps 1-3: Repeatedly find and replace frequent pairs
        for new_id in range(len(self.vocab), vocab_size):
            pair_id = self.find_freq_pair(token_ids, mode="most")
            if pair_id is None:
                break
            token_ids = self.replace_pair(token_ids, pair_id, new_id)
            self.bpe_merges[pair_id] = new_id

        # Build the vocabulary with merged tokens
        for (p0, p1), new_id in self.bpe_merges.items():
            merged_token = self.vocab[p0] + self.vocab[p1]
            self.vocab[new_id] = merged_token
            self.inverse_vocab[merged_token] = new_id

    def load_vocab_and_merges_from_openai(self, vocab_path, bpe_merges_path):
        """
        Load pre-trained vocabulary and BPE merges from OpenAI's GPT-2 files.

        Args:
            vocab_path (str): Path to the vocab file (GPT-2 calls it 'encoder.json').
            bpe_merges_path (str): Path to the bpe_merges file  (GPT-2 calls it 'vocab.bpe').
        """
        # Load vocabulary
        with open(vocab_path, "r", encoding="utf-8") as file:
            loaded_vocab = json.load(file)
            # Convert loaded vocabulary to correct format
            self.vocab = {int(v): k for k, v in loaded_vocab.items()}
            self.inverse_vocab = {k: int(v) for k, v in loaded_vocab.items()}

        # Handle newline character without adding a new token
        if "\n" not in self.inverse_vocab:
            # Use an existing token ID as a placeholder for '\n'
            # Preferentially use "<|endoftext|>" if available
            fallback_token = next((token for token in ["<|endoftext|>", "Ġ", ""] if token in self.inverse_vocab), None)
            if fallback_token is not None:
                newline_token_id = self.inverse_vocab[fallback_token]
            else:
                # If no fallback token is available, raise an error
                raise KeyError("No suitable token found in vocabulary to map '\\n'.")

            self.inverse_vocab["\n"] = newline_token_id
            self.vocab[newline_token_id] = "\n"

        # Load GPT-2 merges and store them with an assigned "rank"
        self.bpe_ranks = {}  # reset ranks
        with open(bpe_merges_path, "r", encoding="utf-8") as file:
            lines = file.readlines()
            if lines and lines[0].startswith("#"):
                lines = lines[1:]

            rank = 0
            for line in lines:
                pair = tuple(line.strip().split())
                if len(pair) == 2:
                    token1, token2 = pair
                    # If token1 or token2 not in vocab, skip
                    if token1 in self.inverse_vocab and token2 in self.inverse_vocab:
                        self.bpe_ranks[(token1, token2)] = rank
                        rank += 1
                    else:
                        print(f"Skipping pair {pair} as one token is not in the vocabulary.")

    def encode(self, text, allowed_special=None):
        """
        Encode the input text into a list of token IDs, with tiktoken-style handling of special tokens.
    
        Args:
            text (str): The input text to encode.
            allowed_special (set or None): Special tokens to allow passthrough. If None, special handling is disabled.
    
        Returns:
            List of token IDs.
        """
        import re
    
        token_ids = []
    
        # If special token handling is enabled
        if allowed_special is not None and len(allowed_special) > 0:
            # Build regex to match allowed special tokens
            special_pattern = (
                "(" + "|".join(re.escape(tok) for tok in sorted(allowed_special, key=len, reverse=True)) + ")"
            )
    
            last_index = 0
            for match in re.finditer(special_pattern, text):
                prefix = text[last_index:match.start()]
                token_ids.extend(self.encode(prefix, allowed_special=None))  # Encode prefix without special handling
    
                special_token = match.group(0)
                if special_token in self.inverse_vocab:
                    token_ids.append(self.inverse_vocab[special_token])
                else:
                    raise ValueError(f"Special token {special_token} not found in vocabulary.")
                last_index = match.end()
    
            text = text[last_index:]  # Remaining part to process normally
    
            # Check if any disallowed special tokens are in the remainder
            disallowed = [
                tok for tok in self.inverse_vocab
                if tok.startswith("<|") and tok.endswith("|>") and tok in text and tok not in allowed_special
            ]
            if disallowed:
                raise ValueError(f"Disallowed special tokens encountered in text: {disallowed}")
    
        # If no special tokens, or remaining text after special token split:
        tokens = []
        lines = text.split("\n")
        for i, line in enumerate(lines):
            if i > 0:
                tokens.append("\n")
            words = line.split()
            for j, word in enumerate(words):
                if j == 0 and i > 0:
                    tokens.append("Ġ" + word)
                elif j == 0:
                    tokens.append(word)
                else:
                    tokens.append("Ġ" + word)
    
        for token in tokens:
            if token in self.inverse_vocab:
                token_ids.append(self.inverse_vocab[token])
            else:
                token_ids.extend(self.tokenize_with_bpe(token))
    
        return token_ids

    def tokenize_with_bpe(self, token):
        """
        Tokenize a single token using BPE merges.

        Args:
            token (str): The token to tokenize.

        Returns:
            List[int]: The list of token IDs after applying BPE.
        """
        # Tokenize the token into individual characters (as initial token IDs)
        token_ids = [self.inverse_vocab.get(char, None) for char in token]
        if None in token_ids:
            missing_chars = [char for char, tid in zip(token, token_ids) if tid is None]
            raise ValueError(f"Characters not found in vocab: {missing_chars}")

        # If we haven't loaded OpenAI's GPT-2 merges, use my approach
        if not self.bpe_ranks:
            can_merge = True
            while can_merge and len(token_ids) > 1:
                can_merge = False
                new_tokens = []
                i = 0
                while i < len(token_ids) - 1:
                    pair = (token_ids[i], token_ids[i + 1])
                    if pair in self.bpe_merges:
                        merged_token_id = self.bpe_merges[pair]
                        new_tokens.append(merged_token_id)
                        # Uncomment for educational purposes:
                        # print(f"Merged pair {pair} -> {merged_token_id} ('{self.vocab[merged_token_id]}')")
                        i += 2  # Skip the next token as it's merged
                        can_merge = True
                    else:
                        new_tokens.append(token_ids[i])
                        i += 1
                if i < len(token_ids):
                    new_tokens.append(token_ids[i])
                token_ids = new_tokens
            return token_ids

        # Otherwise, do GPT-2-style merging with the ranks:
        # 1) Convert token_ids back to string "symbols" for each ID
        symbols = [self.vocab[id_num] for id_num in token_ids]

        # Repeatedly merge all occurrences of the lowest-rank pair
        while True:
            # Collect all adjacent pairs
            pairs = set(zip(symbols, symbols[1:]))
            if not pairs:
                break

            # Find the pair with the best (lowest) rank
            min_rank = float("inf")
            bigram = None
            for p in pairs:
                r = self.bpe_ranks.get(p, float("inf"))
                if r < min_rank:
                    min_rank = r
                    bigram = p

            # If no valid ranked pair is present, we're done
            if bigram is None or bigram not in self.bpe_ranks:
                break

            # Merge all occurrences of that pair
            first, second = bigram
            new_symbols = []
            i = 0
            while i < len(symbols):
                # If we see (first, second) at position i, merge them
                if i < len(symbols) - 1 and symbols[i] == first and symbols[i+1] == second:
                    new_symbols.append(first + second)  # merged symbol
                    i += 2
                else:
                    new_symbols.append(symbols[i])
                    i += 1
            symbols = new_symbols

            if len(symbols) == 1:
                break

        # Finally, convert merged symbols back to IDs
        merged_ids = [self.inverse_vocab[sym] for sym in symbols]
        return merged_ids

    def decode(self, token_ids):
        """
        Decode a list of token IDs back into a string.

        Args:
            token_ids (List[int]): The list of token IDs to decode.

        Returns:
            str: The decoded string.
        """
        decoded_string = ""
        for i, token_id in enumerate(token_ids):
            if token_id not in self.vocab:
                raise ValueError(f"Token ID {token_id} not found in vocab.")
            token = self.vocab[token_id]
            if token == "\n":
                if decoded_string and not decoded_string.endswith(" "):
                    decoded_string += " "  # Add space if not present before a newline
                decoded_string += token
            elif token.startswith("Ġ"):
                decoded_string += " " + token[1:]
            else:
                decoded_string += token
        return decoded_string

    def save_vocab_and_merges(self, vocab_path, bpe_merges_path):
        """
        Save the vocabulary and BPE merges to JSON files.

        Args:
            vocab_path (str): Path to save the vocabulary.
            bpe_merges_path (str): Path to save the BPE merges.
        """
        # Save vocabulary
        with open(vocab_path, "w", encoding="utf-8") as file:
            json.dump(self.vocab, file, ensure_ascii=False, indent=2)

        # Save BPE merges as a list of dictionaries
        with open(bpe_merges_path, "w", encoding="utf-8") as file:
            merges_list = [{"pair": list(pair), "new_id": new_id}
                           for pair, new_id in self.bpe_merges.items()]
            json.dump(merges_list, file, ensure_ascii=False, indent=2)

    def load_vocab_and_merges(self, vocab_path, bpe_merges_path):
        """
        Load the vocabulary and BPE merges from JSON files.

        Args:
            vocab_path (str): Path to the vocabulary file.
            bpe_merges_path (str): Path to the BPE merges file.
        """
        # Load vocabulary
        with open(vocab_path, "r", encoding="utf-8") as file:
            loaded_vocab = json.load(file)
            self.vocab = {int(k): v for k, v in loaded_vocab.items()}
            self.inverse_vocab = {v: int(k) for k, v in loaded_vocab.items()}

        # Load BPE merges
        with open(bpe_merges_path, "r", encoding="utf-8") as file:
            merges_list = json.load(file)
            for merge in merges_list:
                pair = tuple(merge["pair"])
                new_id = merge["new_id"]
                self.bpe_merges[pair] = new_id

    @lru_cache(maxsize=None)
    def get_special_token_id(self, token):
        return self.inverse_vocab.get(token, None)

    @staticmethod
    def find_freq_pair(token_ids, mode="most"):
        pairs = Counter(zip(token_ids, token_ids[1:]))

        if not pairs:
            return None

        if mode == "most":
            return max(pairs.items(), key=lambda x: x[1])[0]
        elif mode == "least":
            return min(pairs.items(), key=lambda x: x[1])[0]
        else:
            raise ValueError("Invalid mode. Choose 'most' or 'least'.")

    @staticmethod
    def replace_pair(token_ids, pair_id, new_id):
        dq = deque(token_ids)
        replaced = []

        while dq:
            current = dq.popleft()
            if dq and (current, dq[0]) == pair_id:
                replaced.append(new_id)
                # Remove the 2nd token of the pair, 1st was already removed
                dq.popleft()
            else:
                replaced.append(current)

        return replaced

#### Implementando la clase BPETokenizerSimple a nuestro dataset

In [38]:
print(train_df["article"].iloc[3])

WASHINGTON (CNN) -- Doctors removed five small polyps from President Bush's colon on Saturday, and "none appeared worrisome," a White House spokesman said. The polyps were removed and sent to the National Naval Medical Center in Bethesda, Maryland, for routine microscopic examination, spokesman Scott Stanzel said. Results are expected in two to three days. All were small, less than a centimeter [half an inch] in diameter, he said. Bush is in good humor, Stanzel said, and will resume his activities at Camp David. During the procedure Vice President Dick Cheney assumed presidential power. Bush reclaimed presidential power at 9:21 a.m. after about two hours. Doctors used "monitored anesthesia care," Stanzel said, so the president was asleep, but not as deeply unconscious as with a true general anesthetic. He spoke to first lady Laura Bush -- who is in Midland, Texas, celebrating her mother's birthday -- before and after the procedure, Stanzel said. Afterward, the president played with his

In [None]:
# Para el ejemplo elegimos el tercer articulo de nuestro dataset de entrenamiento

text = train_df["article"].iloc[3]

In [None]:
# utilizamos la clase de Raschka para entrenar el algortimo BPE con un token especial para finalizar la secuencia

tokenizer = BPETokenizerSimple()
tokenizer.train(text, vocab_size=1000, allowed_special={"<EOS>"})

In [None]:
# El tamaño del vocabulario

print(len(tokenizer.vocab))

1000


In [None]:
# La cantidad de veces que se 

print(len(tokenizer.bpe_merges))

742


In [39]:
input_text = "Bush had five harmless polyps removed and returned to work."
token_ids = tokenizer.encode(input_text)
print(token_ids)

[300, 256, 420, 100, 256, 447, 101, 256, 420, 114, 109, 354, 115, 115, 256, 340, 115, 256, 361, 307, 282, 110, 100, 256, 262, 116, 279, 110, 307, 256, 275, 256, 119, 287, 107, 46]


In [None]:
# Agregando un token especial EOS = End Of Sequence

input_text = "Bush had five harmless polyps removed and returned to work.<EOS> "
token_ids = tokenizer.encode(input_text)
print(token_ids)

[300, 256, 420, 100, 256, 447, 101, 256, 420, 114, 109, 354, 115, 115, 256, 340, 115, 256, 361, 307, 282, 110, 100, 256, 262, 116, 279, 110, 307, 256, 275, 256, 119, 287, 107, 46, 60, 69, 79, 83, 62]


In [42]:
input_text = "Bush had five harmless polyps removed and returned to work.<EOS>"
token_ids = tokenizer.encode(input_text, allowed_special={"<EOS>"})
print(token_ids) 

[300, 256, 420, 100, 256, 447, 101, 256, 420, 114, 109, 354, 115, 115, 256, 340, 115, 256, 361, 307, 282, 110, 100, 256, 262, 116, 279, 110, 307, 256, 275, 256, 119, 287, 107, 46, 257]


En la celda anterior notamos que el token especial es asignado al numero 257, es decir que es el primer token asignado luego de enumerar los caracteres

In [43]:
print("Numero de caracteres:", len(input_text))
print("Numero de token IDs:", len(token_ids))

Numero de caracteres: 64
Numero de token IDs: 37


Como vemos, el número de caracteres es diferente al número de tokens. Esto se debe a que el algoritmo agrupa los pares más frecuentes. Por eso, podemos hablar del tamaño de las secuencias en términos de la cantidad de tokens que las componen.

En el cuaderno anterior teníamos la disyuntiva de qué hacer cuando teníamos un resumen muy largo comparado con la media. Esa longitud se calculó con respecto al tamaño de las secuencias, pero lo correcto es que se mida en función del tamaño de las secuencias tokenizadas.

A partir de allí, podemos truncar las secuencias para quedarnos con una cantidad de tokens que nos interese.

In [45]:
print(token_ids)

[300, 256, 420, 100, 256, 447, 101, 256, 420, 114, 109, 354, 115, 115, 256, 340, 115, 256, 361, 307, 282, 110, 100, 256, 262, 116, 279, 110, 307, 256, 275, 256, 119, 287, 107, 46, 257]


In [46]:

for token_id in token_ids:
    print(f"{token_id} -> {tokenizer.decode([token_id])}")

300 -> Bush
256 ->  
420 -> ha
100 -> d
256 ->  
447 -> fiv
101 -> e
256 ->  
420 -> ha
114 -> r
109 -> m
354 -> le
115 -> s
115 -> s
256 ->  
340 -> polyp
115 -> s
256 ->  
361 -> remov
307 -> ed
282 ->  a
110 -> n
100 -> d
256 ->  
262 -> re
116 -> t
279 -> ur
110 -> n
307 -> ed
256 ->  
275 -> to
256 ->  
119 -> w
287 -> or
107 -> k
46 -> .
257 -> <EOS>


Aqui podemos ver con mas detalle como el algoritmo junta los pares mas frecuentes y le asigna un ID para agregarlo al vocabulario.

In [48]:
tokenizer.decode(
    tokenizer.encode("This is some text.")
)

'This is some text.'

In [49]:
tokenizer.decode(
    tokenizer.encode("This is some text with \n newline characters.")
)

'This is some text with \n newline characters.'

### Guardar y descargar el tokenizador

In [None]:
# Guardar el tokenizador entrenado

tokenizer.save_vocab_and_merges(vocab_path="bpe-tokenizer-raschka/vocab.json", bpe_merges_path="bpe-tokenizer-raschka/bpe_merges.txt")

In [52]:
# Cargamos el tokenizador

tokenizer2 = BPETokenizerSimple()
tokenizer2.load_vocab_and_merges(vocab_path="bpe-tokenizer-raschka/vocab.json", bpe_merges_path="bpe-tokenizer-raschka/bpe_merges.txt")

In [53]:
print(tokenizer2.decode(token_ids))

Bush had five harmless polyps removed and returned to work.<EOS>


In [54]:
tokenizer2.decode(
    tokenizer2.encode("This is some text with \n newline characters.")
)

'This is some text with \n newline characters.'

Raschka no recomienda usar su tokenizador BPE creado desde cero sino mas bien recomienda bibliotecas que se encuentren optimizadas para esa tarea, aqui vamos a presentar dos bibliotecas que es tiktoken de OpenAI que es recomendado por el mismo Raschka y el tokenizador de Hugging Face

## Tokenización con tiktoken (OpenAI)

Esta biblioteca es recomendado por su rendimiento computacional y es el tokenizador utilizado por los modelos de OpenAI (como GPT-2, GPT-4). Es útil probarlo como un tokenizador BPE pre-entrenado de alto rendimiento.

In [None]:
import tiktoken

enc = tiktoken.get_encoding("o200k_base")

In [62]:
enc.encode("tiktoken es genial!")

[83, 8251, 2488, 878, 80401, 0]

In [60]:
def num_tokens_from_string(text):
    """Retorna el numero de tokens en un string."""
    num_tokens = len(enc.encode(text))
    return num_tokens

In [73]:
enc.encode("Hello, I'm Luis")

[13225, 11, 5477, 37972]

In [61]:
num_tokens_from_string("Hello, I'm Luis")

4

In [63]:

enc.decode([83, 8251, 2488, 878, 80401, 0])

'tiktoken es genial!'

In [65]:

[enc.decode_single_token_bytes(token) for token in [83, 8251, 2488, 878, 80401, 0]]

[b't', b'ikt', b'oken', b' es', b' genial', b'!']

In [81]:
train_df.iloc[0,0]

'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don\'t think I\'ll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart. Details o

In [82]:
enc.encode(train_df.iloc[0,0])

[43,
 137750,
 11,
 14318,
 350,
 77254,
 8,
 2230,
 23564,
 45666,
 8253,
 19041,
 22950,
 143342,
 35640,
 3158,
 316,
 261,
 9822,
 8989,
 455,
 5749,
 3653,
 4987,
 13,
 16,
 5749,
 8,
 46505,
 472,
 501,
 18304,
 220,
 1157,
 402,
 10715,
 11,
 889,
 501,
 104340,
 290,
 3905,
 14219,
 9831,
 261,
 29176,
 402,
 2395,
 13,
 19041,
 22950,
 143342,
 472,
 23564,
 45666,
 306,
 392,
 78006,
 45666,
 326,
 290,
 10735,
 328,
 290,
 37739,
 1,
 2514,
 290,
 67981,
 328,
 91602,
 6000,
 2549,
 2846,
 290,
 2375,
 11,
 290,
 5612,
 19556,
 5003,
 501,
 853,
 860,
 8935,
 316,
 76611,
 399,
 1232,
 9342,
 4194,
 402,
 5661,
 13653,
 11,
 8879,
 326,
 46658,
 13531,
 13,
 392,
 40,
 4128,
 3496,
 316,
 413,
 1001,
 328,
 2617,
 1665,
 1218,
 11,
 472,
 6780,
 472,
 1023,
 3716,
 220,
 1157,
 11,
 24645,
 3877,
 9247,
 261,
 18965,
 10325,
 1669,
 5801,
 503,
 3543,
 6771,
 3532,
 501,
 6967,
 448,
 21083,
 163640,
 11965,
 495,
 2944,
 13,
 392,
 40,
 4128,
 2411,
 17291,
 413,
 11884,
 1

In [84]:
num_tokens_from_string(train_df.iloc[0,0])

556

In [89]:
sample_article = train_df["article"].iloc[0]
sample_highlight = train_df["highlights"].iloc[0]

print("\nArtículo de ejemplo:")
print(sample_article[:500] + "...") 
print("\nResumen de ejemplo:")
print(sample_highlight)

article_ids_tiktoken = enc.encode(sample_article)
highlight_ids_tiktoken = enc.encode(sample_highlight)

print(f"\nTokens tiktoken para artículo (primeros 50): {article_ids_tiktoken[:50]}...")
print(f"Número total de tokens tiktoken (artículo): {len(article_ids_tiktoken)}")
print(f"Tokens tiktoken para resumen: {highlight_ids_tiktoken}")
print(f"Número total de tokens tiktoken (resumen): {len(highlight_ids_tiktoken)}")

print(f"\nDecodificado: {enc.decode(article_ids_tiktoken[:50])}...")


Artículo de ejemplo:
LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don't plan to be one of those people who, as s...

Resumen de ejemplo:
Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .
Young actor says he has no plans to fritter his cash away .
Radcliffe's earnings from first five Potter films have been held in trust fund .

Tokens tiktoken para artículo (primeros 50): [43, 137750, 11, 14318, 350, 77254, 8, 2230, 23564, 45666, 8253, 19041, 22950, 143342, 35640, 3158, 316, 261, 9822, 8989, 455, 5749, 3653, 4987, 13, 16, 5749, 8, 46505, 472, 501, 18304, 220

# Referencias

- Cuarderno de BPE de Raschka: https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/05_bpe-from-scratch/bpe-from-scratch.ipynb
- Repositorio de tiktoken: https://github.com/openai/tiktoken 